UKSG Conference April 2013
Phil Nicolson
Data Governance
What is Data Governance
What is Data Quality
The challenges
Data governance programme
A publisher approach
The outcome: Book author example
ICEDIS
Summary
Data governance“I think that the key issue here, is that the information is probably incorrect, inaccurate and in a form that almost certainly shouldn't have been used”
Dr John Thomson cardiologist at Leeds General Infirmary,
Sky News 30/3/2013
Data Governance – a definition
Data governance is defined as the processes, policies, standards, organisation, and technologies required to manage and ensure the availability, accessibility, quality, consistency, auditability, and security of data
Data Quality - definitions Data are of high quality "if they are fit for their intended uses
in operations, decision making and planning"
Data are deemed of high quality if they correctly represent the real-world construct to which they refer
Data Quality Data quality attributes:
Accurate
Reliable
Complete
Appropriate
Timely
Credible
Up-to-date
The challenge: Data Sources Multiple data sources – ‘system’ data silos
Multiple locations – ‘geographic’ data silos
Data entered through multiple channels
Data entered by different people
The challenge: Data SourcesTypical publisher systems: Data can be entered by:
Financial system
CRM/Sales database
Authentication system
Fulfilment
Usage statistics
Submissions system
Author database
…..
Organisation staff
Authors
Society members
Agents in the supply chain
3rd party organisations
…..
The challenge: Institutions UCL:
University College London (UK) Université Catholique de Louvain (Belgium) Universidad Cristiana Latinoamericana (Ecuador) University College Lillebælt (Denmark) Centro Universitario Celso Lisboa (Brazil) Union County Library (USA)
NPL: National Physical Laboratory (UK) National Physical Laboratory (India)
York Uni. University of York (UK) York University (Canada)
Northeastern University: Northeastern University (Boston, USA) Northeastern University (Shenyang, China)
The challenge: IndividualsHow can we uniquely identify individuals? Of the 700,000 individuals known to the RSC in 2012 there were:
Smith: ~1,500
Jones: ~1,000
Li: >10,000
Consequences of poor data
Biggest obstacle(s) to data quality improvement in your organization?
Lack of accountability and responsibility for data quality 55.4%
Too many information silos 51.8%
Lack of awareness or communication of the magnitude of data quality problems 51.4%
Lack of common understanding of what data quality means 50.2%
Lack of awareness or communication of the opportunities associated with high quality data 45.0%
Lack of senior leadership in tackling data quality issues 44.2%
Lack of data quality policies, plans, and procedures 42.2%
Perception that data quality is an IT issue only rather than an organisation wide issue 41.8%
The State of Information and Data Quality 2012 Industry Survey& Report, (IAIDQ) Understanding how Organizations Manage the Quality of their Information and Data Assets.Pierce, Yonke, Malik, Nagaraj
Data Governance – why it is vital“processes, policies, standards… ensure quality and consistency”
Increase consistency and confidence in our decision making
Maximise the income generation potential of our data
Provide excellent customer service
Designating accountability for information quality
Minimising or eliminating re-work
Optimise staff effectiveness
Decreasing the risk of regulatory fines
Improving data security
Data is one of the most valuable assets within an organisation
Data governance – a new culture
Data governance programme
Plan & prioritise Sponsorship: director level sponsor?
Program management: business or IT driven?
Organisational structure: local, national, international?
Scope: focus on the most important data?
Ownership: who are the business owners of critical data?
New system implementation: protect investment
Plan & prioritise Resources: dedicated staff?
Funding: which area of the business will fund the program?
Business drivers: what are the major business drivers?
Barriers: what are the main barriers (cultural, funding, resources, priorities etc.) and can they be mitigated
Audit & Analyse Audit existing data quality
Review all relevant systems
How poor is it?
Incomplete data
Invalid
Out of date
….
Clean existing data Prioritise
Quick wins
Highlight progress
What can be automated?
Introduce unique identifiers
Identifiers available People
International Standard Name Identifier (ISNI)
Open Researcher and Contributor ID (ORCID)
Scopus Author Identifier
ResearcherID
Organisations
International Standard Name Identifier (ISNI)
Ringgold ID
DUNS Number (D&B) and other business and finance IDs
MDR PID Numbers and other marketing IDs
Library of Congress MARC Code List for Organizations
ISNI
ISNI Number ISNI Number
Party ID 2Party ID 1
Proprietary Information and/or
Metadata
Proprietary Information and/or
Metadata
ISNI is designed to be a “bridge identifier”
Author IDs ORCID is designed to persistently identify and disambiguate
scholarly researchers and attach them to research output
ORCID identifiers utilize a format compliant with the ISNI ISO standard
ISNI has reserved a block of identifiers for use by ORCID, so there will be no overlaps in assignments
Recorded as http://orcid.org/0000-0001-2345-6789
http://about.orcid.org/
http://www.isni.org/
Use cases Disambiguation of researchers
and connection to all their research
Links to contributors, editors, compilers and others involved in the research process
Embed IDs into research workflows and the supply chain
Integrate systems
Institutional IDs Ringgold is an ISNI Registration Agency
Unique institutional ID number maps data across systems
ISNI numbers should be used across the scholarly supply chain to:
Disambiguate institutional records
Eradicate duplication of data
Map institutions into their hierarchy
Link systems using the institutional ID as the lynchpin
Minimising the impact of data silos Standard identifiers (both individual and institution) can be
used to breakdown silos by enabling better system linking:
Improve data capture Data quality policy
Web forms
Closer collaboration with 3rd parties to encourage use of industry standard identifiers such as ISNI or ORCID
Data capture - data quality policy Design to ensure accuracy, quality and consistency
Individual responsibilities: All staff are responsible for the accuracy and consistency of data
Capture data in such a way that it is uniquely identifiable and easily shared within the organisation and with 3rd parties
Records relating to individuals
Records relating to institutions
Reporting of inaccuracies to Data Owners
Data owners responsibilities: All source data systems must have a designated Data Owner
Data owner retains overall responsibility for all records within their source data system
Improve data capture – web forms Required fields
Validation
Address validation – postcode lookup
Institution validation – institution lookup
‘Internal’ and ‘external’ web form consistency
Language barriers
Help and hints
Free-text fields
On-going monitoring Dashboards
Regular audits
Metrics – Institutional Linking Rate
Staff awareness
Reporting of errors
A publisher example Develop a Data Governance Programme
Data ‘champion’
Engagement – at all levels
Ownership – at all levels
Allocate necessary resources
Guidelines/Policy - Data quality policy
Processes put in place
Education - raise awareness
New staff – training on Data Governance and their wider impact
Change of culture
A publisher example Ringgold and DataSalon client
All institutional records contain Ringgold Identifiers
System linking via Individual and Institutional identifiers
Data (both good and bad) visible to all via MasterVision
Use of data governance dashboards
Tidying of existing data
Simple reporting of incorrect data across organisation
New data captured correctly
Author database
1. Create a data governance dashboard to monitor problem areas:
• Book authors with no related institution• Unknown book authors• Author records without an affiliation entry• Author records with commas in the
affiliation entry• Book authors without an email address• Book authors with an invalid email address
2. Correct problem records in existing data• Dashboard clearly highlighted all records of
concern and these records were corrected
Author database3. Ensure new records are created correctly
• Raise staff understanding of the importance of capturing data correctly and the impact it has across the organisation as a whole (data silos)
• Training covering data governance
4. Ensure appropriate Ringgold coverage• Where institutions were discovered in the Author database that didn’t exist
within Identify these were reported to Ringgold. This not only means that individual authors can be linked to the new institution but that any individuals in other data sources at the same institution can be linked. This benefits all users of our data and potentially highlights new sales opportunities.
5. Monitor data quality on an on-going basis• Books data governance dashboard update on a weekly basis.
Author database – results
70.00%
75.00%
80.00%
85.00%
90.00%
95.00%
100.00%
All data sources
ANKO
10% will never link:• Missing data (old records)• Institution no longer exists • Retired author• Genuinely no related institution
End of process: • 15% increase in authors linked to
institutions - information valuable in supporting all areas of the business
• Ready for data migration
ICEDIS The international standards organization EDItEUR is working to
encourage improvements in the ways that "party" information is communicated
Some parts of the supply chain continue to send unstructured name & address records, making matching, disambiguation and automatic ingest near impossible
ICEDIS has collaborated with EDItEUR to develop a highly structured data model for exchanging names, addresses and standard identifiers.
The group has recently been validating the model by means of a "paper pilot", using a small library of about 100 name & address types
An XML schema and HTML documentation are freely available
www.editeur.org www.editeur.org/138/[email protected]
Summary Your data is a very valuable asset when managed correctly
Establishing a data governance programme will enable you to gain maximum benefit from that data
Data governance is as much about changing the culture of an organisation as it is about processes and procedures
It will take time but the benefits can be enormous
Phil Nicolson
Data Manager
Ringgold Inc.