Top Banner
Workshop on Metadata Standards and Best Practices November 19-20 th , 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation [email protected] http:// www.opendatafoundation.org
24

Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation [email protected].

Mar 27, 2015

Download

Documents

Bryan Kirk
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

Workshop on Metadata Standards and Best PracticesNovember 19-20th, 2007

Session 3Researcher Metadata in RDCs

Pascal Heus

Open Data Foundation

[email protected]

http://www.opendatafoundation.org

Page 2: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Outline

• RDC Needs• Metadata in RDCs• Potential solutions• Examples• Conclusions / Q&A

Page 3: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

RDC Overview

• Provide an environment for the researcher to perform the in depth analysis of data in the most efficient way

• Simple access to data file and codebook is insufficient

• Need a high quality metadata and collaborative environment to promote dynamic research

• Should capture the research process• Provide benefits to all stakeholders:

producers, librarians, researcher, general public, etc.

Page 4: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Metadata and the survey life cycle

• A survey is not a static process• It dynamically evolved across time and involves many players• It extends to aggregate data to reach decision makers• Metadata is crucial to capture knowledge

Page 5: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Importance of metadata

Imagine a world without metadata….• Users would say:

– I can’t find the right data! How do I get access?– Where is the report / questionnaire / methodology?– I don’t understand this survey / file / variable– I can’t merge the files– How do I weight the data?– My results don’t match the report, I can’t reproduce the

same results– Are these things comparable?– I didn’t know someone did this research before?

• Sounds familiar?– Metadata is an answer to a researcher’s frustrations

• Producers and archivists are making efforts to improve metadata but similarly, metadata must also be captured by researchers (Life Cycle!)

Page 6: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

When to capture metadata?

• Metadata must be captured at the time the event occurs!• Documenting after the facts leads to considerable loss of

information• This is true for producers and researchers

Page 7: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Metadata and the Replication standard

• Replication standard– Gary King, Harvard, 1995

http://gking.harvard.edu/projects/repl.shtml– "The replication standard holds that sufficient information

exists with which to understand, evaluate, and build upon a prior work if a third party can replicate the results without any additional information from the author."

– The only way to understand and evaluate an empirical analysis fully is to know the exact process by which the data were generate

– Replication dataset include all information necessary to replicate empirical results

• Metadata crucial to meet the standard– Composed of documentation and structured metadata– Undocumented data is useless

Page 8: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

RDC issues

• Without producer metadata– researchers can’t work discover data or perform efficient

work

• Without researcher metadata– producer don’t know about data usage and quality issues– Other researcher are not aware of what has been done

• Without standards– Information can’t be properly managed and exchanged

between agencies or with the public

• Without tools:– Can’t capture and preserve/share knowledge

Page 9: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

RDCRDC

RDC

Data

RDC Metadata Framework

Producers

Researcher

Producer/ArchiveMetadata

ResearchMetadata

Research Output

Public Usemetadata

External users

1. Producer provide data & basic docs

2. Need to enhance existing metadata

3. Start capturing researcher metadata

4. Knowledge grows and gets reused

5. Provides usage and quality feedback to producer / RDC6. Repeat across surveys/topics

7. Metadata facilitates output

8. Public metadata facilitates data discovery / fosters global knowledge

9. Metadata exchange between agencies

Page 10: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

RDC Solutions

• Metadata management– Adopt standards and provide researcher with

comprehensive metadata– Use related tools to capture research process

• Collaborative environment– Used web technologies to foster a dynamic research

environment

• Connected and Remote enclaves– Connect RDCs through secure networks– Consider virtual data enclave

• Data disclosure– Protect respondent through sound data disclosure

techniques

• Train providers / researchers

Page 11: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Simple techniques

• Starts with good practices– File and variable naming conventions (embed

metadata)– Code documentation– Good statistical methods

• Web tools– Take advantage of common web technologies– Organize: calendar, events & news, task/todo– Knowledge capture/sharing: shared

document/script libraries, wiki, blogs, discussion groups, citation bases, etc.

Page 12: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Coding and naming conventions (1)

• Give meaningful names to files– Avoid spaces in names, don’t use upper case– Version your files (capture progress)– Use “middle” extensions– Include metadata in the name

• Not too good: – report.doc, notes.txt– myfile.dta, table2.xls– reg.do, test.do,, results.

• Better– usda_arms_2005_final_report_v200607.doc– usda_arms_results_v200706.dta , usda_farms_by_crop.xls, – income_regression_v200706.do

Page 13: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Coding and naming conventions (2)

• Give meaningful names to variables– Not too good:

• tmp3, ag_exp2, v324– Better:

• valid_enterprise, agricultural_expenditure, s1q3

• Avoid complex code• Comments, comments, comments!!

– Make sure to include lots of comments in your source code– This is the best time to capture knowledge!– It also promotes replicability and will help you in a few

months when to try to remember what you did• Share source code, use peer review

Page 14: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Not so good code example

local mypath = “c:\data\anonymization\"global data_in = "`mypath'" + "\" + "Demohh1000.dta"global data_out = "`mypath'" + "\" + "Demohh1000.out.dta"global threshold = 0.8cd $mypathset more offuse $data_in, clear tempfile tempgen fk=1gen wi=weightcollapse (sum) fk wi, by (town province marstat sex age)gen pk=fk/wigen qk=1-pkgen rk= (pk/qk) * log(1/pk) if fk==1replace rk= (pk/(qk^2)) * ((pk*log(pk))+qk) if fk==2replace rk=(pk/(2*(qk^3))) * (qk*(3*qk-2) - (2*pk^2)*log(pk)) if fk==3#delimit ;replace rk= (pk/fk) * (1+ (qk/(fk+1)) + ((2*qk^2) / ((fk+1)*(fk+2))) +

((6*qk^3) / ((fk+1)*(fk+2)*(fk+3))) + ((24*qk^4) / ((fk+1)*(fk+2)*(fk+3)*(fk+4))) + ((120*qk^5) / ((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5))) + ((720*qk^6) / ((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5)*(fk+6))) + ((5040*qk^7) / ((fk+1)*(fk+2)*(fk+3)*(fk+4)*(fk+5)*(fk+6)*(fk+7)))) if fk>3 ;

Page 15: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Better code example

/** * Computes the disclosure risk at individual level * * @author John Anonymous ([email protected]) * @version 2007.06 * References: * - micro-Argus 4.1 manual, p27-25 */

// Configurationlocal mypath = “C:\data\anonymization\"global data_in = "`mypath'" + "\" + "Demohh1000.dta"global data_out = "`mypath'" + "\" + "Demohh1000.out.dta"global threshold = 0.8

// Initializecd $my_pathset more off

// Load the datause $data_in, clear tempfile temp

Page 16: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Canada RDC Project

• Consists of 14 Research Data Centres Centres, 6 branch RDCs and the Federal Research Data Centre in Ottawa

• Data provided by Statistics Canada• RDC are now connected through a high

speed secure network• Project to adopt a DDI 3.0 based metadata

framework for survey documentation and research work and sponsor development of tools

• ODaF providing technical assistance• http://www.statcan.ca/english/rdc/index.htm

Page 17: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

ProjectApplication

ProjectApproval

ProjectCreation

Access to Data

GenerateAnalysis

Files

OutputDisclosureAnalysis

ResearchCommun-

icatons

Stages in the life cycle

The Canada RDC Research Life Cycle

[Chuck Humphrey, University of Alberta]

Managing DataStages

Page 18: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Metadata in Canada RDC

RDC

Producer Analyst Researcher

OriginalSurvey

MasterSurvey

VirtualSurvey

ResearchOutput

Security

Other researchersPolicy MakersGeneral Public

…PublicationConferences

Security

1. Producer makes survey available2. Analyst packages for RDC3. Researcher gets access and reshapes the data4. Researcher perform complex analysis5. Researchers publishes results6. Information flowing in/out and activities are controlled

and monitored7. Outside users get access to the research output8. Analyst includes results, activity, feedback

and reports to the producer

The information flow relies on metadata and also generates new information

that must be captured!!

1

2 3 4 5

6678

8

Page 19: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

MASTER

VIRTUAL OUTPUTORIGINAL Repurpose Disclosure

Tables

OtherVersion

Log

Group

Metadata Management Virtual File System

Storage Query Registry Exchange Data Files

Security

AuthorizationAuthentication

i18n

Analysis

Report

MetadataMining

Compare

2.0 Editor

Question

Quality

Concepts

Resources

Legacy

SPSS, SAS, Stata

2.0 / 3.0 DDI 3.0

ProjectAdmin

AuditLogs

CommunicationCollaborative

Intranet

TrainingDocumentation

OriginalSurvey

MasterSurvey

VirtualSurvey

ResearchOutput

PublicationConferences

Metadata Framework in Canada RDC

Page 20: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

NORC Data Enclave

• National Opinion Research Center• provides a secure environment within which

authorized researchers can access sensitive microdata remotely from their offices or onsite

• Data from National Institute for Standards and Technology’s (NIST) Technology Innovation Program (TIP), the Ewing Marion Kauffman Foundation, and the Economic Research Service at the US Department of Agriculture

• Possibly the first virtual data enclave• http://dataenclave.norc.org

Page 21: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

NORC Virtual Enclave

Page 22: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Benefits (1)

• Data documentation– Through good metadata practices,

comprehensive documentation is available to the researchers

• Preservation, integration and sharing of knowledge– Research process is captured and preserved in

harmonized format– Research knowledge becomes integrant part of

the survey and available to others– Producer gets feedback from the data users

(usage, quality issues)– Reduce duplication of efforts and facilitates reuse

Page 23: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Benefits (2)

• Research outputs and dissemination– Facilitate production of research outputs– Facilitate dissemination and fosters broader

visibility of research outputs

• Exchange of information– Metadata exchange between RDC, producers,

librarians– Importance of public metadata for sensitive

datasets– Facilitate data discovery (inside and outside

RDC)

Page 24: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 3 Researcher Metadata in RDCs Pascal Heus Open Data Foundation pheus@opendatafoundation.org.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Conclusions

• Metadata plays a crucial roles in RDC’s• Benefits all stakeholders

– Better use of the data (return on investment)– Improves research quality– Foster production of high quality data (more

relevant and accurate) accompanied by comprehensive metadata

• Adopting good practices may mean changing the way you work – This requires good change management

techniques and discipline– But the benefits are worth the effort