DwB – Data without Boundaries Additional Workshop – Metadata Standards 07/12/2011 1 DDI and the GSBPM Data without Boundaries Thematic Workshop on Metadata EDDI11 - SND, Gothenburg, Sweden, Dec 7 2011 Joachim Wackerow GESIS - Leibniz Institute for the Social Sciences Goals of Data without Boundaries • DwB, project aims to an integrated model for accessing official data – a model where the best solutions for access are available irrespective of national boundaries and – flexible enough to fit national arrangements. Description of Workshop • DwB is an FP7 program aiming at developing an integrated model for accessing official data, irrespective of national boundaries. – In particular, the project proposes to build agreements on standards between different stakeholders such as the Statistical Institutes, the Data Archives and the researchers, who are the final users. • The thematic meeting will focus on metadata standards relevant for DwB: SDMX, DDI, and the GSBPM. – Each standard will be described and related to the others, and the ongoing works directed to their articulation will be presented.
24
Embed
DDI and the GSBPM - Data without Boundaries · DDI and the GSBPM Data without ... GSBPM DDI Life Cycle Model ... Product Physical Data Product Physical Instance Archive Groups and
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
1
DDI and the GSBPM
Data without Boundaries
Thematic Workshop on Metadata
EDDI11 - SND, Gothenburg, Sweden, Dec 7 2011
Joachim Wackerow
GESIS - Leibniz Institute for the Social Sciences
Goals of Data without Boundaries
• DwB, project aims to an integrated model for
accessing official data
– a model where the best solutions for access are
available irrespective of national boundaries and
– flexible enough to fit national arrangements.
Description of Workshop
• DwB is an FP7 program aiming at developing an integrated model for accessing official data, irrespective of national boundaries.– In particular, the project proposes to build agreements on
standards between different stakeholders such as the Statistical Institutes, the Data Archives and the researchers, who are the final users.
• The thematic meeting will focus on metadata standards relevant for DwB: SDMX, DDI, and the GSBPM.– Each standard will be described and related to the others,
and the ongoing works directed to their articulation will be presented.
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
2
Purpose of GSBPM
• The original intention was for the GSBPM to provide a basis for statistical organizations to agree on standard terminology to aid their discussions on developing statistical metadata systems and processes.
• The GSBPM should therefore be seen as a flexible tool to describe and define the set of business processes needed to produce official statistics.
What can DDI do here?
• DDI is a combined informational/data and
process model
• Probably more related to the GSIM
– But GSIM doesn‘t exist in detail yet
• Now we can try to look at relationships
between process phases of GSBPM and DDI
modules/parts
Quality Management / Metadata Management
1SpecifyNeeds
2Design
3Build
4Collect
5Process
6Analyse
7Disseminate
8Archive
1.1Determine needs for
information
1.2Consult &confirm needs
1.3Establish
outputobjectives
1.5Check dataavailability
1.6Prepare
business case
2.1Design outputs
2.4Design frame
& samplemethodology
2.3Design datacollection
methodology
2.5Design statistical
processing methodology
2.6Design production
systems & workflow
3.1Build datacollection
instrument
3.2Build or enhance
process components
3.3Configure workflows
3.4Test production
system
3.6Finalize
production system
4.1Select
sample
4.2Set up
collection
4.3Run
collection
4.4Finalize
collection
5.1Integrate data
5.2Classify & code
5.3Review, Validate &
edit
5.5Derive new variables &
statistical units
5.7Calculate
aggregates
6.1Prepare draft
outputs
6.2Validate outputs
6.3Scrutinize &
explain
6.4Apply
disclosure control
6.5Finalizeoutputs
7.1Update output
systems
7.2Produce
dissemination products
7.3Manage
release of dissemination
products
7.5Manage user
support
7.4Promote
dissemination products
8.1Define
archive rules
8.2Manage archive
repository
8.3Preserve data and
associated metadata
8.4Dispose of
data & associated metadata
5.6Calculate weights
1.4Identify
concepts
9Evaluate
9.1Gather
evaluation inputs
9.2Conduct
evaluation
9.3Agree action plan
5.4Impute
3.5Test statistical
business process
5.8Finalize data files
2.2Design variable
descriptions
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
3
Combining standards?
Main Differences
between GSBPM and DDI• The GSBPM places data archiving at the end of the process, after the analysis phase. It may also
form the end of processing within a specific organization in the DDI model, but a key difference is that the DDI model is not necessarily limited to processes within one organization. Steps such as “Data analysis” and “Repurposing” may be carried out by different organizations to the one that collected the data.
• The DDI model replaces the dissemination phase with “Data Distribution” which takes place before the analysis phase. This reflects a difference in focus between the research and official statistics communities, with the latter putting a stronger emphasis on disseminating data, rather than research based on data disseminated by others.
• The DDI model contains the process of “Repurposing”, defined as the secondary use of a data set, or the creation of a real or virtual harmonized data set. This generally refers to some re-use of a data-set that was not originally foreseen in the design and collect phases. This is covered in the GSBPM phase 1 (Specify Needs), where there is a sub-process to check the availability of existing data, and use them wherever possible. It is also reflected in the data integration sub-process within phase 5 (Process).
• The DDI model has separate phases for data discovery and data analysis, whereas these functions are combined within phase 6 (Analysis) in the GSBPM. In some cases, elements of the GSBPM analysis phase may also be covered in the DDI “Data Processing” phase, depending on the extent of analytical work prior to the “Data distribution” phase.
Main Differences
between GSBPM and DDI
GSBPM
• Data archiving at the end of
the process, after the analysis
phase
• Stronger emphasis on
dissemination
• Availability of existing data in
Specify Needs (1), data
integration in Process (5)
• Analysis (6), combined
DDI
• Similar, but not necessarily limited to processes within one organization
• Data distribution and research based on data disseminated by others
• Repurposing - re-use of a data-set not foreseen in the design and collect phases
• Separate phases for data discovery and data analysis, and for data processing
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
4
GSBPM / DDI
Top Level RelationshipsGSBPM DDI Life Cycle Model
1 Specify Needs Study Concept Repurposing (part)2 Design
3 Build
4 Collect Data Collection
5 Process Data Processing (mostly)Repurposing (part)
6 Analyse Data DiscoveryData AnalysisData Processing (part)
7 Disseminate Data Distribution
8 Archive Data Archiving
9 Evaluate
Quality Management
Metadata Management
Nothing particular for quality
indicators.
But structured metadata on
detailed level is good basis.
Unique identifiers per agency
and support for maintainable
containers supports metadata
menagement.
GSBPM: 1 Specify Needs
• Study Concept - Repurposing (part)1
SpeciyNeeds
1.1Determine needs for
information
1.2Consult &confirm needs
1.3Establish
outputobjectives
1.5Check dataavailability
1.6Prepare
business case
1.4Identify
concepts
GSBPM: 2 Design
2Design
2.1Design outputs
2.4Design frame
& samplemethodology
2.3Design datacollection
methodology
2.5Design statistical
processing methodology
2.6Design production
systems & workflow
2.2Design variable
descriptions
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
5
What Is DDI I?
• An international specification for structured metadata
describing social, behavioral, and economic data
• A standardized framework to maintain and exchange
– This is especially true of longitudinal/repeat cross-
sectional studies
• This produces different versions of the metadata
• The metadata versions have to be maintained as they
change over time
– If you reference an item, it should not change: you
reference a specific version of the metadata item
DDI Support for Metadata Reuse
• DDI allows for metadata items to be identifiable
– They have unique IDs
– They can be re-used by referencing those IDs
• DDI allows for metadata items to be published
– The items are published in resource packages
• Metadata items are maintainable
– They live in “schemes” (lists of items of a single type) or in “modules” (metadata for a specific purpose or stage of the lifecycle)
– All maintainable metadata has a known owner or agency
• Maintainable metadata can be versionable
– This reflects changes over time
– The versionable metadata has a version number
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
10
DDI Support for Comparison
• For data which is completely the same, DDI provides a way of showing comparability: Grouping
– These things are comparable “by design”
– This typically includes longitudinal/repeat cross-sectional studies
• For data which may be comparable, DDI allows for a statement of what the comparable metadata items are: the Comparison module
– The Comparison module provides the mappings between similar items (“ad-hoc” comparison)
– Mappings are always context-dependent (e.g., they are sufficient for the purposes of particular research, and are only assertions about the equivalence of the metadata items)
DDI 3 Lifecycle Model and Related Modules
StudyUnit
Data Collection
LogicalProduct
PhysicalData Product
PhysicalInstance
Archive
Groups and Resource Packages are a means of publishing any portion or combination of sections of the life cycle
Local Holding Package
S04 29
XML Schemas, DDI Modules,
and DDI Schemes
<file>.xsd<file>.xsd<file>.xsd<file>.xsd
XML Schemas DDI Modules
May
Correspond
DDI Schemes
May
Contain
Correspond to
a stage in the
lifecycle
S09 30
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
11
DDI Instance
Citation Coverage
Other Material / NotesTranslation Information
Study Unit Group
Resource Package
3.1 Local Holding Package
S04 31
Citation / Series StatementAbstract / Purpose
Coverage / Universe / Analysis Unit / Kind of DataOther Material / Notes
Depository Study Unit OR Group Reference:[A reference to the stored version of the deposited study unit.]
Local Added Content:[This contains all content available in a Study Unit whose source is the local archive.]
Citation / Series StatementAbstract / Purpose
Coverage / Universe Other Material / Notes
Funding Information / Embargo
S04 35
DDI Schemes: Purpose
• A maintainable structure that contains a list of versionable things
• Supports registries of information such as concept, question and variable banks that are reused by multiple studies or are used by search systems to location information across a collection of studies
• Supports a structured means of versioning the list
• May be published within Resource Packages or within DDI modules
• Serve as component parts in capturing reusable metadata within the life-cycle of the data
S04 36
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
13
Building from Component PartsUniverseScheme
ConceptScheme
CategoryScheme
CodeScheme
QuestionScheme
Instrument
Variable Scheme
NCube Scheme
ControlConstructScheme
LogicalRecord
RecordLayout Scheme [Physical Location]
PhysicalInstance
S04 37
Versioning and Maintenance
• There are three classes of objects:
– Identifiable (has ID)
– Versionable (has version and ID)
– Maintainable (has agency, version, and ID)
• Very often, identifiable items such as Codes
and Variables are maintained in parent
schemes
S08 38
Maintenance Rules
• A maintenance agency is identified by a reserved code based on its domain name (similar to it’s website and e-mail)
– There is a register of DDI agency identifiers which we will look at later in the course
• Maintenance agencies own the objects they maintain
– Only they are allowed to change or version the objects
• Other organizations may reference external items in their own schemes, but may not change those items
– You can make a copy which you change and maintain, but once you do that, you own it!
S08 39
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
14
Publication in DDI• There is a concept of “publication” in DDI which is important
for maintenance, versioning, and re-use
• Metadata is “published” when it is exposed outside the agency which produced it, for potential re-use by other organizations or individuals– Once published, agencies must follow the versioning rules
– Internally, organizations can do whatever they want before publication
• Note that an “agency” can be an organization, a department, a project, or even an individual for DDI purposes– It must be described in an Organization Scheme, however!
• There is an attribute on maintainable objects called “isPublished” which must be set to “true” when an object is published (it defaults to “false”)
S08 40
A study is born
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow419/19/2011
Checksum of Study Design Document
Could be Archived
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow429/19/2011
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
15
Multiple Collection Processes Begin
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow439/19/2011
Processing – (e.g. Data Cleaning,
Restructuring, Recoding)
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow449/19/2011
Initial Data are Archived
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow459/19/2011
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
16
Initial Distribution
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow469/19/2011
Initial Distribution – Possibly From
Archive
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow479/19/2011
Initial Data Discovery
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow489/19/2011
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
17
Initial Data Analysis
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow499/19/2011
Initial Data Analysis and Data Archived
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow509/19/2011
Publications – Reference and
Referenced by Archive
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow519/19/2011
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
18
SECOND WAVE – Revised Concept
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow529/19/2011
SECOND WAVE – Data Collection
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow539/19/2011
SECOND WAVE – Data Processing
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow549/19/2011
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
19
SECOND WAVE – Processing Uses
Feedback from Stage 1
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow55
Here something
learned in the
initial distribution
affects future
processing. This
should be
recorded.
9/19/2011
SECOND WAVE – Processing Uses
Feedback from Stage 1
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow56
These metadata
flows may happen
between many
stages, e.g. from
processing to later
collection.
9/19/2011
SECOND WAVE – Distribution
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow579/19/2011
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
20
SECOND WAVE – Discovery
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow589/19/2011
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow599/19/2011
Final Analysis Archived
60
A Kansan's Cyclone View
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
21
Gantt View – Initial Design
Much of this
movement of data
between stages is
planned from the
beginning of the
project
Gantt with Data Flow (Blue)
Gantt With Planned Data and Metadata Flow
Metadata are
generated as data
move through the
project, as well as
before any data are
gathered.
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
22
Gantt – Collection Changes Project Concept
Some metadata are
unanticipated. Here
something learned
during the first
collection phase
causes a
reconceptualization
Here something
learned during
discovery changes
future collection
Gantt – Discovery Changes Future Collection
Representing Longitudinal Data in DDI(Extract)
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow66
Level Dimension Description DDI Tag(s)
Project/Study
(highest level)
Management Executive control, scientific
leadership, funding, etc.
Group
Citation
Purpose
Abstract
FundingInformation
Archive Module
Organization
Individual
Role (research, management, funding,
etc.)
Location
Email
Telephone
Access How to obtain data and any
restrictions on access
Group/Subgroup/StudyUnit
Archive Module
AccessConditions
AccessPermissions
ConfidentialityStatement
Restrictions
LifecycleInformation
Longitudinal
Survey
Sample
Design and
Procedures
Universe: Population being
sampled: Refreshment
strategy; Replacement
strategy; Potential errors
Group
ConceptualComponents
Universe
Concept
DataCollection
Methodology
SamplingProcedure
DeviationFromSampleDesign
ActionToMinimizeLosses
9/19/2011
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
23
Reuse
Dagstuhl event 11382, Sept. 2011, Hoyle
and Wackerow679/19/2011
Old study
My study
Generic Longitudinal
Business Process Model
Combined Process and Analyse Phase
DwB – Data without Boundaries
Additional Workshop – Metadata
Standards
07/12/2011
24
Additional Phase Research/Publish
Circle View
Acknowledgements
• Arofan Gregory and Wendy Thomas
– Core collection of DDI slides
• Larry Hoyle
– Managing Metadata for Longitudinal Data (2011)
• Steven Vale
– Generic Statistical Business Process Model (METIS 2009)
– Exploring the relationship between DDI, SDMX and the Generic Statistical Business Process Model (EDDI 2010)
• Dagstuhl Workshop on Longitudinal Data 2011
– Working Group on GLBPM, Ingo Barkow, Jay Greenfield, Arofan Gregory, Marcel Hebing, Larry Hoyle, Wolfgang Zenk-Möltgen
• DDI Alliance Working Paper Series. Best Practices for Longitudinal Data http://www.ddialliance.org/resources/publications/working/BestPractices/LongitudinalData