Moving Beyond Planning to Implementation: Open-Source Tools… Josh Young Ocean Sciences Meeting February 24, 2016
Moving Beyond Planning to Implementation: Open-Source Tools…
Josh YoungOcean Sciences Meeting
February 24, 2016
ScopeImagine a project:• that includes a well-thought out and
documented data management plan, • and robust implementation of that
plan through out the project and beyond. • This talk is not for that project; it is
for the rest of us.
So why do we care about data management?
• Internal reasons: do good research, write papers, get tenure, win more grants.
• External reasons: public access & reproducibility Risk of becoming dark data
(Heidorn, 2008)
Why care about external access?• Intangibles for an Investigator• Maybe someday I’ll benefit from someone else’s data• Maybe I’ll learn something through informal dialogue• Most science funding is from public resources and
should/could be considered a public trust resource• Peer pressure
• Tangibles for an Investigator• Increased efficiency• My funders require it.
So why do we care about data management?
• Internal reasons: do good research, write papers, get tenure, win more grants.
• External reasons: greater impact
What is the DMRC & do we really need another Data Plan Project?
• Probably not• The DMRC is not a Data Plan tool• Unidata community requested help
with implementation• Therefore, the DMRC is primarily a
curated list of tools for implementation
What the DMRC Offers• Highlights requirements from
funding agencies;• Points to Best Practices
developed by others in the Data Management space;
• Sorts available tools by best practice;
• Details available tools.
Requirements• Highlight data management
funding requirements from NASA, NOAA, NSF• These are the agencies that fund
our community so we try to stay up to date, but remember the agency posted information is always the authority
Activity Best Practices & Possible Tools
Activity column based on DataOne Best Practices
What We Are Exploring• Dataverse by Harvard • Designed for sharing, archiving,
and citing data• Allows you to create a DOI• Allows you to store and make
data accessible in perpetuity
What We Are ExploringKnown Dataverse Characteristics:• Largest single file limited to 10GB• No limit to number of files• Users create their own Dataverse• Designate private or public• Open to data from all science disciplines• Does not corrupt at least some software
files (e.g. IDV bundles)• FREE
What We Are ExploringPossible Dataverse Contributions:• Description (providing DOIs)• Sharing (access for perpetuity) • Preservation (static copy for
perpetuity)• Cost (free) very suitable for projects
that might otherwise become long-tail data
Activity Best Practices & Possible Tools
Activity column based on DataOne Best Practices
We Welcome Your Resource Suggestions!
• Please visit: http://goo.gl/forms/Ngp4Xu9nGr
Example Workflow Implementation
• Radar and Lidar data from the University of Wyoming King Air
• Millersville University Plains Elevated Convection at Night (PECAN) data
• North Carolina State University WRF North Atlantic Model Outputs
?
Part of a larger effort: Agile Data Curation
• Means taking implementable steps to improve data management for external access.
• Philosophically, it attempts to apply lessons from agile software development to data management.
Agile Curation Principles, 2nd Generation
(J.Young, K.Benedict, & C. Lenhardt, AGU 2015 Fall Meeting)
1) Delivery, access, use and citation of research data are the primary measures of success.
2) Maximize the impact of research data through the continuous integration of curation activities
3) Support unanticipated needs for and uses of research data (and documentation) and develop flexible systems to capture new uses.
Agile Curation Principles, 2nd Generation
4) Make data open and accessible as early in the process as possible.
5) Encourage crowd-sourced / community feedback to improve and enhance the data. Provide basic metadata for data available early in the process even if the data are not finalized.
6) Identify key individuals in a research project that have the requisite motivation, knowledge, or ability to learn and get out of their way.
Agile Curation Principles, 2nd Generation continued
7) Data creators and data curators should work closely throughout the data life story to ensure the most efficient and streamlined process.
8) Identify the most effective method(s) for maintaining close communication between the data creators and curators involved and use them.
9) Target the steady delivery of incremental improvements to research data discovery, access and use that is consistent with a sustainable level of effort and available funding.
Agile Curation Principles, 2nd Generation continued
10) Start with the basics and only make systems more complex as needed, while maintaining a low bar to entry.
11)Continuous attention to technical excellence and good design enhances agility.
12)Continuously develop a community of data providers, curators and users that participate in the evolution of the research data systems.
We Welcome Your Stories• Please email: [email protected]
Balancing infrastructure development & scientific advancement to create sustainable, multidisciplinary solutions
M. Chan
• Advance science• Meet grand challenges• Leverage shared
cyberinfrastructure technology
NSF’s EarthCube
CyberInfrastructure
Science
RCNsBuildingBlocks
InteractiveActivities
End UserWorkshops EC
Committees
GOALS
Get Involved!Science
Committee
Technology & Architecture Committee
Liaison Team
LEADERSHIP
COUNCILOffice
Council of Data
Facilities
Engagement Team
• Talk to EarthCube Participants!
• Attend EarthCube Workshops!
• Join the mailing list at earthcube.org
• Apply for funding (EC Travel Grants, Distinguished Lecturers)
• Follow on twitter @earthcube