Top Banner
Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division Microsoft Corporation
24

Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

Mar 26, 2015

Download

Documents

Mia Blevins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

Speeding ScienceSolutions for Data Curation

from Microsoft (Research)

Lee Dirks Director, Education & Scholarly Communication

External Research DivisionMicrosoft Corporation

Page 2: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

Division within Microsoft Research focused on partnerships between academia, industry and government to advance computer science, education, and research in fields that rely heavily upon advanced computingSupporting groundbreaking research to help advance human potential and the wellbeing of our planetDeveloping advanced technologies and services to support every stage of the research processMicrosoft External Research is committed to interoperability and to providing open access, open tools, and open technology

Page 3: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

MissionMissionOptimize and extend Microsoft software to meet the specific needs of the academic community

Our approach:

Conduct applied projects to enhance academic productivity by evolving Microsoft’s scholarly communication offerings

Microsoft External Research is uniquely positioned to drive this initiative across Microsoft

Page 4: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

CollaborationSharePoint

LiveMeetingOffice Live

Office OpenXMLXPS FormatSQL Server & Entity FrameworkRights ManagementData Protection Manager

Office 2010:•Word•PowerPoint•Excel•OneNote

Tablet PC/UMPC

Word 2010 + PowerPoint 2010WPF & Silverlight

“Sea Dragon” / “PhotoSynth” / “Deep Zoom”

Excel 2010Windows Server HPC“Astoria” / “Pop Fly”

The Scholarly Communication Lifecycle

DiscoverabilityFAST

MSR Academic Search“Bookweb”

SharePoint 2010

Page 5: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

• Interoperability is essential– Actively lobby and drive for consensus around technical standards and standardized protocols

proactively adopted by the community; enable broad community engagement• Customers have told Microsoft that interoperability is OUR responsibility

• Leverage Existing Community Protocols, Practices, Guidelines, etc.– Example – metadata conventions / taxonomies / ontologies: a traditional strength for libraries –

and a critical component in enabling Web 2.0

• Optimize for data-driven research– To both data (scientific) and to information (scholarly publications)– Reproducible research + computational science– Properly document / annotate scholarly output

• Data preservation (and provenance) should be baseline– Documentation of the data’s provenance– Preservation needs to be like “accessibility” features – i.e., assumed as required

• Semantic knowledge discovery & social networking– Harnessing collective intelligence must be a consideration – since accessing research is a core

step in the life-cycle. Enable knowledge discovery – Optimize for Web 2.0 scenarios and allow end-users/experts to find things easier

Page 6: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Open Science

Open Access Open Source

Open Data

http://www.microsoft.com/interop/

“In order to help catalyze and facilitate the growth of advanced CI, a critical component is the adoption of open access policy for data, publications and software.”

NSF Advisory Committee on Cyberinfrastructure (ACCI)

Microsoft Interoperability PrinciplesOpen Connections to Microsoft ProductsSupport for StandardsData PortabilityOpen Engagement

Page 7: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

DataCite is an international consortium to establish easier access to scientific research data on the Internet increase acceptance of research data as legitimate, citable contributions to the scientific record, and to support data archiving that will permit results to be verified and re-purposed for future study.

The Open Planets Foundation has been established to provide practical solutions and expertise in digital preservation, building on the €15 million investment made by the European Union and Planets consortium. OPF members benefit from the Planets results, new developments and the growing OPF community that includes experts at some of the most prestigious research, technology and memory institutions in Europe.

The Confederation of Open Access Repositories (COAR) is a not-for-profit association of repository initiatives launched in October 2009. It aims to enhance greater visibility and application of research outputs through global networks of Open Access digital repositories.

The Coalition for Networked Information (CNI) is an organization dedicated to supporting the transformative promise of networked information technology for the advancement of scholarly communication and the enrichment of intellectual productivity. Membership includes some 200 institutions representing higher education, publishing, network and telecommunications, information technology, and libraries and library organizations.

ICSTI, the International Council for Scientific and Technical Information, offers a unique forum for interaction between organizations that create, disseminate and use scientific and technical information. ICSTI's mission cuts across scientific and technical disciplines, as well as international borders, to give member organizations the benefit of a truly global community.

CrossRef is a not-for-profit membership association whose mission is to enable easy identification and use of trustworthy electronic content by promoting the cooperative development and application of a sustainable infrastructure. CrossRef's general purpose is to promote the development and cooperative use of new and innovative technologies to speed and facilitate scholarly research.

Page 8: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Page 9: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Source code and binary:http://GenepatternWordAddin.codeplex.com

Services: Connects to GenePattern database

Data: Resulting data (and provenance) stored within Word document

Data: Resulting data (and provenance) stored within Word document

Data: Control and execute query pipelines into GenePattern

Data: Control and execute query pipelines into GenePattern

Relationships: Inline graphics are synchronized to dataset

Relationships: Inline graphics are synchronized to dataset

Page 10: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Intent: Insert Creative Commons licenses from within Office 2007Intent: Insert Creative Commons licenses from within Office 2007

Relationships: license information stored as RDF XML within the document OOXML

Relationships: license information stored as RDF XML within the document OOXML

Source code and binary:http://ccaddin2007.codeplex.com

Services: Integrates with Creative Commons Web API to create new licenses

Page 11: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

• Phil Bourne • Lynn Fink

Source code and binary:http://research.microsoft.com/ontology/

Relationships: Ontology browser

Relationships: Ontology browser

Intent: Term recognition & disambiguation

Intent: Term recognition & disambiguation

• John Wilbanks

Services: Ontology download web service

Page 12: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Binary (version 2.0):http://research.microsoft.com/authoring/

Relationships: ORE Resource Map creationRelationships: ORE Resource Map creation

Structure: Read, convert, and author NLM XML documents

Structure: Read, convert, and author NLM XML documents

Structure: Client-side XML validationStructure: Client-side XML validation

Services: repository deposit via SWORD

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Relationships: Citation lookup and reference management

Relationships: Citation lookup and reference management

Page 13: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Page 14: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>

Relationships: Navigate and link referenced chemistry

Relationships: Navigate and link referenced chemistry

• Peter Murray-Rust

• Joe Townsend• Jim Downing

Available soon:http://research.microsoft.com/chem4word/

Data: Semantics stored in Chemistry Markup Language

Data: Semantics stored in Chemistry Markup Language

Intent: Recognizes chemical dictionary and ontology terms

Intent: Recognizes chemical dictionary and ontology terms

Author/edit 1D and 2D chemistry. Change chemical layout styles.Author/edit 1D and 2D chemistry. Change chemical layout styles.

Intelligence: Verifies validity of authored chemistry

Intelligence: Verifies validity of authored chemistry

Page 15: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Organize collection of individual workflow activities Organize collection of individual workflow activities

Author, Execute and Monitor WorkflowsAuthor, Execute and Monitor Workflows

Available now:http://research.microsoft.com/collaboration/tools/trident.aspx

Compose and modify workflows via drag & drop canvas

Compose and modify workflows via drag & drop canvas

View data products, performance metrics, and provenance data View data products, performance metrics, and provenance data

Page 16: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Page 17: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

• The Windows Azure platform offers a flexible, familiar environment for developers to create cloud applications and services. With Windows Azure, you can shorten your time to market and adapt as demand for your service grows. Windows Azure offers a platform that is easily implemented alongside your current environment.

• Offerings:– Windows Azure: operating system as an online service– Microsoft SQL Azure: fully relational cloud database solution– Windows Azure platform AppFabric: connects cloud services

and on-premises applications– Microsoft Codename “Dallas”: information marketplace for

data and web services

Page 18: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

• Microsoft "Dallas" is a service allowing developers and information workers to easily discover, purchase, and manage premium data subscriptions in the Windows Azure platform. – Dallas is an information marketplace that brings data, imagery, and

real-time web services from leading commercial data providers and authoritative public data sources together into a single location, under a unified provisioning and billing framework.

– Dallas APIs allow developers and information workers to consume this premium content with virtually any platform, application or business workflow.

– More: http://www.microsoft.com/windowsazure/dallas/

Page 19: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

• Excel Calculation Services (ECS) is the "engine" of Excel Services that loads the workbook, calculates in full fidelity with Microsoft Office Excel 2007, refreshes external data, and maintains sessions.

• Excel Web Access (EWA) is a Web Part that displays and enables interaction with the Microsoft Office Excel workbook in a browser by using Dynamic Hierarchical Tag Markup Language (DHTML) and JavaScript without the need for downloading ActiveX controls on your client computer, and can be connected to other Web Parts on dashboards and other Web Part Pages.

• Excel Web Services (EWS) is a Web service hosted in Microsoft Office SharePoint Services that provides several methods that a developer can use as an application programming interface (API) to build custom applications based on the Excel workbook.

• More: http://msdn.microsoft.com/en-us/library/ms546696.aspx

Page 20: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

• What is it?– The Open Data Protocol (OData) is a Web protocol for querying and updating data that

provides a way to unlock your data and free it from silos that exist in applications today. OData does this by applying and building upon Web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores. The protocol emerged from experiences implementing AtomPub clients and servers in a variety of products over the past several years.

– OData is being used to expose and access information from a variety of sources including, but not limited to, relational databases, file systems, content management systems and traditional Web sites.

– OData is consistent with the way the Web works - it makes a deep commitment to URIs for resource identification and commits to an HTTP-based, uniform interface for interacting with those resources (just like the Web). This commitment to core Web principles allows OData to enable a new level of data integration and interoperability across a broad range of clients, servers, services, and tools.

– OData is released under the Open Specification Promise to allow anyone to freely interoperate with OData implementations.

• Find out more– http://odata.org & http://msdn.com/data – Contact Pablo Castro ([email protected]) / Blog: http://blogs.msdn.com/pablo

Page 21: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

• The Open Government Data Initiative (OGDI) is a cloud-based collection of software assets that enables publicly available government data to be easily accessible. Using open standards and application programming interfaces (API), developers and government agencies can retrieve the data programmatically for use in new and innovative online applications, or mash-ups that can help:– Improve citizen services – Enhance collaboration between government agencies and private

organizations – Increase government transparency

• OGDI promotes the use of this data by capturing and publishing re-usable software assets, patterns, and practices. The data repository already holds over 60 different government datasets that are readily available for use in new applications, and is continuously updated with additional government datasets.

• More: http://www.microsoft.com/industry/government/opengovdata/

Page 22: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

• In partnership with the California Digital Library’s Curation Center – In collaboration with Tricia Cruse & John Kunze – Part of the DataONE (an NSF DataNet Project)

PROPO

SED

Page 23: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

Proposed functionality under consideration:•Support for versioning, so that revision history and the original raw data can be easily protected and recovered, •Standardized date/time stamps so that researchers can easily determine when the data were created and last updated. •A “workbook builder” allowing researchers to select from globally shared standardized layouts for capturing data,•Ability to export metadata in a standard format (e.g., a DataCite citation or an EML document that describes the dataset(s) in a workbook) so that researchers can readily share their data, •Ability to select from a globally shared vocabulary of terms for data descriptions (e.g., column names), and as needed to add new terms to the globally shared vocabulary, to enable wide collaboration between researchers•Ability to import term descriptions from the shared vocabulary and annotate them locally to refine their definitions as used in the dataset,•“Speed bumps” to discourage use of macros and customizations that would impede interoperation of data imported from Excel into other applications, and•Ability to deposit data and metadata directly into a data archive to enable compliance with funding agency requirements to preserve and publish research data.

PROPO

SED

Page 24: Speeding Science Solutions for Data Curation from Microsoft (Research) Lee Dirks Director, Education & Scholarly Communication External Research Division.

This work is licensed under a Creative Commons Attribution 3.0 United States License.

Lee DirksDirector—Education & Scholarly Communication

Microsoft External [email protected]

http://research.microsoft.com/people/ldirks

URL – http://www.microsoft.com/scholarlycomm/Facebook: Scholarly Communication at Microsoft