Top Banner
SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010
76

SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Dec 17, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

SPIRES and INSPIRETravis Brooks

SLAC National Accelerator LaboratoryINSPIRE Collaboration

PPA Computing1 July 2010

Page 2: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Infrastructure

• The basic facilities, services and installations needed for the functioning of a community or society wiktionary.org

Page 3: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Community• ~30,000 researchers worldwide

• Questions like:

• What is the universe made of?

• What happened 3µs after the Big Bang?

• Distinction between Theory and Experiment

Page 4: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

~15,000 HEP scientists smash stuff at the speed of light to produce new stuff

Page 5: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

…and it works!

LHC re-discovering known particles for starters.First needles in the haystack: one in a million.

Page 6: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Another 15,000 HEP researchers scratch their heads to make sense of all that stuff and then

some more

Page 7: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Community

• Experiment

• Large, global collaborations ( > 2000 authors!)

• Big centers of research distributed globally

• SLAC, Fermilab, CERN, DESY, KEK

• Theory

• Small, but global collaborations (avg 3 authors)

• Self-contained papers

Page 8: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 9: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

1960’s - 1970’s

• HEP Lab Libraries store paper preprints

• Distributed via postal mail to major centers

• “Institute-pays” Open Access

• “SPIRES” catalogs (and distributes) preprints received at SLAC

• Centralized, community-driven model

• Users query SPIRES via terminal login accounts

Page 10: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

1990’s

CERN Invents WWWUsers query SPIRES at SLAC via 1st Web Site in the U.S.

Page 11: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

1991: arXiv.org

Page 12: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Preprint Culture

• Connections/trust/expertise

• Infrastructure from Labs

• SPIRES, WWW, arXiv

• Researcher desire for rapid communication                 

Page 13: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

2000’s

• 2007 survey of 2,000 physicists by CERN, DESY, Fermilab and SLAC. “What is your primary HEP Information Resource?” Gentil-Beccot et al, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course. J.Am.Soc.Inf.Sci.60:150-160,2009 arXiv:0804.2701

Page 14: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

•97% of published literature freely available on arXiv

• No Mandates – No Debates

Page 15: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Researchers want speed

Page 16: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Researchers want speed

• SPIRES counts: citations to/from preprints/articles

• Citation peaks at publications

• Scientific discourse proceeds on discipline repository

Page 17: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Citation Advantage

• When an arXiv paper is published, it has already surpassed the citation count a non-arXiv paper will have after 2 years

Page 18: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Read Journals?Gentil-Beccot et al.

arxiv:0906.5418

As many scientists as analyzed here go straight to arXiv so80% arXiv users becomes 90% arXiv users

arXiv 82%

Publisher server 18%

∼30,000 clicks (choice between arXiv and journal)

Page 19: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Benefits to Researchers

• Centralized discipline-based repository with curated metadata/search

• Includes Peer reviewed literature

• Links to every known copy

• dois, urls, arXiv

Page 20: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Numbers

• 834,049 (as of Oct 15)

• 50,077 (During 2008)

• 82,719 (Oct 15 - typical)

• 178 (Last week - typical)

Page 21: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

What is SPIRES?• Deep, carefully curated metadata

• Authors, Affiliations, Citations, Keywords

• Carefully, intentionally limited to HEP

• Associated community information

• Conferences, Institutions, People, Jobs

Page 22: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 23: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 24: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 25: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 26: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 27: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Future of HEP Information

• Conversations on arXiv

• Noting, but not waiting for peer review.

• blog/wiki - like

• Rapid turnaround

• Freely accessible content

• Community driven

• Use technology to tighten this relationship further…with an existing community

Page 28: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

2010• Past 40 Years: Information

Infrastructure in response to user needs

• Community Needs in 2010:

• Preserve Quality

• Promote Access

• Archive Research Artifacts

Page 29: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

2010• Past 40 Years: Information

Infrastructure in response to user needs

• Community Needs:

• Preserve Quality - SCOAP3

• Promote Access - INSPIRE

• Archive Research Artifacts - INSPIRE/HEPData

Page 30: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Quality via Peer Review

• Peer Review and other journal services currently funded by HEP libraries paying for access...

Page 31: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

..to material that is freely available

Page 32: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

HEP Open Access• LHC scientists (8000 scientists from 54

countries):

• "We strongly […] support the principles of Open Access Publishing, which includes granting free access of our publications to all. Furthermore, we encourage all our members to publish papers in easily accessible journals, following the principles of the Open Access Paradigm."

Page 33: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

SCOAP3 Model• An international consortium to convert

existing (and new) top-quality HEP journals to OA

• Libraries re-direct subscriptions to SCOAP3

• SCOAP3 pays centrally for peer-review service

• Price-per-article established by call for tender

• Articles are (free and libre) Open Access

Page 34: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

SCOAP3 Partnerships

Page 35: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

SCOAP3 Outlook

• Reach critical mass

• Partnership in Asia and Latin America

• Engage publishers in a call for tender

• Go/No-Go decision

Page 36: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

2010• Past 40 Years: Information

Infrastructure in response to user needs

• Community Needs:

• Preserve Quality - SCOAP3

• Promote Access - INSPIRE

• Archive Research Artifacts - INSPIRE/HEPData

Page 37: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Future of HEP Information

• Conversations on arXiv

• Noting, but not waiting for peer review.

• Rapid turnaround of freely accessible content

• Community driven

• Literature growing more complex

• Objects that aren’t papers, but are “information”

• “Datasets”, figures, tables, Computer code

• Use technology to tighten this relationship further…with an existing community

Page 38: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Guts...

Page 39: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

SPIRES System

• PL360 Emulated in C!

• SPIRES (non-SQL DBMS + internal scripting language)

• And the clearest, least obfuscated, best documented part of the code base is...

• ...Perl!

Page 40: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

INSPIRE• Joint Project of CERN, DESY, Fermilab

and SLAC

• Unify SPIRES content with Invenio platform

• Invenio = Open source digital library

• http://invenio-software.org

• http://inspirebeta.net

Page 41: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

INSPIRE Philosophy• Leverage Users

• Clean, maintainable, sharable codebase

• Open Source/Open Standards

• Continue manual curation...

• ...but utilize automation feeds where possible

• Utilize person-power to

• drive user participation

• exercise judgement (author ID, classification)

Page 42: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Invenio: Modern System

• Stable, modern, extensible software stack (LAMP)

• Fast, even with large repository

• Focused on search

• Open Source (GPL) community

• Substantial HEP use (CERN, ILC, …)

• Over 20 production instances worldwide

• Modular architecture

• Based on open standards

• MARCXML, OAI-PMH, etc

Page 43: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Opportunities• Enhanced Search and Discovery

• Automated classification using taxonomy

• User tagging

• Organize your personal papers etc.

• Run a Journal Club

• Author identification

• Claim your papers

Page 44: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 45: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 46: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 47: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 48: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 49: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 50: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

User tagging

•Hidden 20 FTE - Can be utilized via interactive techniques

• 2007 survey of 2,000 physicists by CERN, DESY, Fermilab and SLAC Gentil-Beccot et al, Information Resources in High-Energy Physics: Surveying the Present Landscape and Charting the Future Course. J.Am.Soc.Inf.Sci.60:150-160,2009 arXiv:0804.2701

Page 51: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Who do we know?

• HEPNames: 80K entries

• Affiliation history for 20K researchers

• Emails for 25K

• 800K papers with authors and (standardized) affiliations

• 5M ‘signatures’ on papers

• 350K unique name strings

Page 52: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Who• Automatic Disambiguation

• Henning Weiler - PhD student@CERN

• On 963 documents, 21 real authors could be identified for the query "Chen, G".

• 22 orphans remain

• 98% identified

Page 53: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 54: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 55: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

User Accounts

• Tied to academic affiliation

• ...and ORCID....

• Ability to correct information and claim papers

• Corrections still vetted by staff

Page 56: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Sources

• Source of 2008 additions

• Many papers have information from multiple sources

• Many arXiv papers will be published later

Page 57: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

arXiv• OAI-PMH Feed

• Rough Metadata (author/title/id)

• LaTeX and/or PDF parsing

• Citations, Authors, Affiliations, Keywords

• Parsed by Perl/Python

• Checked (or redone) by Humans

Page 58: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Journals

Page 59: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Publishers

• APS (Phys.Rev.D, Phys.Rev.Lett.)

• Elsevier (Phys.Lett.B, Nucl.Phys.B)

• Springer (Eur.Phys.C, JHEP(>2010))

• IOP (J.Phys.G, JHEP (<2010))

Page 60: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Feeds• APS in OAI-PMH

• Full Metadata + References

• Elsevier, Springer, JHEP

• In-house XML via FTP

• Rich Metadata, Most with References

• Fall back to screen-scraping HTML

Page 61: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Users• In 2008:

• 173 Added papers directly from users

• 3,800 Papers with user updates/corrections to reference lists

• 4,000 User updated profiles (institutional history, etc)

Page 62: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Export• DOIs, publication information to arXiv,

ADS

• bidirectional exchange of XML

• Currently: Rough “API” with in-house XML formats for Physicists building apps

• INSPIRE:OAI-PMH interface, rich API

• NLM DTD

• MARCXML

Page 63: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

2010• Past 40 Years: Information

Infrastructure in response to user needs

• Community Needs:

• Preserve Quality - SCOAP3

• Promote Access - INSPIRE

• Archive Research Artifacts - INSPIRE/HEPData

Page 64: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

~15,000 HEP scientists smash stuff at the speed of light to produce new stuff

Page 65: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

…and it works!

LHC re-discovering known particles for starters.First needles in the haystack: one in a million.

Page 66: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Data• HEPData - Durham U.

• Stores Data “behind” figures/tables

• Submitted from Experiments

• INSPIRE partners with HEPData

• Provides access, linking and deposition in central community location

• Serve “long-tail” of theorists and others with “misc.” materials

• Enables access citation etc..

Page 67: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Existing Infrastructure

Page 68: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 69: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 70: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 71: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 72: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.
Page 73: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Data

• Trusted Community Infrastructure

• Future?

• DPHEP Study Group

• Continuing conversation with researchers to develop data preservation strategy

Page 74: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Conclusion• Access, Quality, and Artifacts

• Emerging from community of researchers

• Aligned with community needs

• Target what scientists need

• Quality - Speed - Completeness

• Building on existing, trusted infrastructures

Page 75: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Infrastructure

• The basic facilities, services and installations needed for the functioning of a community or society wiktionary.org

Page 76: SPIRES and INSPIRE Travis Brooks SLAC National Accelerator Laboratory INSPIRE Collaboration PPA Computing 1 July 2010.

Questions?

• For more information on INSPIRE see

http://www.projecthepinspire.net http://inspirebeta.net