Content Mining of Science in Cambridge Peter Murray-Rust, Dept of Chemistry, University of Cambridge libraries@cambridge, Cambridge, UK 2016-01-07 What is mining? Why is it useful? Open Access and UK “Hargreaves” legislation How Cambridge can become a world leader
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Content Mining of Science in Cambridge
Peter Murray-Rust, Dept of Chemistry, University of Cambridge
libraries@cambridge, Cambridge, UK 2016-01-07
What is mining?Why is it useful?
Open Access and UK “Hargreaves” legislationHow Cambridge can become a world leader
The Right to Read is the Right to Mine* *PeterMurray-Rust, 2011
repositories to reports in scientific literature• Mining chemical reactions from patents• Creating a bacterial supertree-of-life from
4500 papers
Polly has 20 seconds to read this paper…
…and 10,000 more
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
400,000 Clinical TrialsIn 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s happened in last 6 years??
Search the whole scientific literatureFor “2009-0100068-41”
ContentMine-ing strategy• Discover. Crawl the COMPLETE relevant literature.
=> bibliography• Scrape (download). ALL papers• Index papers => Facts• Search/analyze papers => complex science• Extract, Annotate, Aggregate (“Transformative”)
Semantic re-usable/computable output (ca 4 secs/image)
Supertree for 924 species
Tree
Supertree created from 4300 papers
Copyright and Mining
• UK (“Hargreaves”) 2014 legislation:– “personal” “non-commercial*” “research” “data
analytics”– legitimizes copying (?to disk), but not publishing
*teaching, textbooks, etc. may be “commercial”
STM Publishers prevent Mining• FUD & disinformation about legality (Elsevier)• Monopolies on infrastructure (“API”s, CCC
Rightfind)• Technical obstruction (Wiley Captcha,
Macmillan Readcube)• Restrictive contracts with libraries (ALL) [1]• Wasting my/our time (ALL)
[1] [You may not] utilize the TDM Output to enhance … subject repositories in a way that would [… ] have the potential to substitute and/or replicate any other existing Elsevier products, services and/or solutions.
WILEY … “new security feature… to prevent systematic download of content
“[limit of] 100 papers per day”
“essential security feature … to protect both parties (sic)”
CAPTCHAUser has to type words
ContentMine working with Libraries
• Cambridge: Library, Plant Sciences, Public Health, Chemistry
• Cochrane Collaboration on Systematic Reviews of Clinical Trials
• FutureTDM (H2020, LIBER)• Running workshops and training