Top Banner
Architecture of TheContentMine These slides are for enlightenment and presentations. Use http://discuss.contentmine.org/t/overall-architecture/ 142 for up-to-date info. Questions, comments and critiques welcome! All s/w is Open (BSD/Apache2) Some diagrams are autogenerated from *.dot files which are located in the projects (mainly Norma and AMI)
17

Architecture of ContentMine Components contentmine.org

Feb 11, 2017

Download

Science

TheContentMine
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Architecture of ContentMine Components contentmine.org

Architecture of TheContentMine

These slides are for enlightenment and presentations. Use http://discuss.contentmine.org/t/overall-architecture/142 for up-to-date info. Questions, comments and critiques welcome! All s/w is Open (BSD/Apache2)

Some diagrams are autogenerated from *.dot files which are located in the projects (mainly Norma and AMI)

Page 2: Architecture of ContentMine Components contentmine.org

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

Latest 20150908

Page 3: Architecture of ContentMine Components contentmine.org

quickscrape Norma Index &Transform

PDF

XML

URL

DOI

DOC

CSV

sHTML

Plugins

SequencesSpecies

BespokeScrapers XPath

Taggers

Per- Journal

Chemistry

Phylogenetics Plants

AMI

BadHTML

OCR

Diagrams

CAT-alogue index

getpapersquery

Titles+ links

DailyCrawl/feed

EuPMC

JToCs

Latest 20150908; limited in scope

Page 4: Architecture of ContentMine Components contentmine.org

Starting points for ingestion(getpapers/quickscrape/Norma)

• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CTree(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG|image) good

• PDF,XML,TXT,HTML -> Norma -> CTree(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR|TXT2HTML -> CTree(sHTML,TXT,SVG) variable

20150908

Page 5: Architecture of ContentMine Components contentmine.org

Norma Conversions

• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG

fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.

slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate

20150908

Page 6: Architecture of ContentMine Components contentmine.org

Norma End points

• Norma -> CTree(OpenSHTML-SVG) -> everything?• Norma -> CTree(sHTML. sections) -> AMI -> all

text + species, chemText, sequences)• Norma -> CTree(TXT (unsectioned)) -> AMI ->

bagOfWords, regex, IDs, species?• Norma -> CTree(PNG) -> AMI -> phylo, bar/xy-

plots, • Norma -> CTree(SVG) -> AMI -> phylo, bar/xy-

plots, chemistry

Page 7: Architecture of ContentMine Components contentmine.org

Pre/early Norma toolchainTransforming PDF and PNG into higher value components

20150908Diagram autogenerated from *.dot graph

Page 8: Architecture of ContentMine Components contentmine.org

getpapers/quickscrape/Norma workflow

20150908Diagram autogenerated from *.dot graph

Page 9: Architecture of ContentMine Components contentmine.org

20150908Diagram autogenerated from *.dot graph

Getpapers/quickscrape/Norma: commonest uses

Page 10: Architecture of ContentMine Components contentmine.org

20150908Diagram autogenerated from *.dot graph

AMI: inputs and outputs for common plugins

Page 11: Architecture of ContentMine Components contentmine.org

Earlier diagrams

Probably significantly out of date, but may contain useful info.

Page 12: Architecture of ContentMine Components contentmine.org

NORMALIZE

NormaConvert PDF,XMLTo sHTMLTag sections

Normalized Scientific Literature

AMIIndexTransformExtractSearch

PDF2SVGXSL stylesheetsTaggers

normalizationParameters

“Permanent” Filestore

Temporary Filestore

Extracted factsindexes

PluginsRegex

Page 13: Architecture of ContentMine Components contentmine.org

PDFNon-UnicodePixel glyphsNo wordsNo structures

ScholarlyHTML

SVG

High-levelgraphics

PDF2SVG

characters

SentencesParastables

PNG OCR

TaggedSections

SVGBuilder

CaptionedFigures

NORMA

XSLT1/2

Page 14: Architecture of ContentMine Components contentmine.org

Raw HTMLNot wellformedBad charactersemantics

ScholarlyHTML

Well-formed XHTML

PNG

TaggedSections

CaptionedFigures

Tables

CaptionedTables

XMLHtmlTidyJsoupHtmlUnit

XSLT1/2

XSLT1/2

NORMA

Per-journalStylesheets

Page 15: Architecture of ContentMine Components contentmine.org

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

Page 16: Architecture of ContentMine Components contentmine.org

quickscrapeCrawlFeed Norma Index &

Transform

TXTXML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

PluginsRegex

SequencesSpecies

Bespoke

ScrapersXPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

PDF

Page 17: Architecture of ContentMine Components contentmine.org