Future of Cognitive Computing and AISemantic PDF Processing and Knowledge Representation
Sridhar Iyengar
Distinguished Engineer
Cognitive Computing Research
IBM T.J. Watson Research Center
© 2017, IBM Corporation
20102009 2011 2012 2013 2014 2015
50
0
25
75
100
125
150
Financial Services : Query from Unstructured Data
Financial Documents
(.pdf, .html, docx…)
Ingest
“ShowmerevenuesforCitibankbetween2009and2015”
© 2017, IBM Corporation
Summary : PDF understanding is hard and requires significant
Research breakthroughs and Product Innovations
3
▪ PDF Documents are optimized for display and often do not
include metadata and structure to facilitate Cognitive post
processing
– Existing technologies and solutions are optimized for printing and
viewing – not cognitive post processing
▪ Need to handle Programmatically created (via MS Word, PPT….)
and Legacy and scanned documents (Forms, hand written
notes...)
▪ Approach : The definition of a Semantic Document Structure
Model (DSM) for a consistent internal representation of document
structures to be used in Future WDC Services and products
▪ Currently focused on Table and Diagram Understanding from PDF
– Healthcare, Financial Services, Compliance, Legal…
© 2017, IBM Corporation
Research Focus : IBM AI Platform for Business
• Best platform for building applications that incorporate enterprise and industry knowledge
• Time to Value at every step of cognitive application development
• Tools & Methodology to support development, deployment & intuitive usage
4
Data:
‒ Structured & Unstructured data sources
‒ Multimodal (text, visual, speech, etc.) data sources
‒ Public & private data sources
Training
‒ Create Domain Models and Specialize them
§ for conversations aligned with business process
§ For discovery of insights
‒ Fast adaptation to new domains
‒ Scale from small to large amounts of training data
‒ Tuned model creation for accuracy vs. training time.
‒ Incremental & Automated Knowledge Evolution
Conversation
‒ Tools for SMEs train from Business Processes
‒ Inference engines for specific content structure
Discovery
‒ Tools for SMEs train from Business Knowledge
‒ Reason about domain knowledge (vs. Lexical/Syntactic)
Tools & Methodology
‒ Cognitive application lifecycle (code/data/model)
Resilient deployment of cognitive models
© 2017, IBM Corporation
Cognitive Computing (AI) Technologies Research
Decision Support People InsightsCognitive Software
and Data Life Cycle*
Reasoning and
Planning
Human Computer
Interaction
Conversation
Query and Retrieval*
Knowledge Extraction
and Representation*
Learning*
Natural Language &
Text Understanding*
Visual
Comprehension*
Speech and Audio Embodied Cognition
Cognitive Computing
Platform Infrastructure
Signal
Comprehension
Reasoning
About Domains
Interaction Systems
Trust and Security
Semantic PDF
Processing*
© 2017, IBM Corporation
Goal : From Raw Data to Business Artifacts
Line PlotBulleted List
• Create representation for an obligation
• Models for “obligation language”
• Reason about list or data that refines the obligation
• Create document fragments by parsing out chunks
• Document structure models
• Reason about document chunks
Obligation
• Create representation for a fragments
• Document fragment models
• Reason about fragment constituentsfragment
Section
fragmentfragment
• Hierarchical Processing
• Machine-learned models and reasoning at all levels
• Learnability of artifacts, models
• Learn how to specify reasoners
Example 1: Semantic PDF Processing
6© 2017, IBM Corporation
Example 1:
Line PlotBulleted List
• Create representation for an obligation
• Models for “obligation language”
• Reason about list or data that refines the obligation
• Create document fragments by parsing out chunks
• Document structure models
• Reason about document chunks
Obligation
• Create representation for a fragments
• Document fragment models
• Reason about fragment constituentsfragment
Section
fragmentfragment
.mp4
SceneScene
Boy Girl NightSoft
MusicCandles
Romantic Scene
• Hierarchical Processing
• Machine-learned models and reasoning at all levels
• Learnability of artifacts, models
• Learn how to specify reasoners
Example 2: Semantic MPEG Processing
7
From Raw Data to Business Artifacts
© 2017, IBM Corporation
Complexity akin to “Natural Language Understanding”
Why is PDF Processing hard?
▪ Thousands of PDF generators (driver), with their own rules for placing marks on paper.
▪ Incredible variety in content – complex tables, images, diagrams, formulas, varying resolution in scanned content
▪ No closed form / algorithmic solution feasible – must resort to machine learning.
© 2017, IBM Corporation
Why is it hard? Variety of tables : 20-25 major
table types in discussion with just one major
customer
Complex tables – graphical lines can be misleading – is this 1, 2 or 3
tables ?
Table with visual clues
only
Multi-row, multi-column column
headers
Nested row
headers
Tables with Textual content
Table with graphic
lines
Table interleaved
with text and charts
Complex multi-row, multi-column column headers identifiable using graphical lines
and visual clues
© 2017, IBM Corporation
Why is it hard? Variety in Image, Diagram TypesL. Lin et al. / Pattern Recognition 42 (2009) 1297 -- 1307 1305
Fig. 8. ROC curves of the detection results for bicycle parts. Each graph shows the ROC curve of the results for a different part of the bicycle using just bottom-up informationand bottom-up + top-down information. We can see that the addition of top-down information greatly improves the results. We can also see that the bicycle wheel is themost reliably detected object using only bottom-up cues, so we will look for that part first.
With a quick second glance, even the seat and handlebars may be“seen”, though they are actually occluded. Our algorithm simulatesthe top-down process (indicated by blue/green downward arrows inFig. 4) in a similar way, using the constructed And–Or graphs.
Verification of hypotheses: Each of the bottom-up proposals ac-tivates a production rule that matches the terminal nodes in thegraph, and the algorithm predicts its neighboring nodes subject tothe learned relationships and node attributes. For example in Fig. 4,a proposed circle will activate the rule that expands a wheel intotwo rings. The algorithm then searches for another circle of propor-tional radius, subject to the concentric relation with existing circle.In Fig. 5(b), the wheels are already verified. The candidate framesare then predicted with their ends affixed to the center points of thewheels. Since we cannot tell the front wheels from the rear ones atthis moment, frames facing in two different directions are both pre-dicted and put in the Open List. In Fig. 5(a), the triangle templatesare detected using a Generalized Hough Transform only when thewheels are first verified and frames are predicted. If no neighboring
nodes are matched, the algorithm stops pursuing this proposal andremoves it from the Lists. Otherwise, if all of the neighboring nodesare matched, the production rule is completed. The grouped nodesare then put in the Closed List and lined up to be another bottom-upproposal for the higher level. Note that we may have both bottom-up and top-down information being passed about a particular pro-posal as shown by the gray arrows in Fig. 3. In Fig. 4, the sub-partsof the frame are predicted in the top-down phase from the framenode (blue arrows); at the same time, they are also proposed in thebottom-up phase based on the triangles we detected (red arrows).Proposals with bidirectional supports such as these are more likelyto be accepted. After one particle is accepted from the Open List, anyother overlapping particles should update accordingly.
Template match: The pre-defined part templates, such as the bi-cycle frames or teapot bodies, are represented by sub-sketch-graphs,which are composed of a set of linked edgelets and junctions. Once atemplate is proposed and placed at a location with initial attributes,the template matching process is then activated. As shown in
10
PDF renderingq .doc, .ppt rendering to .pdf keeps minimal structure formatting.
Geared towards visual fidelityq Often .pdf is created by “screen scraping” or scanning or hybrid
ways that do not keep structure information.
Multi-modality: extremely rich informationq Images + Text + Tables both co-exist as well as form nested
hierarchies possibly with several levels
Nested table (numeric andnon-numeric + image)
Tabular representationof images with pictorialcross reference
Images + captions + cross references andtext that comments the image
© 2017, IBM Corporation
Two major approaches to tackling PDF Processing
▪Unsupervised Learning and out of the box PDF
processing
– Works well for a large class of domains with some compromise in
quality
▪Supervised Learning with a graphical labelling tool
– Potential for improved quality when many similar documents are
available
Both approaches can be used together© 2017, IBM Corporation
…
…
DU:Lineplots(LP)
DU:flowcharts(FC)
DU:bubbleplots(BP)
Imageclassification
TU:Tableunderstanding
(ProgrammaticPDF
Textanalytics(Programmatic
PDF)
PDFParser
DU:scannedtables(ST)
Dataintegration:Linkingtexttodiagrams,tables,
serialization….
PDFUnderstanding:HighLevelOverview
© 2017, IBM Corporation
Learned Semantic Document Representation
© 2017, IBM Corporation
PDFProcessingOverviewinWDCWDCDCSService
PDFDocsHTML
JSON
PlainText
https://www.ibm.com/watson/developercloud/document-conversion.html
CurrentimplementationofDCShaslimitedTableprocessingcapabilityandnosupportforscanneddocuments,diagrams,graphsetc.
TextandSimpleTablestructure
© 2017, IBM Corporation
PDFProcessing.Next Overview(2017/2018)
WDCDCS.Next Service
PDF,HTML,WordDocs
DSM-XML
JSON
PlainText
HTMLWDCDCSService…
PDF2HTML
PDF2JSON
PDF2-DSMXML
NewPDFTools
SME,DataScientist(DomainAdaptationusingML)
DeveloperUsingDCSAPI
Text,Tables,DiagramsGraphs..
PDF,HTML,WordDocs(Training)
© 2017, IBM Corporation
PDFConversionArchitecture
ProgrammaticPDF
PDFBox API: Parse PDF Document
HTML
Layout+ReadingOrderInference
HTMLGeneration
TableStructurePopulation
MetadataIdentification
TableIdentification
CleanseRawPDFData
OpenSourceorCommercialSoftware
ResearchSolution
CompositeUnit/RegionIdentification
ScannedPDF
CleanseRawOCROutput
OCR Engine API: Scan PDF Document
• ML-based PDF conversion Pipeline is source-independent
• SAME ML-based algorithms can be applied directly to data
extracted from either scanned or programmatic PDF
• PDF Conversion ML algorithms are unsupervised; thus achieve
stated performance out-of-box with NO training / tuning data
required
• Deployable in WDC for document-at-a-time processing (thru
Document Conversion Service) and in deep enrichment service
• Scanned PDF processing available
now using Datacap OCR engine
• Extension using Tesseract engine
ProgrammaticPDFExtraction
Scanned&HybridPDFExtraction
HybridPDF
ChartIdentification
ML-basedPDFConversionPipeline
© 2017, IBM Corporation
• HTMLoutputfromWCSPDFConversionisdirectlyconsumablebydownstreamanalytics
• WCSPDFConversionTableprocessingexample(Norbury sampletable):
17
PDFConversionDownstreamAnalyticsExample
PDF HTML WatsonKnowledgeGraph
WCSPDF
ConversionTable
Processing
NLQAnswering
StructuredFactsfromTable
Answer
OriginalScannedPDFtable
HTMLgeneratedfromcurrentPDFConversionWebservice
Bridge Designer Length
Brooklyn J.A.Roebling
1595
Manhattan G.Lindenthal
1470
Queensborough Palmer &Hornbostel
1182
StructuredfactsfromexistingTableProcessingLibraries
(withappropriatecustomization)
WhodesignedBrooklynBridge?
NLQAnsweringJ.A.Roebling…
© 2017, IBM Corporation
DocumentStructureModel(DocumentRepresentation)• Definecommondocumentstructureidealforsubsequentsemanticanalysis
• Definedperfeature:Section,BulletedLists,Headers,Footers,Footnotes,Tables,...
18
Definehowsectioninformationsuchastitle,numberandnestingshouldbeRepresented
DefinehowlistinformationsuchaslistitemsandlisttypeshouldbeRepresented
© 2017, IBM Corporation
DocumentStructureModel(DSM)- Draft
ScanPDF
ProgPDF
Page
[1…n]
Token
Character
Phrase
TextLine
Paragraph
PageColumn
[1…n]
[1…n]
[1…n]
[1…n]
[1…n]
PageChart
TableCell
Table GraphicalLine
[1…n]meansorderedlistAllobjectshaveBoundingBox attribute
ColordisplayOrder
rowSpancolSpan
[1…n][1…n]
[1…n]
[1…n][1…n]
[1…n][1…n]
EmbeddedImage
BoundBoxCoordscontentsdisplayOrder
[1…n]
LogicalDataModel
OntologyRepresentation
(C) 2017 IBM Corporation
19
• Goal: Define common document structure ideal for subsequent semantic
processing
• Captures both raw extracted information (text, vector graphics) along
with inferred artifacts (tables, charts, paragraphs)
• Start with PDF documents and extend to other formats such as Word
and Excel
• DSM Schema in OWL, Serializations to HTML, JSON...
20102009 2011 2012 2013 2014 2015
50
0
25
75
100
125
150
Financial Services : Query from Unstructured Data
Financial Documents
(.pdf, .html, docx…)
Ingest
“ShowmerevenuesforCitibankbetween2009and2015”
© 2017, IBM Corporation
Thank You