Please tick the box to continue:

Page 1: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Smart TextHow to Turn Big Text into Big

DataTom Reamy

Chief Knowledge Architect

KAPS Group

Program Chair – Text Analytics World

Taxonomy Boot Camp, KMWorld: Washington DC

Internet Librarian: Monterey, CA

Page 2: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


KAPS Group: General

Knowledge Architecture Professional Services – Network of Consultants Partners – Expert System, SAS, SAP, IBM, FAST, Smart Logic,

Concept Searching, Attensity, Clarabridge, Lexalytics, Strategy – IM & KM - Text Analytics, Social Media, Integration Services:

– Taxonomy/Text Analytics development, consulting, customization– Text Analytics Fast Start – Audit, Evaluation, Pilot– Social Media: Text based applications – design & development

Clients: – Genentech, Novartis, Northwestern Mutual Life, Financial Times,

Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, etc.

Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies

Presentations, Articles, White Papers –

Page 3: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.



Introduction: Big Text and Big Data Pharma: Semantic Search Application

– Project Components & Approach– Extraction Rules

Publishing: Processing 700K Proposals– Adding Structure to Unstructured Text– Text into Data


Page 4: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Big Text and Big Data

Big Text is Bigger than Big Data– 80% -> 90% of business information (Social Media)

Big Data tells you WHAT – Smart Text tells you WHY

Big Data – Data Munging = 50-80% of Data Scientist Time– Variety of Formats // Ambiguity of Human Language

Ontology / Fact Extraction – Pulmonary ISA Disease– Chronic obstructive pulmonary disease, obstructive pulmonary disease, Copd, copd,

COPD, Asthma (Asthema) , Emphysema, etc., etc.

Semi-Automatic Hybrid Solutions– AI not here yet (again)

Page 5: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Pharma: Project

Agile Methodology Goal – evaluate text analysis technologies ability to:

– Replace manual annotation of scientific documents – automated or semi-automated

– Discover new entities and relationships – Provide users with self-service capabilities

Goal – feasibility and effort level

Page 6: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Components – Technology, Resources

Cambridge Semantics, Linguamatics, SAS Enterprise Content Categorization– Initial integration – passing results as XML

Content – scientific journal articles Taxonomy – Mesh – select small subset Access to a “customer” – critical for success

Page 7: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Three rounds - Iterations

Visualization – faceted search, sort by date, author, journal– Cambridge Semantics

Round 1 – PDF from their database– Needed to create additional structure and metadata– No such thing as unstructured content

Round 2 & 3 – XML with full metadata from PubMed Entity Recognition – Species, Document Type, Study Type, Drug

Names, Disease Names, Adverse Events

Page 8: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Components & Approach

Rules or sample documents?– Need more precision and granularity than documents can do– Training sets – not as easy as thought

First Rules – text indicators to define sections of the document – Objectives, Abstract, Purpose, Aim – all the “same” section– Experiment – clusters / vocabulary to define section

Separate logic of the rules from the text – Stable rules, changing text

Scores – relevancy with thresholds– Not just frequency of words

Page 9: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Document Type Rules

(START_2000, (AND, (OR, _/article:"[Abstract]", _/article:"[Methods]“, _/article:"[Objective]",

_/article:"[Results]", _/article:"[Discussion]“, (OR, _/article:"clinical trial*", _/article:"humans", (NOT, (DIST_5, (OR,_/article:"approved", _/article:"safe",

_/article:"use", _/article:"animals"), Clinical Trial Rule: If the article has sections like Abstract or Methods AND has phrases around “clinical trials / Humans” and not words

like “animals” within 5 words of “clinical trial” words – count it and add up a relevancy score

Page 10: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Rules for Drug Names and Diseases

Primary issue – major mentions, not every mention– Combination of noun phrase extraction and categorization– Results – virtually 100%

Taxonomy of drug names and diseases Capture general diseases like thrombosis and specific types like

deep vein, cerebral, and cardiac Combine text about arthritis and synonyms with text like “Journal

of Rheumatology”

Page 11: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Page 12: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Rules for Drug Names and Diseases

(OR, _/article/title:"[clonidine]",  (AND, _/article/mesh:"[clonidine]",_/article/abstract:"[clonidine]"), (MINOC_2, _/article/abstract:"[clonidine]")  (START_500, (MINOC_2,"[clonidine]")))

Means – any variation of drug name in title – high score Any variation in Mesh Keywords AND in abstract – high score Any variation in Abstract at least 2x – good score Any variation in first 500 words at least 2x – suspect

Page 13: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Rules for Drug Names and Diseases

Results: – Wide Range by type -- 70-100% recall and precision

Focus mostly on precision – difficult to test recall One deep dive area indicated that 90%+ scores for both precision

and recall could be built with moderate level of effort Not linear effort – 30% accuracy does not mean 1/3 done

Page 14: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Project was a success! Useful results – as defined by the customer Reasonable and doable effort level – both for initial development

and maintenance Essential Success Factors

– Rules not documents, training sets (starting point)– Full platform for disambiguation of noun phrase extraction,

major-minor mention– Separation of logic and text

“Semantic” Search works!– If you do it smart!


Page 15: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Reed Construction Data

700,000 Proposals – Wide Variation Process Proposals – extract data – 30-50 types Current Manual Process – Internal Teams

– Expensive and Slow Structure Variety of Unstructured Documents

– Generate Table of Contents– Generate Sections and Capture Text

Extract Key Information Save Time & Money, Flexible Hiring, New Offerings


Page 16: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Components:Technology, Resources

Initial Attempt – failed target, too expensive to complete KAPS Group and SAS – Enterprise Content Categorization

– Team of 4 – mostly part time Reed Data Resources – 3 part time +, Current team of

proposal processors – develop test documents 4 Months – majority of time/effort on Key Data Extraction Sections – by Construction codes & text, Automated Table

of Contents


Page 17: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Example RulesAutomated Table of Content


Page 18: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Example RulesAutomated Table of Content

(AND, (OR, (ORD,"[SectionHeaderTags]","[Division01B_RegEx]","[TechnicalSpecPhrases]", (ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]" )), (ORD,"[Division01B_RegEx]","[TechnicalSpecPhrases]", (ORDDIST_3,"[SectionBodyPart]","[SectionBodyDesc]" __Division01BRegEx 00[0-9][0-9][0-9], 00[ _-]?[0-9][0-9][ _-]?[0-9][0-9], 00[ _-]?[0-9][0-9][ _-]?[0-9][0-9][\.][0-9][0-9], )))) Abandonment, Abatement, Abbreviations, Above-Grade, Aboveground, Abrasion-Resistant, Abrasive, Absorption, AC, Acceleration, etc - ~2,000 terms Section Header Tags – “Section, Division, Document”


Page 19: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Example RulesKey Data Extraction

Bid Dates/Times Roles (Architect, Designer, etc.) – names and addresses, etc. Project Attributes – Cost, Invitation Number, Parking, etc. Some Easy, Some Hard – Address! Example ARCHITECT: MICHEAL KIM ARCHITECTURE 1 HOLDEN STREET BROOKLINE, MA 02445 P: (617) 739-6925 F: (772) 325-2991


Page 20: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Process & Approach


Page 21: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Example RulesKey Project Data


Page 22: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Publishing Project: Example RulesKey Project Data


Page 23: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Conclusion: Lessons Learned

Development requires lots of content, testers, regular meetings Best Pattern Rule Development = develop a few rules to

production level, then adapt to other areas Hybrid Solutions are best (AI not here yet) Biggest Problem = Human Creativity Best Solution = Human Creativity

But – successful project! Foundation laid for Semi-automated text processing, new data Next Steps – refine, add, refine, new, refine, refine


Page 24: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.


Text Analytics: Platform & Foundation for Applications Semantic Search and (Semi)-Automated Business Processes AND – Sentiment Analysis-Social Media, Fraud Detection,

eDiscovery, Expertise location & analysis, behavior prediction Data/Fact Extraction can feed/extend Big Data and Semantic

Technology applications

Interested?– Text Analytics World, San Francisco March 30-April 1

• (Call for Speakers Now)

New Book coming: Text Analytics: Everything You Need to Know to Conquer Information Overload, Mine Social Media for Real Value, and Turn Big Text into Big Data


Page 25: Smart Text How to Turn Big Text into Big Data Tom Reamy Chief Knowledge Architect KAPS Group  Program Chair – Text Analytics World.

Questions? Tom Reamy

[email protected]

KAPS Group

Knowledge Architecture Professional Services March 30-April 1, San Francisco

Related Documents