Efficient Practices for Large Scale Text Mining Process
Post on 07-Apr-2017
151 Views
Preview:
Transcript
March 2, 2017
Ivelina Nikolova
Senior NLP Engineer
Efficient Practices for Large Scale Text Mining Process
2Mar 2, 2017
In this webinar you will learn …
• Industry applications that maximize Return on Investment (ROI) of your text mining process
• To describe your text mining problem
• To define the output of the text mining
• To select the appropriate text analysis techniques
• To plan the prerequisites for a successful text mining solution
• DOs and DON’Ts in setting up a text mining process.
3
Outline
• Business need for text mining solutions
• Introduction to NLP and information extraction
• How to tailor your text analysis process
• Applications and demonstrations
Mar 2, 2017
4
Analyzing text to capture data from them supports:– increased user engagement via content
recommendations, – shortened research cycle via semantic search,– regulatory compliance via smart indexing, – better content management etc.
Business needs for text mining solutions
Mar 2, 2017
5
Some of our customers
Mar 2, 2017
6
• Parsing texts in order to extract machine-readabe facts from them.
• Create sets of structured or semi-structured data out of heaps of unstructured heterogeneous documents.
• Relies on natural language processing techniques like: - automatic morphological analysis, - automated syntax analysis, - term weights and co-occurrence,- lexical semantics, and more compex tasks like:- named entity recognition - relation extraction etc.
Text analysis
Mar 2, 2017
7
• Inextricably tied to text analysis
• Links mentions in the text to knowledge base concepts
• Automatic, manual and semi-automatic
Semantic annotation/enrichment
Mar 2, 2017
8
• Named Entity Recognition– 60% F1 [OKE-challenge@ESWC2015]– 82.9% F1 [Leaman and Lu, 2016] in the biomedical
domain– above 90% for more specific tasks
State-of-the art
Mar 2, 2017
9
Designing the text mining process
• Know your business problem
• Know your data
• Find appropriate samples
• Use common formats or formats which can be easily transformed to such
• Get together domain experts, technical staff, NLP engineers and potential users
• Narrow the business problem to information extraction task
• Clearly define the annotation types
• Clearly define the annotation guidelines
• Apply the appropriate algorithm for IE
• Do iterations of evaluation and improvement
• Insure continuous adaptation by curation and re-training
Mar 2, 2017
10Mar 2, 2017
11
Clear problem definition
• Define clearly your business problem • specific smart search • content recommendation• content enrichment• content aggregation etc.
E.g. the system must do <A, B, C>
• Define clearly the text analysis problem• Reduce the business problem to information extraction problem
Business problem: faceted search by Persons, Organizations, LocationsInformation extraction problem: extract mentions of Persons, Organizations, Locations and link them to the corresponding concepts in the knowledge base
Mar 2, 2017
12
• Annotations – abstract descriptions of the mentions of concepts of interest
Named entities: Person, Location, OrganizationDisease, Symptom, Chemical SpaceObject, SpaceCraf
Relations: PersonHasRoleInOrganisation, Causation
Define the annotation types I
Mar 2, 2017
13
• Annotation types• Person, Organization, Location• Person, Organization, City• Person, Organization, City, Country
• Annotation features
Location: string, geonames instance, latitude, longitude
Define the annotation types II
Mar 2, 2017
14
• Annotation types• Person, Organization, Location• Person, Organization, City• Person, Organization, City, Country
• Annotation features
Location: string, geonames instance, latitude, longitudeChemical: string, inChi, SMILES, CASPersonHasRoleInOrganization: person instance, role instance, organization instance, timestamp
Define the annotation types II
string: the Gulf of MexicostartOffset: 71endOffset: 89type: Locationinst: http://ontology.ontotext.com/resource/tsk7b61yf5dslinks: [http://sws.geonames.org/3523271/
http://dbpedia.org/resource/Gulf_of_Mexico]latitude:25.368611longitude:-90.390556
Mar 2, 2017
15
Locations mentioned Holocaust documents
Mar 2, 2017
16
• Realistic
• Demonstrating the desired output
• Positive and negative• “It therefore increases insulin secretion and reduces POS[glucose] levels,
especially postprandially.”• “It acts by increasing POS[NEG[glucose]-induced insulin] release and by
reducing glucagon secretion postprandially.”
• Representative and balanced set of the types of problems
• In appropriate/commonly used format – XML, HTML, TXT, CSV, DOC, PDF.
Provide examples
Mar 2, 2017
17
Domain model and knowledge
• Domain model/ontology - describes the types of objects in the problem area and the relations between them
Mar 2, 2017
18
• Data sources - proprietary data, public data, professional data
• Data cleanup
• Data formats
• Data stores • For metadata - GraphDB (http://ontotext.com/graphdb/)• For content – MongoDB, MarkLogic etc.
• Data modeling is inevitable part of the process of semantic data enrichment• Start it as early as possible• Keep to the common data formats• Mistakes and underestimations are expensive because they influence the
whole process of developing a text mining solution
Data
Mar 2, 2017
19
• Gold standard – annotated data with superior quality
• Annotation guidelines - used as guidance for manually annotating the documents.
POS[London] universities = universities located in LondonNEG[London] City CouncilNEG[London] Mayor
• Manual annotation tools – intuitive UI, visualization features, export formats• MANT – Ontotext's in-house tool• GATE – http://gate.ac.uk/ and https://gate.ac.uk/teamware/• Brad - http://brat.nlplab.org/
• Annotation approach• Manual vs. semi-automatic• Domain experts vs. crowd annotation• E.g. Mechanical Turk - https://www.mturk.com/
• Inter-annotator agreement
• Train:Test ratio – 60:40, 70:30
Gold standard
Mar 2, 2017
20
• Rule-based approach• lower number of clear patterns which do not change over time or slightly change• high precision • appropriate for domains where it is important to know how the decision for extracting
given annotation is taken – e.g. bio-medical domain
• Machine learning approach• higher number of patterns which do change over time• requires annotated data• allows for retraining over time
• Neural Network approach• Deep Neural Networks - getting closer to AI• Recent advances promise true natural language understanding via complex neural
networks• Great results in Speech recognition, Image recognition and Machine translation;
breakthrough expected in NLP• Still unclear why and how it works thus difficult to optimize
Text analysis approach
Mar 2, 2017
21
• Preprocessing
• Keyphrase extraction
• Gazetteer based enrichment
• Named entity recognition and disambiguation
• Generic entity extraction
• Result consolidation
• Relation extraction
NER Pipeline
Mar 2, 2017
22
NER pipeline
Mar 2, 2017
23
NER pipeline
Mar 2, 2017
24
NER pipeline
Mar 2, 2017
25
• Curation of results - domain experts assess manually the work of the text analysis components
• Testing interfaces
• Feedback• Select representative set of documents to evaluate manually• Provide as full description of the results and the used component as
possible: <pipeline version> <input as send for processing> <description of the wrong behavior> <description of the correct behavior>
• The earlier this happens it triggers revision of the models and improvement of the annotation
Results curation / Error analysis
Mar 2, 2017
26
• Gold standard split train:test • 70:30• 80:20
• Which task you want to evaluate • E.g. extraction at document level
or inline annotation
• Evaluation metrics• Information extraction tasks – precision, recall, F-measure• Recommendations – A/B-testing
Evaluation of the results
Mar 2, 2017
27
Continuous adaptation
Mar 2, 2017
28
• Document categorization • post, political news, sport news, etc.;
• Topic extraction• important words and phrases in the text;
• Named entity recognition • People, Organization, Location, Time, Amounts of money, etc.;
• Keyterm assignment from predefined hierarchies
• Concept extraction• entities from a knowledge base;
• Relation extraction • relations between types of entities.
Types of extracted information
Mar 2, 2017
29
• TAG (http://tag.ontotext.com)
• NOW (http://now.ontotext.com)
• Patient Insights (http://patient.ontotext.com/) - contact todor.primov@ontotext.com for credentials.
Applications
Mar 2, 2017
30
• Clearly defined business problem needs to be broken down to a clearly defined information extraction problem
• Requires combined efforts from business decision makers, domain experts, natural language processing experts and technical staff
• Data modeling is inevitable part of the process, consider it as early as possible
• Create clear annotation guidelines based on real-world examples
• Start with an initial small set of balanced and representative documents
• Plan the evaluation of the results in advance
• Choose appropriate manual annotation tool
• While annotating content check how the quantity influences the performance
• Select the appropriate text analysis approach
• Plan iterations of curation by domain experts followed by revision of the text analysis approach
• Plan the aspects of continuous adaptation – document quantity, timing, temporality of the information fed in the model
Take away messages - DOs
Mar 2, 2017
31
Most common mistakes are caused by under/overestimation of some phases in the text mining process:
• Underestimated efforts for training corpus – this may lead to a longer phase of determining the correct algorithms and training models.
• Underestimated efforts for evaluation corpus – this may lead to a solution which cannot be practically evaluated thus formally delivered/released.
• Overestimating the value of the data in the text mining process – if you spend too much efforts in building your own vocabularies, you will most probably end up with the same text mining solution as if you buy professionally prepared data.
• Underestimating the data ETL before starting a text mining solution – this may lead to a delay in the text mining solution, caused by delayed training cicle.
• Overexpectations from dynamic data updates – it ofen turns that when the solution is ready, it is more important to have a good process for dynamic update of data rather than having the updates instantly avaiable.
• Intolerance towards extraction speed – this may leed to a faster solution which offer lower quality resuts. If the speed is not crucial tolerate it.
• No readiness to implement changes in the workflow and collected data. The good automatised soution is not the one that completely replaces the manual workflow but the one that brings higher value to your business. Be ready to slightly change your workflow, start collecting some new data and aim for an automated solution which is focused in new benefits.
Take away messages – DON'Ts
Mar 2, 2017
32
Thank you very much for the attention!
You are welcome to try our demos at http://ontotext.com
Mar 2, 2017
top related