Top Banner
Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th , 2014
21

Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

Dec 11, 2015

Download

Documents

Hanna Errett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

Closed Set Extraction (CSE)Enav WeinrebText Metadata Services, Thomson ReutersNovember 18th, 2014

Page 2: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

AGENDA• Thomson Reuters, Text Metadata Services

• Closed Set Extraction– Problem description

– Solution

• Aspects of the solution

• Conclusions and questions

Page 3: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

THOMSON RETUERS

NewsBroker ResearchBondsFundamentalsPress Releases

Case LawAdmin DecisionsPublic RecordsDocketsArbitration

Editorial Analysis Scholarly Articles PatentsTrademarksDomain NamesClinical TrialsDrugs

Page 4: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

TEXT METADATA SERVICES (TMS)• In charge of extracting metadata from unstructured

contents– Entity extraction and resolution

– Relations, events, and facts

– Topics

• Formerly called ClearForest

• Owner of OpenCalais

• I lead a small applied research team within TMS

Page 5: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

COMPANY EXTRACTION

• Task: Identify companies within input text

• Motivation to concentrate on known companies:– Public companies have a special page in TR Eikon

– Better quality for an easier problem

– Faster development

known

Page 6: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

CLOSED SET EXTRACTION• Starting point: lexicon with company aliases

• Given an input document:– Find company aliases as candidates (TRIE)

– Filter out noisy candidates using a machine learning layer

Page 7: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

EXAMPLES• “China, Apple's third biggest market in the world

behind the US and Europe, has been a hefty cash cow for Apple.” 

• “Apple partisans (a real faction) often claim that the typical American grocer stocks no more than 12 varieties: Red Delicious, Gala, Fuji, et al.”

Page 8: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

TRAINING PROCESS• Collect input texts

• Identify company candidates (TRIE)

• Send candidates to manual tagging

• Extract features

• Train a classifier (logistic regression)

• Results for Reuters news documents are of very high quality– F measure of ~96%

Page 9: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

HISTORY• Started with a rule based algorithm with

– Good quality, but

– Bad latency

• Requirements: – Under 50ms per document

– Only public companies

We used the rule based algorithm as a data labeler

It turns out that learning from an existing rule based algorithm can have surprising positive results

Page 10: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

EXTRACTING CANDIDATES• Lexicon search with a basic TRIE implementation

• Coverage: extending the lexicons using the rule based algorithm– Co mentions with known aliases in a large corpus

• Fuzzy lexicon search?– Vowel free lexicon

– Fuzzy TRIE???

– How to treat fuzzy candidates?

Page 11: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

FEATURE ENGINEERING• Positional unigrams and bigrams

– “CEO of” before the instance is a positive feature

– “Mr.” before the instance is a negative feature

– 2-3 tokens before and after the instance give best results

• Company fingerprint – positive and negative– “tennis” is in the negative signature for “Williams Energy”

– “Gas” is in the positive signature for “Williams Energy”

– Here we look in a wider window around the instance

• Company suffix (Inc., Ltd., etc.)

• Part of speech

• More…

CSE

Page 12: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

DATA LABELING• Discovery versus CSE

– Searching for companies in text versus answering multiple choice questions

• Ideal for crowd sourcing

• 10 times cheaper than labeling for company discovery algorithms

CSE

Page 13: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

CROWD SOURCING EXAMPLEPlease read the following text:

“A high-achiever from a young age, Kim Williams is known as one of our top business leaders. Now, the workaholic's sudden departure as News Corp's CEO has given him time for reflection.”

In the above text, “Williams” is best described as:

o A company name

o Part of a company name

o Neither a company name nor a part of one

o There is not enough information in the text to determine

Page 14: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

LOGISTIC REGRESSION VS CRF• Faster improvement iterations

• Large training sets:100Ks of examples labeled by– Our rule based algorithm

– The crowd

• More data or smarter machine learning?

CSE

Page 15: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

DOMAIN ADAPTATION MADE EASIER• Domain adaptation is simpler for logistic regression

• Example– Trained a model on Reuters news

– Ran it on research reports to get 77% f-measure

– Adapted a model for research reports

– F-measure improved to 87%

• Main idea: cluster news and research instances together and boost confidence for research instances within positive news clusters

boost!

CSE

Page 16: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

SYNTACTIC VS. WORLD-KNOWLEDGE• Syntactic features

– Words before/after candidate instance

– Part of speech

– Regular expressions on the candidate

• World knowledge features– Company signature

– Candidate aliases

• We trained each model separately and then took a logical OR of the results

• Result model introduces significant improvement over baseline

Page 17: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

SOLVING PRODUCTION ISSUES• Suppose we have a bug in production

“ISIS is turning us all into its recruiting sergeants”

• We do not want to wait for the next drop to fix it

• “Negative signature” – get a list of related bugs and create a patch to solve it on-the-fly– Content specialist needs to answer a few yes/no

questions and the system builds a classifier to identify further instances of the bug and clean it

Page 18: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

DOCUMENT LEVEL TRIAGE• Let’s stay with our friends in ISIS example

• Sometimes the sentence itself really looks as if it has a company mention

“ISIS is turning us all into its recruiting sergeants”

• As a human, we immediately identify the document is not about a company

• We built a classifier at the document level answering the question: How likely is this document to contain any company?

Page 19: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

UNSUPERVISED VERSION OF CSE• Find candidate instances in the corpus

• Using clustering and basic statistics infer clearly positive and clearly negative instances

• Identify unambiguous instances to be used as positive examples– Avoid introducing a feature that looks at these aliases

• Use sibling lexicons as negative examples

• Use negative and positive set to train a classifier to decide on more instances

• Iterate…

Page 20: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

CONCLUSIONS AND FUTURE WORK• When only known entities matter, entity extraction

becomes easier

• Labeling data is cheaper

• Faster improvement iterations

• Easier to use world knowledge

• Easier to apply domain adaptation

Page 21: Closed Set Extraction (CSE) Enav Weinreb Text Metadata Services, Thomson Reuters November 18 th, 2014.

WE ARE HIRING!