Named Entity Recognition at Scale with Deep Learning Sijun He @SijunHe #TwitterCortex at #ODSCWest 1
Named Entity Recognition at Scale with Deep Learning
Sijun He @SijunHe#TwitterCortex at #ODSCWest
1
Introduction
Sijun He@SijunHe
ML Engineer IITwitter Cortex
2
3
12345
NER on TweetsDataModelConfidence EstimationSystem Overview
Agenda
4
12345
NER on TweetsDataModelConfidence EstimationSystem Overview
Agenda
Named Entity Recognition (NER) on Tweets
PersonLocationOrganizationProductOther
5
Application of NER: Trends
6
Application of NER: Events Detection
7[Fedoryszak et al., 2019]
Application of NER: User Interest
Last Engagements
Twitter (9), US (9), China (7), HK (7), Google (3),
Linkedin (3), Stanford CoreNLP (2), Jeremy Lin (2)
Manchester United (1)
PersonLocationOrganizationProductOther
8
Why in-house NER?
● Strategic: Gauge of information extraction and content understanding at Twitter
● Unique linguistic feature of tweets○ Limited context due to brevity○ Abbreviation ○ Typos ○ Informal language○ Temporality ○ ...
● Cost of 3rd party Cloud API at production volume
9
Example of NER on Tweet
Google Natural Language API
Our Model
SpaCy (Open-source)
10
11
12345
NER on TweetsDataModelConfidence EstimationSystem Overview
Agenda
Generating Training Data
Data Cleaning
● Process character labels into token labels to train NER model
● Regular removal of deleted tweets (GDPR)
Sampling
● Stratified sampling based on tweet engagement
● Long period of time to capture temporal signal
Labeling
● Character-based Labeling on crowdsourced labeling platform○ Person○ Location○ Organization○ Product○ Other
12
13
12345
NER on TweetsDataModelConfidence EstimationSystem Overview
Agenda
NER Model Setup
14
John lives in San Jose
B-Per O O B-Loc I-Loc
Model
B - Beginning token of an entityI - Inside token of an entityO - Not an entity
Model Architectures
Conditional Random Field
[Lafferty et al., 2001]
Deep LearningArchitectures
[Li et al., 2018]
Fine-tunedLanguage Models
[Devlin et al., 2019]
15
Conditional Random Field (CRF)
16[Lafferty et al., 2001]
John lives in San Jose
B-Per O O B-Loc I-LocHidden State
Observed State .
O
● Discriminative analog to Hidden Markov Model (HMM)● Models local context with transition matrix
CRF Transition Matrix
17
From
To
Deep Learning Architectures
[Li et al., 2018]
Word Embedding, Character EmbeddingHand-crafted Features...
CNN, RNN, LSTM, Transformer, Attention...
MLP+Softmax, CRF... Decode Layer
Input Layer
Context Layer
18
Char-BiLSTM-CRF
Word Representation
Bidirectional LSTM
CRF
Character Representation
OtherFeatures
Decode Layer
Input Layer
Context Layer
19
Character Representations
[Li et al., 2018]20
Decoder
[Li et al., 2018]21
Fine-tuning Pre-trained LM (e.g. BERT)
Fine-tuning
22[Devlin et al., 2019]
Performance on CoNLL 2003
23nlp-progress
Model Type Performance (F1)
CRF ~ 0.85
BiLSTM-CRF ~ 0.92
BERT large ~ 0.93
24
12345
NER on TweetsDataNER ModelConfidence EstimationSystem Overview
Agenda
Confidence Estimation
25
Confidence Estimation
B-Per I-Per O O B-Loc I-Loc I-Loc I-Loc Sijun He is in San Jose , CA
NER Model
Sijun He Person 0.99San Jose, CA Location 0.97
Sijun He is in San Jose, CA
Confidence Estimation
26
0.9 0.6 B-Loc I-Loc
San Jose is in California
NER Model
● Softmax decoder computes token confidence● CRF decoder only computes the confidence for the whole sentence
Confidence Estimation with CRF
[Culotta et al., 2004]27
B I OJane
Doe
went
to
Paris
.
Total Likelihood
B I OJane
Doe
went
to
Paris
.
Constrained Total Likelihood
Entity: Jane DoeConstraints: (Jane, B), (Doe, I)
Find the total likelihood of all possible sequences a.k.a. normalizer Compute the marginal probability
Constraint Forward-Backward Algorithm
28
12345
NER on TweetsDataNER ModelConfidence EstimationSystem Overview
Agenda
System Overview
Model Endpoint Proxy
English NER
Spanish NER
Japanese NER
...
...
29
System Overview
Model Endpoint
HDFS
Cache
Tweet Creation
Scribe
PutOnline Clients
Read
Offline Clients
30
Cache miss
System Read RPS 120k rps
Model Inference RPS 10k rps
Model Latency p99 20 ms
Named Entities in External Articles
31
● One of the core pieces of public conversation on Twitter● Process NER on articles’ title and short snippet● Significant upside in entity signal coverage
No Named Entities in Tweet
Named Entities in the Linked Article:● Brunswick, GA● Detroit● Lions● Georgia
PersonLocationOrganizationProductOther
Future Work
32
● Language-specific Model Architecture● Multilingual Model● Active Learning for Data Efficiency
Reference
33
● Mateusz Fedoryszak, Brent Frederick, Vijay Rajaram and Changtao Zhong, Real-time Event
Detection on Social Data Streams, KDD 2019, link● John Lafferty, Andrew McCallum and Fernando C.N. Pereira, Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2001, link● Jing Li, Aixin Sun, Jianglei Han and Chenliang Li, A Survey on Deep Learning for Named Entity
Recognition, link● Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding, NAACL-HLT 2019, link● NLP Progress, link● Aron Culotta and Andrew McCallum, Confidence Estimation for Information Extraction,
HLT-NAACL 2004, link