Named Entity Recognition at Scale with Deep Learning

Named Entity Recognition at Scale with Deep Learning

Sijun He @SijunHe#TwitterCortex at #ODSCWest

1

Introduction

Sijun He@SijunHe

ML Engineer IITwitter Cortex

2

3

12345

NER on TweetsDataModelConfidence EstimationSystem Overview

Agenda

4

12345


Agenda

Named Entity Recognition (NER) on Tweets

PersonLocationOrganizationProductOther

5

Application of NER: Trends

6

Application of NER: Events Detection

7[Fedoryszak et al., 2019]

Application of NER: User Interest

Last Engagements

Twitter (9), US (9), China (7), HK (7), Google (3),

Linkedin (3), Stanford CoreNLP (2), Jeremy Lin (2)

Manchester United (1)


8

Why in-house NER?

● Strategic: Gauge of information extraction and content understanding at Twitter

● Unique linguistic feature of tweets○ Limited context due to brevity○ Abbreviation ○ Typos ○ Informal language○ Temporality ○ ...

● Cost of 3rd party Cloud API at production volume

9

Example of NER on Tweet

Google Natural Language API

Our Model

SpaCy (Open-source)

10

11

12345


Agenda

Generating Training Data

Data Cleaning

● Process character labels into token labels to train NER model

● Regular removal of deleted tweets (GDPR)

Sampling

● Stratified sampling based on tweet engagement

● Long period of time to capture temporal signal

Labeling

● Character-based Labeling on crowdsourced labeling platform○ Person○ Location○ Organization○ Product○ Other

12

13

12345


Agenda

NER Model Setup

14

John lives in San Jose

B-Per O O B-Loc I-Loc

Model

B - Beginning token of an entityI - Inside token of an entityO - Not an entity

Model Architectures

Conditional Random Field

[Lafferty et al., 2001]

Deep LearningArchitectures

[Li et al., 2018]

Fine-tunedLanguage Models

[Devlin et al., 2019]

15

Conditional Random Field (CRF)

16[Lafferty et al., 2001]

John lives in San Jose

B-Per O O B-Loc I-LocHidden State

Observed State .

O

● Discriminative analog to Hidden Markov Model (HMM)● Models local context with transition matrix

CRF Transition Matrix

17

From

To

Deep Learning Architectures

[Li et al., 2018]

Word Embedding, Character EmbeddingHand-crafted Features...

CNN, RNN, LSTM, Transformer, Attention...

MLP+Softmax, CRF... Decode Layer

Input Layer

Context Layer

18

Char-BiLSTM-CRF

Word Representation

Bidirectional LSTM

CRF

Character Representation

OtherFeatures

Decode Layer

Input Layer

Context Layer

19

Character Representations

[Li et al., 2018]20

Decoder

[Li et al., 2018]21

Fine-tuning Pre-trained LM (e.g. BERT)

Fine-tuning

22[Devlin et al., 2019]

Performance on CoNLL 2003

23nlp-progress

Model Type Performance (F1)

CRF ~ 0.85

BiLSTM-CRF ~ 0.92

BERT large ~ 0.93

https://nlpprogress.com/english/named_entity_recognition.html

24

12345

NER on TweetsDataNER ModelConfidence EstimationSystem Overview

Agenda

Confidence Estimation

25


B-Per I-Per O O B-Loc I-Loc I-Loc I-Loc Sijun He is in San Jose , CA

NER Model

Sijun He Person 0.99San Jose, CA Location 0.97

Sijun He is in San Jose, CA


26

0.9 0.6 B-Loc I-Loc

San Jose is in California

NER Model

● Softmax decoder computes token confidence● CRF decoder only computes the confidence for the whole sentence

Confidence Estimation with CRF

[Culotta et al., 2004]27

B I OJane

Doe

went

to

Paris

.

Total Likelihood

B I OJane

Doe

went

to

Paris

.

Constrained Total Likelihood

Entity: Jane DoeConstraints: (Jane, B), (Doe, I)

Find the total likelihood of all possible sequences a.k.a. normalizer Compute the marginal probability

Constraint Forward-Backward Algorithm

28

12345

NER on TweetsDataNER ModelConfidence EstimationSystem Overview

Agenda

System Overview

Model Endpoint Proxy

English NER

Spanish NER

Japanese NER

...

...

29

System Overview

Model Endpoint

HDFS

Cache

Tweet Creation

Scribe

PutOnline Clients

Read

Offline Clients

30

Cache miss

System Read RPS 120k rps

Model Inference RPS 10k rps

Model Latency p99 20 ms

Named Entities in External Articles

31

● One of the core pieces of public conversation on Twitter● Process NER on articles’ title and short snippet● Significant upside in entity signal coverage

No Named Entities in Tweet

Named Entities in the Linked Article:● Brunswick, GA● Detroit● Lions● Georgia


Future Work

32

● Language-specific Model Architecture● Multilingual Model● Active Learning for Data Efficiency

Reference

33

● Mateusz Fedoryszak, Brent Frederick, Vijay Rajaram and Changtao Zhong, Real-time Event

Detection on Social Data Streams, KDD 2019, link● John Lafferty, Andrew McCallum and Fernando C.N. Pereira, Conditional Random Fields:

Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2001, link● Jing Li, Aixin Sun, Jianglei Han and Chenliang Li, A Survey on Deep Learning for Named Entity

Recognition, link● Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, BERT: Pre-training of Deep

Bidirectional Transformers for Language Understanding, NAACL-HLT 2019, link● NLP Progress, link● Aron Culotta and Andrew McCallum, Confidence Estimation for Information Extraction,

HLT-NAACL 2004, link

https://www.kdd.org/kdd2019/accepted-papers/view/real-time-event-detection-on-social-data-streams

https://dl.acm.org/citation.cfm?id=655813

https://arxiv.org/abs/1812.09449

https://arxiv.org/abs/1810.04805

https://nlpprogress.com/english/named_entity_recognition.html

https://dl.acm.org/citation.cfm?id=1614012

#ThankYou

34

We are hiring ML Researchers and Engineers! [email protected]

Named Entity Recognition at Scale with Deep Learning

Documents