NAMED ENTITY RECOGNITION
JACOB SU WANG OJO LABS INC.
WE EXPLORE …
IN THIS TALK
• WHAT IS NER, WHAT ARE ITS APPLICATIONS
• WHAT ARE THE METHODS USED IN VARIOUS CONDITIONS
• WHAT MODEL TO USE WHEN?
• HOW DO THE MODELS WORK?
WHAT IS A NAMED ENTITY?
WORDS/PHRASES OF INTEREST IN TEXT
NAMED ENTITIES
• NATURAL NAMED ENTITIES
• PROPER NOUNS
• E.G. PERSON NAME (Steve Jobs), ORGANIZATION (OJO Labs), LOCATION (Austin), ETC.
• DEFINED NAMED ENTITIES
• NON-PRON WORDS/PHRASES WE DEFINE TO BE INFORMATIVE
• E.G. TIME (9pm, 1945), INDICATORS (which, where, etc.), CONTEXTUALLY-SIGNIFICANT TERMS (buffalo hunter, horse trading in Lonesome Dove).
NAMED ENTITY RECOGNITION TASKpic1: https://www.semanticengine.ws/namedentityrecognitionpic2: https://www.ravn.co.uk/named-entity-recognition-ravn-part-1/
IDENTIFICATION OF WORD/PHRASES OF INTEREST
IN TEXT
APPLICATIONS
QUESTION-ANSWERING: LOCATE INFORMATIVE TEXT
WHY?
Q: WHEN DID ADOLF HITLER COME TO POWER?
WHY?
Q: WHEN DID ADOLF HITLER COME TO POWER?
QUESTION-ANSWERING: LOCATE INFORMATIVE TEXT
WHY?
Q: WHEN DID ADOLF HITLER COME TO POWER?
QUESTION-ANSWERING: LOCATE INFORMATIVE TEXT
MEDICAL DIAGNOSIS: LEVERAGE OPEN DATA SOURCE
WHY?
ARE THERE ANY EVIDENCE MEDICINE X IS A GOOD TREATMENT FOR DISEASE Y?
pic: http://www.slideshare.net/larsjuhljensen/one-tagger-many-uses-illustrating-the-power-of-ontologies-in-named-entity-recognition
INFO SUPPLEMENTATION: AMAZON X -RAY
WHY?pic1: http://www.blogher.com/kindle-paperwhite-smart-bitches-reviewpic2: https://adelightfulspace.wordpress.com/2015/09/14/review-amazon-kindle-
INFO SUPPLEMENTATION: AMAZON X -RAY
WHY?
TEXT NORMALIZATION: PROPER LEVEL OF ABSTRACTION
WHY?
WHAT DO PEOPLE CARE ABOUT?
WHAT DO COMPANIES CARE ABOUT?
COOL! HOW?
LOOK UP A GAZETTEER?
HOW? http://www.bph-postcodes.co.uk/licenced_products/postcode-sector-town.cgi
LOOK UP A GAZETTEER?
HOW?
STRING MATCHING ONLY IS NOT GOING TO CUT! MAINLY BECAUSE OF WEAK GENERALIZATION!
LOOKING-UP AIN’T GONNA CUT!
HOW?
ENGLAND: COUNTRY_NAME OR LOCATION?
1945: NUMBER OR TIME?
CHASE: PERSON_NAME OR ORGANIZATION?
ISSUE 1: AMBIGUITY
LOOKING-UP AIN’T GONNA CUT!
HOW?
STREET, ST., STRT, …
UNIVERSITY OF TEXAS, UNIV TX, UT, …
NAMED ENTITY RECOGNITION, NER, …
ISSUE 2: VARIANTS
LOOKING-UP AIN’T GONNA CUT!
HOW?
MOST ITEMS WILL BE OOVS!
ISSUE 3: OUT-OF-VOCAB ITEMS
ZIPF’S LAW
SOLUTION: FEATURIZATION
HOW?
… W W W NAMED ENTITY W W W …MUCH RICHER INFORMATION THAN HAVING ONY THE ITEM ITSELF!!
SEMANTIC
SYNTACTIC
LEXICAL
MORPHOLOGICAL
FEATURE VECTOR OF WORD 1
HOW?SOLUTION: FEATURIZATION
• E.G. DISAMBIGUATION
1945 (NUMBER): - PARSER LIKELY TO GIVE NUM/ADJ POS TAG. - IN MORE SIMILAR CONTEXT AS ARBITRARY-LENGTH DIGIT SEQUENCES AS “YYYY” FORMAT SEQUENCES.
1945 (TIME): - MORE LIKELY TO BE THE LAST ITEM BEFORE DELIMITERS (COMMA OR PERIOD). - MORE LIKELY TO HAVE PRECEDING ‘IN’.
HOW?SOLUTION: FEATURIZATION
• E.G. VARIANTS
UT, UNIVERSITY OF TEXAS, UNIV TX, …
- SIMILAR COOCCURRENCE VECTORS (OVER VOCAB). - MORE SIMILAR TO EACH OTHER THAN TO OTHER NON-WORD ABBREVIATIONS (UT, UNIV TX)
HOW?SOLUTION: FEATURIZATION
• E.G. OUT-OF-VOCAB ITEMS
BARFKNECHT, THORUP, PECKENPAUGH, … (RARE SURNAMES, LESS THAN 0.15% IN 100,000 PEOPLE)
- SIMILAR CONTEXTUAL DISTRIBUTION TO “JOHNSON” OR “SMITH” THAN TO RANDOM WORDS. - LIKELY TO BE TAGGED AS “PRON” BY PARSER.
FEATURIZATION
METHODS
FEATURIZATION
• FEATURE ENGINEERING
• DOMAIN EXPERT KNOWLEDGE
• LINGUISTIC KNOWLEDGE
• FEATURE ABSTRACTION
• “ALMOST FROM SCRATCH” WITH AUTOMATIC FEATURE DISCOVERY
FEATURE ENGINEERING
… W W W NAMED ENTITY W W W …
• MORPHOLOGY: PREFIX, SUFFIX, STEM, ETC. IN A ENGLISH • anti-, con-, dis-, re-, …, -ly, -ness, …
• LEXICAL: GAZETTEER, SPELLING • {city_names}, …, capitalized, …
• SEMANTICS: COOCCURRENCE PATTERN, HEARST PATTERNS • cooccurrence counts, CITIES such as …
• SYNTAX: DEPENDENCY CONTEXT, SYNTACTIC PATH • (NE, dobj, kill), …, V->NP->N, …
FEATURE ABSTRACTION
… W W W NAMED ENTITY W W W …
• MORPHOLOGY: CHARACTER EMBEDDINGS • from sequences of characters in words (CNN)
• LEXICAL & SEMANTICS: WORD EMBEDDINGS • from sequences of words in sentences (word2vec with RNN/FNN)
• SYNTAX: WORD EMBEDDING + PHRASE EMBEDDINGS • from sequences of words in sentences (vectors with RecNN)
FEATURE ENGINEERING VS. FEATURE ABSTRACTIONA BIG DIFFERENCE WE CARE ABOUT: INTERPRETABILITY
THEY WORK IN THE CLASSIFICATION TASK BUT I DON’T KNOW WHAT THEY MEAN!!
WHAT MODEL TO USE?
NERWHAT TO USE?
ALL DEPENDS ON DATA AVAILABILITY & KNOWLEDGE STATE
WHAT MODEL TO USE?
SUPERVISED SEMI-SUPERVISED UNSUPERVISED
ANNOTATION
LABELING SCHEME
WHAT MODEL TO USE?OUR GENERAL OPINION
PURPOSEPERFORMANCE EXPECTATION
(ACCURACY/F1)
SUPERVISEDWEAPONIZED,
PRODUCTION-LEVEL MODELS
95%+
SEMI-SUPERVISED
EXPLORATORY MODELS PATTERN DISCOVERY
~75%
UNSUPERVISED ~65%
OVERVIEW
FEATURE ENGINEERING FEATURE ABSTRACTION
SUPERVISED• CONDITIONAL RANDOM
FIELDS (CRF)• RECURRENT NEURAL
NETS (RNN)
SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION
-
UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY
FEATURE ENGINEERING FEATURE ABSTRACTION
SUPERVISED• CONDITIONAL RANDOM
FIELDS (CRF)• RECURRENT NEURAL
NETS (RNN)
SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION
-
UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY
LINEAR-CHAIN GRAPHICAL MODEL
CRF
STATES: LATENT VARIABLES THAT TAKE LABELS AS VALUES
OBSERVATION: WORDS
EXAMPLE
ADOLF HITLER CAME TO POWER IN 1933
CRF
EXAMPLE
CRF
ADOLF HITLER CAME TO POWER IN 1933
OBJECTIVE: FINDING THE BEST PATH!
COMPUTING FOR THE BEST SEQUENCE OF LABELS (Y’s)? EXPENSIVE!!
CRF
CRFOBJECTIVE: FINDING THE BEST PATH!
COMPLEXITY OF BRUTE-FORTH COMPUTATION
E.G. ATIS DATASET
- TYPICAL SENTENCE: ~15 WORDS - SIZE OF LABEL SET: 127 - POSSIBLE PATHS / LABEL SEQUENCES: 12715
CRFBRUTE FORTH?
WE WANT TO BREAK THE GRAPH INTO SUBCOMPONENTS (FACTORS)
CRFFACTORIZATION
: FACTOR AT TIME T�(Xt, Yt)
HITLER CAME TO POWER IN
CRFFACTORIZATION
: FACTOR AT TIME T�(Xt, Yt)
HITLER CAME TO POWER IN
<0, 1, 1, 0, 0, 0, … >
FEATURE INDICATOR FUNCTION
f1 f2 f3 f4 f5 f6, …
CRFFEATURE FUNCTIONS (INDICATOR FUNCTIONS)
fk(Xt, Yt) =
(1 if �(Xt, Yt) has feature k
0 otherwise
E.G. - X_t-1 IS WORD “TO” - X_t+1 HAS POS TAG “PREP” - Y_t+1 IS LABEL “B-PER” - e.t.
�(Xt, Yt)
HITLER CAME TO POWER IN
CRFBEST SEQUENCE
Y = argmax
Y
QTt exp
⇣PKk wkfk(Xt, Yt)
⌘
PY 0
QTt exp
⇣PKk wkfk(Xt, Y
0t )⌘
w_k: WEIGHT OF FEATURE FUNCTION f_k
THE PROBABILITY OF THE ENTIRE SEQUENCE PRODUCT OF “WEIGHTED FEATURE SCORE” OF FACTORS
THIS COULD BE FORMULATED INTO AN OPTIMIZATION PROBLEM AND SOLVED WITH VARIOUS ALGORITHMS!
FACTORIZATION: AS YOU LIKE IT
Sutton & McCallum (2011)
CRF
WE FOUND THESE MODELS LESS APPROPRIATE
CRF
HIDDEN MARKOV MODEL (HMM)
• EMPIRICALLY WEAKER PERFORMANCE • TROUBLE CAPTURING LONG-DISTANCE DEPENDENCY
MAXIMUM ENTROPY MARKOV MODEL
(MEMM)• CRF IS AN IMPROVED VERSION OF MEMM
SUPPORT VECTOR MACHINE (SVM)
• SVM, AS A BINARY CLASSIFIER, AGGREGATES ERROR!
Lafferty et al. (2001) Sutton & McCallum (2011)
RECAP
SUPERVISED + FEATURE ENGINEERING
IN: ADOLF HITLER CAME TO POWER IN 1933.
OUT: B-PER I-PER O O O O B-TIME. NER
LEARNING A MAPPING
TOOLBOX
SUPERVISED + FEATURE ENGINEERING
LIBRARIES
CRF• pycrfsuite: https://python-crfsuite.readthedocs.io/en/latest/ • crf++: https://taku910.github.io/crfpp/
HMM• seqlearn: https://github.com/larsmans/seqlearn • hmmlearn: https://github.com/hmmlearn/hmmlearn
MEMM • nltk: http://www.nltk.org/_modules/nltk/classify/maxent.html
SVM • sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
FEATURE ENGINEERING FEATURE ABSTRACTION
SUPERVISED• CONDITIONAL RANDOM
FIELDS (CRF)• RECURRENT NEURAL
NETS (RNN)
SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION
-
UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY
LEARNING TYPE & OBJECTIVE
SUPERVISED VS. SEMI-SUPERVISED NER
LEARNING TYPE
SUPERVISEDINDUCTIVE LEARNING
LEARNING CLASSIFIER THAT INCORPORATES GENERAL RULES
SEMI-SUPERVISEDTRANSDUCTIVE LEARNING
CLASSIFY UNLABELED DATA AS SPECIFIC CASES BY THEIR SIMILARITY TO LABELED DATA
SUPERVISED LEARNING: INDUCTIVE
SUPERVISED VS. SEMI-SUPERVISED NER
WE ARE LEARNING THESE PARAMETERS!
classifier(label sequences|word sequences;⇥)
SEMI-SUPERVISED LEARNING: TRANSDUCTIVE
SUPERVISED VS. SEMI-SUPERVISED NER
SIMILAR WORDS SHOULD HAVE SAME LABELS
BOOTSTRAPPINGALTERNATING/MUTUAL BOOTSTRAPPING
BY LEXICAL FEATURES
EXTRACTION
• AUTHOR
{[A-Z][A-Za-z .,&]; [A-Za-z.]; ...}
• TITLE
{[A-Z0-9][A-Za-z0-9 .,:’#!?;&]; [A-Za-z0-9?!]}
• ...
E.G. REGEX CHARACTERIZATION
E.G. LEXICAL RULES
Brin (1999)
Collins & Singer (1999)
BY CONTEXTUAL FEATURES
EXTRACTION
Riloff & Jones (1999)
BY DISTRIBUTIONAL SIMILARITY
EXTRACTION
Pasca et al. (2006)
BOOTSTRAPPINGALTERNATING/MUTUAL BOOTSTRAPPING
THE SET OF NAMED ENTITIES!
LABEL PROPAGATION
ITEMS
SIMILARITIES
(TOKENS)
*THE ACTUAL GRAPH IS USUALLY FULLY CONNECTED, BUT NOT NECESSARILY SO IN SOME VARIANTS OF LP
LABEL PROPAGATION
L_1
L_2
<0,1,1,0,0, …><1,1,0,1,1, …>
<0,1,1,1,0, …>
LABELED NODES: LABEL + FEATURE VEC UNLABELED NODES: FEATURE VEC ONLY
LABEL PROPAGATION
Network graph from Mejova (2015), interpretation differs here.
LABELED ITEMS
UNLABELED ITEMS
ITEMS
SIMILARITIES
PROPAGATION
FULLY LABELED!
PROCEDURE
LABEL PROPAGATION
l
l l
u
u u
Xl+u
Xl+u Xl+u
C
SIMILARITY MATRIX SOFT LABEL ASSIGNMENT DISTRIBUTION
Zhu & Ghahramani (2002)
STEP 1
LABEL PROPAGATION
NEW DISTRIBUTION MATRIX <= SIMILARITY MATRIX * DISTRIBUTION MATRIX
X =
C
Xl+u
Zhu & Ghahramani (2002)
STEP 1
LABEL PROPAGATION
AN INTUITIVE EXPLANATION WHY THE UPDATE WORKS
X =
C
Xl+u
Zhu & Ghahramani (2002)
INFLUENCE OF LABELED DATA ON UNLABELED DATA PROPORTIONAL TO THEIR SIMILARITY!
WORD X_i
LABELED WORD
X_jx =
STEP 2
LABEL PROPAGATION
ROW NORMALIZE SIMILARITY MATRIX
EACH ROW IS A PROBABILITY DISTRIBUTION OVER CLASSES!
Xl+u
C
Zhu & Ghahramani (2002)
STEP 3
LABEL PROPAGATION
ROW NORMALIZE SIMILARITY MATRIX
CLAMP/ “REPLENISH” THE LABELED DATA
C
Xl+u
Zhu & Ghahramani (2002)
CONVERGENCE
LABEL PROPAGATION
UNTIL ALL THE ITEMS ARE LABELED …
C
Xl+u
Zhu & Ghahramani (2002)
PROS & CONS
BOOTSTRAPPING VS. LABEL PROPAGATION
PROS CONS
BOOTSTRAPPING• EASY IMPLEMENTATION • FAST EXTRACTION
• LABOR INTENSIVE (HEAVY HUMAN SUPERVISION TO
GUARANTEE QUALITY • PRONE TO INTRODUCING
NOISE
LABEL PROPAGATION• CONVERGENCE
GUARANTEED • MORE AUTOMATED
• PARAMETERS DIFFICULT TO TUNE (PERFORMANCE DEPENDS HEAVILY ON PARAMETER SETTING)
• SLOW WITH LARGE GRAPH (SOPHISTICATED VARIANTS)
RECAP
SEMI-SUPERVISED + FEATURE ENGINEERING
SIMILAR WORDS SHOULD HAVE SAME LABELS
TOOLBOX
SEMI-SUPERVISED + FEATURE ENGINEERING
LIBRARIES
BOOTSTRAPPING NONE NEEDED
LABEL PROPAGATION
• sklearn: http://scikit-learn.org/stable/modules/label_propagation.html • MAD: https://github.com/psorianom/modified_adsorption
Taludkar & Crammer (2009)
FEATURE ENGINEERING FEATURE ABSTRACTION
SUPERVISED• CONDITIONAL RANDOM
FIELDS (CRF)• RECURRENT NEURAL
NETS (RNN)
SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION
-
UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY
OBJECTIVE
UNSUPERVISED LEARNING VS. OTHER
SUPERVISED SEMI-SUPERVISED UNSUPERVISED
ANNOTATION
LABELING SCHEME
LEARNING OBJECTIVE LABELING STUFF ALSO INDUCE A
LABELING SCHEME
GENERAL IDEA
UNSUPERVISED + FEATURE ENGINEERING
INDUCING LABELING SCHEME
INDUCING LABELING SCHEME
UNSUPERVISED LEARNING
HYPERNYMS
HYPONYMS
UNSUPERVISED LEARNINGINDUCING LABELING SCHEME
WHAT IS A HEARST PATTERN?
HEARST PATTERN BASED EXTRACTION
PARADIGM
Y such as X (LABEL CANDIDATE, NE CANDIDATE)
E.G.
CITIES such as Austin, Dallas, and Houston
Hearst (1992)
HEARST PATTERN BASED EXTRACTIONSTEP 1
STEP 2
HEARST PATTERN BASED EXTRACTION
LABEL(X) = argmax
YSCORE(X,Y)
Evans (2003)
THIS IS A SELECTION PROCESS FOR LABELS
STEP 3
HEARST PATTERN BASED EXTRACTION
Etzioni et al. (2005)
PMI(X,Yi[X]) = PMI(Austin, CITY such as Austin)where
X = Austin
Y = CITY
Yi[X] = CITY such as Austin
Y_1 Y_2 Y_3 …. Y_k (Austin,CITY)
Y such as X
GROUNDING
EXTERNAL TAXONOMY BASED EXTRACTIONEXAMPLE: WORDNET
HYPERNYM (MORE GENERAL)
HYPONYM (MORE SPECIFIC)
EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT
FIND ALL CAPITALIZED WORDS / PHRASES
(OPTION: BOOTSTRAP FROM GAZETEER)
MANUALLY FIND A SET OF HYPERNYMS THAT
COVERS ALL THE NE CANDIDATES!
EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT
EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT
TOPIC SIGNATURE
SIG(X) = {(word, freqword
) | word in X
0s context}
EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT
argmax
Y(SIM(SIG(X), SIG(Y ))) = LOCATION
SIM=230
SIM=410SIM=140
CURRENT NODE = ENTITY
EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT
argmax
Y(SIM(SIG(X), SIG(Y ))) = VILLAGE
SIM=410
SIM=251
SIM=533
CURRENT NODE = LOCATION
EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT
SWEET SPOT FOUND!!
argmax
Y(SIM(SIG(X), SIG(Y ))) = VILLAGE
SIM=533CURRENT NODE = VILLAGE
EXTERNAL TAXONOMY BASED EXTRACTIONOBJECTIVE: FIND THE SWEET SPOT
NE CANDIDATES: MORDOR HOBBITON, HOBBIT, WIZARD, DWARF, FAIRY
Alphonseca & Manandhar (2002)
RECAP
UNSUPERVISED + FEATURE ENGINEERING
INDUCING LABELING SCHEME
FEATURE ENGINEERING FEATURE ABSTRACTION
SUPERVISED• CONDITIONAL RANDOM
FIELDS (CRF)• RECURRENT NEURAL
NETS (RNN)
SEMI-SUPERVISED• BOOTSTRAPPING • LABEL PROPAGATION
-
UNSUPERVISED• HEARST PATTERN • EXTERNAL TAXONOMY
FEATURE ENGINEERING VS. FEATURE ABSTRACTION
FEATURE ENGINEERING
AUTOMATIC FEATURE ABSTRACTION THROUGH JOINT OPTIMIZATION
• WHAT ARE EMBEDDINGS?
FEATURE ABSTRACTIONREPRESENTATION: EMBEDDINGS
REPRESENTATIONS WHICH LIVE IN A HIGH DIMENSIONAL SPACE WHERE THE DISTANCE AMONG ITEMS IS DEFINED WITH SIMILARITY OF SORTS…
FEATURE ABSTRACTIONE.G. WORD EMBEDDINGS
FEATURE ABSTRACTIONPIPELINE: HOW ARE EMBEDDINGS LEARNED?
ONE-HOT
PREDICTION RECALIBERATION
ONE-HOT
FEATURE ABSTRACTIONJOINT OPTIMIZATION
FEATURE ABSTRACTION
ONE-HOT
JOINT OPTIMIZATION
MULTICHANNEL EMBEDDINGS
EMBEDDINGS COULD DRAW ON INFORMATION FROM MULTIPLE SOURCES!
MORPHOLOGICAL LEXICAL SEMANTIC SYNTACTIC
dos Santos & Guimaraes (2014a)
MULTICHANNEL EMBEDDINGSEXAMPLE: CHAR-WORD JOINT FEATURIZATION
EXAMPLE: CHAR-WORD JOINT FEATURIZATION
dos Santos & Guimaraes (2015)
PROJECTION
PROJECTION
EMBEDDING LV1
EMBEDDING LV2
MULTICHANNEL EMBEDDINGS
ARCHITECTURE
RECURRENT NEURAL NETS
ADOLF HITLER CAME TO POWER IN 1933
B-PER I-PER O O O O B-TIME
TIME DISTRIBUTED PREDICTION
OUTPUT SEQUENCE: LABELS
INPUT SEQUENCE: WORDSTHE MODEL “REMEBERS” WHAT HAPPENED IN THE 3 PREVIOUS TIME STEPS
PROCESS
RECURRENT NEURAL NETS
ADOLF
B-PER
PROJECTION TO EMBEDDING SPACE
PROJECTION TO LABEL SPACE
PROCESS
RECURRENT NEURAL NETS
ADOLF HITLER
B-PER I-PER
THE PARAMETERS “REMEMBER”
ITS TRANSITIONAL HISTORY!
THE SAME HIDDEN LAYER AT DIFFERENT TIME POINTS
PROCESS
RECURRENT NEURAL NETS
ADOLF HITLER CAME
B-PER I-PER O
THE PARAMETERS “REMEMBER”
ITS TRANSITIONAL HISTORY!
RESULT
RECURRENT NEURAL NETS
ADOLF HITLER CAME TO POWER IN 1933
B-PER I-PER O O O O B-TIME
AT EACH TIME POINT, THE PREVIOUS HISTORY IS ENCODED IN PARAMETERS
STATE-OF -THE-ART: BIDIRECTIONAL LSTM-CRF
RECURRENT NEURAL NETSClassifier
EncoderInput
Join
t Tra
inin
g
Lample et al. (2016)
ClassifierEncoder
InputJo
int T
rain
ing
Lample et al. (2016)CAN ALSO BE MULTICHANNEL EMBEDDINGS!
RECURRENT NEURAL NETSSTATE-OF -THE-ART: BIDIRECTIONAL LSTM-CRF
TOOLBOX
SUPERVISED NER + FEATURE ABSTRACTION
LIBRARIES
RNN
• word embeddings - pre-trained: spacy (https://spacy.io/) - create new: https://radimrehurek.com/gensim/models/word2vec.html • neural nets - Keras: https://keras.io/layers/recurrent/ - Tensorflow: https://www.tensorflow.org/tutorials/recurrent/ - Theano: http://deeplearning.net/tutorial/rnnslu.html
COMPARISON
FEATURE ENGINEERING VS. FEATURE ABSTRACTION
FEATURIZATION INTERPRETABILITY
FEATURE ENGINEERING MANUAL INTERPRETABLE
FEATURE ABSTRACTION AUTOMATIC NOT
INTERPRETABLE
ARE DEEP LEARNING BASED MODELS NECESSARILY BETTER?
FEATURE ENGINEERING VS. FEATURE ABSTRACTION
• CRF CONVERGES FAST • CRF IS GOOD IN LOW DATA • CRF IS MORE INTERPRETABLE • PERFORMANCE DIFFERENCE ~1%
LIME (LOCAL INTERPRETABLE MODEL-AGNOSTIC EXPLANATIONS) https://github.com/marcotcr/lime
• OPTION 1 (ABSENT DOMAIN KNOWLEDGE)
• 1) UNSUPERVISED EXPLORATION
• 2) PAID LABELING, THEN SUPERVISED MODEL
• OPTION 2 (EXPERT DOMAIN KNOWLEDGE AVAILABLE)
• 1) PAID LABELING ON SMALL SET
• 2) SEMI-SUPERVISED EXPLORATION
• 3) PAID LABELING, THEN SUPERVISED MODEL
SUGGESTIONS ON MODELINGNEW DOMAIN + UNLABELED DATA
PAID LABELINGTOOLS
CONFIDENT IN DOMAIN
KNOWLEDGE
MECHANICAL TURK (https://www.mturk.com/mturk)
LESS CONFIDENT IN DOMAIN
KNOWLEDGE
• CROWDFLOWER (https://www.crowdflower.com/)
• SPARE5 (https://app.spare5.com/fives)
THANK YOU!