CONSTRAINED CONDITIONAL MODELS TUTORIAL Jingyu Chen, Xiao Cheng
Dec 14, 2015
CONSTRAINED CONDITIONAL MODELS TUTORIALJingyu Chen, Xiao Cheng
INTRODUCTION
Main ideas:• Idea 1: Modeling
Separate modeling and problem formulation from algorithms• Similar to the philosophy of probabilistic modeling
• Idea 2: Inference
Keep model simple, make expressive decisions (via constraints)
• Unlike probabilistic modeling, where models become more expressive • Inject background knowledge
• Idea 3: Learning
Expressive structured decisions can be supported by simply
learned models • Global Inference can be used to amplify the simple models (and even
minimal supervision).
Task of interest: Structured Prediction• Common formulation
• e.g. HMM, CRF, Structured Perceptron etc.
• Covers a lot of NLP problems:• Parsing; Semantic Parsing; Summarization; Transliteration; Co-
reference resolution, Textual Entailment…
• IE problems:• Entities, relations, attributes…
• How to improve without incurring performance issues?
Pipeline?• Very crude approximation to the real problem, propagates
error.• Ignores dependency :
• e.g. In relation extraction, the label of the entity depends on the relation it is involved and the relation label depends on the label of its arguments.
Model Formulation• Typical models
• With CCM we choose
Penalty Violation measure
Regularization
Local dependencye.g. HMM, CRF
Constraint expressivity
Multiclass Problem:
One v. All approximation:
Ideal classification, can be expressed through constraints
Implementations
Modeling Objective function
Constrained Optimization Solver
Integer Linear Programming
Inference Exact ILP, Heurisitic Search, Relaxation, Dynamic Programming
Learning Learn and , can be learnt jointly or separately, semi-supervised learning etc.
arg max𝑦𝑤𝑇 𝑓 (𝑥 , 𝑦 ) −𝜌𝑇 𝑑 (𝑥 , 𝑦 )
How do we use CCM to learn?
EXAMPLE 1: JOINT INFERENCE-BASED LEARNINGConstrained HMM in Information Extraction
Typical work flow• Define basic classifiers• Define constraints as linear inequalities• Combine the two into an objective function
HMMCCM Example• Information extraction without prior knowledge• Use HMM
HMMCCM Example
AUTHOR Lars Ole Andersen . Program analysis and
TITLE specialization for the
EDITOR C
BOOKTITLE Programming language
TECH-REPORT . PhD thesis .
INSTITUTION DIKU , University of Copenhagen , May
DATE 1994 .
Violates a lot of natural constraints
HMMCCM Example• Each field must be a consecutive list of words and can
appear at most once in a citation.
• State transitions must occur on punctuation marks.
• The citation can only start with AUTHOR or EDITOR.
• The words pp., pages correspond to PAGE.• Four digits starting with 20xx and 19xx are DATE.• Quotations can appear only in TITLE
HMMCCM Example• How do we use constraints with HMM?• Standard HMM:
• Learn the probability of the sequence of labels and input :
• Inference, taking the most likely label sequence:
HMMCCM Example• New objective function involving constraints• Penalize the probability of sequence if it violates
constraint
Penalty for each time the constraint is violated
HMMCCM Example• Transform to linear model
HMMCCM Example• We need to learn the new parameters maximizes the
scoring function
• Despite the fact that the scoring function is no longer a log likelihood of the dataset, it is still a smooth concave function with a unique global maximum with zero gradient.
HMMCCM Example
Simply counting the probabilityof the constraints being violated
HMMCCM Example
Are there other ways to learn?
Can this paradigm be generalized?
TRAINING PARADIGMS
Training paradigms
DecomposeLearn Inference
Prior knowledge: Features vs. Constraints
Feature Constraint
Data dependent Yes No (if not learnt)
Learnable Yes Yes
Size Large Small
Improvement Approach
Higher order model Post-processing for I+L
Domain
Penalty type Soft Hard & Soft
Common usage Local Global
Formulation Propositional/ FOL/
Comparison with MLN• MLN models constraints are formulated as an explicit
probability jointly with the overall distributions:• e.g.
• Constraints in CCM are formulated as linear inequalities• e.g.
• Theoretically the same, very different in practice
Training paradigms• Learning + Inference: Train with some constraints, apply
all constraints only in inference• No need to retrain an existing system• Fast and modular
• Inference-Based Training: Train jointly with constraints and dependencies (e.g. Graphical Models)• Better for strong interactions between
• Other training paradigm:• Pipe-line like sequential model [Roth, Small, Titov: AI&Stat’09]• Constraints Driven Learning (CODL) [Chang et. al’07,12]
Which paradigm is better?
For each iteration
For each in the training data
If
endif
endfor
endfor
Algorithmic view of the differences
IBT−𝜌𝑇𝑑 (𝑥 , 𝑦)
𝒀 𝑷𝑹𝑬𝑫=arg max𝑦𝑤𝑇 𝑓 (𝑥 , 𝑦 ) −𝜌𝑇𝑑 (𝑥 , 𝑦 ) I+L
L+I vs. IBT tradeoffs
# of Features
In some cases problems are hard due to lack of training data.
Semi-supervised learning
Choice of paradigm• IBT:
• Better when the interaction between output label is strong
• L+I:• Faster computationally• Modular, no need to retrain existing classifier and works with
simple models such as
PARADIGM 2:LEARNING + INFERENCEAn example with Entity-Relation Extraction
Entity-Relation Extraction [RothYi07]
Dole ’s wife, Elizabeth , is a native of N.C. E1 E2 E3
R12 R2
3
1: 32Decision time inference
Entity-Relation Extraction [RothYi07]
• Formulation 1: Joint Global Model
Intractable to learn Need to decomposition
Entity-Relation Extraction [RothYi07]
• Formulation 2: Local learning + global inference
Entity-Relation Extraction [RothYi07]
Cost function:
c{E1 = per}· x{E1 = per} + c{E1 = loc}· x{E1 = loc} + … + c{R12 = spouse_of}· x{R12 = spouse_of} + … + c{R12 = }· x{R12 = } + …
R12 R21 R23 R32 R13 R31
E1
DoleE2
ElizabethE3
N.C.
Entity-Relation Extraction [RothYi07]
Exactly one label for each relation and entity
Relation and entity type constraints
Integral constraints, in effect boolean
Entity-Relation Extraction [RothYi07]
• Each entity is either a person, organization or location:x{E1 = per}+ x{E1 = loc}+ x{E1 = org} + x{E1 = }=1
• (R12 = spouse_of) (E1 = person) (E2 = person)
x{R12 = spouse_of} x{E1 = per}
x{R12 = spouse_of} x{E2 = per}
Entity-Relation Extraction [RothYi07]
• Entity classification results
Entity-Relation Extraction [RothYi07]
• Relation identification results
Entity-Relation Extraction [RothYi07]
• Relation identification results
INNER WORKINGS OF INFERENCE
Constraints Encoding• Atoms
• Existential quantification
• Negation
• Conjunction• Disjunction
Integer Linear Programming (ILP)• Powerful tool, very general
• NP-hard even in binary case, but efficient for most NLP problems
• If ILP can not solve the problem efficiently, we can fall back to approximate solutions using heuristic search
Integer Linear Programming (ILP)
Integer Linear Programming (ILP)
SENTENCE COMPRESSION
Sentence Compression Example Modelling Compression with Discourse Constraints, James Clarke and Mirella Lapata,
COLING/SCL 2006
• 1. What is sentence compression? • Sentence compression is commonly expressed as a word deletion
problem: given an input sentence of words W = w1,w2, . . . ,wn, the aim is to produce a compression by removing any subset of these words (Knight and Marcu 2002).
A trigram language model: maximize a scoring function by ILP:
p i: word i starts the compressionq i,j : sequence wi,wj ends the compressionX i,j,k : trigram wi , wj ,wk in the compressionY i : word i in the compressionEach p ,q,x,y is either 0 or 1,
Sentential Constrains:• 1. disallows the inclusion of modifiers without their head
words:
• 2. presence of modifiers when the head is retained in the compression:
• 3. constrains that if a verb is present in the compression then so are its arguments:
Modifier Constraint Example
Modifier Constraint Example
Sentential Constrains:• 4. preserve personal pronouns in the compressed output:
Discourse Constrains:• 1. Center of a sentence is retained in the compression,
and the entity realised as the center in the following sentence is also retained.
• Center of the sentences is the entity with the highest rank.• Entity may ranked by many features.• EX:• grammatical role• (subjects > objects > others).
Discourse Constrains:• 2. Lexical Chain Constrains:•
• Lexical chain is a sequences of semantically related words.
• Often the longest lexical chain is the most important chain.
SEMANTIC ROLE LABELING
Semantic Role labeling Example:
• What is SRL?• SRL identifies all
constituents that fill a semantic role, and determines their roles.
General information:• Both models(argument identifier and argument
classifiers) are trained by SNoW.
• Idea: maximization the scoring function
SRL: Argument Identification• use a learning scheme that utilizes two classifiers, one to• predict the beginnings of possible arguments, and the
other the ends. The predictions are combined to form argument candidates.
• Why:• When only shallow parsing is available, the system does
not have constituents to begin with. Therefore, conceptually, the system has to consider all possible subsequences.
SRL: List of features• POS tags• Length• Verb class• Head word and POS tag of the head word• Position• Path• Chunk pattern• Clause relative position• Clause coverage• NEG• MOD
SRL: Constraints• 1. Arguments cannot overlap with the predicate.
• 2. Arguments cannot exclusively overlap with the clauses.
• 3. If a predicate is outside a clause, its arguments cannot be embedded in that clause.
• 4. No overlapping or embedding arguments.
• 5. No duplicate argument classes for core arguments.• Note: conjunction is an exception.• [A0 I] [V left ] [A1 my pearls] [A2 to my daughter] and [A1 my
gold] [A2 to my son].
SRL: Constraints• 6. if an argument is a reference to some other argument
arg, then this referenced argument must exist in the sentence.
• 7. If there is a C-arg argument, then there has to be an arg argument; in addition,the C-arg argument must occur after arg.
• the label C-arg is then used to specify the continuity of the arguments.
• 8. Given a specific verb, some argument types should• never occur.
SRL Results:
QA• Questions?