Joint work with Karl Schultz, Sameer Singh, Michael Wick, Sebastian Reidel. Some slide material from Avi Pfeffer. Andrew McCallum Department of Computer Science University of Massachusetts Amherst Probabilistic Programming with Imperative Factor Graphs
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Joint work with Karl Schultz, Sameer Singh, Michael Wick, Sebastian Reidel.Some slide material from Avi Pfeffer.
Andrew McCallum
Department of Computer ScienceUniversity of Massachusetts Amherst
• Reasoning under uncertainty is a central challenge for building intelligent systems.
Probability
• Probability provides a mathematically sound basis for dealing with uncertainty.
• Combined with utilities, provides a basis for decision-making under uncertainty.
Probabilistic Modeling in the Last Few Years
• Models ever growing in richness and variety- hierarchical- spatio-temporal- relational- infinite
• Developing the representation, inference and learning for a new model is a significant task.
Conditional Random Fields
Finite state model
Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)
transitions observations
y1 y2 y3 y4 y5 y6 y7 y8state sequence
observation sequence
x1 x2 x3 x4 x5 x6 x7 x8
Graphical model
(Linear-chain) [Lafferty, McCallum, Pereira 2001]
1Z�x
|�x|�
t=1
φ(yt, yt−1)φ(xt, yt)= exp
� �
k
λkfk(xt, yt)
�
p(y|x) =
Conditional Random Fields
Finite state model
Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence)
state sequence
observation sequence
Graphical model
(Linear-chain) [Lafferty, McCallum, Pereira 2001]
Asian word segmentation [COLING’04], [ACL’04]IE from Research papers [HTL’04]Object classification in images [CVPR ‘04]
Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]Protein structure prediction [ICML’04]IE from Bioinformatics text [Bioinformatics ‘04],…
y1 y2 y3 y4 y5 y6 y7 y8
x1 x2 x3 x4 x5 x6 x7 x8
Skip-chain CRF
. . .
Senator Joe Green said today . Green chairs the ...
Joint NER across sentences
Capture long-distance dependencies
[Sutton, McCallum, 2005]
Factorial CRF
Part-of-speech
Noun-phrase boundaries
Named-entity tag
English words
Those surfers like San Jose
[Sutton, McCallum ’04]
Joint Part-of-speech, NP chunking, NER
Inference by Loopy Belief Propagation
Pairwise Affinity CRFMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
[McCallum & Wellner 2003]
Entity Resolution
“mention”
“mention” “mention”
“mention”
“mention”Mr. Hill
Amy Hall
Dana Hill
Dana she
Entity Resolution
“entity”
“entity”
Mr. Hill
Amy Hall
Dana Hill
Dana she
Entity Resolution
“entity”
“entity”
Mr. Hill
Amy Hall
Dana Hill
Dana she
Entity Resolution
“entity”
“entity”
“entity”Mr. Hill
Amy Hall
Dana Hill
Dana she
CRF for Co-referenceMr. Hill
Amy Hall
Dana Hill
Dana she
CRF for Co-referenceMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
[McCallum & Wellner 2003]
p(�y|�x) =1
Z�xexp
�
i,j
�
l
λlfl(xi, xj , yij)
+ mechanism for preserving transitivity
Make pair-wise mergingdecisions jointly by:- calculating a joint prob.- including all edge weights- enforcing transitivity.
Pairwise Affinity is not EnoughMr. Hill
Amy Hall
Dana Hill
Dana she
C
C
N
N
N
C
NC
N N
Pairwise Affinity is not Enoughshe
Amy Hall
she
she she
C
C
N
N
N
C
NC
N N
Pairwise Comparisons Not EnoughExamples:
• ∀ mentions are pronouns?
• Entities have multiple attributes (name, email, institution, location);
need to measure “compatibility” among them.
• Having 2 “given names” is common, but not 4.– e.g. Howard M. Dean / Martin, Dean / Howard Martin
• Need to measure size of the clusters of mentions.
• ∃ a pair of lastname strings that differ > 5?
We need to ask ∃, ∀ questions about a set of mentions
We want first-order logic!
Pairwise Affinity is not Enoughshe
Amy Hall
she
she she
C
C
N
N
N
C
NC
N N
Partition Affinity CRFshe
Amy Hall
she
she she
Ask arbitrary questionsabout all entities in a partitionwith first-order logic...
Partition Affinity CRFshe
Amy Hall
she
she she
Partition Affinity CRFshe
Amy Hall
she
she she
Partition Affinity CRFshe
Amy Hall
she
she she
Partition Affinity CRFshe
Amy Hall
she
she she
How can we perform inference and learning in models that cannot be “unrolled” ?
Can’t use belief propagation.Can’t use standard integer linear programming.
Don’t represent all alternatives...
she
she
AmyHall
sheshe
Don’t represent all alternatives... just one
she
she
AmyHall
sheshe
she
she
AmyHall
sheshe
StochasticJump
ProposalDistribution
at a time
Markov Chain Monte Carlo
Metropolis-HastingsSampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
feasible region defined by deterministic constraintse.g. clustering, parse-tree projectivity.
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
Given factor graph with target variables y and observed x
SampleRank
Metropolis-Hastings for MAP
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
Maximum a Posteriori inference: (MAP) argmaxy∈F P(Y = y |x)1
. . . over a model
P(y |x) =1
ZX
�
y i∈F
ψ(x , y i)
. . . using a proposal distribution q(y �|y) : F × F → [0, 1]
MH for MAP:
1. Begin with some initial configuration y0 ∈ F2. For i=1,2,3,. . . draw a local modification y � ∈ F from q3. Probabilistically accept jump as Bernoulli draw with param α
α = min
�1,
p(y �)
p(y)
q(y |y �)
q(y �|y)
�
1F is the feasible region defined by deterministic constraints, e.g. clustering,
Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding
Want: push updates (not inference) into inner-most loop of learning
Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk
Figaro• Generative model of objects and relations.
• Object oriented (also in Scala!)- “Models” are basic building block,
composed of other models, derived by inheritance.
- Models are objects with conditions, constraints and relations to other objects.
- Model = data + factors; they are intertwined.
[Pfeffer, 2009]
Figaro [Pfeffer, 2009]
People smoke with probability 0.6:Smoke(x) 1.5Friends are 3 times as likely to have the same smoking habit than different:¬Friends(x,y) v ¬Smoke(x) v Smoke(y) 3¬Friends(x,y) v Smoke(x) v ¬ Smoke(y) 3
class Person { val smokes = Flip(0.6) }val alice, bob, clara = new Personalice.smokes.condition(true)val friends = List((alice, bob), (bob, clara))def constraint(pair: (Boolean, Boolean)) = if (pair._1 == pair._2) 3.0; else 1.0for { (p1,p2) ← friends } Pair(p1.smokes, p2.smokes).constrain(constraint)
Markov LogicFirst-Order Logic as a Template to Define CRF Parameters
[Richardson & Domingos 2005][Paskin & Russell 2002][Taskar et al 2003]
ground Markov network
grounding Markov network requires space O(nr)
n = number constants r = highest clause arity
My Approach
• I’m going to immediately dismiss the generative models.- Interesting, but not what performs best in NLP.
Want
• Discriminatively trained factor graphs.
• Best previous example of this: Markov Logic.
Logic + Probability
• Significant interest in this combination- Poole, Muggleton, DeRaedt, Sato, Domingos,...
• We now hypothesize that in much of this previous workthe “logic” aspect is mostly a red herring.- Power: repeated relational structures and tied parameters
- Logic is one way to specify these structures, but not the only one, and perhaps not the best.
- In deterministic programming, Prolog replaced by imperative lang’s✦ programmers have to keep imperative solver in mind after all✦ much domain knowledge is procedural anyway
- Logical inference replaced by probabilistic inference.
Declarative Model Specification• One of biggest advances in AI & ML
• Gone too far?Much domain knowledge is also procedural.
- upcoming slides: 3 examples of injecting imperativ-ism into factor graphs
FACTORIE• Factor Graphs, Imperative, Extensible• Implemented as a library in Scala [Martin Odersky]
- object oriented & functional- type inference- lazy evaluation- everything an object (int, float,...)- nice syntax for creating “domain-specific languages”- runs in JVM (complete interoperation with Java)- “Haskell++ in a Java style”
• Library, not new “little language”- all familiar Java constructs & libraries available to you- integrate data pre-processing & eval. w/ model spec- Scala makes syntax not too bad.- But not as compact as a dedicated language (BLOG, MLN)
Stages of FACTORIE programming1. Define templates for data (i.e. classes)
- Use data structures just like in deterministic programming.
- Only special requirement: provide “undo” capability for changes.
2. Define templates for factors- Distinct from above data representation;
makes it easy to modify model scores indep’ly.
- Use & transform data’s natural relations to define factors’ relations.
3. Optionally, define MCMC proposal functions that leverage domain knowledge.
Scala• New variable
var myHometown : Stringvar myAltitude = 10523.2
• New constantval myName = “Andrew”
• New methoddef climb(increment:double) = myAltitude += increment
• New classclass Skier extends Person
• New trait (like Java interface with implementations)trait FirstAid { def applyBandage = ... }
• New class with traitclass BackcountrySkier extends Skier with FirstAid
• New static object [generic]object GlobalSkierTable extends ArrayList[Skier]
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg)
class Token(word:String) extends EnumVariable(word)
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq
class Token(word:String) extends EnumVariable(word) with VarSeq
label.prev label.next
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq
class Token(word:String) extends EnumVariable(word) with VarSeq
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label}
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Avoid representing relations by indices.Do it directly with members, pointers... arbitrary data structure.
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
Labels
Words
Bill loves skiing Tom loves snowshoeing
T F F T F F
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token} class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores
Key Operation: Scoring a Proposal
• Acceptance probability ~ ratio of model scores.Scores of factors that didn’t change cancel.
• To efficiently score:– Proposal method runs.– Automatically build a list of variables that changed.– Find factors that touch changed variables– Find other (unchanged) variables needed to calculate
those factors’ scores• How to find factors from variables & vice versa?
– In BLOG, rich, highly-indexed data structure stores mapping variables ←→ factors
– But complex to maintain as structure changes
Imperativ-ism #2: Model Structure• Maintain no map structure between factors and variables
• Finding factors is easy. Usually # templates < 50.• Primitive operation:
Given factor template and one changed variable, find other variables• In factor Template object, define imperative methods that do this.
– unroll1(v1) returns (v1,v2,v3)– unroll2(v2) returns (v1,v2,v3)– unroll3(v3) returns (v1,v2,v3)– I.e., use Turing-complete language to determine structure on the fly.– If you want to use a data structure instead, access it in the method.– If you want a higher-level language for specifying structure,
write it terms of this primitive.
• Other nice attribute– Easy to do value-conditioned structure. Case Factor Diagrams, etc.
Example: Linear-Chain CRF for Segmentation
class Label(isBeg:boolean) extends Bool(isBeg) with VarSeq { val token : Token}class Token(word:String) extends EnumVariable(word) with VarSeq { val label : Label def longerThanSix = word.length > 6}
Problem with traditional ML: inference in inner-most loop of learningmaximum likelihood requires inference for marginalsperceptron requires inference for decoding
Want: push updates (not inference) into inner-most loop of learning
Idea: use MH as a guide, exploit its efficiency, and learn to rankneighboring samples during random walk