Latent Factor Models Geoff Gordon Joint work w/ Ajit Singh, Byron Boots, Sajid Siddiqi, Nick Roy
Jan 22, 2016
Latent Factor Models
Geoff GordonJoint work w/ Ajit Singh, Byron Boots,
Sajid Siddiqi, Nick Roy
Motivation
A key component of a cognitive tutor: student cognitive model
Tracks what skills student currently knows—latent factors
circle-area
rectangle-area
decompose-area
right-answer
Motivation
Student models are a key bottleneck in cognitive tutor authoring and performance
rough estimate: 20-80 hrs to hand-code model for 1 hr of content
result may be too simple, not rigorously verified
But, demonstrated improvements in learning from better models
E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model
This talk: automatic discovery of new models and data-driven revision of existing models via (latent) factor analysis
SCORE: STDNT I, ITEM J
Simple case: snapshot, no side information
1 2 3 4 5 6 …
A 1 1 0 0 1 0 …
B 0 1 1 0 0 0 …
C 1 1 0 1 1 0 …
D 1 0 0 1 1 0 …
… … … … … … … …
ITEMS
STUDENTS
Missing data
1 2 3 4 5 6 …
A 1 ? ? ? 1 0 …
B 0 ? 1 0 ? ? …
C 1 1 ? ? ? 0 …
D 1 0 0 1 ? ? …
… … … … … … … …
ITEMS
STUDENTS
Data matrix X
xx11
xx22
xx33
..
..
..xxnn
STUDENTS
ITEMS
Simple case: model
XX
VV
UU
U: student latent factorsV: item latent factorsX: observed performance
n students
m items k latent factors
k latent factors
observed
unobserved
Linear-Gaussian version
student factoritem factor
XX
VV
UU
n students
m items k latent factors
k latent factors
U: Gaussian (0 mean, fixed var)V: Gaussian (0 mean, fixed var)X: Gaussian (fixed var, mean at left)
Matrix form: Principal Components Analysis
xx11
xx22
xx33
..
..
..xxnn
DATA MATRIX X
≈
COMPRESSED MATRIX U
uu11
uu22
uu33
..
..
..uunn
vv11 …… vvkk
BASIS MATRIX VT
PCA: the picture
PCA: matrix form
xx11
xx22
xx33
..
..
..xxnn
DATA MATRIX X
≈
COMPRESSED MATRIX U
uu11
uu22
uu33
..
..
..uunn
vv11 …… vvkk
BASIS MATRIX VT
COLS OF V SPAN THE LOW-RANK SPACE
Interpretation of factors
uu11
uu22
uu33
..
..
..uunn
vv11 …… vvkk
STUDENTS
ITEMSBASIS WEIGHTS
BASIS VECTORS
BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS”
WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS
PCA is a widely successful model
FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT
Data matrix: face images
xx11
xx22
xx33
..
..
..xxnn
IMAGES
PIXELS
Result of factoring
uu11
uu22
uu33
..
..
..uunn
vv11 …… vvkk
IMAGES
PIXELSBASIS WEIGHTS
BASIS VECTORS
BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”
Eigenfaces
IMAGE CREDIT: AT&T LABS CAMBRIDGE
PCA: the good
Unsupervised: need no human labels of latent state!
No worry about “expert blind spot”
Of course, labels helpful if available
Post-hoc human interpretation of latents is nice too—e.g., intervention design
PCA: the bad
Linear, Gaussian
PCA assumes E(X) is linear in UV
PCA assumes (X–E(X)) is i.i.d. Gaussian
Nonlinearity: conjunctive skills
P(CORRECT)
SKILL 1SKILL 2
Nonlinearity: disjunctive skills
P(CORRECT)
SKILL 1SKILL 2
Nonlinearity: “other”P(CORRECT)
SKILL 1SKILL 2
Non-Gaussianity
Typical hand-developed skill-by-item matrix
1 2 3 4 5 6 …
1 1 0 0 1 1 …
0 0 1 1 0 1 …
SKILLS
ITEMS
Result of Gaussian assumption
true recovered
rows of true and recovered V matrices
Result of Gaussian assumption
true recovered
rows of true and recovered V matrices
The ugly: MLE only
PCA yields maximum-likelihood estimate
Good, right?
sadly, the usual reasons to want the MLE don’t apply here
e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item )
Result: MLE is typically far too confident of itself
Too certain: example
Learned coefficients
(e.g., a row of U)
Predictions
Result: “fold-in problem”
Nonsensical results when trying to apply learned model to a new student or item
Similar to overfitting problem in supervised learning: confident-but-wrong parameters do not generalize to new examples
Unlike overfitting, fold-in problem doesn’t necessarily go away with more data
Summary: 3 problems w/ PCA
Can’t handle nonlinearity
Can’t handle non-Gaussian distributions
Uses MLE only (==> fold-in problem)
Let’s look at each problem in turn
Nonlinearity
In PCA, had Xij ≈ Ui ⋅ Vj
What if
Xij ≈ exp(Ui ⋅ Vj)
Xij ≈ logit(Ui ⋅ Vj)
…
Non-Gaussianity
In PCA, had Xij ~ Normal(μ), μ = Ui ⋅ Vj
What if
Xij ~ Poisson(μ)
Xij ~ Binomial(p)
…
Exponential family review
Exponential family of distributions:
• P(X | θ) = P0(X) exp(X⋅θ – G(θ))
G(θ) is always strictly convex, differentiable on interior of domain
means G’ is strictly monotone (strictly generalized monotone in 2D or higher)
Exponential family review
Exponential family PDF:
• P(X | θ) = P0(X) exp(X⋅θ – G(θ))
Surprising result: G’(θ) = g(θ) = E(X | θ)
g & g–1 = “link function”
θ = “natural parameter”
E(X | θ) = “expectation parameter”
Examples
Normal(mean)
g = identity
Poisson(log rate)
g = exp
Binomial(log odds)
g = sigmoid
Nonlinear & non-Gaussian
Let P(X | θ) be an exponential family with natural parameter θ
Predict Xij ~ P(X | θij), where θij = Ui ⋅ Vj
e.g., in Poisson, E(Xij) = exp(θij)
e.g., in Binomial, E(Xij) = logit(θij)
Optimization problem
max ∑ log P(Xij | θij)
s.t. θij = Ui ⋅ Vj
“Generalized linear” or “exponential family” PCA
all P(…) terms are exponential families
analogy to GLMs
+ log P(U) + log P(V)U,V
[Collins et al, 2001][Gordon, 2002][Roy & Gordon, 2005]
Special cases
PCA, probabilistic PCA
Poisson PCA
k-means clustering
Max-margin matrix factorization (MMMF)
Almost: pLSI, pHITS, NMF
Comparison to AFM
p = probability correct
θ = student overall performance
β = skill difficulty
Q = item x skill matrix
= skill practice slope
T = number of practice opportunities
TTikik kkθ
β0
11
xx
Theorem
• In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem
• And, finding best V (holding U fixed) is a convex problem
• Further, Hessian is block diagonal
So, an efficient and effective optimization algorithm: alternately improve U and V
Example: compressing
histograms w/ Poisson PCA
Points: observed frequencies in ℝ3
Hidden manifold: a 1-parameter family of multinomials
A
B C
Example
ITERATION 1
Example
ITERATION 2
Example
ITERATION 3
Example
ITERATION 4
Example
ITERATION 5
Example
ITERATION 9
Remaining problem: MLE
Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference
Typical problem: computation
In our case, the computation is just fine if we’re a little clever
Additional wrinkle: switch to hierarchical model
Bayesian hierarchical exponential-family PCA
XX
VV
UU
U: student latent factorsV: item latent factorsX: observed performanceR: shared prior for student latentsS: shared prior for item latents
n students
m items
k latent factors
k latent factors
observed
unobserved
RR
SS
student factoritem factor
A little clever: MCMC
Z P(X)
Experimental comparisonGeometry Area 1996-1997
data
Geometry tutor: 139 items presented to 59 students
On average, each student tested on 60 items
Results: hold-out error
Embedding dimension for *EPCA is K = 15
credit: Ajit Singh
Extensions
Relational models
Temporal models
Relational models
1 2 3 4 5 6john
1 1 0 0 1 0
sue 0 1 1 0 0 0
tom 1 1 0 1 1 0
ITEMS
STUDENTS
1 2 3 4 5 6trig 1 1 0 0 1 0
story
0 1 1 0 0 0
hard 1 1 0 1 1 0
ITEMS
TAGS
Relational hierarchical Bayesian exponential-family
PCA
XX
VV
UU
X, Y: observed dataU: student latent factorsV: item latent factorsZ: tag latent factorsR, S, T: shared priors
n students
m items
k latent factors
k latent factors
observed
unobserved
RR
SS
p tags
YY
ZZk latent factors
TT
X ≈ f(UVT) Y ≈ g(VZT)
Example: brain imaging
2000 dictionary words
60 stimulus words
500 brain voxels
X = co-occurrence of (dictionary word, stimulus word) on web
Y = activation of voxel when presented with stimulus
Task: predict X
HB-EPCA
H-EPCA
EPCA
Relational versions
Mean squared error
credit: Ajit Singh
Temporal models
So far: latent factors of students and content
e.g., knowledge components
for student: skill at KC
for problem: need for KC
e.g., student affect
But limited idea of evolution through time
e.g., fixed-structure models: proficiency = a + B X, where x = # practice opportunities, A = initial skill level, b = skill learning rate
Temporal models
For evolving factors, we expect far better results if we learn about time explicitly
learning curves, gaming state, affective state, motivational state, self-efficacy, …
XX11XX11XX11LATENT STATE
PROPERTIES OF TRANSACTION
X1X1X1X1YY11
INSTRUCTIONAL DECISIONS X1X1X1X1UU11
TRANS. 1 TRANS. 2 TRANS. 3
XX11XX11XX22
X1X1X1X1YY22
X1X1X1X1UU22
XX11XX11XX33
X1X1X1X1YY33
X1X1X1X1UU33
Example: Bayesian Evaluation & Assessment
[BECK ET AL., 2008]
PROPERTIES OF TRANSACTIONS
LATENT STATE
INSTRUCTIONAL DECISIONS
The hope
Fit a temporal model
Examine learned parameters and latent states
Discover important evolving factors which affect performance
learning curve, affective state, gaming state, …
Discover how they evolve
The hope
Reduce assumptions about what the factors are
Explore a wider variety of models
Model search guided by data
⇒ discover factors we might otherwise have missed
Walking: original data
THANKS: BYRON BOOTS, SAJID
SIDDIQI
Walking: original data
THANKS: BYRON BOOTS, SAJID
SIDDIQI
XX11XX11XX11LATENT STATE
JOINT ANGLESX1X1X1X1YY11
DESIREDDIRECTION X1X1X1X1UU11
TRANS. 1 TRANS. 2 TRANS. 3
XX11XX11XX22
X1X1X1X1YY22
X1X1X1X1UU22
XX11XX11XX33
X1X1X1X1YY33
X1X1X1X1UU33
Walking: learned model
Steam: original data
Steam: original data
XX11XX11XX11LATENT STATE
PIXELSX1X1X1X1YY11
(EMPTY) X1X1X1X1UU11
TRANS. 1 TRANS. 2 TRANS. 3
XX11XX11XX22
X1X1X1X1YY22
X1X1X1X1UU22
XX11XX11XX33
X1X1X1X1YY33
X1X1X1X1UU33
Steam: learned model