Latent Factor Models

Latent Factor Models

Geoff GordonJoint work w/ Ajit Singh, Byron Boots,

Sajid Siddiqi, Nick Roy

Motivation

A key component of a cognitive tutor: student cognitive model

Tracks what skills student currently knows—latent factors

circle-area

rectangle-area

decompose-area

right-answer

Motivation

Student models are a key bottleneck in cognitive tutor authoring and performance

rough estimate: 20-80 hrs to hand-code model for 1 hr of content

result may be too simple, not rigorously verified

But, demonstrated improvements in learning from better models

E.g., Cen et al [2007]:12% less time to learn 6 geometry units (same retention) using tutor w/ more accurate model

This talk: automatic discovery of new models and data-driven revision of existing models via (latent) factor analysis

SCORE: STDNT I, ITEM J

Simple case: snapshot, no side information

1 2 3 4 5 6 …

A 1 1 0 0 1 0 …

B 0 1 1 0 0 0 …

C 1 1 0 1 1 0 …

D 1 0 0 1 1 0 …

… … … … … … … …

ITEMS

STUDENTS

Missing data

1 2 3 4 5 6 …

A 1 ? ? ? 1 0 …

B 0 ? 1 0 ? ? …

C 1 1 ? ? ? 0 …

D 1 0 0 1 ? ? …

… … … … … … … …

ITEMS

STUDENTS

Data matrix X

xx11

xx22

xx33

..

..

..xxnn

STUDENTS

ITEMS

Simple case: model

XX

VV

UU

U: student latent factorsV: item latent factorsX: observed performance

n students

m items k latent factors

k latent factors

observed

unobserved

Linear-Gaussian version

student factoritem factor

XX

VV

UU

n students

m items k latent factors

k latent factors

U: Gaussian (0 mean, fixed var)V: Gaussian (0 mean, fixed var)X: Gaussian (fixed var, mean at left)

Matrix form: Principal Components Analysis

xx11

xx22

xx33

..

..

..xxnn

DATA MATRIX X

≈

COMPRESSED MATRIX U

uu11

uu22

uu33

..

..

..uunn

vv11 …… vvkk

BASIS MATRIX VT

PCA: the picture

PCA: matrix form

xx11

xx22

xx33

..

..

..xxnn

DATA MATRIX X

≈

COMPRESSED MATRIX U

uu11

uu22

uu33

..

..

..uunn

vv11 …… vvkk

BASIS MATRIX VT

COLS OF V SPAN THE LOW-RANK SPACE

Interpretation of factors

uu11

uu22

uu33

..

..

..uunn

vv11 …… vvkk

STUDENTS

ITEMSBASIS WEIGHTS

BASIS VECTORS

BASIS VECTORS ARE CANDIDATE “SKILLS” OR “KNOWLEDGE COMPONENTS”

WEIGHTS ARE STUDENTS’ KNOWLEDGE LEVELS

PCA is a widely successful model

FACE IMAGES FROM Groundhog Day, EXTRACTED BY CAMBRIDGE FACE DB PROJECT

Data matrix: face images

xx11

xx22

xx33

..

..

..xxnn

IMAGES

PIXELS

Result of factoring

uu11

uu22

uu33

..

..

..uunn

vv11 …… vvkk

IMAGES

PIXELSBASIS WEIGHTS

BASIS VECTORS

BASIS VECTORS ARE OFTEN CALLED “EIGENFACES”

Eigenfaces

IMAGE CREDIT: AT&T LABS CAMBRIDGE

PCA: the good

Unsupervised: need no human labels of latent state!

No worry about “expert blind spot”

Of course, labels helpful if available

Post-hoc human interpretation of latents is nice too—e.g., intervention design

PCA: the bad

Linear, Gaussian

PCA assumes E(X) is linear in UV

PCA assumes (X–E(X)) is i.i.d. Gaussian

Nonlinearity: conjunctive skills

P(CORRECT)

SKILL 1SKILL 2

Nonlinearity: disjunctive skills

P(CORRECT)

SKILL 1SKILL 2

Nonlinearity: “other”P(CORRECT)

SKILL 1SKILL 2

Non-Gaussianity

Typical hand-developed skill-by-item matrix

1 2 3 4 5 6 …

1 1 0 0 1 1 …

0 0 1 1 0 1 …

SKILLS

ITEMS

Result of Gaussian assumption

true recovered

rows of true and recovered V matrices

Result of Gaussian assumption

true recovered

rows of true and recovered V matrices

The ugly: MLE only

PCA yields maximum-likelihood estimate

Good, right?

sadly, the usual reasons to want the MLE don’t apply here

e.g., consistency: variance and bias of estimates of U and V do not approach 0 (unless #items/student and #students/item )

Result: MLE is typically far too confident of itself

Too certain: example

Learned coefficients

(e.g., a row of U)

Predictions

Result: “fold-in problem”

Nonsensical results when trying to apply learned model to a new student or item

Similar to overfitting problem in supervised learning: confident-but-wrong parameters do not generalize to new examples

Unlike overfitting, fold-in problem doesn’t necessarily go away with more data

Summary: 3 problems w/ PCA

Can’t handle nonlinearity

Can’t handle non-Gaussian distributions

Uses MLE only (==> fold-in problem)

Let’s look at each problem in turn

Nonlinearity

In PCA, had Xij ≈ Ui ⋅ Vj

What if

Xij ≈ exp(Ui ⋅ Vj)

Xij ≈ logit(Ui ⋅ Vj)

…

Non-Gaussianity

In PCA, had Xij ~ Normal(μ), μ = Ui ⋅ Vj

What if

Xij ~ Poisson(μ)

Xij ~ Binomial(p)

…

Exponential family review

Exponential family of distributions:

• P(X | θ) = P0(X) exp(X⋅θ – G(θ))

G(θ) is always strictly convex, differentiable on interior of domain

means G’ is strictly monotone (strictly generalized monotone in 2D or higher)

Exponential family review

Exponential family PDF:

• P(X | θ) = P0(X) exp(X⋅θ – G(θ))

Surprising result: G’(θ) = g(θ) = E(X | θ)

g & g–1 = “link function”

θ = “natural parameter”

E(X | θ) = “expectation parameter”

Examples

Normal(mean)

g = identity

Poisson(log rate)

g = exp

Binomial(log odds)

g = sigmoid

Nonlinear & non-Gaussian

Let P(X | θ) be an exponential family with natural parameter θ

Predict Xij ~ P(X | θij), where θij = Ui ⋅ Vj

e.g., in Poisson, E(Xij) = exp(θij)

e.g., in Binomial, E(Xij) = logit(θij)

Optimization problem

max ∑ log P(Xij | θij)

s.t. θij = Ui ⋅ Vj

“Generalized linear” or “exponential family” PCA

all P(…) terms are exponential families

analogy to GLMs

+ log P(U) + log P(V)U,V

[Collins et al, 2001][Gordon, 2002][Roy & Gordon, 2005]

Special cases

PCA, probabilistic PCA

Poisson PCA

k-means clustering

Max-margin matrix factorization (MMMF)

Almost: pLSI, pHITS, NMF

Comparison to AFM

p = probability correct

θ = student overall performance

β = skill difficulty

Q = item x skill matrix

= skill practice slope

T = number of practice opportunities

TTikik kkθ

β0

QQ

11

xx

Theorem

• In GL PCA, finding U which maximizes likelihood (holding V fixed) is a convex optimization problem

• And, finding best V (holding U fixed) is a convex problem

• Further, Hessian is block diagonal

So, an efficient and effective optimization algorithm: alternately improve U and V

Example: compressing

histograms w/ Poisson PCA

Points: observed frequencies in ℝ3

Hidden manifold: a 1-parameter family of multinomials

A

B C

Example

ITERATION 1

Example

ITERATION 2

Example

ITERATION 3

Example

ITERATION 4

Example

ITERATION 5

Example

ITERATION 9

Remaining problem: MLE

Well-known rule of thumb: if MLE gets you in trouble due to overfitting, move to fully-Bayesian inference

Typical problem: computation

In our case, the computation is just fine if we’re a little clever

Additional wrinkle: switch to hierarchical model

Bayesian hierarchical exponential-family PCA

XX

VV

UU

U: student latent factorsV: item latent factorsX: observed performanceR: shared prior for student latentsS: shared prior for item latents

n students

m items

k latent factors

k latent factors

observed

unobserved

RR

SS

student factoritem factor

A little clever: MCMC

Z P(X)

Experimental comparisonGeometry Area 1996-1997

data

Geometry tutor: 139 items presented to 59 students

On average, each student tested on 60 items

Results: hold-out error

Embedding dimension for *EPCA is K = 15

credit: Ajit Singh

Extensions

Relational models

Temporal models

Relational models

1 2 3 4 5 6john

1 1 0 0 1 0

sue 0 1 1 0 0 0

tom 1 1 0 1 1 0

ITEMS

STUDENTS

1 2 3 4 5 6trig 1 1 0 0 1 0

story

0 1 1 0 0 0

hard 1 1 0 1 1 0

ITEMS

TAGS

Relational hierarchical Bayesian exponential-family

PCA

XX

VV

UU

X, Y: observed dataU: student latent factorsV: item latent factorsZ: tag latent factorsR, S, T: shared priors

n students

m items

k latent factors

k latent factors

observed

unobserved

RR

SS

p tags

YY

ZZk latent factors

TT

X ≈ f(UVT) Y ≈ g(VZT)

Example: brain imaging

2000 dictionary words

60 stimulus words

500 brain voxels

X = co-occurrence of (dictionary word, stimulus word) on web

Y = activation of voxel when presented with stimulus

Task: predict X

HB-EPCA

H-EPCA

EPCA

Relational versions

Mean squared error

credit: Ajit Singh

Temporal models

So far: latent factors of students and content

e.g., knowledge components

for student: skill at KC

for problem: need for KC

e.g., student affect

But limited idea of evolution through time

e.g., fixed-structure models: proficiency = a + B X, where x = # practice opportunities, A = initial skill level, b = skill learning rate

Temporal models

For evolving factors, we expect far better results if we learn about time explicitly

learning curves, gaming state, affective state, motivational state, self-efficacy, …

XX11XX11XX11LATENT STATE

PROPERTIES OF TRANSACTION

X1X1X1X1YY11

INSTRUCTIONAL DECISIONS X1X1X1X1UU11

TRANS. 1 TRANS. 2 TRANS. 3

XX11XX11XX22

X1X1X1X1YY22

X1X1X1X1UU22

XX11XX11XX33

X1X1X1X1YY33

X1X1X1X1UU33

Example: Bayesian Evaluation & Assessment

[BECK ET AL., 2008]

PROPERTIES OF TRANSACTIONS

LATENT STATE

INSTRUCTIONAL DECISIONS

The hope

Fit a temporal model

Examine learned parameters and latent states

Discover important evolving factors which affect performance

learning curve, affective state, gaming state, …

Discover how they evolve

The hope

Reduce assumptions about what the factors are

Explore a wider variety of models

Model search guided by data

⇒ discover factors we might otherwise have missed

Walking: original data

THANKS: BYRON BOOTS, SAJID

SIDDIQI

Walking: original data

THANKS: BYRON BOOTS, SAJID

SIDDIQI


JOINT ANGLESX1X1X1X1YY11

DESIREDDIRECTION X1X1X1X1UU11


XX11XX11XX22

X1X1X1X1YY22

X1X1X1X1UU22

XX11XX11XX33

X1X1X1X1YY33

X1X1X1X1UU33

Walking: learned model

Steam: original data

Steam: original data


PIXELSX1X1X1X1YY11

(EMPTY) X1X1X1X1UU11


XX11XX11XX22

X1X1X1X1YY22

X1X1X1X1UU22

XX11XX11XX33

X1X1X1X1YY33

X1X1X1X1UU33

Steam: learned model

Latent Factor Models

Documents

student latent factorsv

item latent factorsx

human labels of latent

gaussian fixed var

skills student

tutor w

student cognitive modeltracks

cognitive tutor authoring