Top Banner
Where's The Reward? Where's The Reward? A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1
163

A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

May 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Where's The Reward?Where's The Reward?A Review of Reinforcement Learning for

Instructional Sequencing

Shayan Doroudi

1

Page 2: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

2

Page 3: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

2

Page 4: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Over the past 50 years, howOver the past 50 years, howsuccessful has RL been insuccessful has RL been in

discovering useful adaptivediscovering useful adaptiveinstructional policies?instructional policies?

Research QuestionResearch Question

3

Page 5: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Under what conditions is RLUnder what conditions is RLmost likely to be successful inmost likely to be successful in

advancing instructionaladvancing instructionalsequencing?sequencing?

Research QuestionResearch Question

4

Page 6: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

OverviewOverview

Reinforcement Learning: Towards a "Theory of Instruction"Part 1: Historical PerspectivePart 2: Systematic Review

Discussion: Where's the Reward?

Part 3: Case StudyPlanning for the Future

5

Page 7: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

OverviewOverview

Reinforcement Learning: Towards a “Theory of Instruction”Part 1: Historical PerspectivePart 2: Systematic Review

Discussion: Where's the Reward?

Part 3: Case StudyPlanning for the Future

6

Page 8: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Theory of InstructionTheory of InstructionAtkinson (1972):

1. The possible states of nature

2. The actions that the decision maker can take totransform the state

3. The transformation of the state of nature thatresults from each action

4. The cost of each action

5. The return resulting from each state of nature

“The derivation of an optimal strategy requires that theinstructional problem be stated in a form amenable to a

decision-theoretic analysis...”

7

Page 9: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Markov Decision ProcessMarkov Decision ProcessA Markov Decision Process is defined as a 5-tuple (S,A,T ,R,H):

8

Page 10: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Markov Decision ProcessMarkov Decision ProcessA Markov Decision Process is defined as a 5-tuple (S,A,T ,R,H):

1. The possible states of nature = S

8

Page 11: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Markov Decision ProcessMarkov Decision ProcessA Markov Decision Process is defined as a 5-tuple (S,A,T ,R,H):

1. The possible states of nature = S

2. The actions that the decision maker can take totransform the state = A

8

Page 12: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Markov Decision ProcessMarkov Decision ProcessA Markov Decision Process is defined as a 5-tuple (S,A,T ,R,H):

1. The possible states of nature = S

2. The actions that the decision maker can take totransform the state = A

3. The transformation of the state of nature that resultsfrom each action = T (s ∣s, a)′

8

Page 13: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Markov Decision ProcessMarkov Decision ProcessA Markov Decision Process is defined as a 5-tuple (S,A,T ,R,H):

1. The possible states of nature = S

2. The actions that the decision maker can take totransform the state = A

3. The transformation of the state of nature that resultsfrom each action = T (s ∣s, a)

4. The cost of each action = R(a)

8

Page 14: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Markov Decision ProcessMarkov Decision ProcessA Markov Decision Process is defined as a 5-tuple (S,A,T ,R,H):

1. The possible states of nature = S

2. The actions that the decision maker can take totransform the state = A

3. The transformation of the state of nature that resultsfrom each action = T (s ∣s, a)

4. The cost of each action = R(a)

5. The return resulting from each state of nature = R(s)

8

Page 15: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Markov Decision ProcessMarkov Decision ProcessA Markov Decision Process is defined as a 5-tuple (S,A,T ,R,H):

1. The possible states of nature = S

2. The actions that the decision maker can take totransform the state = A

3. The transformation of the state of nature that resultsfrom each action = T (s ∣s, a)

4. The cost of each action = R(a)

5. The return resulting from each state of nature = R(s)

6. The horizon, or number of time steps for which theagent takes actions = H

8

Page 16: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Theory of InstructionTheory of InstructionAtkinson's (1972) “Ingredients for a Theory of Instruction”:

taken in conjunction with methods for deriving optimal strategies

A model of the learning process.

Specification of admissible instructionalactions.

Specification of instructional objectives

A measurement scale that permits coststo be assigned to each of theinstructional actions and and payoffs tothe achievement of instructionalobjectives.

9

Page 17: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Reinforcement Learning (RL)Reinforcement Learning (RL)Markov Decision Process

Set of States S

Set of Actions A Transition Matrix T

Reward function R

Horizon H

10

Page 18: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Reinforcement Learning (RL)Reinforcement Learning (RL)Markov Decision Process

MDP Planning: methods for deriving optimal strategies (e.g., value iteration, policy iteration)  

Set of States S

Set of Actions A Transition Matrix T

Reward function R

Horizon H

10

Page 19: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Reinforcement Learning (RL)Reinforcement Learning (RL)Markov Decision Process

MDP Planning: methods for deriving optimal strategies (e.g., value iteration, policy iteration)  

Set of States S

Set of Actions A Transition Matrix T

Reward function R

Horizon H

Reinforcement Learning: methods for deriving optimal strategieswhen T  and R are unknown.

10

Page 20: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Different RL SettingsDifferent RL Settings

11

Page 21: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Different RL SettingsDifferent RL SettingsOnline RL: Learn an instructional policy as you interact with

students. (Need to balance exploration and exploitation.)

vs.

Offline RL: Learn an instructional policy using prior data.

 

11

Page 22: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Different RL SettingsDifferent RL SettingsOnline RL: Learn an instructional policy as you interact with

students. (Need to balance exploration and exploitation.)

vs.

Offline RL: Learn an instructional policy using prior data.

 

MDP: The agent knows the state of the world

vs.

Partially observable MDP (POMDP): The agent can only observesignals of the state

(e.g., can see if the student responded correctly but does notknow the student's cognitive state)

 11

Page 23: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

OverviewOverview

Reinforcement Learning: Towards a “Theory of Instruction”Part 1: Historical PerspectivePart 2: Systematic Review

Discussion: Where's the Reward?

Part 3: Case StudyPlanning for the Future

12

Page 24: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Why History?Why History?

13

Page 25: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Why History?Why History?

Who has been interested in using RL for instructionalsequencing and why?

13

Page 26: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Why History?Why History?

Who has been interested in using RL for instructionalsequencing and why?History repeats itself!

13

Page 27: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Why History?Why History?

Who has been interested in using RL for instructionalsequencing and why?History repeats itself!Surprising ways in which RL for instructional sequencinghas impacted both the field of reinforcement learning andthe field of education.

13

Page 28: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Why History?Why History?

Who has been interested in using RL for instructionalsequencing and why?History repeats itself!Surprising ways in which RL for instructional sequencinghas impacted both the field of reinforcement learning andthe field of education.A lot of the literature does not acknowledge the history ofthis area.

13

Page 29: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

First Wave: 1960s-70sFirst Wave: 1960s-70s

14

Page 30: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

First Wave: 1960s-70sFirst Wave: 1960s-70s

Why 1960s?

14

Page 31: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

First Wave: 1960s-70sFirst Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s.

Why 1960s?

14

Page 32: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

First Wave: 1960s-70sFirst Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s.Computers! -> Computer-Assisted Instruction

Why 1960s?

14

Page 33: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

First Wave: 1960s-70sFirst Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s.Computers! -> Computer-Assisted InstructionDynamic Programming and Markov Decision Processes

Why 1960s?

14

Page 34: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

First Wave: 1960s-70sFirst Wave: 1960s-70s

Teaching machines were popular in late 50s-early 60s.Computers! -> Computer-Assisted InstructionDynamic Programming and Markov Decision ProcessesMathematical Psych: studying mathematical models of learning

Why 1960s?

14

Page 35: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

15

Page 36: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

A Decision Structure for Teaching Machines

15

Page 37: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

Edward Sondik

A Decision Structure for Teaching Machines

The Optimal Control of Partially Observable Markov Processes

15

Page 38: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

Edward Sondik

“The results obtained by Smallwood [on the special case of determiningoptimum teaching strategies] prompted this research into the general

problem.”

A Decision Structure for Teaching Machines

The Optimal Control of Partially Observable Markov Processes

15

Page 39: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

Edward Sondik16

Page 40: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

Edward Sondik

Operations Research / Engineering

16

Page 41: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

Edward Sondik

Richard Atkinson Patrick Suppes

Operations Research / Engineering Mathematical Psychology / CAI

16

Page 42: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

Edward Sondik

Richard Atkinson Patrick Suppes

James Matheson

William Linvill

Operations Research / Engineering Mathematical Psychology / CAI

16

Page 43: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Ronald Howard

Richard Smallwood

Edward Sondik

Richard Atkinson Patrick Suppes

James Matheson

William Linvill

Operations Research / Engineering Mathematical Psychology / CAI

Optimum Teaching Procedures Derived fromMathematical Learning Models

16

Page 44: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Dark AgesThe Dark Ages

c. 1972 - 2000sc. 1972 - 2000s

17

Page 45: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Dark AgesThe Dark Ages

c. 1972 - 2000sc. 1972 - 2000sBy 1970s - Howard, Smallwood, Matheson et al. goback to operations research (sans education)

17

Page 46: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Dark AgesThe Dark Ages

c. 1972 - 2000sc. 1972 - 2000sBy 1970s - Howard, Smallwood, Matheson et al. goback to operations research (sans education)1975 - Atkinson leaves research (for administrativepositions)

17

Page 47: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

“The mathematical techniques of optimizationused in theories of instruction draw upon a

wealth of results from other areas of science,especially from tools developed in

mathematical economics and operationsresearch over the past two decades, and itwould be my prediction that we will seeincreasingly sophisticated theories of

instruction in the near future.”

 

Suppes (1974)

The Place of Theory in Educational Research

AERA Presidential Address

18

Page 48: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

“The mathematical techniques of optimizationused in theories of instruction draw upon a

wealth of results from other areas of science,especially from tools developed in

mathematical economics and operationsresearch over the past two decades, and itwould be my prediction that we will seeincreasingly sophisticated theories of

instruction in the near future.”

 

Suppes (1974)

The Place of Theory in Educational Research

AERA Presidential Address

Atkinson (2014)“work [on MOOCs] is promising, but the key to success isindividualizing instruction, and necessarily that requires a

psychological theory of the learning process”

18

Page 49: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Second Wave: 2000sSecond Wave: 2000s

Why 2000s?

19

Page 50: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Second Wave: 2000sSecond Wave: 2000s

Intelligent Tutoring Systems

Why 2000s?

19

Page 51: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Second Wave: 2000sSecond Wave: 2000s

Intelligent Tutoring SystemsReinforcement Learning formed as a field

Why 2000s?

19

Page 52: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Second Wave: 2000sSecond Wave: 2000s

Intelligent Tutoring SystemsReinforcement Learning formed as a fieldAIED/EDM: studying statistical models of learning

Why 2000s?

19

Page 53: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Second Wave: 2000sSecond Wave: 2000s

Intelligent Tutoring SystemsReinforcement Learning formed as a fieldAIED/EDM: studying statistical models of learning

Why 2000s?

Parallels 1960s

19

Page 54: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Second Wave: 2000sSecond Wave: 2000s

Intelligent Tutoring SystemsReinforcement Learning formed as a fieldAIED/EDM: studying statistical models of learning

Teaching machines and Computer-Assisted InstructionDynamic Programming and Markov Decision ProcessesMathematical Psych: studying mathematical models of learning

Why 2000s?

Parallels 1960s

19

Page 55: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Reinforcement Learning AI in Education / ITS

20

Page 56: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Reinforcement Learning AI in Education / ITS

Andrew Barto Beverly Woolf

Joe Beck

20

Page 57: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Reinforcement Learning AI in Education / ITS

Andrew Barto

Balaraman Ravindran

Beverly Woolf

Joe Beck

20

Page 58: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Emma Brunskill Vincent Aleven

Reinforcement Learning AI in Education / ITS

Shayan Doroudi

21

Page 59: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Third Wave:The Third Wave: What Lies in the HorizonWhat Lies in the Horizon

Why 2010s?

22

Page 60: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Third Wave:The Third Wave: What Lies in the HorizonWhat Lies in the Horizon

Massive Open Online Courses (MOOCs)

Why 2010s?

22

Page 61: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Third Wave:The Third Wave: What Lies in the HorizonWhat Lies in the Horizon

Massive Open Online Courses (MOOCs)Deep Reinforcement Learning formed as a field

Why 2010s?

22

Page 62: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Third Wave:The Third Wave: What Lies in the HorizonWhat Lies in the Horizon

Massive Open Online Courses (MOOCs)Deep Reinforcement Learning formed as a fieldDeep Learning: building deep models of learning

Why 2010s?

22

Page 63: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

The Third Wave:The Third Wave: What Lies in the HorizonWhat Lies in the Horizon

Massive Open Online Courses (MOOCs)Deep Reinforcement Learning formed as a fieldDeep Learning: building deep models of learning

35% increase in papers/books mentioning“reinforcement learning” from 2016 to 2017

(Google Scholar)

Why 2010s?

22

Page 64: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Three Waves: SummaryThree Waves: Summary

First Wave (1960s-70s)

Second Wave (2000s-2010s)

Third Wave (2010s)

Medium ofInstruction

TeachingMachines / CAI

IntelligentTutoring Systems

Massive OpenOnline Courses

OptimizationModels

DecisionProcesses

ReinforcementLearning

Deep RL

Models ofLearning

MathematicalPsychology

Machine Learning AIED/EDM

Deep Learning

23

Page 65: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Three Waves: SummaryThree Waves: Summary

First Wave (1960s-70s)

Second Wave (2000s-2010s)

Third Wave (2010s)

Medium ofInstruction

TeachingMachines / CAI

IntelligentTutoring Systems

Massive OpenOnline Courses

OptimizationModels

DecisionProcesses

ReinforcementLearning

Deep RL

Models ofLearning

MathematicalPsychology

Machine Learning AIED/EDM

Deep Learning

More data-driven

23

Page 66: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Three Waves: SummaryThree Waves: Summary

First Wave (1960s-70s)

Second Wave (2000s-2010s)

Third Wave (2010s)

Medium ofInstruction

TeachingMachines / CAI

IntelligentTutoring Systems

Massive OpenOnline Courses

OptimizationModels

DecisionProcesses

ReinforcementLearning

Deep RL

Models ofLearning

MathematicalPsychology

Machine Learning AIED/EDM

Deep Learning

More data-driven

More data-generating

23

Page 67: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

OverviewOverview

Reinforcement Learning: Towards a “Theory of Instruction”Part 1: Historical PerspectivePart 2: Systematic Review

Discussion: Where's the Reward?

Part 3: Case StudyPlanning for the Future

24

Page 68: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Inclusion CriteriaInclusion CriteriaWe consider any papers where:

25

Page 69: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Inclusion CriteriaInclusion CriteriaWe consider any papers where:

There is (implicitly) a model of the learning process, wheredifferent instructional actions probabilistically change the stateof a student.

25

Page 70: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Inclusion CriteriaInclusion CriteriaWe consider any papers where:

There is (implicitly) a model of the learning process, wheredifferent instructional actions probabilistically change the stateof a student.There is an instructional policy that maps past observations froma student (e.g., responses to questions) to instructional actions.

25

Page 71: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Inclusion CriteriaInclusion CriteriaWe consider any papers where:

There is (implicitly) a model of the learning process, wheredifferent instructional actions probabilistically change the stateof a student.There is an instructional policy that maps past observations froma student (e.g., responses to questions) to instructional actions.Data collected from students are used to learn either:

the modelan adaptive policy

25

Page 72: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Inclusion CriteriaInclusion CriteriaWe consider any papers where:

There is (implicitly) a model of the learning process, wheredifferent instructional actions probabilistically change the stateof a student.There is an instructional policy that maps past observations froma student (e.g., responses to questions) to instructional actions.Data collected from students are used to learn either:

the modelan adaptive policy

If the model is learned, the instructional policy is designed to(approximately) optimize that model according to some rewardfunction

25

Page 73: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

What's Not Included?What's Not Included?

26

Page 74: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

What's Not Included?What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules(rather than data-driven/optimized decision rules)

26

Page 75: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

What's Not Included?What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules(rather than data-driven/optimized decision rules)Experiments that do not control for everything other thansequence of instruction

26

Page 76: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

What's Not Included?What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules(rather than data-driven/optimized decision rules)Experiments that do not control for everything other thansequence of instructionMachine teaching experiments

26

Page 77: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

What's Not Included?What's Not Included?

Adaptive policies that use hand-made or heuristic decision rules(rather than data-driven/optimized decision rules)Experiments that do not control for everything other thansequence of instructionMachine teaching experimentsExperiments that use RL for other educational purposes, such as:

generating data-driven hints (Stamper et al., 2013) orgiving feedback (Rafferty et al., 2015)

26

Page 78: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

27 studies empirically compare adaptive policy to baseline

27

Page 79: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

27 studies empirically compare adaptive policy to baseline

≥ 10 papers compare policies learned with student data insimulation

27

Page 80: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

27 studies empirically compare adaptive policy to baseline

≥ 10 papers compare policies learned with student data insimulation

≥ 16 papers build policies only on simulated data

27

Page 81: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

27 studies empirically compare adaptive policy to baseline

≥ 10 papers compare policies learned with student data insimulation

≥ 16 papers build policies only on simulated data

≥ 7 papers that propose using RL for instructional sequencing

27

Page 82: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

27 studies empirically compare adaptive policy to baseline

≥ 10 papers compare policies learned with student data insimulation

≥ 16 papers build policies only on simulated data

≥ 7 papers that propose using RL for instructional sequencing

≥ 3 other papers with policies used on real students

27

Page 83: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

Among papers with empirical comparisons:

14 found sig difference between adaptive policy and baseline

28

Page 84: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

Among papers with empirical comparisons:

14 found sig difference between adaptive policy and baseline

2 found sig aptitude-treatment interaction

Policy is sig better for below median learners

28

Page 85: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

Among papers with empirical comparisons:

14 found sig difference between adaptive policy and baseline

2 found sig aptitude-treatment interaction

Policy is sig better for below median learners

2 found sig difference between adaptive policy and some but not all baselines

28

Page 86: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review OverviewReview Overview

Among papers with empirical comparisons:

14 found sig difference between adaptive policy and baseline

2 found sig aptitude-treatment interaction

Policy is sig better for below median learners

2 found sig difference between adaptive policy and some but not all baselines

9 found no sig difference between policies

28

Page 87: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Studies by YearStudies by Year

29

Page 88: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Review SummaryReview Summary

30

Page 89: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

OverviewOverview

Reinforcement Learning: Towards a “Theory of Instruction”Part 1: Historical PerspectivePart 2: Systematic Review

Discussion: Where's the Reward?

Part 3: Case StudyPlanning for the Future

31

Page 90: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Where's the Reward?Where's the Reward?The Pessimistic Story

Studies with sig difference were often constrained:

32

Page 91: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Where's the Reward?Where's the Reward?The Pessimistic Story

Studies with sig difference were often constrained:

7 of them only compare to random policy or other RL-induced policy

32

Page 92: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Where's the Reward?Where's the Reward?The Pessimistic Story

Studies with sig difference were often constrained:

7 of them only compare to random policy or other RL-induced policy

9 of them were on paired-association tasks or conceptlearning tasks

Decent psychological understanding of how humans learn

32

Page 93: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Where's the Reward?Where's the Reward?The Pessimistic Story

Studies with sig difference were often constrained:

7 of them only compare to random policy or other RL-induced policy

9 of them were on paired-association tasks or conceptlearning tasks

Decent psychological understanding of how humans learn

2 of the studies (+ 2 ATI studies) sequenced activity typesrather than content

32

Page 94: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Where's the Reward?Where's the Reward?The Pessimistic Story

Studies with sig difference were often constrained:

7 of them only compare to random policy or other RL-induced policy

9 of them were on paired-association tasks or conceptlearning tasks

Decent psychological understanding of how humans learn

2 of the studies (+ 2 ATI studies) sequenced activity typesrather than content

2 of the studies did not optimize for learning

32

Page 95: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Where's the Reward?Where's the Reward?The Pessimistic Story

Studies with sig difference were often constrained:

7 of them only compare to random policy or other RL-induced policy

9 of them were on paired-association tasks or conceptlearning tasks

Decent psychological understanding of how humans learn

2 of the studies (+ 2 ATI studies) sequenced activity typesrather than content

2 of the studies did not optimize for learning

1 study seems to have been “lucky”

32

Page 96: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Among papers without sig difference:

Where's the Reward?Where's the Reward?The Pessimistic Story

33

Page 97: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Among papers without sig difference:

Only 3 of them only compare to random policy or other RL-induced policy

Where's the Reward?Where's the Reward?The Pessimistic Story

33

Page 98: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Among papers without sig difference:

Only 3 of them only compare to random policy or other RL-induced policy

Only 3 of them were on paired-association or concept learningtasks

Where's the Reward?Where's the Reward?The Pessimistic Story

33

Page 99: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Among papers without sig difference:

Only 3 of them only compare to random policy or other RL-induced policy

Only 3 of them were on paired-association or concept learningtasks

Only 2 of them sequenced activity types rather than content.

Where's the Reward?Where's the Reward?The Pessimistic Story

33

Page 100: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Among papers without sig difference:

Only 3 of them only compare to random policy or other RL-induced policy

Only 3 of them were on paired-association or concept learningtasks

Only 2 of them sequenced activity types rather than content.

Papers that showed no sig. difference were generallymore complex and ambitious in a number of dimensions

Where's the Reward?Where's the Reward?The Pessimistic Story

33

Page 101: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Among papers with sig difference:

9 of them use models inspired by cognitive psychology.

The policies that were successful for paired-associationtasks tended to use more psychologically plausible modelsthan those that were not successful.

Where's the Reward?Where's the Reward?The Optimistic Story

34

Page 102: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Among papers with sig difference:

9 of them use models inspired by cognitive psychology.

The policies that were successful for paired-associationtasks tended to use more psychologically plausible modelsthan those that were not successful.

Several use some sort of clever offline policy selection (e.g.,importance sampling or robust evaluation)

Where's the Reward?Where's the Reward?The Optimistic Story

34

Page 103: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

OverviewOverview

Reinforcement Learning: Towards a “Theory of Instruction”Part 1: Historical PerspectivePart 2: Systematic Review

Discussion: Where's the Reward?

Part 3: Case StudyPlanning for the Future

35

Page 104: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Case StudyCase Study

Fractions Tutor

36

Page 105: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Case StudyCase Study

Fractions TutorTwo experiments testing RL-induced policies(both no sig difference)

36

Page 106: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Case StudyCase Study

Fractions TutorTwo experiments testing RL-induced policies(both no sig difference)Off-policy policy evaluation

36

Page 107: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Fractions TutorFractions Tutor

37

Page 108: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 1Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015).

38

Page 109: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 1Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015).

Used G-SCOPE Model to derive two new Adaptive Policies.

38

Page 110: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 1Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015).

Used G-SCOPE Model to derive two new Adaptive Policies.

Wanted to compare Adaptive Policies to a Baseline Policy(fixed, spiraling curriculum).

38

Page 111: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 1Experiment 1

Used prior data to fit G-SCOPE Model (Hallak et al., 2015).

Used G-SCOPE Model to derive two new Adaptive Policies.

Wanted to compare Adaptive Policies to a Baseline Policy(fixed, spiraling curriculum).

Simulated both policies on G-SCOPE Model to predictposttest scores (out of 16 points).

38

Page 112: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 1:Experiment 1: Policy EvaluationPolicy Evaluation

Baseline Adaptive Policy

Simulated Posttest 5.9 ± 0.9 9.1 ± 0.8

Doroudi, Aleven, and Brunskill, L@S 201739 . 1

Page 113: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Baseline Adaptive Policy

Simulated Posttest 5.9 ± 0.9 9.1 ± 0.8

Actual Posttest 5.5 ± 2.6 4.9 ± 2.6

Doroudi, Aleven, and Brunskill, L@S 2017

Experiment 1:Experiment 1: Policy EvaluationPolicy Evaluation

39 . 2

Page 114: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Single Model SimulationSingle Model Simulation

Used by Chi, VanLehn, Littman, and Jordan  (2011) andRowe, Mott, and Lester (2014) in educational settings.

40

Page 115: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Single Model SimulationSingle Model Simulation

Used by Chi, VanLehn, Littman, and Jordan  (2011) andRowe, Mott, and Lester (2014) in educational settings.Rowe, Mott, and Lester (2014): New adaptive policyestimated to be much better than random policy.

40

Page 116: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Single Model SimulationSingle Model Simulation

Used by Chi, VanLehn, Littman, and Jordan  (2011) andRowe, Mott, and Lester (2014) in educational settings.Rowe, Mott, and Lester (2014): New adaptive policyestimated to be much better than random policy.But in experiment, no significant difference found(Rowe and Lester, 2015).

40

Page 117: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Importance SamplingImportance Sampling

Estimator that gives unbiased and consistent estimatesfor a policy!

41

Page 118: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Importance SamplingImportance Sampling

Estimator that gives unbiased and consistent estimatesfor a policy!Can have very high variance when policy is differentfrom prior data.

41

Page 119: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Importance SamplingImportance Sampling

Estimator that gives unbiased and consistent estimatesfor a policy!Can have very high variance when policy is differentfrom prior data.Example: Worked example or problem-solving?

41

Page 120: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Importance SamplingImportance Sampling

Estimator that gives unbiased and consistent estimatesfor a policy!Can have very high variance when policy is differentfrom prior data.Example: Worked example or problem-solving?

20 sequential decisions ⇒ need over 2 students20

41

Page 121: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Importance SamplingImportance Sampling

Estimator that gives unbiased and consistent estimatesfor a policy!Can have very high variance when policy is differentfrom prior data.Example: Worked example or problem-solving?

20 sequential decisions ⇒ need over 2 students50 sequential decisions ⇒ need over 2 students!

20

50

41

Page 122: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Importance SamplingImportance Sampling

Estimator that gives unbiased and consistent estimatesfor a policy!Can have very high variance when policy is differentfrom prior data.Example: Worked example or problem-solving?

20 sequential decisions ⇒ need over 2 students50 sequential decisions ⇒ need over 2 students!

Importance sampling can prefer the worse of twopolicies more often than not (Doroudi et al., 2017b).

20

50

Doroudi, Thomas, and Brunskill, UAI 2017, Best Paper41

Page 123: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Robust Evaluation MatrixRobust Evaluation Matrix

Policy 1 Policy 2 Policy 3

Student Model 1

Student Model 2

Student Model 3

VSM ,P1 1

VSM ,P2 1

VSM ,P3 1

VSM ,P1 2

VSM ,P2 2

VSM ,P3 2

VSM ,P1 3

VSM ,P2 3

VSM ,P3 3

42

Page 124: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Robust Evaluation MatrixRobust Evaluation Matrix

Baseline AdaptivePolicy

G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8

Doroudi, Aleven, and Brunskill, L@S 201743 . 1

Page 125: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Robust Evaluation MatrixRobust Evaluation Matrix

Baseline AdaptivePolicy

G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8

Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0

Doroudi, Aleven, and Brunskill, L@S 201743 . 2

Page 126: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Robust Evaluation MatrixRobust Evaluation Matrix

Baseline AdaptivePolicy

G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8

Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0

Deep Knowledge Tracing 9.9 ± 1.5 8.6 ± 2.1

Doroudi, Aleven, and Brunskill, L@S 201743 . 3

Page 127: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Robust Evaluation MatrixRobust Evaluation Matrix

Baseline AdaptivePolicy

AwesomePolicy

G-SCOPE Model 5.9 ± 0.9 9.1 ± 0.8 16

Bayesian Knowledge Tracing 6.5 ± 0.8 7.0 ± 1.0 16

Deep Knowledge Tracing 9.9 ± 1.5 8.6 ± 2.1 16

Doroudi, Aleven, and Brunskill, L@S 201743 . 4

Page 128: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2Experiment 2

Used Robust Evaluation Matrix to test new policies

44

Page 129: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2Experiment 2

Used Robust Evaluation Matrix to test new policiesFound that a New Adaptive Policy that was very simplebut robustly expected to do well:

44

Page 130: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2Experiment 2

Used Robust Evaluation Matrix to test new policiesFound that a New Adaptive Policy that was very simplebut robustly expected to do well:

sequence problems in increasing order of avg. time

44

Page 131: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2Experiment 2

Used Robust Evaluation Matrix to test new policiesFound that a New Adaptive Policy that was very simplebut robustly expected to do well:

sequence problems in increasing order of avg. timeskip any problems where students havedemonstrated mastery of all skills (according to BKT)

44

Page 132: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2Experiment 2

Used Robust Evaluation Matrix to test new policiesFound that a New Adaptive Policy that was very simplebut robustly expected to do well:

sequence problems in increasing order of avg. timeskip any problems where students havedemonstrated mastery of all skills (according to BKT)

Ran an experiment testing New Adaptive Policy

44

Page 133: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2Experiment 2

Baseline New AdaptivePolicy

Actual Posttest 8.12 ± 2.9 7.97 ± 2.7

45

Page 134: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2:Experiment 2: InsightsInsights

Even though we did robust evaluation, twothings were not considered adequately:

46

Page 135: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2:Experiment 2: InsightsInsights

Even though we did robust evaluation, twothings were not considered adequately:

How long each problem takes per student

46

Page 136: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2:Experiment 2: InsightsInsights

Even though we did robust evaluation, twothings were not considered adequately:

How long each problem takes per student

Student population mismatch

46

Page 137: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Experiment 2:Experiment 2: InsightsInsights

Even though we did robust evaluation, twothings were not considered adequately:

How long each problem takes per student

Student population mismatch

Robust evaluation can help usidentify where our models are lacking

and lead to building better modelsover time.

46

Page 138: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

OverviewOverview

Reinforcement Learning: Towards a "Theory of Instruction"Part 1: Historical PerspectivePart 2: Systematic Review

Discussion: Where's the Reward?

Part 3: Case Study: Fractions Tutor and Policy SelectionPlanning for the Future

47

Page 139: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Planning for the FuturePlanning for the FutureData-Driven + Theory-Driven Approach

48

Page 140: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Planning for the FuturePlanning for the FutureData-Driven + Theory-Driven Approach

Reinforcement learning researchers should work withlearning scientists and psychologists.

48

Page 141: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Planning for the FuturePlanning for the FutureData-Driven + Theory-Driven Approach

Reinforcement learning researchers should work withlearning scientists and psychologists.

Work on domains where we have or can develop decentcognitive models.

48

Page 142: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Planning for the FuturePlanning for the FutureData-Driven + Theory-Driven Approach

Reinforcement learning researchers should work withlearning scientists and psychologists.

Work on domains where we have or can develop decentcognitive models.

Work in settings where the set of actions is restrictedbut that are still meaningful (e.g., worked examples vs. problem solving)

48

Page 143: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Planning for the FuturePlanning for the FutureData-Driven + Theory-Driven Approach

Reinforcement learning researchers should work withlearning scientists and psychologists.

Work on domains where we have or can develop decentcognitive models.

Work in settings where the set of actions is restrictedbut that are still meaningful (e.g., worked examples vs. problem solving)

Compare to good baselines based on learning sciences(e.g., expertise reversal effect)

48

Page 144: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Planning for the FuturePlanning for the FutureData-Driven + Theory-Driven Approach

Reinforcement learning researchers should work withlearning scientists and psychologists.

Work on domains where we have or can develop decentcognitive models.

Work in settings where the set of actions is restrictedbut that are still meaningful (e.g., worked examples vs. problem solving)

Compare to good baselines based on learning sciences(e.g., expertise reversal effect)

Do thoughtful and extensive offline evaluations.

48

Page 145: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Planning for the FuturePlanning for the FutureData-Driven + Theory-Driven Approach

Reinforcement learning researchers should work withlearning scientists and psychologists.

Work on domains where we have or can develop decentcognitive models.

Work in settings where the set of actions is restrictedbut that are still meaningful (e.g., worked examples vs. problem solving)

Compare to good baselines based on learning sciences(e.g., expertise reversal effect)

Do thoughtful and extensive offline evaluations.

Iterate and replicate! Develop theories of instruction thatcan help us see where the reward might be.

48

Page 146: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

49

Page 147: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

More data

49

Page 148: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

More dataMore computational power

49

Page 149: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

More dataMore computational powerBetter RL algorithms

49

Page 150: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

More dataMore computational powerBetter RL algorithms

Similar advances have recently revolutionized the fields ofcomputer vision, natural language processing, andcomputational game-playing.

49

Page 151: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

More dataMore computational powerBetter RL algorithms

Similar advances have recently revolutionized the fields ofcomputer vision, natural language processing, andcomputational game-playing.Why not instruction?

49

Page 152: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

More dataMore computational powerBetter RL algorithms

Similar advances have recently revolutionized the fields ofcomputer vision, natural language processing, andcomputational game-playing.Why not instruction?

Learning is fundamentally different from images,language, and games.

49

Page 153: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

Is Data-Driven Sufficient?Is Data-Driven Sufficient?Might we see a revolution in data-driven instructionalsequencing?

More dataMore computational powerBetter RL algorithms

Similar advances have recently revolutionized the fields ofcomputer vision, natural language processing, andcomputational game-playing.Why not instruction?

Learning is fundamentally different from images,language, and games.Baselines are much stronger for instructional sequencing.

49

Page 154: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

In the coming years, will likely see both purely data-driven(deep learning) approaches as well as theory+data-drivenapproaches to instructional sequencing.

50

Page 155: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

In the coming years, will likely see both purely data-driven(deep learning) approaches as well as theory+data-drivenapproaches to instructional sequencing.

Only time can tell where the reward lies, but our robustevaluation suggests combining theory and data.

50

Page 156: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

In the coming years, will likely see both purely data-driven(deep learning) approaches as well as theory+data-drivenapproaches to instructional sequencing.

Only time can tell where the reward lies, but our robustevaluation suggests combining theory and data.By reviewing the history and prior empirical literature, wecan have a better sense of the terrain we are operating in.

50

Page 157: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

Applying RL to instructional sequencing has beenrewarding in other ways:

51

Page 158: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

Applying RL to instructional sequencing has beenrewarding in other ways:

Advances have been made to the field of RL.

51

Page 159: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

Applying RL to instructional sequencing has beenrewarding in other ways:

Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes

51

Page 160: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

Applying RL to instructional sequencing has beenrewarding in other ways:

Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes

Our work on importance sampling (Doroudi et al., 2017b)

51

Page 161: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

Applying RL to instructional sequencing has beenrewarding in other ways:

Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes

Our work on importance sampling (Doroudi et al., 2017b)

Advances have been made to student modeling.

51

Page 162: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

So, where is the reward?So, where is the reward?

Applying RL to instructional sequencing has beenrewarding in other ways:

Advances have been made to the field of RL.

The Optimal Control of Partially Observable Markov Processes

Our work on importance sampling (Doroudi et al., 2017b)

Advances have been made to student modeling.

By continuing to try to optimizeinstruction, we will likely continue toexpand the frontiers of the study of

human and machine learning.

51

Page 163: A Review of Reinforcement Learning for …shayand/slides/optimizinghumanlearning...A Review of Reinforcement Learning for Instructional Sequencing Shayan Doroudi 1 2 2 Over the past

AcknowledgementsAcknowledgements

The research reported here was supported, in whole or in part, bythe Institute of Education Sciences, U.S. Department of Education,

through Grants R305A130215 and R305B150008 to Carnegie MellonUniversity. The opinions expressed are those of the authors and donot represent views of the Institute or the U.S. Dept. of Education.

This research was done in collaboration with Vincent Aleven,Emma Brunskill, Kenneth Holstein, and Philip Thomas.

52