1 Learning to Act and Causality CS4780/5780 – Machine Learning Fall 2019 Nika Haghtalab & Thorsten Joachims Cornell University Reading: G. Imbens, D. Rubin, Causal Inference for Statistics …, 2015. Chapter 1. Interactive System Schematic Action y for x System π 0 Utility: 0 News Recommender • Context : – User • Action : – Portfolio of newsarticles • Feedback , : – Reading time in minutes Music Voice Assistant • Context : – User and speech • Action : – Track that is played • Feedback , : – Listened to the end Search Engine • Context : – Query • Action : – Ranking • Feedback , : – Click / no-click Log Data from Interactive Systems • Data = 1 , 1 , 1 ,…, , , Partial Information (aka “Contextual Bandit”) Feedback • Properties – Contexts drawn i.i.d. from unknown () – Actions selected by existing system 0 :→ – Feedback from unknown function : × → ℜ context π 0 action reward / loss [Zadrozny et al., 2003] [Langford
6
Embed
Interactive System Schematic Learning to Act and Causality · Search Engine •Context : –Query •Action : –Ranking •Feedback 𝛿 , : –Click / no-click Log Data from Interactive
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Learning to Act and Causality
CS4780/5780 – Machine Learning
Fall 2019
Nika Haghtalab & Thorsten Joachims
Cornell University
Reading:
G. Imbens, D. Rubin, Causal Inference for Statistics …, 2015. Chapter 1.
Interactive System Schematic
Action y for x
System π0
Utility: 𝑈 𝜋0
News Recommender
• Context 𝑥:
– User
• Action 𝑦:
– Portfolio of newsarticles
• Feedback 𝛿 𝑥, 𝑦 :
– Reading time in minutes
Music Voice Assistant
• Context 𝑥:
– User and speech
• Action 𝑦:
– Track that is played
• Feedback 𝛿 𝑥, 𝑦 :
– Listened to the end
Search Engine
• Context 𝑥:
– Query
• Action 𝑦:
– Ranking
• Feedback 𝛿 𝑥, 𝑦 :
– Click / no-click
Log Data from Interactive Systems
• Data
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥𝑛, 𝑦𝑛, 𝛿𝑛
Partial Information (aka “Contextual Bandit”) Feedback
• Properties
– Contexts 𝑥𝑖 drawn i.i.d. from unknown 𝑃(𝑋)
– Actions 𝑦𝑖 selected by existing system 𝜋0: 𝑋 → 𝑌
– Feedback 𝛿𝑖 from unknown function 𝛿: 𝑋 × 𝑌 → ℜ
contextπ0 action reward / loss
[Zadrozny et al., 2003] [Langford & Li], [
2
Goal
Use interaction log data
𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥𝑛, 𝑦𝑛, 𝛿𝑛- for evaluation of system 𝜋
• Offline estimate of online performace of some system 𝜋.
• System 𝜋 can be different from 𝜋0 that generated log.
- for learning new system 𝜋
Evaluation: Outline
• Offline Evaluating of Online Metrics
– A/B Testing (on-policy) Counterfactual estimation from logs (off-policy)
• Approach 1: “Model the world”
– Imputation via reward prediction
• Approach 2: “Model the bias”
– Counterfactual model and selection bias
– Inverse propensity scoring (IPS) estimator
Online Performance Metrics
Example metrics– CTR– Revenue– Time-to-success– Interleaving– Etc.
Correct choice depends on application and is not the focus of this lecture.
This lecture:Metric encoded as δ(𝑥, 𝑦) [click/payoff/time for (x,y) pair]
System
• Definition [Deterministic Policy]:
Function𝑦 = 𝜋(𝑥)
that picks action 𝑦 for context 𝑥.
• Definition [Stochastic Policy]:
Distribution 𝜋 𝑦 𝑥
that samples action 𝑦 given context 𝑥𝜋1(𝑌|𝑥) 𝜋2(𝑌|𝑥)
𝑌|𝑥
π1 𝑥 π2 𝑥
𝑌|𝑥
System Performance
Definition [Utility of Policy]:
The expected reward / utility U(𝜋) of policy 𝜋 is
U 𝜋 = නන𝛿 𝑥, 𝑦 𝜋 𝑦 𝑥 𝑃 𝑥 𝑑𝑥 𝑑𝑦
𝜋(𝑌|𝑥𝑖)
𝑌|𝑥𝑖
𝜋(𝑌|𝑥𝑗)
𝑌|𝑥𝑗
…e.g. reading
time of user x for portfolio y
Online Evaluation: A/B Testing
Given 𝑆 = 𝑥1, 𝑦1, 𝛿1 , … , 𝑥𝑛, 𝑦𝑛, 𝛿𝑛 collected under π0,