-
CausalModelsforScientificDiscovery
ResearchChallengesandOpportunities
DavidJensen
CollegeofInformationandComputerSciences
ComputationalSocialScienceInstituteCenterforDataScience
UniversityofMassachusettsAmherst
SymposiumonAcceleratingScience
18November2016
-
Sources:TheGuardian,July2005;WallaceKirkland,forTime
-
Sources:Wikipedia(pile);ArgonneNationalLaboratory(Fermi)
-
Mainpoints
•
Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.
•
Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.
• Severalemergingopportunitiesandchallengesexist:•
Expressiveness—Combiningdataandknowledgefrom
multiplesourcestounderstandcomplexphenomena
•
Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction
•
Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling
-
Causalityiscentral
toscience
-
Explanation⇒Causality
•
Explanationisacentralactivity
inscience.Effectivetheoriesexplainpreviouslyunexplainedphenomena
•
Effectiveexplanationsgenerallytaketheformofacounterfactual
(“Whatwouldhavehappenedif
conditionshadbeendifferent?”).
•
“…explanatoryrelationshipsarerelationshipsthatarepotentiallyexploitableforpurposesofmanipulationandcontrol.”
-
Control&design⇒Causality
Sources:Wikipedia(pile)
-
Models
•
Becauseofthis,“models”in
mostscientificfieldshavecausalimplications(inferhowasystemwouldbehaveunderintervention)
•
Incontrast,most“models”inmachinelearningandstatisticshavebeendefinedashavingonlyassociationalsemantics.
•
Thisleadstosubstantialconfusionamongresearchersfromother
fieldswhenfirstencountering
machinelearningmethods.
-
Progressincausalmodeling
•
Anexplicittheoryofcausalinferencehasbeenworkedoutoverthepast20years
byasmallgroupofcomputer
scientists,philosophers,
andstatisticians.
• Thetheoryusesdirected
graphicalmodelstorepresent
causaldependenceamongvariables.
•
Thattheoryprovidesaformalcorrespondence
betweencausalmodelsandtheirobservablestatisticalimplications.Thiscorrespondencehasbeenexploitedtoproduceanumberofalgorithmsforreasoningwithcausalgraphicalmodels(CGMs).
(Pearl2000,2009;Spirtes,Glymour,andScheines1993,2001)
-
Keyconcepts
•
Onlystatisticaldependenceisdirectlyobservableindata.Causaldependenceisnotobservable.
•
Statisticaldependenceunderdeterminescausaldependence(“correlationisnotcausation”)
•
Theobservablestatisticalconsequencesofagivencausalmodelcanbeinferredfromstructure(d-separation)
•
Multiplecausalstructuresproducethesameobservedstatisticaldependencies(Markovequivalence).
•
However,somecombinationsofconditionalindependenceandknowncausaldependenceimplyconstraintsonthespaceofcausalstructures,andsomeuniquelyidentifycausalstructures
-
Mainpoints
•
Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.
•
Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.
• Severalemergingopportunitiesandchallengesexist:•
Expressiveness—Combiningdataandknowledgefrom
multiplesourcestounderstandcomplexphenomena
•
Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction
•
Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling
-
Expressiveness
-
Source:Honavar,Hill,&Yelick(2016),AcceleratingScience:AComputingResearchAgenda
-
Source:Honavar,Hill,&Yelick(2016),AcceleratingScience:AComputingResearchAgenda
-
ManualScientificPractice
Rarelysearcheslargespaces
offormallyrepresentedmodels
MachineLearning
Rarelyanalyzes
causaldependence
CausalDiscovery
Rarelydiscoversrelational,temporal,orspatialmodels
CausalAnalysis
AutomatedDiscovery
Relational, Temporaland Spatial Models
-
Causalmodelsofindependentoutcomes
Causal
Process Outcome Variables
A
B
Z
...
-
Causalmodelsofindependentoutcomes
I J
HE
D
A
F G
B C
-
KeyassumptionofsimpleCGMs
Causal
Process Outcome Variables
A
B
Z
...
-
KeyassumptionofsimpleCGMs
Causal
Process
MultipleDependentOutcomes
x
x
?
-
Causalmodelsofindependentoutcomes
I J
HE
D
A
F G
B C
-
K
K
Causalmodelsofdependentoutcomes
(Friedman,Getoor,Koller,&Pfeffer1999;Heckerman,Meek,&Koller2007;Maier,Marazopoulou,andJensen2013)
I J
HE
D
A
F G
B C
x
x
K
O
R
P
S
Q
T
L
M
N
-
(Maier,Marazopoulou,andJensen2013)
-
(Maier,Marazopoulou,andJensen2013)
-
(Maier,Marazopoulou,andJensen2013)
-
Causalmodelsofgeneralprocesses
Causal
Process
1: bool c1, c2;
2: int count = 0;
3: c1 = Bernoulli(0.5);
4:
if (c1==true) then
5: count = count + 1;
6: c2 = Bernoulli(0.5);
7: if (c2==true) then
8: count = count + 1;
9:
observe(c1==true||c2==true);
10: return(count);
Probabilistic
Program
-
Critique
-
“[Tosupportscience,wewouldexpect]
thattwodifferentkindsofinferentialprocess
wouldberequiredtoputitintoeffect.Thefirst,usedinestimatingparametersfromdataconditionalonthetruthofsometentativemodel,
isappropriatelycalledEstimation.
Thesecond,usedincheckingwhether,inthelightofthedata,anymodelofthekindproposedisplausible,
hasbeenaptlynamed…Criticism.”
—GeorgeBox(emphasisadded)
-
Exampleassumptions
• Faithfulness• CausalMarkovassumption•
Definitionsofvariables,entities,relationships,etc.•
Measurementprocess• Temporalgranularityofmeasurement•
Latentvariables,entities,relationships,etc.•
Structuralformofcausaldependence•
Functionalformofprobabilisticdependence• Compositionalform•
Closedworld(orformofopenworld)• …andmanyothers
-
Empiricalevaluation
-
GoalsforEmpiricalEvaluationApproaches
•
Empirical—Apre-existingsystemcreatedbysomeoneotherthantheresearchers.
• Stochastic—Producesnon-deterministicexperimentalresults.
•
Identifiable—Amenabletodirectexperimentalinvestigationtoestimateinterventionaldistributions
•
Recoverable—Lacksmemoryorirreversibleeffects,whichenablescompletestaterecoveryduringexperiments.
•
Efficient—Generateslargeamountsofdatawithrelativelyfewresources.
•
Reproducible—Fairlyeasytorecreatenearlyidenticaldatasetswithoutaccesstoone-of-a-kindhardwareorsoftware.
-
Simpleexample:Databaseconfiguration
-
MLfordatabaseconfiguration(setup)
• Assumeafixeddatabase
andDBserverhardware
• Questions• Foragivenquery,whatistheexpectedperformance
undereachsetofconfigurationparameters?
•
Foragivenquery,whichconfigurationwillgivemethebestperformance?
• Data• Run11,252queriesactuallyrunagainsttheStack
ExchangeDataExplorer
•
EachqueryrunusingoneofmanydifferentjointvaluesoftheconfigurationparametersusingPostgres9.2.2
(Garant&Jensen2016)
-
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
-
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
-
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
-
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
-
DatabaseQuery
ProcessingUser
IndexingPageCostMemory
Level
Block Writesto RAM
YearCreated
JoinCount
Group-byCount
Block Readsfrom RAM
Runtime
RetrievedRow Count
TableCount
Total RowCount
Length
Total Queriesby User
Block Readsfrom Disk
Block Hitsin Cache
CGMfordatabaseconfiguration
-
Comparingassociationalandcausalmodels
•
Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)
• Evaluateby
comparingto
“groundtruth”
(experimental
resultsforall
queriesobtained
usingaspecific
jointsettingof
theconfiguration
parameters). Cache Hits
(Garant&Jensen2016)
-
Comparingassociationalandcausalmodels
•
Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)
• Evaluateby
comparingto
“groundtruth”
(experimental
resultsforall
queriesobtained
usingaspecific
jointsettingof
theconfiguration
parameters). Cache Hits
(Garant&Jensen2016)
-
Comparingassociationalandcausalmodels
•
Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)
• Evaluateby
comparingto
“groundtruth”
(experimental
resultsforall
queriesobtained
usingaspecific
jointsettingof
theconfiguration
parameters). Disk Reads
(Garant&Jensen2016)
-
Comparingassociationalandcausalmodels
•
Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)
• Evaluateby
comparingto
“groundtruth”
(experimental
resultsforall
queriesobtained
usingaspecific
jointsettingof
theconfiguration
parameters). Disk Reads
(Garant&Jensen2016)
-
Comparingassociationalandcausalmodels
•
Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)
• Evaluateby
comparingto
“groundtruth”
(experimental
resultsforall
queriesobtained
usingaspecific
jointsettingof
theconfiguration
parameters). Runtime
(Garant&Jensen2016)
-
Comparingassociationalandcausalmodels
•
Compareastate-ofthe-artassociationalmodel(arandomforest)toaCGMconstructedusinggreedyequivalencesearch(GES)(Chickering
&Meek2002)
• Evaluateby
comparingto
“groundtruth”
(experimental
resultsforall
queriesobtained
usingaspecific
jointsettingof
theconfiguration
parameters). Runtime
(Garant&Jensen2016)
-
Mainpoints
•
Representingandreasoningaboutcausalityiscentraltoscienceandscientificdiscovery.
•
Understandingofcausalinferencehasadvancedtremendouslyinthepast25yearsthroughtheworkofseveraldisparateresearchcommunities.
• Severalemergingopportunitiesandchallengesexist:•
Expressiveness—Combiningdataandknowledgefrom
multiplesourcestounderstandcomplexphenomena
•
Critique—Inferringerrorsinmodelingassumptionsorproblemconstruction
•
Empiricalevaluation—Providingrealisticempiricaltestsofmethodsforcausalmodeling
-
Thanks
DavidArbour—Recentdevelopmentsinlearningcausaldependencefrombivariatejointdistributionsinrelationaldata(UAI&KDD2016)
DanGarant—Empiricalevaluationofalgorithmsforlearningcausalmodels(UAI2016)
AmandaGentzel—Grangercausalitymethodsandempiricalevaluation
KaterinaMarazopoulou—Extendingcausalsemanticstotemporalmodels(UAI2015;2016)
KaleighClary—Additivenoisemodelsforlearningcausaldependencefrombivariatejointdistributions
-
[email protected] kdl.cs.umass.edu
cs.umass.edu/~jensen/
Allopinionsaremineandnotthoseofanycompany,agencyoftheUSGovernment,
ortheUniversityofMassachusettsAmherst.