Top Banner
By David Anderson SZTAKI (Budapest, Hungary) WPI D2009
42

By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Nov 21, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

ByDavidAndersonSZTAKI(Budapest,Hungary)WPID2009

Page 2: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  1997,DeepBluewonagainstKasparov  AverageworkstationcandefeatbestChessplayers

  ComputerChessnolonger“interesting”  Goismuchharderforcomputerstoplay

  Branchingfactoris~50‐200versus~35inChess  Positionalevaluationinaccurate,expensive  Gamecannotbescoreduntiltheend

  BeginnerscandefeatbestGoprograms

Page 3: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Two‐player,totalinformation  Playerstaketurnsplacingblackandwhitestonesongrid

  Boardis19x19(13x13or9x9forbeginners)  Objectistosurroundemptyspaceasterritory  Piecescanbecaptured,butnotmoved Winnerdeterminedbymostpoints(territorypluscapturedpieces)

Page 4: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Imagefromhttp://ict.ewi.tudelft.nl/~gineke/

Page 5: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

 Minimax/α‐βalgorithmsrequirehugetrees  Treedepthcannotbecuteasily

Page 6: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

 Monte‐Carlonowmorepopular  Simulaterandomgamesfromthegametree  Useresultstopickbestmove

  Twoareasofoptimization  Discoveryofgoodpathsinthegametree  Intelligenceofrandomsimulations▪  Randomgamesareusuallybogus

Page 7: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Needtobalancebetweenexploration…  Discoveringandsimulatingnewpaths

  Andexploitation…  Simulatingthemostoptimalpath

  BestmethodiscurrentlyUCTgivenbyLeventeKocsisandCsabaSzepesvári.

Page 8: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Sayyouhaveaslotmachinewithaprobabilityofgivingyoumoney.Youcaninferthisprobabilitythroughexperimentation.

Page 9: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Whatiftherearethreeslotmachines,andeachhasadifferentprobability?

Page 10: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

Youneedtochoosebetweenexperimenting(exploration)andgettingthebestreward(exploitation).

Page 11: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

UCBalgorithmbalancestheseproblemstominimizelossofreward.

Page 12: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

UCTappliesUCBtogameslikeGo,decidingwhichmovetoexplorenextbytreatingitlikethebanditproblem.

Page 13: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Startswithone‐leveltreeoflegalboardmoves

Page 14: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  PicksbestmoveaccordingtoUCBalgorithm

Page 15: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  RunsMonte‐Carlosimulation,updatenode’swin/loss.

  ThisisoneiterationoftheUCTprocess.

Page 16: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Ifnodegetsvisitedenoughtimes,startlookingatitschildmoves

Page 17: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  UCTdivesdeeper,eachtimepickingthemost“interesting”move.

Page 18: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Eventually,UCThasbuiltalargetreeofsimulationinformation

Page 19: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  UCTisnowinmostmajorcompetitiveprograms

  “MoGo”usedUCTtodefeataprofessional  Used800‐nodegridanda9stonehandicap

 Muchresearchnowfocusedonimprovingsimulationintelligence

Page 20: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Policydecideswhichmovetoplaynextinarandomgamesimulation

  HighstochasticitymakesUCTlessaccurate  Takeslongertoconvergetocorrectmove

  ToomuchdeterminismmakesUCTlesseffective  DefeatspurposeofMonte‐Carlosearch Mightintroduceharmfulselectionbias

Page 21: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  CertainshapesinGoaregood  “Hane”hereisastrongattackonB

  Othersarequitebad!  B’s“emptytriangle”istoodenseandwasteful

Page 22: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

 MoGousespatternknowledgewithUCT  Hand‐crafteddatabaseof3x3interestingpatterns  Doubledsimulationwin‐rateaccordingtoauthors

  Canpatternknowledgebetrainedautomaticallyviamachinelearning?

Page 23: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Paper“Monte‐CarloSimulationBalancing”  (byDavidSilverandGeraldTesauro)  Policiesaccumulateerrorwitheachmove  Strongpoliciesminimizethiserror,butnotthewhole‐gameerror

  Proposesalgorithmsforminimizingwhole‐gameerrorwitheachmove

  Authorstestedon5x5Gousing2x2patterns  Foundthatbalancingwasmoreeffectiveoverrawstrength

Page 24: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Implementedpattern‐learningalgorithmsin“Monte‐CarloSimulationBalancing”  Strength:Apprenticeship  Strength:PolicyGradientReinforcement  Balance:PolicyGradientSimulationBalancing  Balance:Two‐StepSimulationBalancing

  Used9x9Gowith3x3patterns

Page 25: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Usedamateurdatabaseof9x9gamesfortraining

 Mention‐worthymetrics:  Simulationwinrateagainstpurelyrandom  UCTwinrateagainstUCTpurelyrandom  UCTwinrateagainstGNUGo

Page 26: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Simplestalgorithm  Looksateverymoveofeverygameinthetrainingset  Highpreferenceforchosenmoves  Lowpreferenceforunchosenmoves

  Stronglyfavoredgoodpatterns  Over‐training;poorerrorcompensation

  Valuesconvergetoinfinity

Page 27: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

0

10

20

30

40

50

60

70

80

Playout UCTvslibEGO UCTvsGNUGo

Winrate(%

)

GameType

ApprenticeshipvsPureRandom

PureRandom

Apprenticeship

Page 28: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Playsrandomgamesfromthetrainingset  Ifthesimulationmatchestheoriginalgameresult,patternsgethigherpreference

  Otherwise,lowerpreference  Resultswerepromising

Page 29: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

0

10

20

30

40

50

60

70

Playout UCTvslibEGO UCTvsGNUGo

Winrate(%

)

GameType

ReinforcementvsPureRandom

PureRandom

Reinforcement

Page 30: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Foreachtraininggame…  Playsrandomgamestoestimatewinrate  Playsmorerandomgamestodeterminewhichpatternswinandlose

  Givespreferencestopatternsbasedonerrorbetweenactualgameresultandobservedwinrate

Page 31: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Usually,stronglocalmoves  Seemedtolearngoodpatterndistribution  Aggressivelyplayeduselessmoveshopingforanopponentmistake

  Poorconsiderationofthewholeboard

Page 32: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009
Page 33: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

0

10

20

30

40

50

60

Playout UCTvslibEGO UCTvsGNUGo

Winrate(%

)

GameType

SimulationBalancingversusPureRandom

PureRandom

SimulationBalancing

Page 34: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Picksrandomgamestates  Computesscoreestimateofeverymoveat2‐plydepth

  Updatespatternpreferencesbasedontheseresults,usingactualgameresulttocompensateforerror

Page 35: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Gamescoreishardtoestimate,usuallyinaccurate

  Extremelyexpensive;10‐30sectoestimatescore

  Gamescoredoesn’tchangemeaningfullyformanymoves

  Probablydoesnotscaleasboardsizegrows

Page 36: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

0

10

20

30

40

50

60

70

Playout UCTvslibEGO UCTvsGNUGo

Winrate(%

)

GameType

TwoStepBalancingvsPureRandom

PureRandom

TwoStepBalancing

Page 37: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009
Page 38: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

0

10

20

30

40

50

60

70

80

Playout UCTvslibEGO UCTvsGNUGo

Winrate(%

)

GameType

AlgorithmResults

PureRandom

Apprenticeship

Reinforcement

SimulationBalancing

TwoStepBalancing

Page 39: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Reinforcementstrongest  Allalgorithmscapableofverydeterministicpolicies

  HigherplayoutwinratesweretoodeterministicandthususuallybadwithUCT

  Gomaybetoocomplexforthesealgorithms  Optimizingself‐playdoesn’tguaranteegoodmoves

Page 40: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  LeventeKocsis

  SZTAKI

  ProfessorsSárközyandSelkow

Page 41: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009
Page 42: By David Anderson SZTAKI (Budapest, Hungary) WPI D2009

  Algorithmgenerateslistofpatterns  Eachpatternhasaweight/value  Policylooksatopenpositionsontheboard  Getsthepatternateachopenposition  Usesweightsasaprobabilitydistribution