Top Banner
6

UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

Jul 09, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,
Page 2: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

UTeachCSPrinciples Unit5:BigData

UNITTOPIC:DataAnalysisDataMining

Youwillinvestigatetheuseofdatamininginthediscoveryofpatternsinlargedatasets.

Youwillapplyassociationruleminingtodiscoverknowledgeindatasets.

UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin

447

Page 3: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

UTeachCSPrinciples Unit5:BigDataDataMiningDataMining

Traditionaloreminingbeginswithanexploration(prospecting)ofaresourcepool(stone),andproceedstodeterminingifusableresourcesexist(ore)andtowhatdegree.Prospectorsbasicallyhaveanideaofwhattheyarelookingfor,andtheyrunsmallteststoseeiftheyarecorrect.Sometimestheystrikegold,othertimestheystrikeout.Likethesephysicalminesthatbringuseverythingfromcoaltodiamonds,wehaveanewtypeofmining:datamining.

Dataminingisakintothediscoveryofpatternsinlargedatasets.Likeoremining,dataminingbeginswithanexploration(analysis)ofaresourcepool(data),andproceedstodeterminewhetherusableresourcesexist(correlations)andtowhatdegree(howstrongtheyare).Notalldataminers"strikesitrich."Likeoremining,dataminingcanresultintheobservationofnousefulpatterns.However,likeoremining,sometimesdataminingleadstoabonanzaofusefulinformation.

Indatamining,theemphasisisonthediscoveryofnewknowledge.Dataminerswanttofindnewpatternsthatwerepreviouslyunobserved.Theyusestatisticalanalysisofbigdatatodiscoverwhatthehumaneyecan'tsee,justlikeanoreminermightuseapick,dynamite,orlabtesttouncoverorethatwasnotvisibletothenakedeyebefore.Thisisaformofexploratorydataanalysisratherthanstatisticalhypothesistesting.

DataMiningStrategiesDatamininginvolvessixcommonclassesoftasks,listedbelow,alongwithexamplesofhowthesestrategiescanbeusedinrecommendersystems,suchasthoseusedbyNetflix,Pandora,Amazon,http://www.whatshouldireadnext.com/,andmanyothercontentproviders.Ineachofthedescriptionsbelow,aNetflix-relatedexampleofitsusageisgiven:

Anomalydetection(Outlier/change/deviationdetection)—Theidentificationofunusualdatarecords,thatmightbeinterestingorsimplydataerrorsandrequirefurtherinvestigation.

MovieXisunlikeanyoftheothermoviesinUserY'sdataset.Removeitfromourcalculations.(example:TheTexasChainsawMassacreisonalistthatmostlycontainstitlessuchasTeletubbies,BarneyandFriends,andClifford.

Associationrulelearning(Dependencymodeling)—Searchesforrelationshipsbetweenvariables.Forexample,asupermarketmightgatherdataoncustomerpurchasinghabits.Usingassociationrulelearning,thesupermarketcandeterminewhichproductsarefrequentlyboughttogetherandusethisinformationformarketingpurposes.Thisissometimesreferredtoasmarketbasketanalysis.

Recommendersystems—UserswholikeMovieXtendtoalsolikeMovieY.

448

Page 4: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

Clustering—isthetaskofdiscoveringgroupsandstructuresinthedatathatareinsomewayoranother"similar,"withoutusingknownstructuresinthedata.

Dynamicallygroupedmoviecategories:"RomanticComediesinParisstarringformerprofessionalfootballplayers."

Classification—isthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,ane-mailprogrammightattempttoclassifyane-mailas"legitimate"oras"spam."

MovieXisaromanticcomedy.Regression—Attemptstofindafunctionthatmodelsthedatawiththeleasterror.

TypeXuserstypicallyincreasetheirmovieconsumptionratebyfourmoviesperyear.

Summarization—providingamorecompactrepresentationofthedataset,includingvisualizationandreportgeneration.

WhattypeofmoviedoesUserXtypicallylike?(i.e.,sumupuserX'spreferencesinYwords)

Thesestrategiesallhavedifferentpurposes,aresometimesmoreeffectiveoncertaindatasetsandlessonothers,andoftentimesworkbestinconjunctionwithoneother.Therefore,thereisnoone"best"waytoperformdatamining.Dataminersusemultiplestrategiestouncoverpatternsanddiscovernewknowledge.

Commonmisconception:DataminingisoftenconfusedwithArtificialIntelligence(AI).

DataminingisactuallyanapplicationoftechniquescommonlyassociatedwithAI."Machinelearning"and"decisionsupport"arestandardAItechniques,butwhenweapplythemto"knowledgediscoveryindatabases,"werefertothemcollectivelysimplyas"toolsfordatamining."

Howmuchpowerliesindatamining?Readthefollowingarticletosee"HowTargetFiguredOutATeenGirlWasPregnantBeforeHerFatherDid.".

UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin

449

Page 5: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

UTeachCSPrinciples Unit5:BigDataAssociationRuleMiningCompaniesKnowWhatYouBuy

FrenchtoastisoneofAmerica'sfavoritebreakfastfoods.It'sdeliciousandcanbeeasilypreparedathomeusingavarietyoftechniquesandtoppings.Eventhoughitcanbepreparedanumberofways,almostallFrenchtoastrecipescallforatleastthreethings:

1. bread2. milk3. eggs

Ifyou'regoingtomakeFrenchtoast,you'regoingtoneedbread,you'regoingtoneedmilk,andyou'regoingtoneedeggs.WhatdoesFrenchtoasthavetodowithbigdata?

AssociationRuleMiningAnassociationruleisalinkbetweenonesetofitemsandanother.Specifically,associationrulesidentifyinstancesinwhichtheappearanceofonesetitems(theantecedent)implythatanothersetofitems(theconsequent)willalsoappear.

Forexample:

{X,Y}⇒{Z}

Thisrulecanbereadas,“Iftheantecedents(XandY)appearthenitislikelythattheconsequent(Z)willalsoappear.”

Byusingassociationrules,wecangroupitemstogetherlogicallyandattempttomakepredictions.Bytrackingeachofthesetransactions,tabulatingthem,andthendiscoveringwhichpairs(orlargergroups)ofcolumnscorrelateoftenwithoneanother,associationrulesmaybegeneratedtocapturethesecorrelationsinthedata.ThisappliestoFrenchtoastpreparation.

Forexample:

Ifmostpeoplewhobuymilk,bread,andeggsalsobuymaplesyrup,thenassociationruleminingmightturnupthefollowingrule:

{milk,bread,eggs}⇒{syrup}

Walmartcannowtargetstorepatronswhopurchasemilk,bread,andeggstogentlysuggestthattheymightliketoalsobuysyrup.Thecomputerizedstorefront(orphysicalstorefront

450

Page 6: UTeach CS Principles - Teach Global Impact...Data Analysis Data Mining ... The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends,

withalayoutdeterminedbycomputationaldatamining)doesnotknowthatthesepatronsmaybemakingFrenchtoast,theymerelyhavedevelopedassociationrulestoguideproductplacement.Theprocessofassociationruleminingisbasically"HowTargetFiguredOutaTeenGirlwasPregnant..."

InstructionsYourgrouphasbeenhiredbyDataMarket,acorporationseekingtoopenanewchainofstoresinyourregion.Theirgoalistoprovidecustomerswithoptimalarrangementsofstoreproducts,inanattempttominimizethetimeandeffortrequiredtoshop.

Youwilldesignamockstoreproductplacementscheme—drivenbydatacollectionfromcompetitors’storesinthearea.Usethereceiptsprovidedbyyourteacher(1)togenerateassociationrulesthatmappotentiallycorrelatedproducts,andthen(2)sketchanendcapfordata-drivenproductplacementtargetingpotentialshoppersinthearea.

Asyouextractdatafromthereceipts,considerthefollowingguidingquestions:

1. Whatisthebestwaytousetheprovidedtabletoorganizeyourdatacollection?

2. Whattrendsdoyoufindinthedata?3. Arethereanynegativeassociationsbetweenproducts?4. Whatistheidealsizeforsetsof

antecedents/consequents?5. Whatadditionalinformationmightbehelpful?6. Canyouimaginescenariosinwhichsetsofproductsare

groupedtogether?

UTeachComputerScience—http://uteachcs.org ©2016TheUniversityofTexasatAustin

451