Top Banner
Introduc)on to Large Databases & Data Mining Tips for Assembling Your Data Analysis Toolbox for the 22 nd Century 10/05/12 Jim Heasley, Ins)turte for Astronomy 1
52

Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Mar 22, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Introduc)ontoLargeDatabases&DataMining

TipsforAssemblingYourDataAnalysisToolboxforthe22ndCentury

10/05/12 JimHeasley,Ins)turteforAstronomy 1

Page 2: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Outline‐I

•  Rela)onalDatabases&BIGDATA– Bigdatavolumesrequireanewdatahandlingparadigm

– Advantagesofarela)onaldatabase•  Organiza)onofdata•  Dataintegrity•  SQL‐‐Structured(andalmoststandard)querylanguageforqueries

– Whatadatabaseisnot.

10/05/12 JimHeasley,Ins)turteforAstronomy 2

Page 3: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Outline‐II

•  Datamining– Whatisit?

– Commondataminingtasks–  (FREE)Toolsavailabletoyoutoperformmanyofthesetasks.

10/05/12 JimHeasley,Ins)turteforAstronomy 3

Page 4: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Outline‐III

•  Examples–Imagined&Real–  Ifweonlyhad)metravel…

– ThingsonemightstarttodowithPAN‐STARRSdata(rightnow).

10/05/12 JimHeasley,Ins)turteforAstronomy 4

Page 5: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

RELATIONALDATABASES

10/05/12 JimHeasley,Ins)turteforAstronomy 5

Page 6: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

BasicDefini8ons•  Database:

–  Acollec)onofrelateddataorganizedtoprovideinforma)on.

•  Data:–  Knownfactsthatcanberecorded

andhaveanimplicitmeaning.–  Obenintegratedfromseveral

sources.–  Storedinastandardformatforuse

bymul)pleapplica)ons.•  DatabaseManagementSystem

(DBMS):–  Asobwarepackage/systemto

facilitatethecrea)onandmaintenanceofacomputerizeddatabase.

•  DatabaseSystem:–  TheDBMSsobwaretogetherwith

thedataitselfandthehardwareuponwhichitruns.Some)mes,theapplica)onsarealsoincluded.

6

Page 7: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Twoapproaches

–  Generally,therearetwoapproachestoextractinforma)onfromdata:•  fileprocessingapproach

–  filebasedsobwareprograms

•  databaseapproach–  DBMS

7

Page 8: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Fileprocessingapproach

–  Issues:•  dataredundancy•  redundantprocesses/interfaces•  dataintegrity

–  easeofmaintenance–  consistency

•  Security–  preserva)on–valuablecompanyasset–  accesscontrol

•  Each application program has a specific purpose

•  Each program uses its own data

...

Application program 1

Data

Instructions

Application program n

Data

Instructions

8

Page 9: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Mo8va8onfordatabases

–  Dataisaveryimportantassetofanorganiza)on

–  Mo)va)onsfordatabases•  tomaintaindataindependentfromapplica)onprograms

•  toavoid:–  redundantdata–  redundantprocesses/interfaces

•  toenable:–  easeofmaintenance

–  sharingofdata–  dataaccesscontrol

9

Page 10: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Databaseapproach

–  DBMS‐a“generalpurpose”sobware•  isself‐describing•  contains

–  data–  metadata(i.e.dataaboutdata)

DBMS Application program 1

Instructions

...Data

Metadata Application program n

Instructions

10

Page 11: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

MainCharacteris8csoftheDatabaseApproach

•  Self‐describingnatureofadatabasesystem:

–  ADBMScatalogstoresthedescrip)onofapar)culardatabase(e.g.datastructures,types,andconstraints)

•  Insula8onbetweenprogramsanddata:–  Calledprogram‐dataindependence.

•  DataAbstrac8on:–  Adatamodelisusedtohidestoragedetails

andpresenttheuserswithaconceptualviewofthedatabase.

•  Supportofmul8pleviewsofthedata:–  Eachusermayseeadifferentviewofthe

database,whichdescribesonlythedataofinteresttothatuser.

•  ConcurrentExecu8ons

11

Page 12: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Characteris8csofDBMS

– Datais:•  integrated,shared,persistent•  self‐describing

– Abstrac)on•  programanddataindependence

– Mul)pleviewsofthedata•  differentusersneeddifferentkindsofinforma)on

12

Page 13: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

AdvantagesofUsingtheDatabaseApproach

•  Controllingredundancy–  Sharingofdataamongmul)pleusers.

•  Restric)ngunauthorizedaccesstodata.•  Providingpersistentstoragefor

programObjects•  ProvidingStorageStructures(e.g.

indexes)forefficientQueryProcessing•  backupandrecoveryservices.•  mul)pleinterfacestodifferentclasses

ofusers.•  complexrela)onshipsamongdata.•  integrityconstraints.•  Drawinginferencesandac)onsfrom

thestoreddatausingdeduc)veandac)verules

13

Page 14: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

– Re‐useofdataacrossmul)pleapplica)ons– Datastructureandaccesscanbechangedwithoutchangingapplica)ons

– Enforcementofstandardsandcomputa)onofsta)s)cs

–  Improvedresponsiveness,produc)vity

Addi8onaladvantagesofthedatabaseapproach

14

Page 15: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

Addi8onalImplica8onsofUsingtheDatabaseApproach

•  Poten)alforenforcingstandards•  Reducedapplica)ondevelopment)me•  Flexibilitytochangedatastructures•  Availabilityofcurrentinforma)on

–  Extremelyimportantforon‐linetransac)onsystemssuchasairline,hotel,carreserva)ons.

•  Economiesofscale

15

Page 16: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

–  Complexity

–  Size(ofsobwareandapplica)on)–  Cost–  Performance

–  Riskof(spectacular!)failures

Disadvantagesofthedatabaseapproach

16

Page 17: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy

WhennottouseaDBMS

•  Maininhibitors(costs)ofusingaDBMS:–  Highini)alinvestmentandpossibleneedforaddi)onalhardware.–  Overheadforprovidinggenerality,security,concurrencycontrol,

recovery,andintegrityfunc)ons.

•  WhenaDBMSmaybeunnecessary:–  Ifthedatabaseandapplica)onsaresimple,welldefined,andnot

expectedtochange.–  Ifaccesstodatabymul)pleusersisnotrequired.

•  WhennoDBMSmaysuffice:–  Ifthedatabasesystemisnotabletohandlethecomplexityofdata

becauseofmodelinglimita)ons–  Ifthedatabaseusersneedspecialopera)onsnotsupportedbythe

DBMS.

17

Page 18: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

DatabaseLogic

•  Opera)onswithinthedatabasearegovernedbystandardsettheoryandlogic.Newtypesofdatabasesthatarebuiltuponfuzzysets,fuzzylogic,andfuzzymeasurearecurrentlythesubjectofac)veresearch,butarenot(asyet)widelyavailable.

•  Thetwokeysetopera)onsofinterestindatabasesareINTERSECTION(theJOIN)andUNION(calledthesameintheDBworld).

10/05/12 JimHeasley,Ins)turteforAstronomy 18

Page 19: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

StructuredQueryLanguage

•  Theuserusuallyinteractswiththedatabasebyexpressingwhatshe/hewantstoaccomplishbyexpressingtherequestinSQL.NoteSQLtellsthedatabasewhatyouwanttodo,butnothowtodoit.

•  TherearemanyhelpfultutorialsaboutSQLavailableontheweb.Anexcellentintroduc)onisavailableat

www2.aao.gov.au/2dfgrs/Public/Release/Database/sql_intro.pdf

•  Thisintroduc)onissufficientlyvanillaitwillgetyoustarteddespitetheminorvaria)onsbetweendifferentflavorsofSQL

10/05/12 JimHeasley,Ins)turteforAstronomy 19

Page 20: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

TheSchema

•  Thelogicalschemadefineshowaoributesareassignedtovarioustablesandthedefini)onofkeys(indexes)thathelpto)etablestogether.Ausermusthaveunderstandingofthelogicalschema.

•  Thephysicalschemadefineshowthedatatablesarestoredonthephysicalstoragemedia(e.g.,disks).Generally,usersdonotneedtoknowthephysicalschemaalthoughthesystemdevelopersmustleveragethistomaximizetheperformanceoftheirsystem.

10/05/12 JimHeasley,Ins)turteforAstronomy 20

Page 21: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

UserQueries

•  Usersdevelopqueriestothedatabaseinaprocedurallanguage,usuallysomeformofSQL,thatbuildsrequestsforinforma)onstoredinthedatabasestables,obenmakinguseofinternalrela)onshipsinherentinthedata(e.g.,intersec)onsbetweendifferenttables).

10/05/12 JimHeasley,Ins)turteforAstronomy 21

Page 22: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

TheSQLSelectCommand

•  ThemostfrequentlyusedSQLcommand(bythetypicalusers)istheSELECTcommand.Thisisusedtoget(i.e.select)datafromthedatabasetables.

•  ThebasicsyntaxoftheSELECTcommandis

SELECT(listofaoributesyouwant)FROM

(listoftablescontainingthem)WHERE

(listoflimi)ng/restric)ngcondi)ons)

10/05/12 JimHeasley,Ins)turteforAstronomy 22

Page 23: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

WhataDatabaseisn’t!

WhilethecolumnarrangementofaoributesindatabasetablesmightremindtheuserofaspreadsheetprogramlikeExcel,adatabaseisnotacompu)ngengine.Further,becauseofthenatureofSQL,theuser’squerysimplydefineswhatdataiswanted,nothowtogetit.Thatalsoincludeshowthedatabasemaychoosetoexecutenumericalopera)onstheuserembedsinthequery.

10/05/12 JimHeasley,Ins)turteforAstronomy 23

Page 24: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

DATAMINING:CONFLUENCEOFMULTIPLEDISCIPLINES

Data Mining

Database Technology

Statistics

Other Disciplines

Information Science

Machine Learning Visualization

10/05/12 JimHeasley,Ins)turteforAstronomy 24

Page 25: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Thepurposeofcompu)ngisinsight,notnumbers.

RichardHamming,intheprefacetohis1962textonnumericalmethods.

10/05/12 JimHeasley,Ins)turteforAstronomy 25

Page 26: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

WhatisDataMining?

•  Finding(meaningful)paoernsindata–  Classifica)on–  Associa)onRules–  ClusterAnalysis–  AnomalyDetec)on–  Regression

•  Dataminingtoolshavebeenusedextensivelyin–  Biology,gene)cs,medicalresearch(Bioinforma)cs)–  BusinessandEconomics–  Ecologyandresourcemanagement–  Engineering–  Literature–  Music–  Voiceandfacialrecogni)on

10/05/12 JimHeasley,Ins)turteforAstronomy 26

Page 27: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Don’tRe‐inventtheWheel!

10/05/12 JimHeasley,Ins)turteforAstronomy 27

Page 28: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Rela8onshipbetweenDatabases&DataMining

•  Databasesareobenakeycomponentindatamining.Oneobenfindsdatawarehousesprovidingtheinforma)onneededbytheminingtools.

•  However,oneusuallyfindsthattheactualdataminingopera)onsareexecutedoutsidethedatabaseitself.Databasesareexcellentinforma)onseversbutarenotgoodcomputeengines!

10/05/12 JimHeasley,Ins)turteforAstronomy 28

Page 29: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Classifica8on:Defini8on

•  Givenacollec)onofrecords(trainingset)–  Eachrecordcontainsasetofa<ributes,oneoftheaoributesistheclass.

•  Findamodelforclassaoributeasafunc)onofthevaluesofotheraoributes.

•  Goal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.–  Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit.

10/05/12 JimHeasley,Ins)turteforAstronomy 29

Page 30: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Associa8onRuleMining•  Givenasetoftransac)ons,findrulesthatwillpredictthe

occurrenceofanitembasedontheoccurrencesofotheritemsinthetransac)on

Market‐Baskettransac)onsExampleofAssocia)onRules

{Diaper}→{Beer},{Milk,Bread}→{Eggs,Coke},{Beer,Bread}→{Milk},

Implica)onmeansco‐occurrence,notcausality!

10/05/12 JimHeasley,Ins)turteforAstronomy 30

Page 31: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

WhatisClusterAnalysis?•  Findinggroupsofobjectssuchthattheobjectsinagroupwill

besimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

10/05/12 JimHeasley,Ins)turteforAstronomy 31

Page 32: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Anomaly/OutlierDetec8on

•  Whatareanomalies/outliers?–  Thesetofdatapointsthatareconsiderablydifferentthantheremainder

ofthedata

•  VariantsofAnomaly/OutlierDetec)onProblems–  GivenadatabaseD,findallthedatapointsx∈Dwithanomalyscores

greaterthansomethresholdt–  GivenadatabaseD,findallthedatapointsx∈Dhavingthetop‐nlargest

anomalyscoresf(x)

–  GivenadatabaseD,containingmostlynormal(butunlabeled)datapoints,andatestpointx,computetheanomalyscoreofxwithrespecttoD

•  Applica)ons:–  Creditcardfrauddetec)on,telecommunica)onfrauddetec)on,network

intrusiondetec)on,faultdetec)on

10/05/12 JimHeasley,Ins)turteforAstronomy 32

Page 33: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Regression(Predic8on)

Regressionistheprocessoffindingafunc)onthatdescribesdataclassesforthepurposeofbeingabletopredictdiscretenumericaldatavalues.Numerousapproachesfordevelopingthedesiredfunc)onexist,includingclassifica)on(IF‐THEN)rules,decisiontrees,mathema)calformulae,orneuralnetworks.Predic)onalsoencompassestheiden)fica)onofdistribu)ontrendsbasedontheavailabledata.

Bothclassifica)onandpredic)onmayneedtobeprecededbyrelevanceanalysis,whichaoemptstoiden)fythoseaoributesorfeaturesthatdonotcontributetotheclassifica)onorpredic)onprocess.Theseaoributescanthenbeexcludedfromtheanalysis.Acommonrelevanceanalysistechniqueisprincipalcomponentanalysis.

10/05/12 JimHeasley,Ins)turteforAstronomy 33

Page 34: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

MachineLearning

10/05/12 JimHeasley,Ins)turteforAstronomy 34

Page 35: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

DataMiningEnvironments

Therearealargenumberofdataminingsobwarepackagesavailable,bothcommercialandopensource.Asearchoftheinternetcanquicklyiden)fythese.Acomprehensivereviewofthesepackagesisfarbeyondthescopeofwhatwecandealwithinthistalk,soIwillrestrictmycommentsheretoseveralwell‐knownpackagesusedfordataanalysisandmining:theRsta)s)calanalysispackage,Matlab(andtheopensourcework‐alikeOctave),anddataminingpackagesWekaandScikits.Learn.

10/05/12 JimHeasley,Ins)turteforAstronomy 35

Page 36: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

•  TheRProjectforSta8s8calCompu8ngwww.r‐project.org/

•  R,alsocalledGNUS,isastronglyfunc)onallanguageandenvironmenttosta)s)callyexploredatasets,makemanygraphicaldisplaysofdata.Verystrongsta)sicaltools.

•  Thebasicsystemhasbeengreatlyexpandedbytheaddi)onofpackagesdevelopedbyitsusercommunity

10/05/12 JimHeasley,Ins)turteforAstronomy 36

Page 37: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Matlab(Octave)

•  MATLAB,acommercialproductfromMathWorks,isahigh‐leveltechnicalcompu)nglanguageandinterac)veenvironmentforalgorithmdevelopment,datavisualiza)on,dataanalysis,andnumericalmodeling.

hop://www.mathworks.com/products/matlab/•  GNUOctaveisahigh‐levelinterpretedlanguage,primarilyintendedfornumericalcomputa)ons.Itisianopensourcework‐alikeversionofMATLAB.hop://www.gnu.org/sobware/octave/

10/05/12 JimHeasley,Ins)turteforAstronomy 37

Page 38: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 38

Weka(WaikatoEnvironmentforKnowledgeAnalysis)isawell‐knownsuiteofmachinelearningsobwarethatsupportsseveraltypicaldataminingtasks,par)cularlydatapreprocessing,clustering,classifica)on,regression,visualiza)on,andfeatureselec)on.Itstechniquesarebasedonthehypothesisthatthedataisavailableasasingleflatfileorrela)on,whereeachdatapointislabeledbyafixednumberofaoributes.WekaprovidesaccesstoSQLdatabasesu)lizingJavaDatabaseConnec)vityandcanprocesstheresultreturnedbyadatabasequery.ItsmainuserinterfaceistheExplorer,butthesamefunc)onalitycanbeaccessedfromthecommandlineorthroughthecomponent‐basedKnowledgeFlowinterface.

hop://www.cs.waikato.ac.nz/~ml/weka/

Page 39: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 39

scikit‐learnisaPythonmoduleintegra)ngclassicmachinelearningalgorithmsinthe)ghtly‐knitscien)ficPythonworld(numpy,scipy,matplotlib).Itaimstoprovidesimpleandefficientsolu)onstolearningproblems,accessibletoeverybodyandreusableinvariouscontexts:machine‐learningasaversa)letoolforscienceandengineering.

Toolsareavailableforsupervised&unsupervisedlearning,modelselec)on,datasets,featureextrac)on.

hop://scikit‐learn.org/stable/

Page 40: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Pluses,Minuses,Observa8onsTheRandWekasobwarebothhavealargecommunitywhichcontributestoextendingtheirfunc)onalitythroughthedevelopmentofnewadd‐onpackages.FurtherRandWekacanbeinterfacedviatheRWekapackage.Therearemanyexcellenton‐linetutorialsforthesepackages,andWekaitselfiswelldescribedinthetextDataMining–PracBcalMachineLearningToolsandTechniquesbyWioen,Frank,&Hall.Thistextprovidesbothagoodunderpinningofthemethodsandprac)caltutorialinforma)on.(Thetextisavailableasane‐book.)

Scikits.learn,whiles)llfairlynew(currentreleaseisversion0.7),hasaveryimpressivecollec)onoftoolsandanextensiveuserguide.ThesobwareiswrioeninPython.Mymainreserva)onaboutthissobwareisthatwhiletheuserguidepresentsmanyexamples,thereisanimplicitassump)onthattheuserknowsagreatdealaboutthefieldofdatamining.Thismayleavethenewusersomewhatinovertheirheadintryingtodetermineexactlywhichtoolbestservestheirneed.

10/05/12 JimHeasley,Ins)turteforAstronomy 40

Page 41: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

EXAMPLES–IMAGINARY&REAL

10/05/12 JimHeasley,Ins)turteforAstronomy 41

Page 42: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Howcouldwehavehelpedthislady?

10/05/12 JimHeasley,Ins)turteforAstronomy 42

Page 43: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 43

Page 44: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 44

Page 45: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Orthesegentlemen?

10/05/12 JimHeasley,Ins)turteforAstronomy 45

Page 46: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

10/05/12 JimHeasley,Ins)turteforAstronomy 46

Page 47: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Orhim?

10/05/12 JimHeasley,Ins)turteforAstronomy 47

Page 48: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

Pan‐STARRSOpportuni8es•  ThePS1SmallAreaSurvey(SAS),coveringanareaof81deg2,overlaps

withtheSDSSStripe82.Inaddi)ontothedeepStripe82database,theimagesfromthisregionhavebeenexaminedbytheCi)zenScienceteamknownastheGalaxyZoo.Thisinteres)ngoverlapofresourcesprovidesdataforsomeexci)ngdataminingexperiments.

•  Star‐Galaxyclassifica)on(ormoreprecisely,Star‐Galaxy‐QSOclassifica)on)isanon‐goingchallengeforthePS1scienceteams.Whilethisworkhasbeenreasonablysuccessful,theeffortsthusfarseemtohaveaoemptedtogetbywiththesimplestpossibleclassifica)onapproach.Whatmighthappenifweperformedaclassifica)onexercisewhereinweuseawiderangeofIPPmeasurements(e.g.,psf,Kron,Petrosianmagnitude,Petrosianradii,variousmomentsmeasuredinindividualframesandstack)withSDSSandGalaxyZoodataprovidingclassifica)on“truth?”

•  Asimilaranalysis,usingvisualinspec)onoftheimagestoiden)fyar)factsinthePS1imagesand/orstacks,mightprovidearobustgarbagerejec)onprocess.Notnecessarilyglamorousbutdefinitelyimportant.

10/05/12 JimHeasley,Ins)turteforAstronomy 48

Page 49: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

EmpiricalPhoto‐ZMethods

•  Ar)ficialNeuralNetworks•  SupportVectorMachines•  Self‐OrganizingMaps•  GaussianProcessRegression•  KernelRegression•  Linear/Nonlinearpolynomialfixng•  InstanceBasedLearning&NearestNeighbors•  BoostedDecisionTrees•  RegressionTrees

AndthesearejusttheonesI’vefoundsofar!

10/05/12 JimHeasley,Ins)turteforAstronomy 49

Page 50: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

GalaxyClusters?

•  Weallknowthebestwaytoiden)fyclustersofgalaxiesisfromtheirx‐rayemission.Unfortunately,currentx‐raysurveysdon’tprovidesufficientsky&depthcoveragetodothis.

•  Op)calsurveyshavesufficientdepthbutsufferfrombackgroundissues,overlappingforeground&backgroundclusters,etc.

•  Ithaslongbeenhopedthatinlargescaleop)calsurveyssuchasPan‐STARRSandLSST,wewillbeabletousePhoto‐Zvaluestosortoutrealclustersfromaccidentalclusteringofgalaxies,andoverlappingclustersatdifferentdistances.(SomeofthePS1partnersinTaiwanareworkingonthisproblem.)

10/05/12 JimHeasley,Ins)turteforAstronomy 50

Page 51: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

GalaxyClusters–CanDataMiningHelp?

•  Whilethereisaplethoraofdataminingtechniquesforfindingclusterswithindata,mostareprobablynotwellsuitedforfindinggalaxyclusters.Manymethodsstartoffbyassumingthatinagivenregionthatoneknowshowmanyclustersarepresent.Clearlythisisnotthecasewithourproblem.Further,weneedtodealwiththefactthatinthe3‐Drepresenta)on,wehavemuchlargeruncertaintyalongthelineofsightduetotheaccuracyofthePhoto‐Zmeasures.

•  Someinteres)ngworkinthisareahasmadeuseofafriend‐of‐friendsapproach.Ithinkthiscouldbegeneralizedtoincludebeoerbackgrounddiscrimina)onincludingthePhoto‐Zdistribu)on.

10/05/12 JimHeasley,Ins)turteforAstronomy 51

Page 52: Introducon to Large Databases & Data Mining · Data Mining • Databases are oen a key component in data mining. One oen finds data warehouses providing the informaon needed by the

PAU

10/05/12 JimHeasley,Ins)turteforAstronomy 52