Top Banner
SQL SQL Text Mining Text Mining Vik Singh Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006
55

SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

SQLSQL Text Mining Text Mining

Vik SinghVik Singh

w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience)

MSR SVC Mountain View

8/23/2006

Page 2: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Road MapRoad Map

• Motivations• Methodology & Algorithms• Experiments & Analysis• Application• Future Work

1

Page 3: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Motivations (1): SDSS SkyServerMotivations (1): SDSS SkyServer

• Multi-TB Astronomy Archive• 5 Yrs,

178M Hits, 77M Views, 1M Unique IP's20M SQL Queries (9.8M Unique)

• SQL access to telescope data

• Wish to categorize user’s SQL queries – Schema redesign, caching,

query recommendation, popular templates, segment users

Web and SQL Traffic

1.E+04

1.E+05

1.E+06

1.E+07

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61Month

Hit

s

W ebSQLExpon. (W eb)

2

SQL Traffic (Rows/Month)

1.E+06

1.E+07

1.E+08

1.E+09

1.E+10

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43

Month

Ro

ws

Page 4: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

(2): Not much prior work(2): Not much prior work

• Could not find research showing how to characterize SQL

• But many sites and databases maintain query logs

• Fortunately there is related work– NLP, IR, Machine Learning

3

Page 5: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

MethodologyMethodology

• Use unsupervised learning (K-Means)– Cluster centers give us query templates– Cluster sizes tell us popularity

• Do term analysis over these segments– More interesting than total aggregate term

stats– Can isolate types of users (Bots v. Mortals)

4

Page 6: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

K-Means AlgorithmK-Means Algorithm

defdef KMeansKMeans(k, docs): clusters = InitialClusterCentersInitialClusterCenters(k, docs) whilewhile true: change = AssignToClustersAssignToClusters(docs, clusters) ifif change: clusters = RecomputeClusterCentersRecomputeClusterCenters(k, docs) elseelse: breakbreak returnreturn docs, clusters

5

Page 7: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

3 Key Factors to Clustering3 Key Factors to Clustering

1.1. Distance functionDistance function

2.2. Choosing KChoosing K– We’ll undershoot K– Then break these clusters into ‘tighter’ ones

3.3. Choosing the initial centroidsChoosing the initial centroids• Not covered in this talk, but we have some cool tricks• For now assume traditional approaches (BuckShot,

Random)

6

Page 8: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

11stst Step: Distance Function Step: Distance Function

• How syntactically similar are two queries?– Not necessarily functionally similar

• SQL term order– Column Fields and WHERE conditionals

can be reordered & represent the same query

– Solution: Compare token combinations• N-Grams (or Shingles) 7

Page 9: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

We also need to clean the SQLWe also need to clean the SQL

• ‘Templatize’ SQL queries– Remove unnecessary, uncommon

features• 24 REGEX cleaners and substitutions• Ex. Substitute ‘STRING’, ‘NUM’, ‘COMPARE’

• Goal here is to MAXIMIZEMAXIMIZE similarity

8

Page 10: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Before & AfterBefore & After

SELECTSELECT s.s.ra, s.dec FROMFROM #upload uu, SpecObj s WHEREWHERE u.up_plate==s.plate andand u.up_mjd=s.mjd andand u.up_fiber=s.fiberid selectselect p.objID, rc.name as rc3_name, s.name as stetson_nameas stetson_name, p.ra, p.dec, ph.name asas type, p.u, p.g, p.r, p.i, p.z,, o.distance fromfrom ((((((PhotoPrimary p inner joininner join PhotoType ph onon p.type = ph.value) left joinleft join RC3 rc onon p.objid = rc.objid) left joinleft join Stetson s onon p.objid = s.objid), dbo.fGetNearbyObjEq(180,-0.5,3)(180,-0.5,3) o wherewhere o.objid = p.objid andand p.type = ph.value order byorder by o.distance

selectselect ra dec from from temptemp specobj wherewhere up_plate comparecompare plate logic up_mjd compare mjd logiclogic up_fiber compare fiberid selectselect objid name name ra dec name u g r i z distance fromfrom photoprimary inner joininner join phototype onon type compare value left joinleft join rc3 onon objid compare objid left joinleft join stetson onon objid compare objid fgetnearbyobjeq wherewhere objid compare objid logic type compare value orderbyorderby distance

9

Page 11: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Feature VectorFeature Vector• What sized N-Grams should we use?

– Unigrams & Bigrams most common• Any more than Tri-Grams usually results in worse

clusters

• But this assumes unstructuredunstructured text– We have a highly constrained language

• And we want to capture Joins

selectselect objid name name ra dec name u g r i z distance fromfrom photoprimary photoprimary inner joininner join phototype phototype onon type type compare valuecompare value

– (Need at least size 88-grams here)

• At the same time we want good results too - consistent with the literature– So bias smaller grams since they more likely to

occur 10

Page 12: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Feature Strategy – Feature Strategy – ‘Use ‘Em All’‘Use ‘Em All’

• Generate all 1 … 8 sized N-grams for a query• Sort the tokens within an N-gram• Why?

– Increases similarity matches, decreases # N-grams (better for memory)– SQL is highly constrained - unlikely the terms reordered represent a different

style query– Larger N-gram matches are unlikely, should be rewarded similarity if they are

within the same N-distance neighborhood of terms

• Jaccard’sJaccard’s Similarity Measure• |IntersectionIntersection(Q1_n, Q2_n)| / |UnionUnion(Q1_n, Q2_n)|

• Compute the Jaccard for each N-gram set separately, then take a weighted Fibonacci mean favoring smaller grams

• Since there can be duplicate terms, we append each N-gram with a rolling index– ‘Multi-Set Jaccard’

11

Page 13: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Ex. Distance between 2 Ex. Distance between 2 Queries Queries

• AA: ‘select objid ra dec r z from galaxy specobj’

• BB: ‘select objid ra dec from galaxy fgetnearbyobjeq where objid compare objid’

• For simplicity, just up to size 3

• Not Shown– Sort tokens with N-Gram

• “ra_dec_from” => “dec_from_ra”

– Index repeated N-grams within same set size

– Ex. objid_1, objid_2

QueryQuery UnigramUnigramss

BigramsBigrams TrigramsTrigrams

AA selectobjid

radec

rz

fromgalaxyspecobj

select_objidobjid_rara_decdec_r

r_zz_from

from_galaxygalaxy_speco

bj

select_objid_raobjid_ra_dec

ra_dec_rdec_r_zr_z_from

z_from_galaxy

BB selectobjid

radecrrom

galaxyfgetnearb

yobjeqwhereobjid

compareobjid

select_objidobjid_rara_dec

dec_fromfrom_galaxyfgetnearbyob

jeq_wherewhere_objid

objid_compare

compare_objid

select_objid_robjid_ra_decra_dec_from

dec_from_galaxyfrom_galaxy_fgetne

arbyobjeqgalaxy_fgetnearbyo

bjeq_wherefgetnearbyobjeq_wh

ere_objiwhere_objid_compar

eobjid_compare_objid

JaccardJaccard 6/14 = 0.43

4/13 = 0.31

2/12 = 0.17

DistanceDistance = (3*(0.43)+2*(0.31)+1*(0.17)) / 6 = 0.350.35

12

Page 14: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

But .. this still won’t scale for But .. this still won’t scale for 20M20M

• We’re producing 1 … 8 grams each• Producing close to 1000+ grams per

query– This is like full-scale document clustering!

• 8 Jaccard’s over strings in each comparison step

• Piping results in and out of SQL• O(O(N^2N^2)) clustering algorithm• Only have 3 months & a single machine …

13

Page 15: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

But there are only 80K But there are only 80K templates! templates!

• First, we only need to cluster the distinct queries (9.8M9.8M)

• Second, we found MANYMANY queries reduce down to a common template after cleaning

• Number of unique query templates ~ 80K80K– Over 99.5% reduction!

• Filter out the queries which result in errors– Brings it down to ~77K~77K

• Clustering these ~77K templates is equivalent to clustering all 20M!– We maintain the query to template mappings and book-

keep template counts– Just factor the counts at the end to scale the solution for

20M14

Page 16: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Let’s ClusterLet’s Cluster

• IronPython + SQL Server 2005– Didn't use Data Mining package in SQL

Server• Its Clustering optimized for Euclidean distances

• K = 24 (based on previous work at JHU)

• Then search clusters for tighter groups– Within 70% similarity and of size >= 10

• Total clusters: 194• Computation Time: ~22 Hrs• 15 Iterations

15

Page 17: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Ex. A Popular Cluster FoundEx. A Popular Cluster FoundQuery Template Countselect objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num3100261

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 1

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic z compare num logic num 30

select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 1

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num 17

select cast objid ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num 6

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 6

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num 93

select cast objid type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num 4

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 71

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num 10

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num 9

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 3

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic i compare num logic num 20

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 1

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num 1

select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num 2

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num 101

select cast objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 1

select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic g compare num logic num 3

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 9

select top num objid run rerun camcol field obj type ra dec u g r i z colc rowc from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic type compare num 5

• Represents 18%18% of all the SQL queries (3,100,655 hits)• Sum of Square ErrorsSum of Square Errors // Cluster SizeCluster Size = 3.58006287097031E3.58006287097031E-06-06

• 0.061 variance omitting top template from cluster• Tiny variance among queries

16

Page 18: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

And since we collect N-grams …And since we collect N-grams …• … We can find the common phrases within a

cluster

• Ex. Top 15 Most PopularTop 15 Most Popular Eight-GramsEight-Grams

17

Page 19: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Two Example Two Example ApplicationsApplications

1.1. Bot DetectionBot Detection

2.2. Query RecommendationQuery Recommendation

18

Page 20: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

App 1: Bot DetectionApp 1: Bot Detection

• Belief:Belief: Little or no variance and Little or no variance and large cluster sizes correspond to Bot large cluster sizes correspond to Bot agentsagents

• Can isolate bot from human SQL traffic– Useful since we don’t have the user-agent

strings in the SQL logs19

Page 21: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

App 1: Bot DetectionApp 1: Bot Detection

Query Template Countselect objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num3100261

select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid 2732991

select count * from photoprimary where htmid compare num logic htmid compare num 2188913

select rowc_g colc_g from photoprimary where objid compare num 900055

select rowc_r colc_r from photoprimary where objid compare num 814884

select objid distance ra dec fphototypen type psfmag_u psfmagerr_u psfmag_g psfmagerr_g psfmag_r psfmagerr_r psfmag_i psfmagerr_i psfmag_z psfmagerr_z dered_u dered_g dered_r dered_i dered_z isnull z num specclass objtypename from photoobjall join fgetnearbyobjeq on objid compare objid left outer join specobj on bestobjid compare objid orderby distance805539

select objid type flags ra dec r petromag_r isoa_r isob_r isophi_r isophierr_r distance from galaxy fgetnearbyobjeq where objid compare objid 570028

select top num weblink cast objid weblink cast objid weblink run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid448722

select top num ra dec u g r i z petror50_r petror90_r expphi_r expab_r nchild flags lnlexp_r lnldev_r from fgetnearbyobjeq galaxy where objid compare objid logic petromag_z compare num logic num307124

select extinction_u extinction_g extinction_r extinction_i extinction_z from photoobjall where objid compare num 262222

• Significant # of EXACTEXACT query template matches

• Top 10Top 10 Query Templates

20

Page 22: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Check the logs for the top Check the logs for the top templatetemplate

IP Organization DNS State Country WebHits SqlHits PageViews130.167.109.1 National Aeronautics and Space Administration RA.STSCI.EDU AL US 6948077 3114078 4662910

• Sample of SQL log records matching the template

• Even have the same # of whitespaces & casesame # of whitespaces & case

• Just diff numbersJust diff numbers being passed into the TVF’s

• Steady rateSteady rate: 2x hour for weeks at a time

• Product usage pattern (All DR3, then all DR2)

• All from the same user IPsame user IP (NASA)• 22ndnd highest # queries highest # queries on our top org’s list

• Smells like a botSmells like a bot

130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.109.1 BESTDR3130.167.130.1 BESTDR2130.167.130.1 BESTDR2130.167.130.1 BESTDR2130.167.130.1 BESTDR2130.167.130.1 BESTDR2

21

Page 23: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

App 2: Query RecommendationApp 2: Query Recommendation

• Since we have a distance function … we can even return similar queries

• New users, students, visitors wishing to query the sky

• But there’s a learning curve– SQL (writing 3-way spatial query joins not easy)

– Schemas– Optimizing & Debugging

• Because queries are quite repetitive, why not suggest known correct ones to the user– Spelling Correction!Spelling Correction! 22

Page 24: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

App 2: ExampleApp 2: Example

Bad User Query:SELECT TOPSELECT TOP 10 ph.ra,ph.dec,str(ph.g - ph.r,11 ??) asas

color,ISNULL(s.bestObjId, 0) asas bestObjId, 'ugri' FROMFROM #x x, #upload up, BESTDR2..PhotoObjAll asas ph LEFT OUTER LEFT OUTER JOINJOIN ?? SpecObjAll s ONON ph.objID = s.bestObjID WHERE WHERE (ph.type=3 OR ??) ANDAND up.up_id = x.up_id ?? x.objID=p ??.objID ORDER BYORDER BY x.up_id

=> It’s ‘Cleaned’ representation:select topselect top num ra dec str isnull bestobjid fromfrom temp temp

photoobjall left outer joinleft outer join specobjall onon objid compare bestobjid wherewhere type compare num logic logic up_id compare up_id objid compare objid orderby up_id

23

Page 25: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

App 2: Similar Query ResultsApp 2: Similar Query ResultsTemplateTemplate Ex. QueryEx. Query SimilaritySimilarityselect top num ra dec str g r isnull bestobjid from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic type compare num logic up_id compare up_id logic objid compare objid orderby up_id

SELECT TOP 50 p.ra,p.dec,str(p.g - p.r,11,8,8) as grModelColor,ISNULL(s.bestObjID,0) as bestObjID, 'ugri' as filter FROM #x x, #upload u, BESTDR2..PhotoObjAll as p LEFT OUTER JOIN BESTDR2....SpecObjAll s ON p.objID = s.bestObjID WHERE ( p.type = 3 OR p.type = 6OR p.type = 6) AND u.up_id = x.up_id ANDAND x.objID=p.objID ORDER BY x.up_id

74%74%

select top num ra dec ra dec colc rowc isnull z from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic type compare num logic up_id compare up_id logic objid compare objid orderby up_id

SELECT TOP 50 p.ra,p.dec,p.ra,p.[dec],p.colc,p.rowc,ISNULL(s.z,0) as z, 'g' as filter FROM #x x, #upload u, BESTDR2..PhotoObjAll as p LEFT OUTER JOIN BESTDR2..SpecObjAll s ON p.objID = s.bestObjID WHERE ( p.type = 3 OR p.type = 6) AND u.up_id = x.up_id AND x.objID=p.objID ORDER BY x.up_id

66%66%

select top num ra dec run rerun camcol field obj isnull ra num isnull dec from temp temp photoobjall left outer join specobjall on objid compare bestobjid where type compare num logic type compare num logic up_id compare up_id logic objid compare objid orderby up_id

SELECT TOP 50 p.ra,p.dec,p.run,p.rerun,p.camCol,p.field,p.obj,ISNULL(s.ra,0) as ra,ISNULL(s.[dec],0) as [dec], 'ugriz' as filter FROM #x x, #upload u, BESTDR2..PhotoObjAll as p LEFT OUTER JOIN BESTDR2..SpecObjAll s ON p.objID = s.bestObjID WHERE ( p.type = 3 OR p.type = 6) AND u.up_id = x.up_id AND x.objID=p.objID ORDER BY x.up_id

60%60%

* Can computed quickly by comparing only against queries within the same length neighborhood

24

Page 26: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

ConclusionConclusion

• Not discussed– Optimizations, use of SQL Server, how we broke down clusters

(‘Finesse Algorithm’), initializing centers, defining a center (‘Clustroid’), theoretical properties, weighting scheme

• Future Work– More cleaning, using SQL parse trees, other features

representations, better n-grams (wrapping edges), optimizations (min-hashing)

• Findings– We found that queries in cleaned

representation fall into a small group of clusters (20M => 77K)

– Queries follow very repetitive template patterns, enabling us to effectively find bots, query outliers, and do query recommendation 25

Page 27: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

ReferencesReferences1. SkyServer Site Logs: http://skyserver.sdss.org/log/

2. Sloan Digital Sky Survey SkyServer project’s website: http://skyserver.sdss.org/

3. G. Abdulla, “Analysis of SDSS SQL server log files”, UCRL-MI-215756-DRAFT. Lawrence Livermore National Laboratory, 2005

4. T. Malik, R. Burns, A. Chaudhary. Bypass Caching: Making Scientific Databases Good Network Citizens. In Proceedings of the 21st International Conference on Data Engineering, 2005.

5. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig: Syntactic Clustering of the Web. Computer Networks 29(8-13): 1157-1166 (1997)

6. Fast and effective text mining using linear-time document clusteringB Larsen, C Aone - Proceedings of the fifth ACM SIGKDD international conference …, 1999 - portal.acm.org

7. Web mining research: a survey R Kosala, H Blockeel - ACM SIGKDD Explorations Newsletter, 2000 - portal.acm.org

8. Mehran Sahami, Timothy D. Heilman: A web-based kernel function for measuring the similarity of short text snippets. WWW 2006: 377-386

• Plan to publish our web and SQL traffic research in a MSR TR as well as make the database, docs, slides & code available at http://skyserver.wordpress.com this week

26

Page 28: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

QAQA

27

Page 29: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Gives us another optimizationGives us another optimization• Current Design Inefficient

– Many Jaccard’s over very large sets of strings• Let’s use hashing to speed up Jaccard and enable

us to store more features in memory– Hash(Template) => [Hashes(1-Grams), Hashes(2-

Grams) …]

• We can use Python’s 32-bit String hash function• But what about collisions?

– Not many unique N-grams• SQL & Schema vocabulary small

– What about hashing query templates?– Birthday ProblemBirthday Problem

• We can expect collisions with probability greater than a 50% after hashing:

• ½ + sqrt(1/4 + 2*2^32 * ln(2)) = ~77K~77K

Page 30: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

ClusteringClustering

• Primarily two flavors of clustering1. Expectation-Maximization (EM)

2. K-Means

• Don’t want EM– Model-based clustering for unobservable data– K-Means is actually just a special case of EM

• Clusters modelled by spherical Gaussian distributions

• Each data item is assigned to one cluster

• The mixture weights are equal

Page 31: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Now how do we compare two sets Now how do we compare two sets of N-grams? -- TODOof N-grams? -- TODO

• Jaccard’s Measure

• Cardinality of the intersection of the two sets divided by the cardinality of the union of the two sets

• Traditional, established metric for similarity

• Captures Euclidean and Cosine properties

• Since sets remove duplicates, we index n-grams with numbers to denote repetition– Multi-Set Jaccard’s Measure

Page 32: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

(3): How well can we classify SQL?(3): How well can we classify SQL?

• Motivation for converting unstructured to structured text

• Can optimize a common layer for SQL style representation– Reusable ML development framework

• Although translating to SQL is not trivial

• Has nice properties– Highly constrained, clean vocabulary, relational,

propositional, parseable

Page 33: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

But we have 8 sets of n-grams … -- But we have 8 sets of n-grams … -- TODOTODO

• We compute the Jaccard separately for each size n-gram

– Not a good idea to mix sizes (lose independence)

• Wish to prioritize smaller grams for better results and to maximize similarity

– Since the Jaccard of smaller grams will result in a higher similarity value than larger grams

– Prioritize the first 3 grams

• Reverse Fibonacci sequence enables us to grow the weights proportionally and exponentially

• 78% goes to first 3 grams

• Comforting to know we’re taking into account larger grams (22% sounds reasonable to us)

• Then take the weighted mean of the Jaccard’s scaled by their respective Fibonacci number

• This is arbitrary, but we liked the results

• Compared to exponential (2 based) and just plain ol’ unigrams with no scaling

• Gives us something in between these two

Page 34: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Finally, an example: Distance Between Finally, an example: Distance Between 2 SQL Queries -- TODO2 SQL Queries -- TODO

• [include two queries, their n-grams, jaccard of each, multiplied by the fibonacci element, then the mean]

• For now, say n-grams from 1-3 (so it can fit on a slide)

Page 35: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

22ndnd Step: Choosing Initial Centers Step: Choosing Initial Centers

• First, what’s a center?– Common clustering done over numerical vectors

• Euclidean distance and midpoint (‘Centroid’)

– But we’re dealing with sets of n-grams• No clear average midpoint formula

• So instead we use ‘Clustroids’– SQL queries represent centers– Since we can’t do average midpoint, we find the query

minimizes square error with everybody else within the cluster

Page 36: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

InitializationInitialization

• Buckshot (1992) [include cite])– Hierarchical

Agglomerative Clustering (HAC)

– Dendrogram Graph – Start w/ SQRT(N) samples,

Greedily combine clusters (All-Pairs) until left with K Clustroids

– O(N^2 – N*K)

• Hill-Climbing• No really good method that escapes local minima• Simulated Annealing• Genetic K-Means

• Difficult to apply to our features and distance metric

Page 37: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

But this just does not scale for 20M But this just does not scale for 20M -- TODO-- TODO

• How many Jaccards for just BuckShot [include]

• Plus how many Jaccards for KMeans• (probably more, stops when the clusters are

stables, which is unknown number of steps)• One can set a max number of iterations• So how about sample more• Just recomputing clustroids is n^2

– Need to do all-pairs to find center with min square error

– Sample sqrt(n) to make it linear

Page 38: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Can we initialize centers faster? -- Can we initialize centers faster? -- TODOTODO

• BuckShot is slow

• Can we do this step faster and produce better clusters?

• Idea: Cuckoo Clustering

• Inspired by Cuckoo Hashing and Genetic KMeans [include cites]

Page 39: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Side Excursion: Cuckoo Clustering Side Excursion: Cuckoo Clustering – TODO (Should be extra slides)– TODO (Should be extra slides)

• [describe algorithm]

• K buckets

• Elect better leaders

• Spill over to next best bucket

• [Analyze run time]

• [include pseudo code]

Page 40: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Cuckoo Clustering – TODO (extra Cuckoo Clustering – TODO (extra slides)slides)

• Nests (capacity 2)

• K = 4

• Sample size = 8

Page 41: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Clustering 116,425 SQL QueriesClustering 116,425 SQL Queries

• SQL Queries chosen uniformly at random• K = 56• Max K-Means Iterations = 20• MSEWC = Mean Square Error Within Cluster

MSEWC

Random 0.341520927107638Buckshot 0.331783982119379Cuckoo 0.329480122484919

Page 42: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Plus … -- TODOPlus … -- TODO

• Each query template and the number of times it occurred

• The most popular ones should represent initial centers

• And, cleaned queries grouped by the same length should be assigned to the same cluster initially

• Get one iteration of AssignClusters for free• Now we can do KMeans

Page 43: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Bootstrap Example -- TODOBootstrap Example -- TODO

• [include example, before and after group bys]

Page 44: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Experiment -- TODOExperiment -- TODO

• Setup• SQL Server 2005• IronPython <== I love this language• Did not use Data Mining package in SQL Server• 'Microsoft Clustering' designed around and

optimized for Euclidean vectors• Not easy to write a SQL Server C++ plug-in for

my style feature sets + Jaccard + sampling logic• Computation Time: [?]• KMeans Iterations: [?]

Page 45: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Some Optimizations – TODO (extra Some Optimizations – TODO (extra slides)slides)

• Pruning when we find similarity = 1• Hashing• Speeds up Jaccard and reduces memory• Hash(cleanedQuery) => List of Hash(N-gram)’s• Use Python’s 32-bit string hash function• Collisions?• Not many unique N-grams

– Number of SQL keywords + SDSS schema fields very small

• Birthday Problem: We can expect collisions with probability greater than a ½ after hashing:

• ½ + sqrt(1/4 + 2*2^32 * ln(2)) = ~77k queries• We have ~77k unique cleaned queries• Good enough

Page 46: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Clusters Discovered – TODO Clusters Discovered – TODO (include after 80k clustered)(include after 80k clustered)

Page 47: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

We can do better: ‘Finesse’ -- We can do better: ‘Finesse’ -- TODOTODO

• Idea: KMeans does the hard work of grouping similar queries together

• Intuition: We can find 'tighter' query groups within a cluster fast

• We choose a threshold (say within 85% similarity)

• Value of K not really important now• [include algorithm]

Page 48: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Finesse Algorithm -- TODOFinesse Algorithm -- TODO

• [include algorithm]

• Going backwards from 1.0 to threshold in decremental step values key to finding optimal clusters and for speed

Page 49: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

A Popular Cluster FoundA Popular Cluster FoundQuery Template Countselect objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num3100261

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 1

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic z compare num logic num 30

select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 1

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num 17

select cast objid ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num 6

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 6

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num 93

select cast objid type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num logic type compare num 4

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic u compare num logic num 71

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic g compare num logic num 10

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num 9

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 3

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic i compare num logic num 20

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic z compare num logic num 1

select run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic u compare num logic num 1

select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num 2

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic num 101

select cast objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic i compare num logic num 1

select top num run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic g compare num logic num 3

select objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic type compare num 9

select top num objid run rerun camcol field obj type ra dec u g r i z colc rowc from fgetnearbyobjeq photoprimary where objid compare objid logic r compare num logic type compare num 5

• Represents 18%18% of all the SQL queries (3,100,655 hits)• Sum of Square ErrorsSum of Square Errors // Cluster SizeCluster Size = 3.58006287097031E3.58006287097031E-06-06

• Meaning tiny variation within cluster

Page 50: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

And since we collect N-grams …And since we collect N-grams …• … We can find the common phrases• Ex. Top 15 Most PopularTop 15 Most Popular Eight-GramsEight-Grams

Phrase % Share of all 8-gramsdec err_g err_u g i r u z 0.037037037dec err_u g i r ra u z 0.037037037dec g i r ra type u z 0.037037037err_g err_i err_r err_u err_z from i z 0.037037037err_g err_i err_r err_u err_z i r z 0.037037037err_g err_i err_r err_u g i r z 0.037037037err_g err_r err_u g i r u z 0.037037037camcol dec field obj ra rerun type u 0.035273369camcol dec field obj ra rerun run type 0.035273369camcol dec field g obj ra type u 0.035273369dec field g obj r ra type u 0.035273369dec g i obj r ra type u 0.035273369err_i err_z from fgetobjfromrect photoprimary where objid compare 0.02292769err_z from fgetobjfromrect photoprimary where objid compare objid 0.02292769from fgetobjfromrect photoprimary where objid compare objid logic 0.02292769

Page 51: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Bot DetectionBot Detection

• Belief:Belief: Little or no variance and Little or no variance and large cluster sizes correspond to large cluster sizes correspond to bot agentsbot agents

• Can isolate bot from mortal SQL traffic– No user-agent strings in the SQL logs

Page 52: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Bot DetectionBot Detection

Query Template Countselect objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetobjfromrect photoprimary where objid compare objid logic r compare num logic num3100261

select top num objid run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid 2732991

select count * from photoprimary where htmid compare num logic htmid compare num 2188913

select rowc_g colc_g from photoprimary where objid compare num 900055

select rowc_r colc_r from photoprimary where objid compare num 814884

select objid distance ra dec fphototypen type psfmag_u psfmagerr_u psfmag_g psfmagerr_g psfmag_r psfmagerr_r psfmag_i psfmagerr_i psfmag_z psfmagerr_z dered_u dered_g dered_r dered_i dered_z isnull z num specclass objtypename from photoobjall join fgetnearbyobjeq on objid compare objid left outer join specobj on bestobjid compare objid orderby distance805539

select objid type flags ra dec r petromag_r isoa_r isob_r isophi_r isophierr_r distance from galaxy fgetnearbyobjeq where objid compare objid 570028

select top num weblink cast objid weblink cast objid weblink run rerun camcol field obj type ra dec u g r i z err_u err_g err_r err_i err_z from fgetnearbyobjeq photoprimary where objid compare objid448722

select top num ra dec u g r i z petror50_r petror90_r expphi_r expab_r nchild flags lnlexp_r lnldev_r from fgetnearbyobjeq galaxy where objid compare objid logic petromag_z compare num logic num307124

select extinction_u extinction_g extinction_r extinction_i extinction_z from photoobjall where objid compare num 262222

• Significant # of EXACTEXACT query template matches

• Top 10Top 10 Query Templates

Page 53: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

App: Query RecommendationApp: Query Recommendation

• Since we have a distance function …we can even find similar queries

• New users, students, visitors wishing to query the sky

• But there’s a learning curve– SQL (writing 3-way spatial query joins may not be intuitive)– Schemas– Optimizing & Debugging

• And since we a found queries to be quite repetitive …

• Why not return the most similar (and correct) ones as suggestions to the user?– Improving user experience and reach

Page 54: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Example Example (TODO Better Query – this isn’t a (TODO Better Query – this isn’t a common one)common one)

Find Top 3 Most Similar Queries to:SELECTSELECT toptop 100 p.psfMag_g - s.Mag_g FROMFROM

photoprimary p JOINJOIN specObj s onon p.specObjID=s.specObjID wherewhere s.specClass=1

It’s ‘Cleaned’ representation:select topselect top num psfmag_g mag_g fromfrom

photoprimary joinjoin specobj onon specobjid compare specobjid wherewhere specclass compare num

Page 55: SQL Text Mining Vik Singh w/ Jim Gray, Mark Manasse w/ Jim Gray, Mark Manasse (BARC - eScience) MSR SVC Mountain View 8/23/2006.

Similar Query ResultsSimilar Query Results

Template Ex. Query Similarity

select top num psfmag_g mag_g from photoprimary join specobj on specobjid compare specobjid where specclass compare num

SELECT top 100 p.psfMag_g - s.Mag_g FROM photoprimary p JOIN specObj s on p.spec

ObjID=s.specObjID where s.specClass=1

100%

select top num * from specobj join specline on specobjid compare specobjid where specclass compare num logic lineid compare num

select top 10 * from specObj s join specLine l on s.specObjID = l.specObjID where specClass = 3 and l.lineID=6565

41%

select count * from galaxy join specobj on specobjid compare specobjid where zconf compare num logic specclass compare num

SELECT COUNT(*) FROM BESTDR5..Galaxy as G JOIN BESTDR5..SpecObj as GS on G.specObjID = GS.specObjID WHERE GS.zconf > .9 AND GS.specClass = 2

35%