Browser Data Analysis - IIT Kanpur · Browser Data Analysis Anurag semwal (16111033) Akash singh (16111028) Task 1: Identification of Boundary Topics Data extraction: Downloaded Chrome

Browser Data Analysis Anurag semwal (16111033)

Akash singh (16111028)

Task 1: Identification of Boundary Topics Data extraction:

● Downloaded Chrome Browser History Data generated from https://takeout.google.com/settings/takeout/ in json format

Data from 1-Jan-2016 to 1 -mar 2017 (146281 titles) ● Converted to a csv file with columns Date,url and title

Topic modelling:

● Do topic modelling on the data to get 4 topics ● Using generated Model generate list of predicted topics for a given url in decreasing

order of probability Ie url1-> [(t1,0.5) ,(t2,0.4) ,(t3,0.1)]

● Analyze data and get only those urls after which a user switches to a different topic These are some sort of boundary data inside a topic .

● Identified 70807 Boundary titles

● Do topic modelling on These boundary titles ● The generated Topics represents Boundary Topics

https://takeout.google.com/settings/takeout/

Results: Boundary Topics are

Original Topics (0, '0.139*"googl" + 0.132*"search" + 0.024*"iit" + 0.010*"http" + 0.010*"avail" + 0.009*"cs" + 0.007*"kanpur" + 0.007*"quora" + 0.006*"exam" + 0.006*"cse"') ----->academics (1, '0.046*"gmail" + 0.041*"com" + 0.028*"india" + 0.025*"semwal" + 0.023*"anurag0" + 0.022*"onlin" + 0.018*"amazon" + 0.011*"buy" + 0.011*"best" + 0.011*"price"') ----->mail and ol shop (2, '0.017*"comput" + 0.015*"quora" + 0.014*"scienc" + 0.010*"stack" + 0.009*"institut" + 0.009*"engin" + 0.008*"iiit" + 0.008*"messag" + 0.008*"overflow" + 0.007*"technolog"')----->qa sites (3, '0.052*"gate" + 0.034*"2016" + 0.026*"youtub" + 0.015*"free" + 0.014*"overflow" + 0.011*"wikipedia" + 0.011*"m" + 0.009*"encyclopedia" + 0.009*"1" + 0.009*"tech"')-----> Reading and watching videos Boundary Topics (0, '0.260*"facebook" + 0.053*"limitless" + 0.047*"messag" + 0.038*"comment" + 0.035*"semwal" + 0.031*"gmail" + 0.027*"com" + 0.023*"profil" + 0.022*"pictur" + 0.015*"anurag0"') ----->fb message/comments ,checking mails (1, '0.062*"nan" + 0.042*"comput" + 0.035*"scienc" + 0.030*"career" + 0.023*"gate" + 0.022*"anurag" + 0.019*"overflow" + 0.016*"quora" + 0.014*"stack" + 0.011*"2016"') ---->cse/career related (2, '0.086*"googl" + 0.085*"search" + 0.025*"iit" + 0.023*"kanpur" + 0.018*"photo" + 0.015*"youtub" + 0.014*"ubuntu" + 0.012*"webmail" + 0.009*"python" + 0.008*"s"') --->google search,accessing mails,youtube

(3, '0.052*"onlin" + 0.036*"amazon" + 0.030*"shop" + 0.024*"watch" + 0.021*"mobil" + 0.021*"book" + 0.017*"network" + 0.017*"india" + 0.016*"system" + 0.015*"shoe"')] ---->online shopping Task 2: Hypothesis testing Is the Chronologically last link visited in a topic strongly Related to The Topic to which user switches? Extracted Data: Browser History data (titles of visited webpages from 1 jan 2016 - 1Mar 2017) Sample : Boundary titles(after which topic change) 69948 Test statistic: Paired T-test X: Probability of Second most probable topic of the boundary url Y: Probability of the boundary url belonging to Next Topic (i.e most probable topic of the next visited url) Null Hypothesis : Mean probability of X is same as that of Y ie Mean (X) = Mean(Y) Alternate Hypothesis:Mean probability of X is greater than that of Y ie Mean(X) > Mean(Y) Significance level (ɑ): 0.05 Data=50,000

µX µY σX σY α t-statistic

p-value(1-sided)

0.23628

0.15498

0.11738

0.10217

0.05 102.46 0.0

Data=146281 Boundary Titles: 69948 Data for which x = y : 25714 % of Total Data: 36.7 %


p-value(1-sided)

0.23678

0.158433

0.11784

0.10512

0.05 165.6 0.0

Conclusion: ● P-value << α (level of significance),statistically highly significant (<0.05) ● We can reject H0 and say that sample X has higher mean

compared to sample Y. We can say that the the Chronologically last link visited in a topic is Not strongly Related to The Topic to which user switches. Task 3: Hypothesis testing Is the Chronologically First link visited in a topic strongly Related to The Prev Topic from which user switches?

Extracted Data: Browser History data (titles of visited webpages from 1 jan 2016 - 1Mar 2017) Sample : Boundary titles(first url of new topic) 69948 Test statistic: Paired T-test X: Probability of Second most probable topic of the boundary url Y: Probability of the boundary url belonging to prev Topic (i.e most probable topic of the prev visited url) Null Hypothesis : Mean probability of X is same as that of Y ie Mean (X) = Mean(Y) Alternate Hypothesis:Mean probability of X is greater than that of Y ie Mean(X) > Mean(Y) Significance level (ɑ): 0.05 Data=30000 Boundary Titles: 14792 Data for which x = y : 5236 % of Total Data: 35 %


p-value(1-sided)

0.24255

0.1673

0.1123

0.10756

0.05 74.45 0.0

Conclusion: ● P-value << α (level of significance),statistically highly significant (<0.05) ● We can reject H0 and say that sample X has higher mean

compared to sample Y. We can say that the the Chronologically first link visited in a topic is Not strongly Related to The Topic from which user switches.

Task 4: Prediction of URl’s at any hour

Data :

● Data from 1-Jan-2016 to 1 -mar 2017 (146281 titles) for using lda to get topics

● (1 Dec 2016-1 march 2017) 48308 urls from personal browsing history for prediction Hourly Topic Modeling :

● Do topic modelling on the whole data to get 4 topics ● Predict Topics for most recent 48308 urls ● Find topic frequency for each hour(0-1,1-2,...23-24) ● Find most frequent topic for each hour ● Find most frequent urls for each topic ● For each hour ,based on most frequent topic for that hour suggest top 5 urls for that topic

to user Results: Showing predicted urls for each hour eg 0 denotes 12 am to 1 am 1 denotes 1 to 2 am…..23 denotes 11pm to 12 am 0 ['http://www.amazon.in/', 'https://m.facebook.com/messages/read/?tid=mid.1441536560961%3Af41b8aee8c76a6f223', 'https://m.facebook.com/photo.php?fbid=1374539555891053&id=100000050660083&set=a.159962404015447.44507.100000050660083&notif_t=like&notif_id=1480571116462940&ref=m_notif', 'http://www.hrtchp.com/hrtctickets/Availability.aspx', 'http://gateoverflow.in/'] 1 ['http://www.amazon.in/', 'https://m.facebook.com/messages/read/?tid=mid.1441536560961%3Af41b8aee8c76a6f223', 'https://m.facebook.com/photo.php?fbid=1374539555891053&id=100000050660083&set=a.159962404015447.44507.100000050660083&notif_t=like&notif_id=1480571116462940&ref=m_notif', 'http://www.hrtchp.com/hrtctickets/Availability.aspx', 'http://gateoverflow.in/']

2 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 3 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 4 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 5 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 6 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html']

7 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 8 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 9 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 10 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 11 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html']

12 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 13 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 14 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html'] 15 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 16 ['https://m.facebook.com/home.php', 'https://m.facebook.com/?_rdr', 'https://www.facebook.com/', 'https://m.facebook.com/?hrc=2&refsrc=http%3A%2F%2Fh.facebook.com%2Fhr%2Fr&_rdr', 'http://192.168.1.1/main.html']

17 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 18 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 19 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 20 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 21 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php']

22 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php'] 23 ['https://authenticate.iitk.ac.in/netaccess/loginuser.html', 'chrome://newtab/', 'https://nwm.iitk.ac.in/?_task=mail&_mbox=INBOX', 'https://m.facebook.com/groups/114609428571319?ref=bookmarks', 'https://webmail.cse.iitk.ac.in/src/login.php']

Browser Data Analysis - IIT Kanpur · Browser Data Analysis Anurag semwal (16111033) Akash singh (16111028) Task 1: Identification of Boundary Topics Data extraction: Downloaded Chrome

Documents