Browser Data Analysis Anurag semwal (16111033) Akash singh (16111028) Task 1: Identification of Boundary Topics Data extraction: ● Downloaded Chrome Browser History Data generated from https://takeout.google.com/settings/takeout/ in json format Data from 1-Jan-2016 to 1 -mar 2017 (146281 titles) ● Converted to a csv file with columns Date,url and title Topic modelling: ● Do topic modelling on the data to get 4 topics ● Using generated Model generate list of predicted topics for a given url in decreasing order of probability Ie url1-> [(t1,0.5) ,(t2,0.4) ,(t3,0.1)] ● Analyze data and get only those urls after which a user switches to a different topic These are some sort of boundary data inside a topic . ● Identified 70807 Boundary titles ● Do topic modelling on These boundary titles ● The generated Topics represents Boundary Topics
13
Embed
Browser Data Analysis - IIT Kanpur · Browser Data Analysis Anurag semwal (16111033) Akash singh (16111028) Task 1: Identification of Boundary Topics Data extraction: Downloaded Chrome
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Browser Data Analysis Anurag semwal (16111033)
Akash singh (16111028)
Task 1: Identification of Boundary Topics Data extraction:
● Downloaded Chrome Browser History Data generated from https://takeout.google.com/settings/takeout/ in json format
Data from 1-Jan-2016 to 1 -mar 2017 (146281 titles) ● Converted to a csv file with columns Date,url and title
Topic modelling:
● Do topic modelling on the data to get 4 topics ● Using generated Model generate list of predicted topics for a given url in decreasing
order of probability Ie url1-> [(t1,0.5) ,(t2,0.4) ,(t3,0.1)]
● Analyze data and get only those urls after which a user switches to a different topic These are some sort of boundary data inside a topic .
● Identified 70807 Boundary titles
● Do topic modelling on These boundary titles ● The generated Topics represents Boundary Topics
(3, '0.052*"onlin" + 0.036*"amazon" + 0.030*"shop" + 0.024*"watch" + 0.021*"mobil" + 0.021*"book" + 0.017*"network" + 0.017*"india" + 0.016*"system" + 0.015*"shoe"')] ---->online shopping Task 2: Hypothesis testing Is the Chronologically last link visited in a topic strongly Related to The Topic to which user switches? Extracted Data: Browser History data (titles of visited webpages from 1 jan 2016 - 1Mar 2017) Sample : Boundary titles(after which topic change) 69948 Test statistic: Paired T-test X: Probability of Second most probable topic of the boundary url Y: Probability of the boundary url belonging to Next Topic (i.e most probable topic of the next visited url) Null Hypothesis : Mean probability of X is same as that of Y ie Mean (X) = Mean(Y) Alternate Hypothesis:Mean probability of X is greater than that of Y ie Mean(X) > Mean(Y) Significance level (ɑ): 0.05 Data=50,000
µX µY σX σY α t-statistic
p-value(1-sided)
0.23628
0.15498
0.11738
0.10217
0.05 102.46 0.0
Data=146281 Boundary Titles: 69948 Data for which x = y : 25714 % of Total Data: 36.7 %
µX µY σX σY α t-statistic
p-value(1-sided)
0.23678
0.158433
0.11784
0.10512
0.05 165.6 0.0
Conclusion: ● P-value << α (level of significance),statistically highly significant (<0.05) ● We can reject H0 and say that sample X has higher mean
compared to sample Y. We can say that the the Chronologically last link visited in a topic is Not strongly Related to The Topic to which user switches. Task 3: Hypothesis testing Is the Chronologically First link visited in a topic strongly Related to The Prev Topic from which user switches?
Extracted Data: Browser History data (titles of visited webpages from 1 jan 2016 - 1Mar 2017) Sample : Boundary titles(first url of new topic) 69948 Test statistic: Paired T-test X: Probability of Second most probable topic of the boundary url Y: Probability of the boundary url belonging to prev Topic (i.e most probable topic of the prev visited url) Null Hypothesis : Mean probability of X is same as that of Y ie Mean (X) = Mean(Y) Alternate Hypothesis:Mean probability of X is greater than that of Y ie Mean(X) > Mean(Y) Significance level (ɑ): 0.05 Data=30000 Boundary Titles: 14792 Data for which x = y : 5236 % of Total Data: 35 %
µX µY σX σY α t-statistic
p-value(1-sided)
0.24255
0.1673
0.1123
0.10756
0.05 74.45 0.0
Conclusion: ● P-value << α (level of significance),statistically highly significant (<0.05) ● We can reject H0 and say that sample X has higher mean
compared to sample Y. We can say that the the Chronologically first link visited in a topic is Not strongly Related to The Topic from which user switches.
Task 4: Prediction of URl’s at any hour
Data :
● Data from 1-Jan-2016 to 1 -mar 2017 (146281 titles) for using lda to get topics
● (1 Dec 2016-1 march 2017) 48308 urls from personal browsing history for prediction Hourly Topic Modeling :
● Do topic modelling on the whole data to get 4 topics ● Predict Topics for most recent 48308 urls ● Find topic frequency for each hour(0-1,1-2,...23-24) ● Find most frequent topic for each hour ● Find most frequent urls for each topic ● For each hour ,based on most frequent topic for that hour suggest top 5 urls for that topic
to user Results: Showing predicted urls for each hour eg 0 denotes 12 am to 1 am 1 denotes 1 to 2 am…..23 denotes 11pm to 12 am 0 ['http://www.amazon.in/', 'https://m.facebook.com/messages/read/?tid=mid.1441536560961%3Af41b8aee8c76a6f223', 'https://m.facebook.com/photo.php?fbid=1374539555891053&id=100000050660083&set=a.159962404015447.44507.100000050660083¬if_t=like¬if_id=1480571116462940&ref=m_notif', 'http://www.hrtchp.com/hrtctickets/Availability.aspx', 'http://gateoverflow.in/'] 1 ['http://www.amazon.in/', 'https://m.facebook.com/messages/read/?tid=mid.1441536560961%3Af41b8aee8c76a6f223', 'https://m.facebook.com/photo.php?fbid=1374539555891053&id=100000050660083&set=a.159962404015447.44507.100000050660083¬if_t=like¬if_id=1480571116462940&ref=m_notif', 'http://www.hrtchp.com/hrtctickets/Availability.aspx', 'http://gateoverflow.in/']