Popularity Prediction on Twitter EE239AS Project 3 By: Aditya Rao (404434974) Vikas Amar Tikoo (204435535) Saurabh Trikande (604435562) Behnam Shahbazi(704355606)
Nov 16, 2015
Popularity Prediction on Twitter EE239AS Project 3
By:Aditya Rao (404434974)Vikas Amar Tikoo (204435535)Saurabh Trikande (604435562)Behnam Shahbazi(704355606)
Problem 1:Download the training tweet data and calculate these statistics for each hashtag:average number of tweets per hour, average number of followers of users posting thetweets, and average number of retweets. Plot "number of tweets in hour" over time for#SuperBowl and #NFL (a histogram with 1-hour bins). The tweets are stored in separatefiles for different hashtags and files are named as tweet_[#hashtag].txt. The tweet filecontains one tweet in each line and tweets are sorted with respect to their posting time.Each tweet is a JSON string that you can load in Python as a dictionary.
Starting from the earliest timestamp in the tweet_[#hashtag].txt file to the last one, thecount of: each tweet, the followers of the tweeters and the further retweets was tracked.This was later used to calculate the average no.of tweets per hour, average no.offollowers of users posting the tweets and the average no.of retweets.
Hashtag Average no.of tweetsper hour
Average no.of followers
Average no.of retweets
#gopatriots 23.0907 1602.07 1.40014#gohawks 114.298 2393.6 2.01463#nfl 167.326 4763.34 1.53854#patriots 297.697 3641.7 1.78282#sb49 733.102 10230.1 2.51115#superbowl 857.992 9958.12 2.38827
Problem 2:
Fit a Linear Regression model using 5 features to predict number of tweets in the nexthour, with features extracted from tweet data in the previous hour. The features youshould use are: number of tweets, total number of retweets, sum of the number offollowers posting the hashtag, maximum number of followers in users posting thehashtag, and time of the day (which could take 24 values that represent hours of the daywith respect to a given time reference). Explain your model's training accuracy and thesignificance of each feature using the t-test and P-value results of fitting the model.
For this problem, the independent features are: total number of retweets, sum offollowers, maximum of the followers, and hour of the day.The feature to be predicted, i.e., predicant is the number of tweets.We used the statsmodel package as suggested by the professor. A Linear regressionmodel which used Ordinary Least Squares was made to run on this set of features andmade to predict the no.of tweets in the next hour.
R-Square value: The accuracy or the correctness of the predictability of the model isgiven by its R-square value. R-square value = predicted value / actual value. It is clearthat the higher the R-square value, the better the regression model.
P- value: This measures the weight of each feature used for a prediction. The claim(prediction) that's being tested is called the null hypothesis. A small p-value (typically 0.05) indicates strong evidence against the null hypothesis, so the null hypothesis isrejected. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, sothe null hypothesis stands probably true.
T-value: t-values are similar to the deviations of the predicted data from the actual data.It can be seen that the t-values are high when p-values are low and vis-a-vis.
OLS Regression Results for #gopatriots
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.608Model: OLS Adj. R-squared: 0.605Method: Least Squares F-statistic: 210.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 4.03e-135Time: 16:25:15 Log-Likelihood: -4533.8No. Observations: 683 AIC: 9080.Df Residuals: 677 BIC: 9107.Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 11.4988 13.878 0.829 0.408 -15.750 38.748x1 1.4818 0.124 11.945 0.000 1.238 1.725x2 -7.101e-06 0.000 -0.067 0.947 -0.000 0.000x3 -32.4000 1.844 -17.572 0.000 -36.020 -28.780x4 5.407e-06 0.000 0.041 0.967 -0.000 0.000x5 0.3197 1.030 0.310 0.756 -1.703 2.342==============================================================================Omnibus: 1138.167 Durbin-Watson: 2.391Prob(Omnibus): 0.000 Jarque-Bera (JB): 824584.907Skew: 10.007 Prob(JB): 0.00Kurtosis: 172.040 Cond. No. 9.18e+05
OLS Regression Results for #gohawks
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.609Model: OLS Adj. R-squared: 0.607Method: Least Squares F-statistic: 301.0Date: Fri, 20 Mar 2015 Prob (F-statistic): 3.63e-194Time: 16:25:46 Log-Likelihood: -7630.2No. Observations: 972 AIC: 1.527e+04Df Residuals: 966 BIC: 1.530e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 66.8464 38.917 1.718 0.086 -9.525 143.217x1 0.7296 0.090 8.067 0.000 0.552 0.907x2 5.454e-05 5.01e-05 1.089 0.277 -4.38e-05 0.000x3 0.0046 0.030 0.151 0.880 -0.055 0.064x4 -0.0003 0.000 -3.029 0.003 -0.001 -0.000x5 -0.1487 2.906 -0.051 0.959 -5.851 5.553==============================================================================Omnibus: 935.689 Durbin-Watson: 2.235Prob(Omnibus): 0.000 Jarque-Bera (JB): 2260025.080Skew: 3.089 Prob(JB): 0.00Kurtosis: 239.146 Cond. No. 4.31e+06==============================================================================
OLS Regression Results for #nfl OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.765Model: OLS Adj. R-squared: 0.763Method: Least Squares F-statistic: 598.1Date: Fri, 20 Mar 2015 Prob (F-statistic): 4.06e-286Time: 16:26:35 Log-Likelihood: -6734.6No. Observations: 926 AIC: 1.348e+04Df Residuals: 920 BIC: 1.351e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------
const 66.0260 22.685 2.911 0.004 21.506 110.546x1 0.6551 0.062 10.651 0.000 0.534 0.776x2 8.851e-05 1.4e-05 6.301 0.000 6.09e-05 0.000x3 -2.1411 0.142 -15.090 0.000 -2.420 -1.863x4 -8.577e-05 1.98e-05 -4.328 0.000 -0.000 -4.69e-05x5 -2.0316 1.664 -1.221 0.223 -5.298 1.235==============================================================================Omnibus: 1153.074 Durbin-Watson: 2.151Prob(Omnibus): 0.000 Jarque-Bera (JB): 292772.213Skew: 6.056 Prob(JB): 0.00Kurtosis: 89.263 Cond. No. 7.66e+06==============================================================================
OLS Regression Results for #patriots
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.714Model: OLS Adj. R-squared: 0.712Method: Least Squares F-statistic: 485.6Date: Fri, 20 Mar 2015 Prob (F-statistic): 1.42e-261Time: 16:27:59 Log-Likelihood: -8754.5No. Observations: 980 AIC: 1.752e+04Df Residuals: 974 BIC: 1.755e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 90.6135 115.171 0.787 0.432 -135.397 316.624x1 1.0156 0.029 35.254 0.000 0.959 1.072x2 -0.0001 1.38e-05 -7.940 0.000 -0.000 -8.24e-05x3 -0.4754 0.196 -2.429 0.015 -0.860 -0.091x4 0.0004 6.73e-05 6.329 0.000 0.000 0.001x5 -3.4832 8.492 -0.410 0.682 -20.149 13.182==============================================================================Omnibus: 1765.278 Durbin-Watson: 1.904Prob(Omnibus): 0.000 Jarque-Bera (JB): 1699060.510Skew: 12.227 Prob(JB): 0.00Kurtosis: 205.513 Cond. No. 1.76e+07==============================================================================
OLS Regression Results for #sb49
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.841Model: OLS Adj. R-squared: 0.840Method: Least Squares F-statistic: 610.9Date: Fri, 20 Mar 2015 Prob (F-statistic): 1.52e-227Time: 16:30:23 Log-Likelihood: -5653.1No. Observations: 582 AIC: 1.132e+04Df Residuals: 576 BIC: 1.134e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 96.3861 327.335 0.294 0.769 -546.530 739.302x1 0.9662 0.029 32.873 0.000 0.908 1.024x2 -1.182e-05 3.69e-06 -3.203 0.001 -1.91e-05 -4.57e-06x3 -0.4478 0.120 -3.739 0.000 -0.683 -0.213x4 0.0003 4.4e-05 5.805 0.000 0.000 0.000x5 -24.7930 24.256 -1.022 0.307 -72.434 22.848==============================================================================Omnibus: 971.949 Durbin-Watson: 1.416Prob(Omnibus): 0.000 Jarque-Bera (JB): 756009.544Skew: 9.785 Prob(JB): 0.00Kurtosis: 178.478 Cond. No. 1.70e+08==============================================================================
OLS Regression Results for #superbowl OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.835Model: OLS Adj. R-squared: 0.834Method: Least Squares F-statistic: 965.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 0.00Time: 16:44:26 Log-Likelihood: -9685.4No. Observations: 962 AIC: 1.938e+04Df Residuals: 956 BIC: 1.941e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------const -198.4680 365.687 -0.543 0.587 -916.109 519.173x1 1.0006 0.148 6.739 0.000 0.709 1.292x2 4.365e-05 2.12e-05 2.058 0.040 2.03e-06 8.53e-05x3 -5.3961 0.187 -28.851 0.000 -5.763 -5.029x4 0.0003 9.05e-05 3.699 0.000 0.000 0.001x5 10.1494 26.746 0.379 0.704 -42.339 62.638==============================================================================Omnibus: 1403.228 Durbin-Watson: 1.684Prob(Omnibus): 0.000 Jarque-Bera (JB): 657649.721Skew: 8.025 Prob(JB): 0.00Kurtosis: 130.081 Cond. No. nan==============================================================================
Hashtag R-square value
P-values for each feature t-values for each feature
X1 X2 X3 X4 X5 X1 X2 X3 X4 X5#gopatriots 0.608 5.6156
0345e-30
9.46594814e-01
3.13770111e-57
9.67025772e-01
7.56407916e-01
11.94473465
-0.06700838
-17.57219055
0.04135413
0.31032686
#gohawks 0.609 2.13058797e-15
2.76514655e-01
8.80116134e-01
2.51562066e-03
9.59202008e-01
8.06664146
1.08879771
0.15086192
-3.02939581
-0.05116828
#nfl 0.765 4.65069358e-25
4.56544488e-10
3.87017186e-46
1.66724655e-05
2.22531619e-01
10.65072715
6.30149599
-15.09027989
-4.32830944
-1.22064922
#patriots 0.714 3.88634292e-176
5.53261083e-0151.53168004e-002
1.53168004e-002
3.76858392e-010
6.81783371e-001
35.25377308
-7.94034668
-2.42908949
6.32857316
-0.41015374
#sb49 0.835 3.01818294e-134
1.43358226e-003
2.02823457e-004
1.06124683e-008
3.07146570e-001
32.87271467
-3.20331518
-3.73948586
5.80544251
-1.02213401
#superbowl 0.841 2.75802134e-011
3.98229394e-002
3.61804111e-132
2.28435997e-004
7.04424847e-001
6.73870583
2.05839346
-28.85125897
3.69939775
0.37946772
Problem 3:Design a regression model using any features from the paper or other new features youmay find useful for this problem. Fit your model on the data and report fitting accuracyand significance of variables. For the top 3 features in your measurements, draw ascatter plot of predictant (number of tweets for next hour) versus feature value, using allthe samples you have extracted.
We used a combination of the following features:
1. Cumulative Favourites count2. Cumulative Friends_count3. Sum of the no.of followers of the original_author4. Cumulative Followers_count5. Cumulative url_count6. retweet_count7. Cumulative no.of refrerences (@) in each tweet8. Tweet user followers count
Results for each hastags:superbowl:
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.944Model: OLS Adj. R-squared: 0.943Method: Least Squares F-statistic: 1991.Date: Fri, 20 Mar 2015 Prob (F-statistic): 0.00Time: 19:01:05 Log-Likelihood: -9168.4No. Observations: 962 AIC: 1.835e+04Df Residuals: 953 BIC: 1.840e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const -178.9265 120.510 -1.485 0.138 -415.422 57.569x1 -2.4110 0.393 -6.128 0.000 -3.183 -1.639x2 -0.0039 0.000 -14.151 0.000 -0.004 -0.003x3 0.0030 0.000 13.116 0.000 0.003 0.004x4 -4.269e-05 1.33e-05 -3.216 0.001 -6.87e-05 -1.66e-05x5 2.1702 0.300 7.237 0.000 1.582 2.759x6 8.5863 0.240 35.752 0.000 8.115 9.058x7 -0.0001 5.02e-05 -2.484 0.013 -0.000 -2.62e-05x8 -2.2761 0.141 -16.198 0.000 -2.552 -2.000==============================================================================Omnibus: 1173.053 Durbin-Watson: 1.971Prob(Omnibus): 0.000 Jarque-Bera (JB): 556969.469Skew: 5.575 Prob(JB): 0.00Kurtosis: 120.350 Cond. No. 1.08e+08===============================================================
=============== sb49:OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.907Model: OLS Adj. R-squared: 0.906Method: Least Squares F-statistic: 698.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 1.00e-289Time: 18:56:48 Log-Likelihood: -5497.8No. Observations: 582 AIC: 1.101e+04Df Residuals: 573 BIC: 1.105e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const -437.4056 144.302 -3.031 0.003 -720.830 -153.981x1 -6.3339 0.399 -15.857 0.000 -7.118 -5.549x2 0.0042 0.000 8.476 0.000 0.003 0.005x3 0.0001 8.2e-05 1.776 0.076 -1.54e-05 0.000x4 0.0001 1.61e-05 6.524 0.000 7.33e-05 0.000x5 2.5526 0.307 8.320 0.000 1.950 3.155x6 0.0579 0.615 0.094 0.925 -1.150 1.266x7 -6.148e-05 4.18e-05 -1.471 0.142 -0.000 2.06e-05x8 -0.2170 0.095 -2.288 0.023 -0.403 -0.031==============================================================================Omnibus: 989.474 Durbin-Watson: 1.687Prob(Omnibus): 0.000 Jarque-Bera (JB): 952094.822Skew: 10.064 Prob(JB): 0.00Kurtosis: 200.121 Cond. No. 1.00e+08==============================================================================
patriots:
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.762Model: OLS Adj. R-squared: 0.760Method: Least Squares F-statistic: 388.3Date: Fri, 20 Mar 2015 Prob (F-statistic): 2.42e-296Time: 18:31:00 Log-Likelihood: -8664.3No. Observations: 980 AIC: 1.735e+04Df Residuals: 971 BIC: 1.739e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------const -41.0854 61.195 -0.671 0.502 -161.175 79.004x1 -2.3666 0.319 -7.410 0.000 -2.993 -1.740x2 -0.0003 0.000 -0.938 0.348 -0.001 0.000x3 0.0006 8.4e-05 7.565 0.000 0.000 0.001x4 0.0005 5.29e-05 9.111 0.000 0.000 0.001x5 -1.3322 0.331 -4.023 0.000 -1.982 -0.682x6 3.9433 0.544 7.250 0.000 2.876 5.011x7 -0.0006 9.92e-05 -5.695 0.000 -0.001 -0.000x8 -0.4433 0.179 -2.471 0.014 -0.795 -0.091==============================================================================Omnibus: 1529.777 Durbin-Watson: 1.878Prob(Omnibus): 0.000 Jarque-Bera (JB): 948872.977Skew: 9.126 Prob(JB): 0.00Kurtosis: 154.343 Cond. No. 1.24e+07==============================================================================
nfl:
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.797Model: OLS Adj. R-squared: 0.795Method: Least Squares F-statistic: 449.5Date: Fri, 20 Mar 2015 Prob (F-statistic): 3.79e-311Time: 18:31:52 Log-Likelihood: -6666.7No. Observations: 926 AIC: 1.335e+04Df Residuals: 917 BIC: 1.339e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 49.7828 13.374 3.722 0.000 23.536 76.030x1 -0.6894 0.175 -3.935 0.000 -1.033 -0.346x2 -9.57e-05 0.000 -0.945 0.345 -0.000 0.000x3 0.0006 0.000 4.590 0.000 0.000 0.001x4 3.722e-05 1.16e-05 3.198 0.001 1.44e-05 6.01e-05x5 3.0882 0.381 8.116 0.000 2.341 3.835x6 0.5683 0.136 4.183 0.000 0.302 0.835x7 -4.286e-05 1.71e-05 -2.502 0.013 -7.65e-05 -9.25e-06x8 -1.5556 0.154 -10.090 0.000 -1.858 -1.253==============================================================================Omnibus: 722.402 Durbin-Watson: 2.117Prob(Omnibus): 0.000 Jarque-Bera (JB): 159610.118Skew: 2.580 Prob(JB): 0.00Kurtosis: 67.110 Cond. No. 4.83e+06==============================================================================
gohawks:
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.762
Model: OLS Adj. R-squared: 0.760Method: Least Squares F-statistic: 385.9Date: Fri, 20 Mar 2015 Prob (F-statistic): 3.37e-294Time: 18:32:29 Log-Likelihood: -7388.6No. Observations: 972 AIC: 1.480e+04Df Residuals: 963 BIC: 1.484e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 22.5795 17.360 1.301 0.194 -11.489 56.648x1 -1.7203 0.204 -8.447 0.000 -2.120 -1.321x2 0.0006 0.000 3.148 0.002 0.000 0.001x3 0.0010 8.81e-05 11.038 0.000 0.001 0.001x4 -7.143e-05 3.13e-05 -2.279 0.023 -0.000 -9.94e-06x5 -1.8436 0.237 -7.780 0.000 -2.309 -1.379x6 2.8752 0.239 12.031 0.000 2.406 3.344x7 -2.978e-05 6.07e-05 -0.490 0.624 -0.000 8.94e-05x8 0.0152 0.024 0.631 0.528 -0.032 0.062==============================================================================Omnibus: 1745.159 Durbin-Watson: 2.092Prob(Omnibus): 0.000 Jarque-Bera (JB): 4133794.940Skew: 11.647 Prob(JB): 0.00Kurtosis: 321.632 Cond. No. 3.69e+06==============================================================================
gopatriots:
OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.745Model: OLS Adj. R-squared: 0.742Method: Least Squares F-statistic: 246.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 2.62e-194Time: 18:32:34 Log-Likelihood: -4387.1No. Observations: 683 AIC: 8792.Df Residuals: 674 BIC: 8833.Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const -6.5901 6.058 -1.088 0.277 -18.484 5.304x1 4.3425 0.315 13.784 0.000 3.724 4.961x2 0.0010 0.000 2.541 0.011 0.000 0.002x3 -0.0020 0.000 -11.348 0.000 -0.002 -0.002x4 -0.0012 0.000 -7.701 0.000 -0.001 -0.001
x5 4.9909 0.598 8.339 0.000 3.816 6.166x6 -1.1720 0.542 -2.164 0.031 -2.235 -0.109x7 0.0011 0.000 7.013 0.000 0.001 0.001x8 -11.7934 2.000 -5.897 0.000 -15.720 -7.867==============================================================================Omnibus: 910.949 Durbin-Watson: 2.242Prob(Omnibus): 0.000 Jarque-Bera (JB): 564206.178Skew: 6.312 Prob(JB): 0.00Kurtosis: 143.237 Cond. No. 7.50e+05==============================================================================
Problem 4:Split the feature data (your set of (features,predictant) pairs for windows) into 10 partsto perform cross-validation. Run 10 tests, each time fitting your model on 9 parts andpredicting the number of tweets for the 1 remaining part. Calculate the averageprediction error |Npredicted Nreal|over samples in the remaining part, and then averagethese values over the 10 tests. Since we know the Super Bowl's date and time, we cancreate different regression models for different periods of time. First, when the hashtagshaven't become very active, second, their active period, and third, after they pass theirhigh-activity time. Train 3 regression models for these time periods (The times are all inPST): 1. Before Feb. 1, 8:00 a.m. 2. Between Feb. 1, 8:00 a.m. and 8:00 p.m. 3. After Feb. 1, 8:00 p.m.
Report cross-validation errors for the 3 different models. Note that you should do the90-10% splitting for each model within its specific time window. i.e., only use datawithin one of the 3 periods for training and testing each time, so for each period youwill run 10 tests.
Feature Set |Npredicted Nreal|
Entire set 2465.29948811Before Feb 1, 8 am 366.07366939Between Feb 1, 8 am 8 pm 74546.9344882After Feb 1, 8 pm 903.228146426
Problem 5:Download the test data and run your model to make predictions for the next hour in each case.Each file in the test data contains a hashtag's tweets for a 6-hour window. Thefile name shows sample number followed by the period number the data is from. E.g. a file named sample5_period2.txt contains tweets for a 6-hour window that lies in the 2nd time period described in part 4. Report your predicted number of tweets for the next hour of each sample window.
We ran the model over the given test data and the prediction results are shown in the table below.
Test File Hour 1 Hour 2 Hour 3 Hour 4 Hour 5 Hour 6sample1_period1 164.2380
6279132.34524248
43.56582638
110.01934521
135.89274546
182.13761449
sample2_period2 61360.95911616
65750.4988655
72678.45760117
93980.1620378
173371.25716328
201042.82972879
sample3_period3 450.07928674
381.68793746
507.21787718
712.08603085
581.90945857
433.69485984
sample4_period1 386.48481377
292.73741923
105.81407976
108.47630511
139.0595091
142.9422123
sample5_period1 295.99683053
187.15690446
198.00202464
174.99212932
188.41977493
150.94818959
sample6_period2 39699.26796418
71172.18908046
161267.20149039
153584.18795669
132269.81772466
134482.55627018
sample7_period3 147.8020772
100.51927159
804.36872147
786.48359354
771.11762266
796.86308636
sample8_period1 889.94925098
890.08823136
1102.94303293
1076.40527254
1010.36838112
883.50427938
sample9_period2 51834.51856431
57096.8663199
62192.66699246
53997.62054499
72576.15507135
81059.09345971
sample10_period3 588.94669065
568.43356111
553.0441436
535.31943703
516.56800024
499.70149696