Project3 Report

Popularity Prediction on Twitter EE239AS Project 3

By:Aditya Rao (404434974)Vikas Amar Tikoo (204435535)Saurabh Trikande (604435562)Behnam Shahbazi(704355606)

Problem 1:Download the training tweet data and calculate these statistics for each hashtag:average number of tweets per hour, average number of followers of users posting thetweets, and average number of retweets. Plot "number of tweets in hour" over time for#SuperBowl and #NFL (a histogram with 1-hour bins). The tweets are stored in separatefiles for different hashtags and files are named as tweet_[#hashtag].txt. The tweet filecontains one tweet in each line and tweets are sorted with respect to their posting time.Each tweet is a JSON string that you can load in Python as a dictionary.

Starting from the earliest timestamp in the tweet_[#hashtag].txt file to the last one, thecount of: each tweet, the followers of the tweeters and the further retweets was tracked.This was later used to calculate the average no.of tweets per hour, average no.offollowers of users posting the tweets and the average no.of retweets.

Hashtag Average no.of tweetsper hour

Average no.of followers

Average no.of retweets

#gopatriots 23.0907 1602.07 1.40014#gohawks 114.298 2393.6 2.01463#nfl 167.326 4763.34 1.53854#patriots 297.697 3641.7 1.78282#sb49 733.102 10230.1 2.51115#superbowl 857.992 9958.12 2.38827

Problem 2:

Fit a Linear Regression model using 5 features to predict number of tweets in the nexthour, with features extracted from tweet data in the previous hour. The features youshould use are: number of tweets, total number of retweets, sum of the number offollowers posting the hashtag, maximum number of followers in users posting thehashtag, and time of the day (which could take 24 values that represent hours of the daywith respect to a given time reference). Explain your model's training accuracy and thesignificance of each feature using the t-test and P-value results of fitting the model.

For this problem, the independent features are: total number of retweets, sum offollowers, maximum of the followers, and hour of the day.The feature to be predicted, i.e., predicant is the number of tweets.We used the statsmodel package as suggested by the professor. A Linear regressionmodel which used Ordinary Least Squares was made to run on this set of features andmade to predict the no.of tweets in the next hour.

R-Square value: The accuracy or the correctness of the predictability of the model isgiven by its R-square value. R-square value = predicted value / actual value. It is clearthat the higher the R-square value, the better the regression model.

P- value: This measures the weight of each feature used for a prediction. The claim(prediction) that's being tested is called the null hypothesis. A small p-value (typically 0.05) indicates strong evidence against the null hypothesis, so the null hypothesis isrejected. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, sothe null hypothesis stands probably true.

T-value: t-values are similar to the deviations of the predicted data from the actual data.It can be seen that the t-values are high when p-values are low and vis-a-vis.

OLS Regression Results for #gopatriots

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.608Model: OLS Adj. R-squared: 0.605Method: Least Squares F-statistic: 210.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 4.03e-135Time: 16:25:15 Log-Likelihood: -4533.8No. Observations: 683 AIC: 9080.Df Residuals: 677 BIC: 9107.Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 11.4988 13.878 0.829 0.408 -15.750 38.748x1 1.4818 0.124 11.945 0.000 1.238 1.725x2 -7.101e-06 0.000 -0.067 0.947 -0.000 0.000x3 -32.4000 1.844 -17.572 0.000 -36.020 -28.780x4 5.407e-06 0.000 0.041 0.967 -0.000 0.000x5 0.3197 1.030 0.310 0.756 -1.703 2.342==============================================================================Omnibus: 1138.167 Durbin-Watson: 2.391Prob(Omnibus): 0.000 Jarque-Bera (JB): 824584.907Skew: 10.007 Prob(JB): 0.00Kurtosis: 172.040 Cond. No. 9.18e+05

OLS Regression Results for #gohawks

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.609Model: OLS Adj. R-squared: 0.607Method: Least Squares F-statistic: 301.0Date: Fri, 20 Mar 2015 Prob (F-statistic): 3.63e-194Time: 16:25:46 Log-Likelihood: -7630.2No. Observations: 972 AIC: 1.527e+04Df Residuals: 966 BIC: 1.530e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 66.8464 38.917 1.718 0.086 -9.525 143.217x1 0.7296 0.090 8.067 0.000 0.552 0.907x2 5.454e-05 5.01e-05 1.089 0.277 -4.38e-05 0.000x3 0.0046 0.030 0.151 0.880 -0.055 0.064x4 -0.0003 0.000 -3.029 0.003 -0.001 -0.000x5 -0.1487 2.906 -0.051 0.959 -5.851 5.553==============================================================================Omnibus: 935.689 Durbin-Watson: 2.235Prob(Omnibus): 0.000 Jarque-Bera (JB): 2260025.080Skew: 3.089 Prob(JB): 0.00Kurtosis: 239.146 Cond. No. 4.31e+06==============================================================================

OLS Regression Results for #nfl OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.765Model: OLS Adj. R-squared: 0.763Method: Least Squares F-statistic: 598.1Date: Fri, 20 Mar 2015 Prob (F-statistic): 4.06e-286Time: 16:26:35 Log-Likelihood: -6734.6No. Observations: 926 AIC: 1.348e+04Df Residuals: 920 BIC: 1.351e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------

const 66.0260 22.685 2.911 0.004 21.506 110.546x1 0.6551 0.062 10.651 0.000 0.534 0.776x2 8.851e-05 1.4e-05 6.301 0.000 6.09e-05 0.000x3 -2.1411 0.142 -15.090 0.000 -2.420 -1.863x4 -8.577e-05 1.98e-05 -4.328 0.000 -0.000 -4.69e-05x5 -2.0316 1.664 -1.221 0.223 -5.298 1.235==============================================================================Omnibus: 1153.074 Durbin-Watson: 2.151Prob(Omnibus): 0.000 Jarque-Bera (JB): 292772.213Skew: 6.056 Prob(JB): 0.00Kurtosis: 89.263 Cond. No. 7.66e+06==============================================================================

OLS Regression Results for #patriots

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.714Model: OLS Adj. R-squared: 0.712Method: Least Squares F-statistic: 485.6Date: Fri, 20 Mar 2015 Prob (F-statistic): 1.42e-261Time: 16:27:59 Log-Likelihood: -8754.5No. Observations: 980 AIC: 1.752e+04Df Residuals: 974 BIC: 1.755e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 90.6135 115.171 0.787 0.432 -135.397 316.624x1 1.0156 0.029 35.254 0.000 0.959 1.072x2 -0.0001 1.38e-05 -7.940 0.000 -0.000 -8.24e-05x3 -0.4754 0.196 -2.429 0.015 -0.860 -0.091x4 0.0004 6.73e-05 6.329 0.000 0.000 0.001x5 -3.4832 8.492 -0.410 0.682 -20.149 13.182==============================================================================Omnibus: 1765.278 Durbin-Watson: 1.904Prob(Omnibus): 0.000 Jarque-Bera (JB): 1699060.510Skew: 12.227 Prob(JB): 0.00Kurtosis: 205.513 Cond. No. 1.76e+07==============================================================================

OLS Regression Results for #sb49

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.841Model: OLS Adj. R-squared: 0.840Method: Least Squares F-statistic: 610.9Date: Fri, 20 Mar 2015 Prob (F-statistic): 1.52e-227Time: 16:30:23 Log-Likelihood: -5653.1No. Observations: 582 AIC: 1.132e+04Df Residuals: 576 BIC: 1.134e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 96.3861 327.335 0.294 0.769 -546.530 739.302x1 0.9662 0.029 32.873 0.000 0.908 1.024x2 -1.182e-05 3.69e-06 -3.203 0.001 -1.91e-05 -4.57e-06x3 -0.4478 0.120 -3.739 0.000 -0.683 -0.213x4 0.0003 4.4e-05 5.805 0.000 0.000 0.000x5 -24.7930 24.256 -1.022 0.307 -72.434 22.848==============================================================================Omnibus: 971.949 Durbin-Watson: 1.416Prob(Omnibus): 0.000 Jarque-Bera (JB): 756009.544Skew: 9.785 Prob(JB): 0.00Kurtosis: 178.478 Cond. No. 1.70e+08==============================================================================

OLS Regression Results for #superbowl OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.835Model: OLS Adj. R-squared: 0.834Method: Least Squares F-statistic: 965.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 0.00Time: 16:44:26 Log-Likelihood: -9685.4No. Observations: 962 AIC: 1.938e+04Df Residuals: 956 BIC: 1.941e+04Df Model: 5 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------const -198.4680 365.687 -0.543 0.587 -916.109 519.173x1 1.0006 0.148 6.739 0.000 0.709 1.292x2 4.365e-05 2.12e-05 2.058 0.040 2.03e-06 8.53e-05x3 -5.3961 0.187 -28.851 0.000 -5.763 -5.029x4 0.0003 9.05e-05 3.699 0.000 0.000 0.001x5 10.1494 26.746 0.379 0.704 -42.339 62.638==============================================================================Omnibus: 1403.228 Durbin-Watson: 1.684Prob(Omnibus): 0.000 Jarque-Bera (JB): 657649.721Skew: 8.025 Prob(JB): 0.00Kurtosis: 130.081 Cond. No. nan==============================================================================

Hashtag R-square value

P-values for each feature t-values for each feature

X1 X2 X3 X4 X5 X1 X2 X3 X4 X5#gopatriots 0.608 5.6156

0345e-30

9.46594814e-01

3.13770111e-57

9.67025772e-01

7.56407916e-01

11.94473465

-0.06700838

-17.57219055

0.04135413

0.31032686

#gohawks 0.609 2.13058797e-15

2.76514655e-01

8.80116134e-01

2.51562066e-03

9.59202008e-01

8.06664146

1.08879771

0.15086192

-3.02939581

-0.05116828

#nfl 0.765 4.65069358e-25

4.56544488e-10

3.87017186e-46

1.66724655e-05

2.22531619e-01

10.65072715

6.30149599

-15.09027989

-4.32830944

-1.22064922

#patriots 0.714 3.88634292e-176

5.53261083e-0151.53168004e-002

1.53168004e-002

3.76858392e-010

6.81783371e-001

35.25377308

-7.94034668

-2.42908949

6.32857316

-0.41015374

#sb49 0.835 3.01818294e-134

1.43358226e-003

2.02823457e-004

1.06124683e-008

3.07146570e-001

32.87271467

-3.20331518

-3.73948586

5.80544251

-1.02213401

#superbowl 0.841 2.75802134e-011

3.98229394e-002

3.61804111e-132

2.28435997e-004

7.04424847e-001

6.73870583

2.05839346

-28.85125897

3.69939775

0.37946772

Problem 3:Design a regression model using any features from the paper or other new features youmay find useful for this problem. Fit your model on the data and report fitting accuracyand significance of variables. For the top 3 features in your measurements, draw ascatter plot of predictant (number of tweets for next hour) versus feature value, using allthe samples you have extracted.

We used a combination of the following features:

1. Cumulative Favourites count2. Cumulative Friends_count3. Sum of the no.of followers of the original_author4. Cumulative Followers_count5. Cumulative url_count6. retweet_count7. Cumulative no.of refrerences (@) in each tweet8. Tweet user followers count

Results for each hastags:superbowl:

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.944Model: OLS Adj. R-squared: 0.943Method: Least Squares F-statistic: 1991.Date: Fri, 20 Mar 2015 Prob (F-statistic): 0.00Time: 19:01:05 Log-Likelihood: -9168.4No. Observations: 962 AIC: 1.835e+04Df Residuals: 953 BIC: 1.840e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const -178.9265 120.510 -1.485 0.138 -415.422 57.569x1 -2.4110 0.393 -6.128 0.000 -3.183 -1.639x2 -0.0039 0.000 -14.151 0.000 -0.004 -0.003x3 0.0030 0.000 13.116 0.000 0.003 0.004x4 -4.269e-05 1.33e-05 -3.216 0.001 -6.87e-05 -1.66e-05x5 2.1702 0.300 7.237 0.000 1.582 2.759x6 8.5863 0.240 35.752 0.000 8.115 9.058x7 -0.0001 5.02e-05 -2.484 0.013 -0.000 -2.62e-05x8 -2.2761 0.141 -16.198 0.000 -2.552 -2.000==============================================================================Omnibus: 1173.053 Durbin-Watson: 1.971Prob(Omnibus): 0.000 Jarque-Bera (JB): 556969.469Skew: 5.575 Prob(JB): 0.00Kurtosis: 120.350 Cond. No. 1.08e+08===============================================================

=============== sb49:OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.907Model: OLS Adj. R-squared: 0.906Method: Least Squares F-statistic: 698.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 1.00e-289Time: 18:56:48 Log-Likelihood: -5497.8No. Observations: 582 AIC: 1.101e+04Df Residuals: 573 BIC: 1.105e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const -437.4056 144.302 -3.031 0.003 -720.830 -153.981x1 -6.3339 0.399 -15.857 0.000 -7.118 -5.549x2 0.0042 0.000 8.476 0.000 0.003 0.005x3 0.0001 8.2e-05 1.776 0.076 -1.54e-05 0.000x4 0.0001 1.61e-05 6.524 0.000 7.33e-05 0.000x5 2.5526 0.307 8.320 0.000 1.950 3.155x6 0.0579 0.615 0.094 0.925 -1.150 1.266x7 -6.148e-05 4.18e-05 -1.471 0.142 -0.000 2.06e-05x8 -0.2170 0.095 -2.288 0.023 -0.403 -0.031==============================================================================Omnibus: 989.474 Durbin-Watson: 1.687Prob(Omnibus): 0.000 Jarque-Bera (JB): 952094.822Skew: 10.064 Prob(JB): 0.00Kurtosis: 200.121 Cond. No. 1.00e+08==============================================================================

patriots:

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.762Model: OLS Adj. R-squared: 0.760Method: Least Squares F-statistic: 388.3Date: Fri, 20 Mar 2015 Prob (F-statistic): 2.42e-296Time: 18:31:00 Log-Likelihood: -8664.3No. Observations: 980 AIC: 1.735e+04Df Residuals: 971 BIC: 1.739e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]

------------------------------------------------------------------------------const -41.0854 61.195 -0.671 0.502 -161.175 79.004x1 -2.3666 0.319 -7.410 0.000 -2.993 -1.740x2 -0.0003 0.000 -0.938 0.348 -0.001 0.000x3 0.0006 8.4e-05 7.565 0.000 0.000 0.001x4 0.0005 5.29e-05 9.111 0.000 0.000 0.001x5 -1.3322 0.331 -4.023 0.000 -1.982 -0.682x6 3.9433 0.544 7.250 0.000 2.876 5.011x7 -0.0006 9.92e-05 -5.695 0.000 -0.001 -0.000x8 -0.4433 0.179 -2.471 0.014 -0.795 -0.091==============================================================================Omnibus: 1529.777 Durbin-Watson: 1.878Prob(Omnibus): 0.000 Jarque-Bera (JB): 948872.977Skew: 9.126 Prob(JB): 0.00Kurtosis: 154.343 Cond. No. 1.24e+07==============================================================================

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.797Model: OLS Adj. R-squared: 0.795Method: Least Squares F-statistic: 449.5Date: Fri, 20 Mar 2015 Prob (F-statistic): 3.79e-311Time: 18:31:52 Log-Likelihood: -6666.7No. Observations: 926 AIC: 1.335e+04Df Residuals: 917 BIC: 1.339e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 49.7828 13.374 3.722 0.000 23.536 76.030x1 -0.6894 0.175 -3.935 0.000 -1.033 -0.346x2 -9.57e-05 0.000 -0.945 0.345 -0.000 0.000x3 0.0006 0.000 4.590 0.000 0.000 0.001x4 3.722e-05 1.16e-05 3.198 0.001 1.44e-05 6.01e-05x5 3.0882 0.381 8.116 0.000 2.341 3.835x6 0.5683 0.136 4.183 0.000 0.302 0.835x7 -4.286e-05 1.71e-05 -2.502 0.013 -7.65e-05 -9.25e-06x8 -1.5556 0.154 -10.090 0.000 -1.858 -1.253==============================================================================Omnibus: 722.402 Durbin-Watson: 2.117Prob(Omnibus): 0.000 Jarque-Bera (JB): 159610.118Skew: 2.580 Prob(JB): 0.00Kurtosis: 67.110 Cond. No. 4.83e+06==============================================================================

gohawks:

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.762

Model: OLS Adj. R-squared: 0.760Method: Least Squares F-statistic: 385.9Date: Fri, 20 Mar 2015 Prob (F-statistic): 3.37e-294Time: 18:32:29 Log-Likelihood: -7388.6No. Observations: 972 AIC: 1.480e+04Df Residuals: 963 BIC: 1.484e+04Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const 22.5795 17.360 1.301 0.194 -11.489 56.648x1 -1.7203 0.204 -8.447 0.000 -2.120 -1.321x2 0.0006 0.000 3.148 0.002 0.000 0.001x3 0.0010 8.81e-05 11.038 0.000 0.001 0.001x4 -7.143e-05 3.13e-05 -2.279 0.023 -0.000 -9.94e-06x5 -1.8436 0.237 -7.780 0.000 -2.309 -1.379x6 2.8752 0.239 12.031 0.000 2.406 3.344x7 -2.978e-05 6.07e-05 -0.490 0.624 -0.000 8.94e-05x8 0.0152 0.024 0.631 0.528 -0.032 0.062==============================================================================Omnibus: 1745.159 Durbin-Watson: 2.092Prob(Omnibus): 0.000 Jarque-Bera (JB): 4133794.940Skew: 11.647 Prob(JB): 0.00Kurtosis: 321.632 Cond. No. 3.69e+06==============================================================================

gopatriots:

OLS Regression Results ==============================================================================Dep. Variable: y R-squared: 0.745Model: OLS Adj. R-squared: 0.742Method: Least Squares F-statistic: 246.2Date: Fri, 20 Mar 2015 Prob (F-statistic): 2.62e-194Time: 18:32:34 Log-Likelihood: -4387.1No. Observations: 683 AIC: 8792.Df Residuals: 674 BIC: 8833.Df Model: 8 ============================================================================== coef std err t P>|t| [95.0% Conf. Int.]------------------------------------------------------------------------------const -6.5901 6.058 -1.088 0.277 -18.484 5.304x1 4.3425 0.315 13.784 0.000 3.724 4.961x2 0.0010 0.000 2.541 0.011 0.000 0.002x3 -0.0020 0.000 -11.348 0.000 -0.002 -0.002x4 -0.0012 0.000 -7.701 0.000 -0.001 -0.001

x5 4.9909 0.598 8.339 0.000 3.816 6.166x6 -1.1720 0.542 -2.164 0.031 -2.235 -0.109x7 0.0011 0.000 7.013 0.000 0.001 0.001x8 -11.7934 2.000 -5.897 0.000 -15.720 -7.867==============================================================================Omnibus: 910.949 Durbin-Watson: 2.242Prob(Omnibus): 0.000 Jarque-Bera (JB): 564206.178Skew: 6.312 Prob(JB): 0.00Kurtosis: 143.237 Cond. No. 7.50e+05==============================================================================

Problem 4:Split the feature data (your set of (features,predictant) pairs for windows) into 10 partsto perform cross-validation. Run 10 tests, each time fitting your model on 9 parts andpredicting the number of tweets for the 1 remaining part. Calculate the averageprediction error |Npredicted Nreal|over samples in the remaining part, and then averagethese values over the 10 tests. Since we know the Super Bowl's date and time, we cancreate different regression models for different periods of time. First, when the hashtagshaven't become very active, second, their active period, and third, after they pass theirhigh-activity time. Train 3 regression models for these time periods (The times are all inPST): 1. Before Feb. 1, 8:00 a.m. 2. Between Feb. 1, 8:00 a.m. and 8:00 p.m. 3. After Feb. 1, 8:00 p.m.

Report cross-validation errors for the 3 different models. Note that you should do the90-10% splitting for each model within its specific time window. i.e., only use datawithin one of the 3 periods for training and testing each time, so for each period youwill run 10 tests.

Feature Set |Npredicted Nreal|

Entire set 2465.29948811Before Feb 1, 8 am 366.07366939Between Feb 1, 8 am 8 pm 74546.9344882After Feb 1, 8 pm 903.228146426

Problem 5:Download the test data and run your model to make predictions for the next hour in each case.Each file in the test data contains a hashtag's tweets for a 6-hour window. Thefile name shows sample number followed by the period number the data is from. E.g. a file named sample5_period2.txt contains tweets for a 6-hour window that lies in the 2nd time period described in part 4. Report your predicted number of tweets for the next hour of each sample window.

We ran the model over the given test data and the prediction results are shown in the table below.

Test File Hour 1 Hour 2 Hour 3 Hour 4 Hour 5 Hour 6sample1_period1 164.2380

6279132.34524248

43.56582638

110.01934521

135.89274546

182.13761449

sample2_period2 61360.95911616

65750.4988655

72678.45760117

93980.1620378

173371.25716328

201042.82972879


381.68793746

507.21787718

712.08603085

581.90945857

433.69485984


292.73741923

105.81407976

108.47630511

139.0595091

142.9422123


187.15690446

198.00202464

174.99212932

188.41977493

150.94818959


71172.18908046

161267.20149039

153584.18795669

132269.81772466

134482.55627018


100.51927159

804.36872147

786.48359354

771.11762266

796.86308636


890.08823136

1102.94303293

1076.40527254

1010.36838112

883.50427938


57096.8663199

62192.66699246

53997.62054499

72576.15507135

81059.09345971


568.43356111

553.0441436

535.31943703

516.56800024

499.70149696

Project3 Report

Documents

average number of tweets

p value

average number of retweets

average number of followers

plot number of tweets

number offollowers

predicted value actual

rsquare value