Top Banner
1 Google Tools for Data Hal Varian Univ of Oregon Oct 2013
47

Google Tools for Data

Feb 14, 2017

Download

Documents

dinhliem
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Google Tools for Data

1

Google Tools for Data

Hal VarianUniv of Oregon

Oct 2013

Page 2: Google Tools for Data

2

Google Trends

Google Correlate

Google Consumer Surveys

Page 3: Google Tools for Data

Searches for [hangover]

Which day of the week are there the most searches for [hangover]?

1: Sunday

2: Monday

3: Tuesday

4: Wednesday

5: Thursday

6: Friday

7: Saturday

Page 4: Google Tools for Data

Search index for [hangover]

Page 5: Google Tools for Data

Hangover by geography

Page 6: Google Tools for Data

Hangover-vodka time series

Page 7: Google Tools for Data

Soup and ice cream

Page 8: Google Tools for Data

Searches for [civil war]

Page 9: Google Tools for Data

Searches for [term paper]

Page 10: Google Tools for Data

Gift for boyfriend v Gift for girlfriend

Forboyfriend

For girlfriend

Page 11: Google Tools for Data

Gift for husband v Gift for wife

Forhusband

For wife

Page 12: Google Tools for Data

12

Google Trends

Google Correlate

Google Consumer Surveys

Page 13: Google Tools for Data

Searches correlated with [weight loss]

Page 14: Google Tools for Data

Plot of [weight loss] and [best vacation spots]

New Year

Page 15: Google Tools for Data

Correlated with [weight loss] 3 weeks later

Page 16: Google Tools for Data

Initial claims: good leading indicator for recessions

Grey bars indicate recessions

Page 17: Google Tools for Data

Google Correlate with initial claims data

Page 18: Google Tools for Data

Initial claims and [unemployment filing]

Page 19: Google Tools for Data

Initial claims, seasonally adjusted

Hard to forecast

Page 20: Google Tools for Data

Regression models

Baseline model yt = a yt-1 + c + et gives an in-sample MAE of 3.1%

Adding the “unemployment filing” query yt = a yt-1 + b qt + c + et gives an in-sample MAE of 3.0%

Train using t weeks, forecast t+1 (rolling window forecast)MAE of baseline = 3.2%, MAE with query = 3.2%, 0% improvement

During recession MAE of baseline = 3.7%, MAE with query = 3.3%, 8.7% improvement

Page 21: Google Tools for Data

Gun sales background check

Page 22: Google Tools for Data

NICS time series

[stack on] has highest correlation[gun shops] is chosen by BSTS

Page 23: Google Tools for Data

Trend

Page 24: Google Tools for Data

Seasonal

Page 25: Google Tools for Data

[gun shops]

Page 26: Google Tools for Data

Searches on [gun shop]

Page 27: Google Tools for Data

Our goal: automate model discovery

Challenge 1: spurious correlation

Sometimes find correlations due simply to common seasonality or trend

Challenge 2: fat regression

With more predictors than observations can always find a good fit

Challenge 3: overfitting

Within sample fits typically look better than out of sample fits

Page 28: Google Tools for Data

Challenge 1: Spurious Correlation

Page 29: Google Tools for Data

Challenge 2: Fat regression

Slim regression Fat regression

bXy = = bXy

Any square subset of regressors will fit perfectly

=y= b1X1y b

2X2== b

1X1

b1X1

A subset of regressors might fit well by chance

b1X1

Page 30: Google Tools for Data

Our approach

Estimating time series: use Kalman filter techniques

Express time series as trend + seasonal + noise (“basic structural model”)

Forecast univariate model using Kalman filter

Advantages: flexibility, adaptive, interpretable, handles non-stationarity well

Model selection using “spike and slab” Bayesian regression

Spike: prior probability that coefficient is included in regression

Slab: diffuse prior for coefficient, conditional on inclusion

Estimate a posterior probability that variable is in model

Combines well with Kalman techniques

Final forecast is weighted average of many models, with weights given by posterior probabilities (Bayesian model averaging)

Example of “ensemble estimation”

Agnostic with respect to “true model”

Tends to avoid overfitting by avoiding choice of “best” single model

Page 31: Google Tools for Data

Checklist

Kalman filter: handles seasonality and trend

Spike and slab: handles variable selection

Model averaging: averages over many small models to avoid overfitting

Page 32: Google Tools for Data

UM consumer sentiment index

Monthly data from University of Michigan survey

Select predictors using spike and slab from 157 Google economic verticals, using average value for first 2-weeks of month (about 3 weeks before data is released).

Page 33: Google Tools for Data

Probability of inclusion of predictor (n=98, k= 195)

White: positive predictor

Black: negative predictor

Financial planning: personal finance, finance education, finance planning, finance schwab, financial literacy

Investing: finance google, stock finance, stocks, etrade, ameritrade, gold

Page 34: Google Tools for Data

Start with “trend”

Page 35: Google Tools for Data

Add “financial planning”

Page 36: Google Tools for Data

Add “Investing”

Page 37: Google Tools for Data

Add “Business News”

Page 38: Google Tools for Data

Add “Search Engines”

Page 39: Google Tools for Data

Add Energy and Utilities

Page 40: Google Tools for Data

40

Google Trends

Google Correlate

Google Consumer Surveys

Page 41: Google Tools for Data

41

How it works

Page 42: Google Tools for Data

42

Page 43: Google Tools for Data

43

Page 44: Google Tools for Data

44

Page 45: Google Tools for Data

45

3rd party analysis

Pew Foundation “A Comparison of Results from Surveys by the Pew Research Center and Google Consumer Surveys”

Nate Silver “Which Polls Fared Best (and Worst) in the 2012 Presidential Race”

Red: GoogleBlue: Pew

Page 46: Google Tools for Data

46

How this changes surveys

Anyone can do them

The cost is dramatically lower

Results come back in a few hours

Surveys can be replicated … or not

You can detect sensitivity due to wording

Page 47: Google Tools for Data

Challenges for the future

Private sector has high-frequency, real time data and a lot of it!

Visa, Mastercard, American Express

UPS and FedEx

Wal-Mart, Target, etc

Supermarket scanner data

Search engines

Government agencies

Long historical series, but usually low frequency

Carefully constructed but labor intensive, with delayed release and periodic revisions

How to combine the public and private data?

How to integrate massive amounts of private sector real-time information with traditional government statistics