Top Banner
1 © Copyright 2013 Pivotal. All rights reserved. Analyzing the power of Tweets in predicting Commodity Futures Mar 17, 2014 Srivatsan Ramanujam Senior Data Scientist Pivotal
20
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Analyzing Power of Tweets in Predicting Commodity Futures

1 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian 1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.

Analyzing the power of Tweets in predicting Commodity Futures

Mar 17, 2014 Srivatsan Ramanujam Senior Data Scientist

Pivotal

Page 2: Analyzing Power of Tweets in Predicting Commodity Futures

2 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Problem Definition �  Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ?

�  The Customer: A major Agricultural Cooperative

Page 3: Analyzing Power of Tweets in Predicting Commodity Futures

3 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Data

Page 4: Analyzing Power of Tweets in Predicting Commodity Futures

4 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Obtaining Data �  Used to fetch 5-years of historical tweets matching any of a list of keywords of interest

Tweets Table Poster Information

Page 5: Analyzing Power of Tweets in Predicting Commodity Futures

5 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

GNIP

�  As plugged-in partners, we’ve worked with GNIP before, experience was great!

�  We needed historical data and GNIP’s Historical PowerTrack came in handy

�  Clean API, quick quotes, convenient to download results of historical jobs

Page 6: Analyzing Power of Tweets in Predicting Commodity Futures

6 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Grain Futures Vs. Volume of Tweets

Page 7: Analyzing Power of Tweets in Predicting Commodity Futures

7 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

The Platform

Page 8: Analyzing Power of Tweets in Predicting Commodity Futures

8 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Data Science Toolkit �  Appliance

–  Full Rack DCA with Greenplum Database

�  ETL –  Python

�  Modeling –  SQL –  MADlib –  PL/Python, PL/Java –  Ark-Tweet-NLP1 with PL/Java Wrappers

�  Visualization –  Tableau

1CMU ARK Twitter Parts-of-Speech tagger : http://www.ark.cs.cmu.edu/TweetNLP (GPL 2)

Page 9: Analyzing Power of Tweets in Predicting Commodity Futures

9 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Pivotal Greenplum MPP DB Think of it as multiple PostGreSQL servers

Segments/Workers

Master

Rows are distributed across segments by a particular field (or randomly)

Page 10: Analyzing Power of Tweets in Predicting Commodity Futures

10 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

�  The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster

•  Data Parallelism: -  PL/X piggybacks on

Greenplum’s MPP architecture

•  Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby

Master

Master Host

SQL

Interconnect

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

Segment Host Segment Segment

PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}

Page 11: Analyzing Power of Tweets in Predicting Commodity Futures

11 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Scalable, in-database ML

•  Open Source!https://github.com/madlib/madlib •  Works on Greenplum DB and PostgreSQL •  Active development by Pivotal

-  Latest Release : 1.4 (Dec 2014) •  Downloads and Docs: http://madlib.net/

Page 12: Analyzing Power of Tweets in Predicting Commodity Futures

12 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

MADlib In-Database Functions

Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers

Matrix Factorization •  Single Value Decomposition (SVD) •  Low-Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards •  Regression •  Elastic Net Regularization •  Sandwich Estimators (Huber white,

clustered, marginal effects)

Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Affinity Analysis, Market

Basket) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Ensemble Learners (Random Forests) •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation

Descriptive Statistics

Sketch-based Estimators •  CountMin (Cormode-

Muthukrishnan) •  FM (Flajolet-Martin) •  MFV (Most Frequent

Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions

Page 13: Analyzing Power of Tweets in Predicting Commodity Futures

13 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

The Models

Page 14: Analyzing Power of Tweets in Predicting Commodity Futures

14 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

The Approach

•  In addition to identifying textual cues in tweets that were correlated with commodity futures, we also wanted to analyze whether tweet sentiment was correlated with commodity futures

Page 15: Analyzing Power of Tweets in Predicting Commodity Futures

15 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Sentiment Analysis – Challenges �  Language on Twitter doesn’t

adhere to rules of grammar, syntax or spelling

�  We don’t have labeled data for our problem. The tweets aren’t tagged with sentiment

�  Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile!

“Cool”

Page 16: Analyzing Power of Tweets in Predicting Commodity Futures

16 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Sentiment Analysis – Approach

1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)

Phrase Extraction

Semi-Supervised Sentiment Classification

Phrasal Polarity Scoring

Sentiment Scored Tweets

Use learned phrasal polarities to score

sentiment of new tweets

Part-of-speech tagger1

Break-up Tweets into tokens and tag their

parts-of-speech

�  Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets

�  Custom (patent pending) algorithm to extract contextual cues & score sentiment of tweets

Page 17: Analyzing Power of Tweets in Predicting Commodity Futures

17 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Text Analytics Pipeline with GNIP stream

Stored on HDFS

Tweet Stream

(gpfdist) Loaded as

external tables into GPDB

Parallel Parsing of JSON and extraction

of fields using PL/Python

Topic Analysis through MADlib pLDA

Sentiment Analysis through custom

PL/Python functions

D3.js

Page 18: Analyzing Power of Tweets in Predicting Commodity Futures

18 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

There is significant signal in Tweets in predicting commodity futures

Key Take-Aways

Sentiment Analysis of tweets can provide an additional signal in predicting commodity futures. Twitter sentiment was negatively correlated with commodity futures, in the sample we analyzed

A blended model of Text Regression, Sentiment Analysis and Tweet Actor information gave us encouraging results and we believe that when combined with market fundamentals like weather or yield will give better models

Page 19: Analyzing Power of Tweets in Predicting Commodity Futures

19 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

What’s in it for me?

Page 20: Analyzing Power of Tweets in Predicting Commodity Futures

20 © Copyright 2013 Pivotal. All rights reserved.

@gopivotal @being_bayesian

Pivotal Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software

•  MADlib – In-database parallel ML -  https://github.com/madlib/madlib

•  PyMADlib – Python Wrapper for MADlib -  https://github.com/gopivotal/pymadlib

•  PivotalR – R wrapper for MADlib -  https://github.com/madlib-internal/PivotalR

•  Part-of-speech tagger for Twitter via SQL -  http://vatsan.github.io/gp-ark-tweet-nlp/

Questions? @being_bayesian