1 © Copyright 2013 Pivotal. All rights reserved. Analyzing the power of Tweets in predicting Commodity Futures Mar 17, 2014 Srivatsan Ramanujam Senior Data Scientist Pivotal
Jul 12, 2015
1 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian 1 © Copyright 2013 Pivotal. All rights reserved. 1 © Copyright 2013 Pivotal. All rights reserved.
Analyzing the power of Tweets in predicting Commodity Futures
Mar 17, 2014 Srivatsan Ramanujam Senior Data Scientist
Pivotal
2 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Problem Definition � Can we predict Corn, Soybean and Wheat futures based on Social Chatter on Twitter ?
� The Customer: A major Agricultural Cooperative
3 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Data
4 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Obtaining Data � Used to fetch 5-years of historical tweets matching any of a list of keywords of interest
Tweets Table Poster Information
5 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
GNIP
� As plugged-in partners, we’ve worked with GNIP before, experience was great!
� We needed historical data and GNIP’s Historical PowerTrack came in handy
� Clean API, quick quotes, convenient to download results of historical jobs
6 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Grain Futures Vs. Volume of Tweets
7 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
The Platform
8 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Data Science Toolkit � Appliance
– Full Rack DCA with Greenplum Database
� ETL – Python
� Modeling – SQL – MADlib – PL/Python, PL/Java – Ark-Tweet-NLP1 with PL/Java Wrappers
� Visualization – Tableau
1CMU ARK Twitter Parts-of-Speech tagger : http://www.ark.cs.cmu.edu/TweetNLP (GPL 2)
9 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Pivotal Greenplum MPP DB Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
10 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
� The interpreter/VM of the language ‘X’ is installed on each node of the Greenplum Database Cluster
• Data Parallelism: - PL/X piggybacks on
Greenplum’s MPP architecture
• Allows users to write Greenplum/PostgreSQL functions in the R/Python/Java, Perl, pgsql or C languages Standby
Master
…
Master Host
SQL
Interconnect
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
Segment Host Segment Segment
PL/X : X in {pgsql, R, Python, Java, Perl, C etc.}
11 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Scalable, in-database ML
• Open Source!https://github.com/madlib/madlib • Works on Greenplum DB and PostgreSQL • Active development by Pivotal
- Latest Release : 1.4 (Dec 2014) • Downloads and Docs: http://madlib.net/
12 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
MADlib In-Database Functions
Predictive Modeling Library
Linear Systems • Sparse and Dense Solvers
Matrix Factorization • Single Value Decomposition (SVD) • Low-Rank
Generalized Linear Models • Linear Regression • Logistic Regression • Multinomial Logistic Regression • Cox Proportional Hazards • Regression • Elastic Net Regularization • Sandwich Estimators (Huber white,
clustered, marginal effects)
Machine Learning Algorithms • Principal Component Analysis (PCA) • Association Rules (Affinity Analysis, Market
Basket) • Topic Modeling (Parallel LDA) • Decision Trees • Ensemble Learners (Random Forests) • Support Vector Machines • Conditional Random Field (CRF) • Clustering (K-means) • Cross Validation
Descriptive Statistics
Sketch-based Estimators • CountMin (Cormode-
Muthukrishnan) • FM (Flajolet-Martin) • MFV (Most Frequent
Values) Correlation Summary
Support Modules
Array Operations Sparse Vectors Random Sampling Probability Functions
13 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
The Models
14 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
The Approach
• In addition to identifying textual cues in tweets that were correlated with commodity futures, we also wanted to analyze whether tweet sentiment was correlated with commodity futures
15 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Sentiment Analysis – Challenges � Language on Twitter doesn’t
adhere to rules of grammar, syntax or spelling
� We don’t have labeled data for our problem. The tweets aren’t tagged with sentiment
� Semi-Supervised Sentiment Prediction can be achieved by dictionary look-ups of tokens in a Tweet, but without Context, Sentiment Prediction is futile!
“Cool”
16 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Sentiment Analysis – Approach
1: Parts-of-speech Tagger : Gp-Ark-Tweet-NLP (http://vatsan.github.io/gp-ark-tweet-nlp/)
Phrase Extraction
Semi-Supervised Sentiment Classification
Phrasal Polarity Scoring
Sentiment Scored Tweets
Use learned phrasal polarities to score
sentiment of new tweets
Part-of-speech tagger1
Break-up Tweets into tokens and tag their
parts-of-speech
� Parallelized ArkTweetNLP to achieve fast parts-of-speech tagging on Tweets
� Custom (patent pending) algorithm to extract contextual cues & score sentiment of tweets
17 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Text Analytics Pipeline with GNIP stream
Stored on HDFS
Tweet Stream
(gpfdist) Loaded as
external tables into GPDB
Parallel Parsing of JSON and extraction
of fields using PL/Python
Topic Analysis through MADlib pLDA
Sentiment Analysis through custom
PL/Python functions
D3.js
18 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
There is significant signal in Tweets in predicting commodity futures
Key Take-Aways
Sentiment Analysis of tweets can provide an additional signal in predicting commodity futures. Twitter sentiment was negatively correlated with commodity futures, in the sample we analyzed
A blended model of Text Regression, Sentiment Analysis and Tweet Actor information gave us encouraging results and we believe that when combined with market fundamentals like weather or yield will give better models
19 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
What’s in it for me?
20 © Copyright 2013 Pivotal. All rights reserved.
@gopivotal @being_bayesian
Pivotal Open Source Contributions http://gopivotal.com/pivotal-products/open-source-software
• MADlib – In-database parallel ML - https://github.com/madlib/madlib
• PyMADlib – Python Wrapper for MADlib - https://github.com/gopivotal/pymadlib
• PivotalR – R wrapper for MADlib - https://github.com/madlib-internal/PivotalR
• Part-of-speech tagger for Twitter via SQL - http://vatsan.github.io/gp-ark-tweet-nlp/
Questions? @being_bayesian