Big data analytics Miha Grčar 1,2 1 Jožef Stefan Institute 2 Sowa Labs GmbH Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu
Feb 23, 2016
Big data analytics
Miha Grčar1,2
1Jožef Stefan Institute2Sowa Labs GmbH
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928)http://project-first.eu
Miha Grčar 2
Outline
• What is big data? What caused it?
Who should care?
• Solving big data problems
• Examples
Frankfurt, 11/11/2013
Miha Grčar 3
• “How many terabytes?”• We deliberately avoid being specific• Big data refers to datasets that cannot be
captured, stored, managed, and/or analyzed by the mainstream storage and processing devices
What is big data?
Frankfurt, 11/11/2013
Miha Grčar 4
What is big data?
Frankfurt, 11/11/2013
Miha Grčar 5
What caused big data?Storage capacity and processing power
Source: Hilbert and López, “The world’s technological capacity to store, communicate, and compute information,” Science, 2011
Frankfurt, 11/11/2013
Miha Grčar 6
What caused big data?Data availability (industry)
Source: IDC; US Bureau of Labor Statistics; McKinsey Global Institute analysisFrankfurt, 11/11/2013
What caused big data?Data availability (social media and mobile devices)
Source: www.creotivo.com
Miha Grčar 8
What caused big data?Data availability (sensors)
Source: Analyst interviews; McKinsey Global Institute analysisFrankfurt, 11/11/2013
Miha Grčar 9
What caused big data?Maturity of technologies & tools
Source: Gartner (July, 2012)
Emerging, hyped
Mature
Frankfurt, 11/11/2013
Miha Grčar 10
Who should care about big data?
Source: US Bureau of Labor Statistics; McKinsey Global Institute analysisFrankfurt, 11/11/2013
Miha Grčar 11
Solving big data problems
• Distributed infrastructure– Cloud Amazon Elastic Compute Cloud (EC3)
• Distributed processing– MapReduce / batches Hadoop– Distributed workflows / streams Twitter Storm
• Distributed storage– Distributed FS/DB– NoSQL
1+1= 1+1=
1+1=
Frankfurt, 11/11/2013
Miha Grčar 12
Solving big data problems
• Distributed infrastructure– Cloud Amazon Elastic Compute Cloud (EC2)
• Distributed processing– MapReduce / batches Hadoop– Distributed workflows / streams Twitter Storm
• Distributed storage– Distributed FS/DB– NoSQL
Amazon EC2, Windows Azure, Google Cloud Platform, Cloudwatt…
Hadoop, MS DryadLINQ, Disco, Misco, Phoenix, Cloud MapReduce, bashreduce, Qizmt…Storm (Twitter), S4 (Yahoo),“Real-time Hadoops”: Impala, HFlame, Spark…
Google File System, HDFS, Google Big Table, HBase, Cassandra, MongoDB, CouchDB, Hive…
Frankfurt, 11/11/2013
Miha Grčar 13
Amazon EC2EC2 = ECC = Elastic Compute Cloud
• Central part of Amazon.com’s cloud computing service• ~500,000 physical Linux machines • Elastic: possibility to start / stop servers with respect to demand; pay only for
running servers• Instances (several examples)
– Micro, 1 ECU, 1 Core, 613 MiB– High-Memory XL, 6.5 ECUs, 2 Cores, 17.1 GiB– High-CPU XL, 20 ECUs, 8 Cores, 7 GiB
• OS– Windows– Linux– FreeBSD
• Storage– Temporary instance-storage– Persistent Elastic Block Storage (EBS)
Frankfurt, 11/11/2013
Miha Grčar 17
MapReduce (Hadoop)
Frankfurt, 11/11/2013
A B CA B C
Election results:A: 321,015B: 179,539C: 201,734
A B C
A bunch of ballots,all mixed up…
Still mixed up…Map
Reduce
Miha Grčar 19
MapReduce (Hadoop)
mergecopy
map reduce outputdatasort
195005150700+0000195005151200+0022195005151800-0011194903241200+0111194903241800+0078
1950 0 1950 22 1950 -11 1949 111 1949 78
1950 [ 0, 22, -11 ]1949 [ 111, 78 ]
1950 [ 22 ] 1949 [ 111 ]
Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press)
Frankfurt, 11/11/2013
Miha Grčar 20
MapReduce (Hadoop)
Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press)
Frankfurt, 11/11/2013
Miha Grčar 21
Twitter Storm
Frankfurt, 11/11/2013
Producereport Print
Collate & bind Sign Send
Spout BoltBolt Bolt Bolt
Data processorsData source Data sink
Miha Grčar 22
Twitter StormBasic principle
Datasource
Dataprocessor
Datasink/writer
Spout Bolt Bolt
195005150700+0000195005151200+0022195005151800-0011194903241200+0111194903241800+0078
Received: 111Current max: 22New max: 111
Overwrite 22with 111
194903241200+0111 111
Frankfurt, 11/11/2013
Miha Grčar 23
Twitter StormTopology
Frankfurt, 11/11/2013
Miha Grčar 24
Twitter StormPipelining and parallelization
Para
lleliz
ation
Stream
Pipelining
Frankfurt, 11/11/2013
Miha Grčar 25
Examples
• Twitter sentiment and volume– Elections– Stock trading
• News cohesiveness, volume, and sentiment– Correlation with VIX, CDS– Correlation with big events
• Vocabulary in news & blogs– Pump & dump use case
Frankfurt, 11/11/2013
Miha Grčar 26
Slovene elections
• 3 candidates, 3 live debates• Sentiment analysis provider: Gama System &
our team at JSI• Streamed live, in real time, in prime time
during the debates on POP TV• During and after the debates (3 broadcasts),
the sentiment chart was shown 5 times (with commentary)
Frankfurt, 11/11/2013
Second live debate Third live
debateElections
(first round)
First live debate
Supporting the gov
Criticizing the gov
Criticizing a questionable
pardoning of a criminal
Justifying it
Candidates justifying their
wealth
Candidates joined by
their wives
Miha Grčar 29
“Democratic.”
Zver:--“What kind of a political party leader were you if they (party members) didn’t follow your lead?”
Pahor:--“Democratic.”
Frankfurt, 11/11/2013
Miha Grčar 30
Polls vs. sentiment vs. outcome
Delo Stik (Delo, 9.11.)44 / 31 / 25
23.61%
34.72%
41.67%
0.00% 10.00% 20.00% 30.00% 40.00% 50.00%
Milan Zver
Borut Pahor
Danilo Türk
Mediana (Slovenske novice, 9.11.)41.67 / 34.72 / 23.61
Ninamedia (Mladina, 9.11.)43.8 / 33.6 / 22.6
Twitter sentiment“Borut Pahor will win”
Actual outcomeNovember 11, 2012
1. Borut Pahor 40% (+4%)2. Danilo Türk 36%3. Milan Zver 24%
Frankfurt, 11/11/2013
Miha Grčar 32
Twitter volume andelection results
Source: Smailović, Kranjc, Juršič, Grčar, Gačnik, Mozetič: Monitoring the Twitter sentiment during the Bulgarian elections (2013; to appear)
There’s no such thing as bad publicity.
“We believe that Twitter and other social media reflect the underlying trend in a political race that goes beyond a district’s fundamental
geographic and demographic composition. If people must talk about you, even in negative ways, it is a signal that a candidate is on the verge
of victory. The attention given to winners creates a situation in which all publicity is good publicity.”
(DiGrazia, McKelvey, Bollen, Rojas: More tweets, more votes: Social media as a quantitative indicator of political behavior, February 2013)
Frankfurt, 11/11/2013
Miha Grčar 33
Source: Sowa Labs GmbH
We’re looking at the stock of
Amazon.com…The blue line shows the stock
price. …during 2012.
The red line shows the related Twitter
sentiment.
The black line is the 7-day moving
average.
A MA zerocross-over
serves as a buy or sell signal.
The green-red line shows whether we
profited (green) or not (red) from blindly
following the social signals.
Frankfurt, 11/11/2013
Miha Grčar 34
Source: Sowa Labs GmbH
Q1results
Q2results
Q3results
Q4/’11results
On April 26, 2012 Amazon announced financial results for its first quarter ended March 31, 2012. Amazon has been spending lots of money on expanding its operations, so analysts expected a huge drop in profit for this first quarter. However, Amazon blows analysts’ estimates away. Even though earnings did fall, they didn't decline nearly as much as analysts had feared.
Amazon earned $130 million or 28 cents per share for the quarter that ended March 31. That was a 35% decline from a year ago, but it was much better than the 7 cents per share forecasts from analysts polled by Thomson Reuters.
Based on this news, Amazon shares surged nearly 16% on Friday morning April 27, 2012.
Frankfurt, 11/11/2013
Miha Grčar 35
Source: Sowa Labs GmbH
The sentiment MA cross-over happens
well before the price jump.
Frankfurt, 11/11/2013
Miha Grčar 36
Source: Sowa Labs GmbH
We’re looking at the stock of
Google…
…during 2012.
Q3results
Q4/’11results
Q1 results Q2
resultsOn October 18, 2012, Google’s shares plunged by 9% after the search giant’s third-quarter earnings came in considerably lower than expected.
The results were accidentally released several hours earlier than expected, leading to a halt in the shares’ trading for a time.
Frankfurt, 11/11/2013
Miha Grčar 37
Source: Sowa Labs GmbH
The sentiment MA cross-over happens
well before the price plunge.
Frankfurt, 11/11/2013
Miha Grčar 38
Source: Sowa Labs GmbHFrankfurt, 11/11/2013
Miha Grčar 39
Sentiment in news:Spain, Greece, Italy, Germany
Frankfurt, 11/11/2013
Miha Grčar 40
News cohesiveness and VIX
VIX – implied volatility of S&P500 (aka fear index)
Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute
Frankfurt, 11/11/2013
Miha Grčar 41
News cohesiveness and CDS
CDS – Credit Default Swaps (insurance against default)
Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute
Frankfurt, 11/11/2013
Pump & dump
Source: b-next, Goethe Universität, JSI (FIRST)Frankfurt, 11/11/2013 Miha Grčar 42
Pump & dumpCountry Black List
Industry Black List
Company Black List
Age
Bankrupt
Trading Volume
Number of Trades
Market Capitalization
Market Segment
Sentiment
Content
Black List
History
Market
Trading
News
Company
Financial Instrument
Comp_FinInst Pump & Dump
Source: b-next, Goethe Universität, JSI (FIRST)Frankfurt, 11/11/2013 Miha Grčar 43
Miha Grčar 44
Quick recap (1/3)
• Big data: volume, velocity, variety
• Enablers– Storage capacity & processing power– Maturity of technologies– Availability of data, e.g., social networks and mobile devices– Mindset
• Financial domain: one of the biggest gainers
Frankfurt, 11/11/2013
Miha Grčar 45
Quick recap (2/3)
Solving big data problems
– Distributed infrastructure• Amazon EC2
– Distributed processing capacity• MapReduce (Hadoop)• Twitter Storm
– Distributed storage
Frankfurt, 11/11/2013
Miha Grčar 46
Quick recap (3/3)Examples
– Elections• No such thing as bad publicity
– Stock trading• Sentiment vs. price, Twitter volume vs. trading volume
– News & blogs • Volume & sentiment expose big events• Cohesiveness vs. VIX & CDS• Content and sentiment as inputs into a pump & dump detection model
Frankfurt, 11/11/2013
Miha Grčar 47Frankfurt, 11/11/2013
http://www.sowalabs.de(coming really soon!)
Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928)http://project-first.eu