Top Banner
Big data analytics Miha Grčar 1,2 1 Jožef Stefan Institute 2 Sowa Labs GmbH Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928) http://project-first.eu
47

Big data analytics

Feb 23, 2016

Download

Documents

hisoki

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT- 257928 ) http://project-first.eu. Big data analytics. Miha Gr čar 1,2 1 Jožef Stefan Institute 2 Sowa Labs GmbH. Outline. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big data analytics

Big data analytics

Miha Grčar1,2

1Jožef Stefan Institute2Sowa Labs GmbH

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928)http://project-first.eu

Page 2: Big data analytics

Miha Grčar 2

Outline

• What is big data? What caused it?

Who should care?

• Solving big data problems

• Examples

Frankfurt, 11/11/2013

Page 3: Big data analytics

Miha Grčar 3

• “How many terabytes?”• We deliberately avoid being specific• Big data refers to datasets that cannot be

captured, stored, managed, and/or analyzed by the mainstream storage and processing devices

What is big data?

Frankfurt, 11/11/2013

Page 4: Big data analytics

Miha Grčar 4

What is big data?

Frankfurt, 11/11/2013

Page 5: Big data analytics

Miha Grčar 5

What caused big data?Storage capacity and processing power

Source: Hilbert and López, “The world’s technological capacity to store, communicate, and compute information,” Science, 2011

Frankfurt, 11/11/2013

Page 6: Big data analytics

Miha Grčar 6

What caused big data?Data availability (industry)

Source: IDC; US Bureau of Labor Statistics; McKinsey Global Institute analysisFrankfurt, 11/11/2013

Page 7: Big data analytics

What caused big data?Data availability (social media and mobile devices)

Source: www.creotivo.com

Page 8: Big data analytics

Miha Grčar 8

What caused big data?Data availability (sensors)

Source: Analyst interviews; McKinsey Global Institute analysisFrankfurt, 11/11/2013

Page 9: Big data analytics

Miha Grčar 9

What caused big data?Maturity of technologies & tools

Source: Gartner (July, 2012)

Emerging, hyped

Mature

Frankfurt, 11/11/2013

Page 10: Big data analytics

Miha Grčar 10

Who should care about big data?

Source: US Bureau of Labor Statistics; McKinsey Global Institute analysisFrankfurt, 11/11/2013

Page 11: Big data analytics

Miha Grčar 11

Solving big data problems

• Distributed infrastructure– Cloud Amazon Elastic Compute Cloud (EC3)

• Distributed processing– MapReduce / batches Hadoop– Distributed workflows / streams Twitter Storm

• Distributed storage– Distributed FS/DB– NoSQL

1+1= 1+1=

1+1=

Frankfurt, 11/11/2013

Page 12: Big data analytics

Miha Grčar 12

Solving big data problems

• Distributed infrastructure– Cloud Amazon Elastic Compute Cloud (EC2)

• Distributed processing– MapReduce / batches Hadoop– Distributed workflows / streams Twitter Storm

• Distributed storage– Distributed FS/DB– NoSQL

Amazon EC2, Windows Azure, Google Cloud Platform, Cloudwatt…

Hadoop, MS DryadLINQ, Disco, Misco, Phoenix, Cloud MapReduce, bashreduce, Qizmt…Storm (Twitter), S4 (Yahoo),“Real-time Hadoops”: Impala, HFlame, Spark…

Google File System, HDFS, Google Big Table, HBase, Cassandra, MongoDB, CouchDB, Hive…

Frankfurt, 11/11/2013

Page 13: Big data analytics

Miha Grčar 13

Amazon EC2EC2 = ECC = Elastic Compute Cloud

• Central part of Amazon.com’s cloud computing service• ~500,000 physical Linux machines • Elastic: possibility to start / stop servers with respect to demand; pay only for

running servers• Instances (several examples)

– Micro, 1 ECU, 1 Core, 613 MiB– High-Memory XL, 6.5 ECUs, 2 Cores, 17.1 GiB– High-CPU XL, 20 ECUs, 8 Cores, 7 GiB

• OS– Windows– Linux– FreeBSD

• Storage– Temporary instance-storage– Persistent Elastic Block Storage (EBS)

Frankfurt, 11/11/2013

Page 14: Big data analytics
Page 15: Big data analytics
Page 16: Big data analytics
Page 17: Big data analytics

Miha Grčar 17

MapReduce (Hadoop)

Frankfurt, 11/11/2013

Page 18: Big data analytics

A B CA B C

Election results:A: 321,015B: 179,539C: 201,734

A B C

A bunch of ballots,all mixed up…

Still mixed up…Map

Reduce

Page 19: Big data analytics

Miha Grčar 19

MapReduce (Hadoop)

mergecopy

map reduce outputdatasort

195005150700+0000195005151200+0022195005151800-0011194903241200+0111194903241800+0078

1950 0 1950 22 1950 -11 1949 111 1949 78

1950 [ 0, 22, -11 ]1949 [ 111, 78 ]

1950 [ 22 ] 1949 [ 111 ]

Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press)

Frankfurt, 11/11/2013

Page 20: Big data analytics

Miha Grčar 20

MapReduce (Hadoop)

Source: Tom White: Hadoop, The Definitive Guide, 3rd Ed., 2012 (O’Reilly & Yahoo! Press)

Frankfurt, 11/11/2013

Page 21: Big data analytics

Miha Grčar 21

Twitter Storm

Frankfurt, 11/11/2013

Producereport Print

Collate & bind Sign Send

Spout BoltBolt Bolt Bolt

Data processorsData source Data sink

Page 22: Big data analytics

Miha Grčar 22

Twitter StormBasic principle

Datasource

Dataprocessor

Datasink/writer

Spout Bolt Bolt

195005150700+0000195005151200+0022195005151800-0011194903241200+0111194903241800+0078

Received: 111Current max: 22New max: 111

Overwrite 22with 111

194903241200+0111 111

Frankfurt, 11/11/2013

Page 23: Big data analytics

Miha Grčar 23

Twitter StormTopology

Frankfurt, 11/11/2013

Page 24: Big data analytics

Miha Grčar 24

Twitter StormPipelining and parallelization

Para

lleliz

ation

Stream

Pipelining

Frankfurt, 11/11/2013

Page 25: Big data analytics

Miha Grčar 25

Examples

• Twitter sentiment and volume– Elections– Stock trading

• News cohesiveness, volume, and sentiment– Correlation with VIX, CDS– Correlation with big events

• Vocabulary in news & blogs– Pump & dump use case

Frankfurt, 11/11/2013

Page 26: Big data analytics

Miha Grčar 26

Slovene elections

• 3 candidates, 3 live debates• Sentiment analysis provider: Gama System &

our team at JSI• Streamed live, in real time, in prime time

during the debates on POP TV• During and after the debates (3 broadcasts),

the sentiment chart was shown 5 times (with commentary)

Frankfurt, 11/11/2013

Page 27: Big data analytics

Second live debate Third live

debateElections

(first round)

First live debate

Page 28: Big data analytics

Supporting the gov

Criticizing the gov

Criticizing a questionable

pardoning of a criminal

Justifying it

Candidates justifying their

wealth

Candidates joined by

their wives

Page 29: Big data analytics

Miha Grčar 29

“Democratic.”

Zver:--“What kind of a political party leader were you if they (party members) didn’t follow your lead?”

Pahor:--“Democratic.”

Frankfurt, 11/11/2013

Page 30: Big data analytics

Miha Grčar 30

Polls vs. sentiment vs. outcome

Delo Stik (Delo, 9.11.)44 / 31 / 25

23.61%

34.72%

41.67%

0.00% 10.00% 20.00% 30.00% 40.00% 50.00%

Milan Zver

Borut Pahor

Danilo Türk

Mediana (Slovenske novice, 9.11.)41.67 / 34.72 / 23.61

Ninamedia (Mladina, 9.11.)43.8 / 33.6 / 22.6

Twitter sentiment“Borut Pahor will win”

Actual outcomeNovember 11, 2012

1. Borut Pahor 40% (+4%)2. Danilo Türk 36%3. Milan Zver 24%

Frankfurt, 11/11/2013

Page 31: Big data analytics
Page 32: Big data analytics

Miha Grčar 32

Twitter volume andelection results

Source: Smailović, Kranjc, Juršič, Grčar, Gačnik, Mozetič: Monitoring the Twitter sentiment during the Bulgarian elections (2013; to appear)

There’s no such thing as bad publicity.

“We believe that Twitter and other social media reflect the underlying trend in a political race that goes beyond a district’s fundamental

geographic and demographic composition. If people must talk about you, even in negative ways, it is a signal that a candidate is on the verge

of victory. The attention given to winners creates a situation in which all publicity is good publicity.”

(DiGrazia, McKelvey, Bollen, Rojas: More tweets, more votes: Social media as a quantitative indicator of political behavior, February 2013)

Frankfurt, 11/11/2013

Page 33: Big data analytics

Miha Grčar 33

Source: Sowa Labs GmbH

We’re looking at the stock of

Amazon.com…The blue line shows the stock

price. …during 2012.

The red line shows the related Twitter

sentiment.

The black line is the 7-day moving

average.

A MA zerocross-over

serves as a buy or sell signal.

The green-red line shows whether we

profited (green) or not (red) from blindly

following the social signals.

Frankfurt, 11/11/2013

Page 34: Big data analytics

Miha Grčar 34

Source: Sowa Labs GmbH

Q1results

Q2results

Q3results

Q4/’11results

On April 26, 2012 Amazon announced financial results for its first quarter ended March 31, 2012. Amazon has been spending lots of money on expanding its operations, so analysts expected a huge drop in profit for this first quarter. However, Amazon blows analysts’ estimates away. Even though earnings did fall, they didn't decline nearly as much as analysts had feared.

Amazon earned $130 million or 28 cents per share for the quarter that ended March 31. That was a 35% decline from a year ago, but it was much better than the 7 cents per share forecasts from analysts polled by Thomson Reuters.

Based on this news, Amazon shares surged nearly 16% on Friday morning April 27, 2012.

Frankfurt, 11/11/2013

Page 35: Big data analytics

Miha Grčar 35

Source: Sowa Labs GmbH

The sentiment MA cross-over happens

well before the price jump.

Frankfurt, 11/11/2013

Page 36: Big data analytics

Miha Grčar 36

Source: Sowa Labs GmbH

We’re looking at the stock of

Google…

…during 2012.

Q3results

Q4/’11results

Q1 results Q2

resultsOn October 18, 2012, Google’s shares plunged by 9% after the search giant’s third-quarter earnings came in considerably lower than expected.

The results were accidentally released several hours earlier than expected, leading to a halt in the shares’ trading for a time.

Frankfurt, 11/11/2013

Page 37: Big data analytics

Miha Grčar 37

Source: Sowa Labs GmbH

The sentiment MA cross-over happens

well before the price plunge.

Frankfurt, 11/11/2013

Page 38: Big data analytics

Miha Grčar 38

Source: Sowa Labs GmbHFrankfurt, 11/11/2013

Page 39: Big data analytics

Miha Grčar 39

Sentiment in news:Spain, Greece, Italy, Germany

Frankfurt, 11/11/2013

Page 40: Big data analytics

Miha Grčar 40

News cohesiveness and VIX

VIX – implied volatility of S&P500 (aka fear index)

Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute

Frankfurt, 11/11/2013

Page 41: Big data analytics

Miha Grčar 41

News cohesiveness and CDS

CDS – Credit Default Swaps (insurance against default)

Source: Rudjer Boskovic Institute, Boston University, Jozef Stefan Institute

Frankfurt, 11/11/2013

Page 42: Big data analytics

Pump & dump

Source: b-next, Goethe Universität, JSI (FIRST)Frankfurt, 11/11/2013 Miha Grčar 42

Page 43: Big data analytics

Pump & dumpCountry Black List

Industry Black List

Company Black List

Age

Bankrupt

Trading Volume

Number of Trades

Market Capitalization

Market Segment

Sentiment

Content

Black List

History

Market

Trading

News

Company

Financial Instrument

Comp_FinInst Pump & Dump

Source: b-next, Goethe Universität, JSI (FIRST)Frankfurt, 11/11/2013 Miha Grčar 43

Page 44: Big data analytics

Miha Grčar 44

Quick recap (1/3)

• Big data: volume, velocity, variety

• Enablers– Storage capacity & processing power– Maturity of technologies– Availability of data, e.g., social networks and mobile devices– Mindset

• Financial domain: one of the biggest gainers

Frankfurt, 11/11/2013

Page 45: Big data analytics

Miha Grčar 45

Quick recap (2/3)

Solving big data problems

– Distributed infrastructure• Amazon EC2

– Distributed processing capacity• MapReduce (Hadoop)• Twitter Storm

– Distributed storage

Frankfurt, 11/11/2013

Page 46: Big data analytics

Miha Grčar 46

Quick recap (3/3)Examples

– Elections• No such thing as bad publicity

– Stock trading• Sentiment vs. price, Twitter volume vs. trading volume

– News & blogs • Volume & sentiment expose big events• Cohesiveness vs. VIX & CDS• Content and sentiment as inputs into a pump & dump detection model

Frankfurt, 11/11/2013

Page 47: Big data analytics

Miha Grčar 47Frankfurt, 11/11/2013

http://www.sowalabs.de(coming really soon!)

Large-scale information extraction and integration infrastructure for supporting financial decision making (FP7-ICT-257928)http://project-first.eu