Real Time Machine Learning Architecture & Sentiment Analysis Quantcon 2017, Singapore Juan CHENG, PHD Data Scientist [email protected] www.infotrie.com @infotrie www.finsents.com @finsents
Jan 21, 2018
Real Time Machine Learning Architecture & Sentiment Analysis
Quantcon 2017, Singapore
Juan CHENG, PHDData [email protected]
www.infotrie.com@infotrie
www.finsents.com@finsents
● About us● News analytics signals in Finance● Big data architecture ● Demo cases
Frederic GEORJONCEO
Ajil GEORGEHead of Development Center
Daniel ABROUKHead of EMEA
Paris/Singapore London
LONG ZhichengCTO
Singapore India
FinSentS.com➔ Real-time information
and trading portal➔ Millions of sources /
Multilingual➔ Saas or on premises➔ Real-time Alerts➔ Actionable signals
Sentiment Data➔ Through API or 1/3 parties➔ Up to 15 years of history➔ Low latency / Tick by tick➔ 50,000+ entities➔ Stock, Forex, commodities,
index, Macroeconomic topics etc…
Consultancy and Training➔ Trading Technology➔ Algorithmic trading➔ Big Data➔ Natural Language
Processing (NLP)➔ Machine Learning
Access to News / News management
- Visualization tools - Filtering tools - On demand view
Feed from multiple sources:- Social Media- Web based content- Private sources - Internal data
News Content Alerts based on sentiment indicator
Provide accurate information from Big Data environment and pushed it front of Users in real time for Risk management
Dashboard
- Consolidated Dashboard- Portfolio Alerts
Actionable indicators
Users receive news signals for trading / hedging / risk management based sentiment indicator
Algo Trading / Robo Trading
Real Time algorithmic trading Sentiment indicator and News Analytics
Equity Research / Sales Team Hedging Trader / Prop Trader
- News Tag Cloud- Filtering newsfeed with Social media blotter, news blotter - Search Engine on demand
- Topics detection - Rumours alerts- News qualification per importance
- Relevant information from single screen- Automatic Alert- Integrated to OMS
Provide relevant news analytics indicator for hedging or trade idea generation
Fully integrated news analytics signals integrated to algo trading strategies
ReutersMARKET NEWS | Fri Oct 21, 2016 | 2:18am EDT
AT&T acquires Time Warner for $85 billionNEW YORK- AT&T Inc said it agreed to buy Time Warner Inc for $85.4 billion, the boldest move yet by a telecommunications company to acquire content to stream over its high-speed network to attract a growing number of online viewers.
The trend of consolidation comes as technology advances have been upending traditional entertainment companies. Many in the industry believe that getting bigger is the best way to compete with companies like Google, Apple, Netflix and Facebook.David Goldman and Paul R. La Monica contributed to this report.
ReutersMARKET NEWS | Fri Oct 21, 2016 | 2:18am EDT
AT&T acquires Time Warner for $85 billionNEW YORK- AT&T Inc said it agreed to buy Time Warner Inc for $85.4 billion, the boldest move yet by a telecommunications company to acquire content to stream over its high-speed network to attract a growing number of online viewers.
The trend of consolidation comes as technology advances have been upending traditional entertainment companies. Many in the industry believe that getting bigger is the best way to compete with companies like Google, Apple, Netflix and Facebook.David Goldman and Paul R. La Monica contributed to this report.
Source
Category
Time
Location
Named Entity
Sentiment
Event
Hacking skill, regex,nlp, named entity recognition, pos taggers
- Companies, indexes - People, locations, organizations- Events- Regions
NLP
Text- Dow Jones, bloomberg- Web news, blogs, twitter- 1000+ sources
Feature Extraction
Classification
Sentiment
- 15 years history- Tens of millions of articles
Training
Indexing - Sector/industry- Commodity, FX, ETFs- Political, country risk- Macroeconomic- Fear, greed, anger,
happiness
Aggregation
● Entity ● Classification● Sentiment
www.infotrie.com@infotrie
Ping An Insurance Group • SSE: 601318 (A share)• SEHK: 02318 (H share)
• Also known as Ping An of China
• A holding company whose subsidiaries mainly deal with insurance, banking, and financial services
• Constituent of Shanghai Stock Exchange 50 A Share Index (SSE50)
• A component of Hang Seng Index
NoSQL Databasecache persistent
Kafka Filter, topic classification, sentiment calculation, entity detection, stock mapping, sentiment aggregation
Apache Storm
DFSNlp modelsML models
ProducersBlogs, twitter, news, bloomberg...
Model training, batch cleaning, batch calculation
Apache Spark
Solr
Relational Database
Web app
www.infotrie.com@infotrie
lead signal in the subsequent price rise
positive corporate announcement on stock dividend release
Ping An44
43444443
434444342
434444341
434444340
43444434-143444434-2
08/14/2016 09/11/2016 11/07/201610/09/2016
positive corporate announcement on stock dividend release
positive announcement on insurance fee income and 17.1% rise of revenue in the first three quarters
mandarin
english
close
get Articles, Treemap, Tags, Company Sentiment, Sentiment History, Company Static Tags, News Buzz, Data, Articles Tag, Index Score, Asset Sentiment, Article Ids, Leaders Laggers…
Easy API call
Available @ [email protected]@infotrie
www.finsents.com@finsents
Train Document Set:
d1: The sky is blue.d2: The sun is bright.
Test Document Set:
d3: The sun in the sky is bright.d4: We can see the shining sun, the bright sun.
Vector Space Model (VSM)
t1 t2...
d1
d2 ...
Train Document Set:d1: The sky is blue.d2: The sun is bright.
Vocabulary
Term frequency(TF)
TF emphasize a term which is almost present in the entire corpus
TD-IDF
TF example IDF example
Normalized TD-IDF
Train Document Set:
d1: The sky is blue.d2: The sun is bright.
Test Document Set:
d3: The sun in the sky is bright.d4: We can see the shining sun, the bright sun.
Vector Space Model (VSM)
t1 t2...
d1
d2 ...
Machine Learning
Analytics on Massive Historical Text Data
Analytics on recent pass
Realtime analytics
Batch layer real-time layer
Fast and general engine for large-scale distributed data processing
Memory Network CPU’s Disk
Reference: spark
Logistic regression in Hadoop and Spark
open source distributed realtime computation system, easily process unbounded streams of data
Storm was benchmarked at processing one million 100 byte messages per second per node on hardware with the following specs:
● Processor: 2x Intel [email protected]
● Memory: 24 GB
Reference: storm
Spout
bolt
✓ Guaranteed data processing ✓ Horizontal scalability✓ Fault-tolerance✓ Higher level abstraction than message
passing✓ Real-time machine learning for
classification and predictive analytics
Sentiment in itself is a powerful trading indicator out of which multiple trading strategies can be build
Simulate impact of complex events
➔ Scale analysis pipeline➔ Live stats➔ Recommendations ➔ Predictions➔ Realtime analytics ➔ Online machine learning
Apply similar architecture in
MIFID alertImprove Client's communication
Regulatory Process complex / low signals events
ESG monitoringEcological – Social – Governance
An union calls for a strike in a factory in Argentina?
Negative news coverage is accelerating for a stock I hold in Chinese press but are not yet in English press?
A European company employs children in Bangladesh (*)?
ACTIONS
111111111
3231
111111111
3231
111111111
3231
dfs
96
3
99693
text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)Job
Executor
Nimbus
Zookeeper
Zookeeper
Worker
Worker
Worker
Worker
Velocity
Big Data
Variety
- News, blogs, social media, analyst reports, company announcement, traders’ chat room…
- Financial reports, price, economic events...
- Weather, GPS, image....
Volumn
- ETL- Machine learning- Correlation analysis,- regressions….
- As fast as possible
B.No, I’m a quant. I found it’s hard to quantified news.
A.No, I found news are noisy. They are just too much.
C. Yes. But I found using news is not very efficient. I have to manually related them to my portfolio.
❏ Guaranteed data processing❏ Horizontal scalability❏ Fault-tolerance❏ Higher level abstraction than message passing❏ Real-time machine learning for classification and predictive
analytics
www.infotrie.com@infotrie
Analysis of an Indonesian Company “Pelindo” in English vs Bahasa Indonesia
Tracking of weak signals : local languages with little to no coverage in English press
Sen
tim
ent
-5/5
New
s vo
lum
e #