Close Encounters with Data Science

Close Encounters with Data Science

Oct 28, 2015

Geoff Yuen, Ph.D. VP Emerging Technology, PCCW [email protected]

What’s new about data ?

• Data = values of qualitative or quantitative variables, belonging to a set of items (usually population)

• Data = often unstructured (without pre-established data model), usually raw file, different formats

chat Genome-DNA base pairs picture

Lots of Data ≠ Insight

Data itself is not useful, we need insights !

Its easy to get lost in your data

Jiawei Han. Abel Bliss Professor, Department of Computer Science, UIUC; “Pattern Discovery in Data Mining” Coursera online course with 75000 students 2/2015

39

%

39

2G Network

3G Network

900 MHz

1800 MHz

2100 MHz

2013 4G Network

The O2 mobile network has hundreds of cells to measure the trends in footfall across the country (Telefonica UK)

Network Data

39

%

39

Easier to use

Further protecting

anonymity

Extrapolated to

represent local

population

Footfall is rendered into 200 x 200 metre grid squares

200 x 200 Grid

Drilling into footfalls demographics

…

“US has killed Osama Bin Laden” • average of 3,000 tweets per second • 27,900,000 tweets in 2.5 hours • peak of 12,384,000 tweets in one hour

Viral Social Data : From 1 to 14.9 million tweets in 5 minutes (1st May 2011)

…

“US has killed Osama Bin Laden” • average of 3,000 tweets per second • 27,900,000 tweets in 2.5 hours • peak of 12,384,000 tweets in one hour

Viral Social Data : From 1 to 14.9 million tweets in 5 minutes (1st May 2011)

The data is the second most important thing

Jeff Leeks, Assistant Professor of Biostatistics, Data Science Program , John Hopkins University :

Focus on the problem first …

Facebook “Likes” Predicting Personality Facebook can predict personality based on annotated data better than humans

… except for spouse

http://www.pnas.org/content/112/4/1036.full.pdf

What’s New About Analytics

• Golden Age of Analytisc (1995-) Statistical Machine Learning has contributed many much more powerful algorithms than simple regression (list modified from Seni Giovanni, A9):

• 1983 CART (Tree) • 1996 Lasso • 1996 Bagging • 1997 AdaBoost • 2001 Random Forest • 2003 Learning Ensembles • 2004 Regularization & Boosted Lasso • 2005-2013 Deep Belief / Deep Learning

Many ways to predict and classify structured and unstructured data now !

1. Kinect Posture Detection

Kinect detection of body segments

Goal: Estimate Pose from Depth Image

A single input depth image is segmented into a dense probabilistic body part labeling, with the parts defined to be spatially localized near skeletal joints of interest

From depth images to joint positions in 3D

Challenges

• 3 trees each of depth 20 from 1 million images were trained

• Get 3D models for 15 bodies with a variety of weights, heights, etc.

• Synthesize mocap data for all 15 body types

• Capture and sample 500K mocap frames of people kicking, driving, dancing, etc.

Get Lots of Training Data into ‘3 trees’

Kinect's reliable detection of body segments is based on successful application of a famous

analytic algorithm (random forest)

Opportunities

What application areas can benefit ? Rehabilitation, motion training (martial arts, tennis, dry land training), elderly fall detection

With aging population, fall detection and related services can be a major opportunity • Australia : 30% of adults over 65 experiencing at least one

fall per year, group predicted to increase from 14% to 23% (8.1 million) in 2050, costing $1.4 billion by 2051.

• China : 1405 mil vs 24 mil, a factor of 58 bigger !

Recommend for HK : elderly fall detection and motion training

Flyby Science is hard!

Flyby Science (typical)

Status Quo: Respond in days

Onboard analysis: Respond in minutes

NASA JPL: better flyby surface feature recognition by random forests

2. Deep learning

By 2017, 10 % of computers will be learning rather than processing (Gartner 2013)

Page 27

Structured Data Unstructured Data

Regression

Linear or Logistic

Problem specific

Learning structure in data

non-Linear (polynomial)

Knowledge specific

Big Data finally found its analytic partner : deep learning

CIFAR-10 Units: accuracy %

Rank Results (%) Method Venue

1 94 Lessons learned from manually classifying CIFAR-10 unpublished 2011

2 91.78 Deeply-Supervised Nets arXiv 2014

3 91.2 Network In Network ICLR 2014

4 90.68 Regularization of Neural Networks using DropConnect ICML 2013

5 90.65 Maxout Networks ICML 2013

6 90.61 Improving Deep Neural Networks with Probabilistic Maxout Units

ICLR 2014

7 90.5 Practical Bayesian Optimization of Machine Learning Algorithms

NIPS 2012

8 89 ImageNet Classification with Deep Convolutional Neural Networks

NIPS 2012

9 88.79 Multi-Column Deep Neural Networks for Image Classification CVPR 2012

10 84.87 Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

arXiv 2013

• The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class.

• There are 50000 training images and 10000 test images.

• Classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

Error Back propagation

Error Back Propagation

Error Back propagation

Parallel Error Correction

Train this layer first

Learning Layer by Layer


then this layer

The new way to train multi-layer NNs…


then this layer

then this layer



then this layer

then this layer

then this layer



then this layer

then this layer

then this layer finally this layer


EACH of the middle layers is trained to be

an auto-encoder

... basically forced to learn good features coming from the previous layer


Deep Learning for unstructured data

• Previous paradigm for feature detection and prediction from data is based on modelling and optimization. “Deep learning” have now surpassed related performance in many problems from various researchers around the world.

“Tech 2015: Deep Learning And Machine Intelligence Will Eat The World” Forbes 12/2014

• Deep learning scale well with big data to learn “layering of knowledge” in hidden

layers without handcrafting of feature detectors as past machine learning methods. Convergence time proof for RBM.

• Demonstrated impressive improvements in diverse areas : speech recognition, object recognition in images, targeted advertising, fraud detection, personalization • Speech recognition : Microsoft, Google & Apple competing mobile “digital assistants” (Google Now vs Siri vs Cortana

9/2014) Digital assistants will drive mCommerce & 50% US digital purchases in 2017 (Gartner) • Object recognition : Facebook

Mining user images for intentions (NYT) • Real-time translation : Skype • World Cup / NBA Predicting 2014 (MS) • Others : Baidu, IBM, Yahoo, Tencent, Netflix, Adobe, NEC, Toyota • Telco centric vendors : Wise-athena, Dataspark, Zettics

https://www.youtube.com/watch?v=__G0Msn1UaM

Deep learning has created breakthroughs in object and speech recognition.

But also watch other areas : sports prediction, natural language processing, churn prediction, targeted advertising, customer segmentation

2014 Survey of Deep Learning Vendor Claims Previous Accuracy

Data used to train model

Latest Accuracy

Company

Speech Recognition 75% 680 speakers, 10 sentences each

94% (2013) Google, IBM, Skype, MS

Object recognition 70% 1.2 mil images 95% (2015) Baidu, Google, Facebook

Target Advertising <1 % (Banner Ads)

220K users 22% NDA

Personalization na 220K users

27% NDA

Churn Prediction (Telco)

69% (SAS) 300 mil CDRs 1.8 mil users

82% NDA

Dealer Fraud Detection (Telco)

<40% (reactive)

700 mil CDRs 1.2 mil users

80% (predictive)

NDA

• Other big companies in related efforts : Baidu, IBM, Yahoo, Tibco, Tencent, Netflix, Adobe, NEC, Toyota

Speech Recognition : the race is on

Contextual Mobile Targeting Contextual & unstructured data using machine learning technology also improve advertising accuracy +219 %

43

Customer visibility: Accuracy and Algorithm speed

43

Manual test of the algorithm

• Several camera can observe

same area

• Aggregated signals with

proper threshold will perfectly

match

Algorithm speed

• Calibration: manual

• Runtime: 60 msec/frame

0

0.2

0.4

0.6

0.8

1

1.2

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59minute

Ground truth(is person infront of ATM 7)

aggregatedsignal

44

Utilization

Average daily

utilization is 10%.

Highest values

(20%) are on

weekends,

Saturdays mostly,

except Chinese

New Year. Lowest

utilization is on the

11th of March (1%).

Recorded coverage

There is recording in

the 30%-90% of the

hours and the 10%-

70% of total time.

This highly correlates

with daily utilization,

but the weekly cycle

is more obvious.

0%

5%

10%

15%

20%

25%

20

15

02

15

20

15

02

16

20

15

02

17

20

15

02

18

20

15

02

19

20

15

02

20

20

15

02

21

20

15

02

22

20

15

02

23

20

15

02

24

20

15

02

25

20

15

02

26

20

15

02

27

20

15

02

28

20

15

03

01

20

15

03

02

20

15

03

03

20

15

03

04

20

15

03

05

20

15

03

06

20

15

03

07

20

15

03

08

20

15

03

09

20

15

03

10

20

15

03

11

20

15

03

12

20

15

03

13

20

15

03

14

20

15

03

15

20

15

03

16

20

15

03

17

20

15

03

18

20

15

03

19

20

15

03

20

20

15

03

21

20

15

03

22

20

15

03

23

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

20

15

02

15

20

15

02

16

20

15

02

17

20

15

02

18

20

15

02

19

20

15

02

20

20

15

02

21

20

15

02

22

20

15

02

23

20

15

02

24

20

15

02

25

20

15

02

26

20

15

02

27

20

15

02

28

20

15

03

01

20

15

03

02

20

15

03

03

20

15

03

04

20

15

03

05

20

15

03

06

20

15

03

07

20

15

03

08

20

15

03

09

20

15

03

10

20

15

03

11

20

15

03

12

20

15

03

13

20

15

03

14

20

15

03

15

20

15

03

16

20

15

03

17

20

15

03

18

20

15

03

19

20

15

03

20

20

15

03

21

20

15

03

22

20

15

03

23

Sat

Sat

Sat Sat

Sat

Sat Sat

Sat

CNY School holiday

Data error

Daily utilization

45

Customer demography: Accuracy and Algorithm speed

Manual test of the algorithm

• Average age/gender accuracy of

algorithms with 48x48 = 92%

• Our current algorithm at the desk with

face size of 40x40 = 72%

• Accuracy will be improved up to 85%,

using tilted face + body corpus

Algorithm speed

• Calibration: one-time

• Runtime: irrelevant

Opportunities What application areas can benefit ? • Internet : Baidu targeting advertising, Facebook sentiments

from face photos • Commercial : fraud detection, churn prediction, food

detection, weapons detection • Others : disability assistance, object recognition for the blind,

speech recognition for the deaf, cancer tissue recognition

Specific Application Example • Bank customers recognition

Recommend for HK : biggest market impact may be in health image processing and online education

3. Networks

How Google beat previous search engines ?

Aside from searched content, also use url data patterns (links)* An additional datatype can make a huge difference ! * Eric Schmidt “How Google Works”; also see http://www.economist.com/node/3171440

Genetic Basis of Diseases

Asthma : known to have multiple variant gene sequences

“ Simple Regression ” “ Multivariate Sparse Lasso Regression ”

Novel statistical method allows for joint network analysis to correlated phenotypes

Eric Xing (2014)

Advantages

• Greater power to detect weak associations

• Fewer false positives

• Joint association to multiple correlated phenotypes

http://www.cs.cmu.edu/~epxing/Class/10708-14/lecture.html

Asthma Trait Network

53

FB data only

Asian Telco data versus Facebook - 1

Analysing family relations with graphs

Asian Telco data versus Facebook - 2 Telco data only

+

Campaign Targeting using URL + Social Data Types Response rate

Normal / Control 0.20%

With Social 0.49%

Social + URL 2.30%

Romantic Partner Relationship Prediction Data Types Accuracy

SMS No. 25%

SMS No. + CDR graphs 75%

SMS No. + CDR & Location graphs 85%

…

Combined Social Networking

Graph

1. Improved demographic prediction : Age (45% -> 63%), Gender (45% -> 70%) 2. Inferring romantic partner from SMS/CDR 3. Inferred family relationships, colleagues & communities

Results :

CDR Facebook

Location

URL

Survey Registration

Loyalty

Telco + FB data

Telco Data and Facebook combined !

•Wave 1 churners with red •Wave 2 churners with pink •Own customers with yellow •Competitor customers with green •Very active customer with blue

Finding: Wave 1 (red) Churners are contagious (followed by pinks) when local community members are less embedded in the network

Viral churn in service providers : prioritize key opinion leaders before they leave !

Capturing network properties can improve prediction

• Finding friend of a friend in social network requires one join operation in relational database (RDBMS), so for six degrees of separation, six joins are required. Graph DB can solve this with six simple traversals which is fast and scalable to millions

Depth (how many level of friends of friends)

Execution Time (seconds)

Result Count

MySQL

2 0.016 ~2,500

3 30.267 ~125,000

4 1,543.505 ~600,000

5 Not finished (days) N/A

Neo4J (Graph db)

2 0.01 ~2,500

3 0.168 ~110,000

4 1.359 ~600,000

5 2.132 ~800,000

• Performance RDBMS joining suffered beyond 2 levels due to the huge Cartesian product resulted from each join operation.

Real Life Benchmarks - A MySQL DB with 1M users and each user has 50 friends.

How to learn network properties ?

2

Opportunities

What application areas can benefit ?

• marketing: recommendation, churn and loyalty

• health: family social disease inheritance, personalized medicine, health education and engagement

• education: socially assisted

Recommend for HK :

digital marketing, education and health

Conclusion

Advancing technologies to derive insights from increasing types and amounts of data points to many new opportunities ahead

Questions ? Email [email protected]

Special Thanks to : Mr. William Mak

Close Encounters with Data Science

Data & Analytics