Top Banner
Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4 Hong Kong University of Science and Technology
19

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Dec 29, 2015

Download

Documents

Grant Morris
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Predictive Modeling with Heterogeneous Sources

Xiaoxiao Shi1 Qi Liu2 Wei Fan3 Qiang Yang4 Philip S. Yu1

1 University of Illinois at Chicago2 Tongji University, China

3 IBM T. J. Watson Research Center

4 Hong Kong University of Science and Technology

Page 2: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

1/18

Why learning with heterogeneous sources?

New York Times

Training (labeled)

Test (unlabeled)

Classifier

New York Times

85.5%

Standard Supervised Learning

Page 3: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

2/18New York Times

Training (labeled)

Test (unlabeled)

New York TimesLabeled data are

insufficient!

47.3%

How to improve the

performance?

In Reality…

Why heterogeneous sources?

Page 4: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

3/18

Why heterogeneous sources?

Reuters

Labeled data from other sources

Target domaintest (unlabeled)

New York Times

82.6%

1. Different distributions

2. Different outputs

3. Different feature spaces

47.3%

Page 5: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Real world examples

• Social Network:– Can various bookmarking systems help predict social tags for a

new system given that their outputs (social tags) and data (documents) are different?

Wikipedia ODP Backflip Blink

……

?4/18

Page 6: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Real world examples

• Applied Sociology:– Can the suburban housing price census data help predict the

downtown housing prices?

?

#rooms #bathrooms #windows price

5 2 12 XXX

6 3 11 XXX

#rooms #bathrooms #windows price

2 1 4 XXXXX

4 2 5 XXXXX 5/18

Page 7: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Other examples

• Bioinformatics– Previous years’ flu data new swine flu– Drug efficacy data against breast cancer

drug data against lung cancer– ……

• Intrusion detection– Existing types of intrusions unknown

types of intrusions • Sentiment analysis

– Review from SDM Review from KDD

6/18

Page 8: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Learning with Heterogeneous Sources

• The paper mainly attacks two sub-problems:– Heterogeneous data distributions

• Clustering based KL divergence and a corresponding sampling technique

– Heterogeneous outputs (to regression problem)

• Unifying outputs via preserving similarity.

7/18

Page 9: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Learning with Heterogeneous Sources

• General Framework

Unifying data distributions

Unifying outputs

Source data

Target data

Source data Target data

8/18

Page 10: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Unifying Data Distributions

• Basic idea: – Combine the source and target data and

perform clustering.– Select the clusters in which the target and

source data are similarly distributed, evaluated by KL divergence.

9/18

Page 11: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

An Example

D T

Combined Data

Adaptive Clustering

10/18

Page 12: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Unifying Outputs

• Basic idea:– Generate initial outputs according to the

regression model– For the instances similar in the original output

space, make their new outputs closer.

11/18

Page 13: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

12/18

16 3726.521.25 31.75

Initia

l Ou

tpu

ts

Initia

l Ou

tpu

ts

Mo

dificatio

n Mo

dificatio

n

Page 14: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Experiment

• Bioinformatics data set:

13/18

Page 15: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Experiment

14/18

Page 16: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Experiment

• Applied sociology data set:

15/18

Page 17: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Experiment

16/18

Page 18: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

17/18

• Problem: Learning with Heterogeneous Sources:• Heterogeneous data distributions• Heterogeneous outputs

• Solution:• Clustering based KL divergence help perform

sampling• Similarity preserving output generation help

unify outputs

Conclusions

Page 19: Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

18/18

Thanks!