Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4 Hong Kong University of Science and Technology
19
Embed
Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Predictive Modeling with Heterogeneous Sources
Xiaoxiao Shi1 Qi Liu2 Wei Fan3 Qiang Yang4 Philip S. Yu1
1 University of Illinois at Chicago2 Tongji University, China
3 IBM T. J. Watson Research Center
4 Hong Kong University of Science and Technology
1/18
Why learning with heterogeneous sources?
New York Times
Training (labeled)
Test (unlabeled)
Classifier
New York Times
85.5%
Standard Supervised Learning
2/18New York Times
Training (labeled)
Test (unlabeled)
New York TimesLabeled data are
insufficient!
47.3%
How to improve the
performance?
In Reality…
Why heterogeneous sources?
3/18
Why heterogeneous sources?
Reuters
Labeled data from other sources
Target domaintest (unlabeled)
New York Times
82.6%
1. Different distributions
2. Different outputs
3. Different feature spaces
47.3%
Real world examples
• Social Network:– Can various bookmarking systems help predict social tags for a
new system given that their outputs (social tags) and data (documents) are different?
Wikipedia ODP Backflip Blink
……
?4/18
Real world examples
• Applied Sociology:– Can the suburban housing price census data help predict the
downtown housing prices?
?
#rooms #bathrooms #windows price
5 2 12 XXX
6 3 11 XXX
#rooms #bathrooms #windows price
2 1 4 XXXXX
4 2 5 XXXXX 5/18
Other examples
• Bioinformatics– Previous years’ flu data new swine flu– Drug efficacy data against breast cancer
drug data against lung cancer– ……
• Intrusion detection– Existing types of intrusions unknown
types of intrusions • Sentiment analysis
– Review from SDM Review from KDD
6/18
Learning with Heterogeneous Sources
• The paper mainly attacks two sub-problems:– Heterogeneous data distributions
• Clustering based KL divergence and a corresponding sampling technique
– Heterogeneous outputs (to regression problem)
• Unifying outputs via preserving similarity.
7/18
Learning with Heterogeneous Sources
• General Framework
Unifying data distributions
Unifying outputs
Source data
Target data
Source data Target data
8/18
Unifying Data Distributions
• Basic idea: – Combine the source and target data and
perform clustering.– Select the clusters in which the target and
source data are similarly distributed, evaluated by KL divergence.
9/18
An Example
D T
Combined Data
Adaptive Clustering
10/18
Unifying Outputs
• Basic idea:– Generate initial outputs according to the
regression model– For the instances similar in the original output
space, make their new outputs closer.
11/18
12/18
16 3726.521.25 31.75
Initia
l Ou
tpu
ts
Initia
l Ou
tpu
ts
Mo
dificatio
n Mo
dificatio
n
Experiment
• Bioinformatics data set:
13/18
Experiment
14/18
Experiment
• Applied sociology data set:
15/18
Experiment
16/18
17/18
• Problem: Learning with Heterogeneous Sources:• Heterogeneous data distributions• Heterogeneous outputs
• Solution:• Clustering based KL divergence help perform
sampling• Similarity preserving output generation help