Top Banner
Clustering Very Large M ulti-dimensional Datase ts with MapReduce 蔡蔡
20

Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Jan 01, 2016

Download

Documents

Helen Baker
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Clustering Very Large Multi-dimensional Datasets with MapReduce蔡跳

Page 2: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

INTRODUCTION

• large dataset of moderate-to-high dimensional elements

• serial subspace clustering algorithms• TB 、 PB• e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB• 方法: combine a fast, scalable serial algorithm and makes it run efficiently in parallel

Page 3: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

INTRODUCTION

• bottleneck: I/O, network• Best of both Worlds -- BoW automatically spots the bottleneck and picks a good

strategy serial clustering methods as a plugged-in clusterin

g subroutine

Page 4: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.
Page 5: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

RELATED WORK

• MapReduce-- 简化的分布式编程模式,用于大规模数据集的并行运算

• mapper, reducer• map stage : input file and outputs(key, value)pairs• shuffle stage : transfers the mappers'output to the re

ducers based on the key• reduce stage: processes the received pairs and output

s thefinal result

Page 6: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

BoW

• ParC :数据划分,合并结果• SnI :先抽样,牺牲 I/O 减少 network cost• trade-off

Page 7: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

ParC--Parallel Clustering

• 划分数据、分配数据到不同的机器• 每台机器在分配到的数据中聚类,得到簇称为β-clusters• 合并β-clusters 得到最终的类

Page 8: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.
Page 9: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

SnI--Sample and Ignore

• 抽样,聚类得到 clusters• 排除属于 clusters 空间内的数据• ParC

Page 10: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.
Page 11: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.
Page 12: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

COST-BASED OPTIMIZATION

• ParC Cost:

• Map Cost :

• Shuffle Cost:

• Reduce Cost:

Page 13: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

• SnI Cost :

Page 14: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Bow

• compute ParC Cost->costC• compute SnI Cost->costCs• if costC > costCs then clusters = result of SnI • else clusters = result of ParC

Page 15: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

EXPERIMENTAL RESULTS

• 采用 Hadoop• M45 : 1.5PB storage , 1TB memory ,• DISC/Cloud : 512 cores , 64 machines , 1TB RAM ,

256TB disk storage ,

Page 16: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Quality of results

• 聚类的平均准确率、召回率• 模拟数据

Page 17: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Scale-up results

• 增加 reducer

Page 18: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Scale-up results

• 增加数据, r=128 , m=700

Page 19: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

Accuracy of our cost equations

Page 20: Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.

感谢聆听 !Thanks for your time