2010 Workshop on Massive Data Analytics on the Cloud (MDAC 2010) April 26, 2010 Raleigh, NC, USA In association with the 19th Annual World Wide Web Conference.

2010 Workshop on Massive Data Analytics on the

Cloud(MDAC 2010)

April 26, 2010Raleigh, NC, USA

In association with the 19th Annual World Wide Web Conference (WWW2010)

Dashboards

Embedded Analytics

Financial Planning

Mash ups

Scorecards

Search

Making Sense of Mountains of Data

Billions of mobile devicesSemi-struct

ClickSteam, CRM Claim data (text,

picture, video) Call data records Location Tracking

(GPS), iPhone, Vehicle Use

Data, $ Trans tracking

(Across borders & IP providers),

Feeds: Census Bureau

Data Market Data,

Weather Data Sensors data

Online Transaction Processing System

PetaBytes -> Exabytes

Auto/CrossCorrelationAnalytics, Predictive Analytics

Deep & WideAnalytics

Fine grained – individual product and customer at a

time and place

Feedback/Action

Semi-Un-struct

Structured

Continuous arrival of high volume information (evolving, highly variant)(struct-/semi--/un-structured

Web Data (for search)

Web Buz data (for reputation analysis)

Sem

i-U

n-s

tru

ct

http://images.google.com/imgres?imgurl=http://www.skywaysecurity.com/images/point_of_sale_image.jpg&imgrefurl=http://www.skywaysecurity.com/point_of_sale_DVR_integration.cfm&h=275&w=248&sz=12&hl=en&start=3&tbnid=37782U9t0828xM:&tbnh=114&tbnw=103&prev=/images?q=cash+register&svnum=100&hl=en&lr=&safe=off&rls=com.microsoft:en-US&sa=N

Massive Data Analytic Platforms• Google: Original MapReduce implementation• Microsoft: Dryad• Yahoo!, Facebook, and many others: Hadoop

• Ecosystems: Hive, Pig, Jaql, Zookeeper,

• Alternatives to Map/Reduce, e.g. Pregel

M

M

M

R

R

Pa

rtiti

on

So

rtC

C

C

• “Easy” parallelism• Scalability• Fault-Tolerance • Elastic• Flexibility• Cost / Performance

• 1000’s processors• Petabytes of data

• …and growing

Chairpeople Perspective

• Other parallel systems technology and customers– Parallel Database – enterprise data warehousing– Parallel ETL (extraction, transformation, load)– Search and text analytics

• Hadoop and related technologies– Finance, Telco, Healthcare, Retail, Government, …

Questions Posed in Call For Papers

• What kinds of problems are people trying to solve?

• How are existing massive-scaleout platforms used, and what extensions would be helpful?

• Other kinds of platforms for different problems?

• How to integrate with existing environments such as data warehouses?

• Challenges in managing massive datasets?

• Legal/moral challenges associated with mining these data sets?

Agenda (morning)9:00 - 10:30: Session 1

Introduction and Welcome

Invited Talk: "Hadoop: An Industry Perspective"Dr. Amr Awadallah, CTO, VP-Engineering, Cloudera

10:30 - 11:00: Coffee Break*

11:00 - 12:30: Session 2Distributed Indexing of Web Scale Datasets for the Cloud

Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos, Nectarios Koziris; National Technical University of Athens

Beyond Online Aggregation: Parallel and Incremental Data Mining with Online Map-ReduceJoos-Hendrik Böse1, Artur Andrzejak2, Mikael Högqvist2; 1Intl. Comp. Sci. Institute, 2Zuse Institute Berlin (ZIB)

Efficient Updates for a Shared Nothing Analytics PlatformKaterina Doka3, Dimitrios Tsoumakos4, Nectarios Koziris3; 3National Technical

Universityof Athens, Greece, 4University of Cyprus

12:30 - 1:30: Lunch*

Agenda (afternoon)1:30 - 3:30: Session 3

Invited Talk: "Large Scale Applications on Hadoop in Yahoo"Dr. Vijay Narayanan, Yahoo! Labs Silicon Valley,

Extracting User Profiles from Large Scale DataMichal Shmueli-Scheuer, Haggai Roitman, David Carmel, Yosi Mass, David Konopnicki; IBM Research, Haifa

A Novel Approach to Multiple Sequence Alignment using Hadoop Data GridsSudha Sadasivam, G. Baktavatchalam; PSG College of Technology

3:30 - 4:00: Coffee Break*

4:00 - 5:30: Session 4

Towards Scalable RDF Graph Analytics on MapReduce Padmashree Ravindra, Vikas Deshpande, Kemafor Anyanwu; North Carolina State University

SPARQL Basic Graph Pattern Processing with Iterative MapReduce Jaeseok Myung, Jongheum Yeon, Sang-goo Lee; Seoul National University

Parallelizing Random Walk with Restart for Large-Scale Query Recommendation Meng-Fen Chiang, Tsung-Wei Wang, Wen-Chih Peng; National Chiao Tung

UniversityHsinchu, Taiwan

Acknowledgements

Workshop ChairsUllas Nambiar, IBM India Research

Lab, New Delhi, IndiaJohn McPherson, IBM Almaden

Research Center, USADavid Konopnicki, IBM Haifa Research

Lab, Israel

Steering CommitteeRakesh Agrawal, Microsoft Search

Labs, Mountain View, CA, USA Alon Halevy, Google Inc., Mountain

View, CA, USA

Invited SpeakersAmr Awadallah, CTO, VP-Engineering,

Cloudera, "Hadoop: An Industry Perspective"

Vijay Narayanan, Yahoo! Labs Silicon Valley, "Large Scale User Modeling on Hadoop"

Program CommitteeAmr Awadallah, Cloudera, USAAndrew McCallum, University of Massachusetts Amherst, USAAssaf Schuster, Technion - Israel Institute of TechnologyGautam Das, University of Texas, Arlington, USAJimeng Sun, IBM Watson Research Center, USAJohn Shafer, Microsoft Search Labs, USAKevin Chang, University of Illinois at Urbana-Champaign, USAKun Liu, Yahoo! Labs, USALouiqa Raschid, University of Maryland, College Park, USAMichal Shmueli-Scheuer, IBM Haifa Research Lab, IsraelMichael Sheng, University of Adelaide, AustraliaMong Li Lee, National University of Singapore, SingaporeRajeev Gupta, IBM India Research Lab, IndiaVanja Josifovski, Yahoo Research, USAYannis Sismanis, IBM Almaden Research Center, USAYi Chen, Arizona State University, USAWen-syan Li, SAP, China

2010 Workshop on Massive Data Analytics on the Cloud (MDAC 2010) April 26, 2010 Raleigh, NC, USA In association with the 19th Annual World Wide Web Conference.

Documents

massive data analytics

data warehouses

data sets

census bureau data market

usa jimeng

incremental data mining

search web buz data

vehicle use data