Top Banner
Big Data Architectures For Machine Learning and Data Mining Seminar Kick-Off Meeting April 6th 2016 Web Technology and Information Systems Group Michael Völske [email protected] 1 © webis 2016
25

Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Oct 17, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Big Data ArchitecturesFor Machine Learning and Data Mining

Seminar Kick-Off Meeting

April 6th 2016

Web Technology and Information Systems Group

Michael Vö[email protected]

1 © webis 2016

Page 2: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

What is Big DataDifferent Points of View

“ “Big data” is data that can’t be processed using standard databases because it istoo big, too fast-moving, or too complex for traditional data processing tools.

AnnaLee Saxenian (Dean, UC Berkeley School of Information) ”“Big data is when data grows to the point that the technology supporting the data has

to change. It also encompasses a variety of topics relating to how disparate datacan be combined, processed into insights, and/or reworked into smart products.

Anna Smith (Analytics Engineer, Rent the Runway) ”“ In my view, big data is data that requires novel processing techniques to handle.

Typically, big data requires massive parallelism in some fashion (storage and/orcompute) to deal with volume and processing variety.

Brad Peters (Chief Product Officer, Birst) ”[http://datascience.berkeley.edu/what-is-big-data/]2 © webis 2016

Page 3: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

What is Big DataGartner’s “Three V’s”

Volume

TerabytesRedundancyData Hoarding

[http://www.gartner.com/it-glossary/big-data/]3 © webis 2016

Page 4: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

What is Big DataGartner’s “Three V’s”

Volume

TerabytesRedundancyData Hoarding

Velocity

Real-timeBursts, StreamsRate of Change

[http://www.gartner.com/it-glossary/big-data/]4 © webis 2016

Page 5: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

What is Big DataGartner’s “Three V’s”

Volume

TerabytesRedundancyData Hoarding

Velocity

Real-timeBursts, StreamsRate of Change

StructuredUnstructuredSemi-StructuredMixed

Variety

[http://www.gartner.com/it-glossary/big-data/]5 © webis 2016

Page 6: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

The Big Data Architecture Stack

DataConsumptionLayer

DataAnalyticsLayer

DataManagementLayer

Hardware /StorageLayer

DataAcquisitionLayer

6 © webis 2016

Page 7: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

The Big Data Architecture Stack

DataConsumptionLayer

DataAnalyticsLayer

DataManagementLayer

Hardware /StorageLayer

DataAcquisitionLayer Structured Unstructured Semi-structured

Crawling, Scraping,APIs

7 © webis 2016

Page 8: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

The Big Data Architecture Stack

DataConsumptionLayer

DataAnalyticsLayer

DataManagementLayer

Hardware /StorageLayer

DataAcquisitionLayer Structured Unstructured Semi-structured

Crawling, Scraping,APIs

Distributed File System

8 © webis 2016

Page 9: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

The Big Data Architecture Stack

DataConsumptionLayer

DataAnalyticsLayer

DataManagementLayer

Hardware /StorageLayer

DataAcquisitionLayer Structured Unstructured Semi-structured

Crawling, Scraping,APIs

Distributed File System

Structured Storage

Data Modeling

9 © webis 2016

Page 10: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

The Big Data Architecture Stack

DataConsumptionLayer

DataAnalyticsLayer

DataManagementLayer

Hardware /StorageLayer

DataAcquisitionLayer Structured Unstructured Semi-structured

Crawling, Scraping,APIs

Distributed File System

Structured Storage

Data Modeling

Data Mining,Information Extraction & Enrichment

10 © webis 2016

Page 11: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

The Big Data Architecture Stack

DataConsumptionLayer

DataAnalyticsLayer

DataManagementLayer

Hardware /StorageLayer

DataAcquisitionLayer Structured Unstructured Semi-structured

Crawling, Scraping,APIs

Distributed File System

Structured Storage

Data Modeling

Data Mining,Information Extraction & Enrichment

Artificial Intelligence,Reasoning under Uncertainty

11 © webis 2016

Page 12: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNCommon Infrastructure for Big Data Technologies

12 © webis 2016

Page 13: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNHadoop 2.0 Ecosystem

[http://hortonworks.com/hadoop/yarn/]13 © webis 2016

Page 14: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNHadoop 2.0 Ecosystem

[http://hortonworks.com/hadoop/yarn/]14 © webis 2016

Page 15: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNYARN Architecture

Node 1

Node 2

Node 3

Node 4

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html]15 © webis 2016

Page 16: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNYARN Architecture

Node 1

Node 2

Node 3

Node 4

ResourceManager Node

Manager

NodeManager

NodeManager

Node Status

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html]16 © webis 2016

Page 17: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNYARN Architecture

Node 1

Node 2

Node 3

Node 4

ResourceManager Node

Manager

NodeManager

NodeManager

Node Status

Client

Job Submission

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html]17 © webis 2016

Page 18: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNYARN Architecture

Node 1

Node 2

Node 3

Node 4

ResourceManager Node

Manager

NodeManager

NodeManager

Node Status

Client

Job Submission

App.Mstr

Resource Request

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html]18 © webis 2016

Page 19: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNYARN Architecture

Node 1

Node 2

Node 3

Node 4

ResourceManager Node

Manager

NodeManager

NodeManager

Node Status

Client

Job Submission

App.Mstr

Resource Request

Container

MapReduce Status

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html]19 © webis 2016

Page 20: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Hadoop YARNYARN Architecture

Node 1

Node 2

Node 3

Node 4

ResourceManager Node

Manager

NodeManager

NodeManager

Node Status

Client

Job Submission

App.Mstr

Resource Request

Container

MapReduce Status

Client

App.Mstr

ContainerContainer

Container

[https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html]20 © webis 2016

Page 21: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

betaweb FactsOur Cluster

q 130 nodesq 1500 coresq 8TB RAMq 2PB HDD

21 © webis 2016

Page 22: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Big Data Architectures For Machine Learning and Data MiningSeminar Deliverables

1. Short talk

q 10-20 minutes.q Overview of one big data technology.q How does it work?q What is it good for?q Installation instructions.q Usage examples.

2. Seminar talk

q 30 – 45 minutes.q Detailed introduction to one big data problem including state-of-the-art.q Discussion about possible big data technologies to solve the problem.q Presentation of implementation, evaluation, and results.

3. Seminar text

q High-quality text summarizing findings

22 © webis 2016

Page 23: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Big Data Architectures For Machine Learning and Data MiningSeminar Topics

Short talk:

q HDFS and basic MapReduce.

q Apache Spark Basics.q Spark MLLib.q Spark GraphX.q Spark Streaming.

q Apache Mahout “Samsara.”

q Apache Flink.

q DeepLearning4J.q Tensorflow.

Seminar talk (general):

q Near-duplicate detection in largedocument collections.

q Ad-hoc Search Engine Index.q Analyzing word similarity with distri-

buted representations.q Text re-use in Wikipedia.q Exploring large document collecti-

ons.q Social Network Analysis.q Recommendation Systems.q Dimensionality Reduction.q Real-time classification.

23 © webis 2016

Page 24: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Big Data Architectures For Machine Learning and Data MiningSchedule

This week

q Survey for regular seminar time slot.http://doodle.com/poll/9td7qryihy73f295

q Reading:Leskovec, Rajaraman, Ullman. Mining of massive datasets.http://infolab.stanford.edu/~ullman/mmds/book.pdf

Weeks 2-3

q Tutorial:Installing Hadoop on virtual machines.

q Preparation of short talks.

Weeks 4-5

q Short Talks.q Assignment of seminar talk topics.

Dates for the seminar talks are to be determined.24 © webis 2016

Page 25: Big Data Architectures For Machine Learning and Data Mining · 2016. 4. 6. · Big Data Architectures For Machine Learning and Data Mining Seminar Topics Short talk: q HDFS and basic

Big Data Architectures For Machine Learning and Data MiningThank you!

q Add you name and email address to the participants list.

q Watch the course web page for schedule updates.www.webis.de → “Teaching” → “SS 2016” → “Big Data Architectures ForMachine Learning and Data Mining”

q Homework:

– Download and install Oracle VirtualBox, as well as the virtual machineimage for this course. Links will be provided on the course page.

– Skim the “Mining of Massive Datasets” book.– Take the seminar time slot survey.– Further instructions by email.

25 © webis 2016