Top Banner
Big Data Analytics Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Big Data Analytics 1 / 36
41

Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

May 27, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics

Big Data Analytics

Lucas Rego Drumond

Information Systems and Machine Learning Lab (ISMLL)Institute of Computer Science

University of Hildesheim, Germany

Big Data Analytics

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 36

Page 2: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics

Outline

1. What is Big Data?

2. Overview

3. Organizational Stuff

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 36

Page 3: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

Outline

1. What is Big Data?

2. Overview

3. Organizational Stuff

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 36

Page 4: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What is Big Data?

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 1 / 36

Page 5: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What is Big Data?

“Big data is like teenage sex: everyone talks about it, nobodyreally knows how to do it, everyone thinks everyone else is doingit, so everyone claims they are doing it.”- Dan Ariely

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 2 / 36

Page 6: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What is Big Data?

Some definitions:

I “A collection of data sets so large and complex that it becomesdifficult to process using on-hand database management tools ortraditional data processing applications.”http://en.wikipedia.org/wiki/Big data

I “Big data is high-volume, high-velocity and high-varietyinformation assets that demand cost-effective, innovative forms ofinformation processing for enhanced insight and decision making.”www.gartner.com/it-glossary/big-data/

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 3 / 36

Page 7: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What is Big Data?

Big Data is about:

I Storing and accessing large amounts of (unstructured) data

I Processing high volume data streams

I Making sense of the data

I Predictive technologies

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 4 / 36

Page 8: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

Where to find Big Data?

I 1.28 billion users (1.23 billion monthly active in January 2014)I Size of user data sored by Facebook: 300 PetabytesI Average amount of data that Facebook takes in daily: 600 terabytesI Size of Facebook’s Graph Search database: 700 Terabytes

Source: http://allfacebook.com/orcfile b130817Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 5 / 36

Page 9: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

Where to find Big Data?

I 3.3 billion searches per day (on average)1

I 30 trillion unique URLs identified on the Web1

I 20 billion sites crawled a day1

I In 2008 Google processed more than 20 Petabytes of data per day2

1http://searchengineland.com/google-search-press-1299252Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified dataprocessing on large clusters. Commun. ACM 51, 1 (January 2008),107-113.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 6 / 36

Page 10: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

Where to find Big Data?

I Average number of tweets per day: 58 million1

I Number of Twitter search engine queries every day: 2.1 billion1

I Total number of active registered Twitter users: 645,750,0001

1http://www.statisticbrain.com/twitter-statistics/Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 7 / 36

Page 11: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

Where to find Big Data?

I Ensembl database contains the genome of humans and 50 otherspecies

I “only” 250 GB1

1http://www.ensembl.org/

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 8 / 36

Page 12: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

Where to find Big Data?

I Large Hadron Collider has collected data from over 300 trillionproton-proton collisions

I Approx. 25 Petabytes per year

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 9 / 36

Page 13: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What to do with Big Data?

We don’t want to know things but to understand them!

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 10 / 36

Page 14: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What to do with Big Data? - Case Studies

I T-Mobile USA: integrated Big Data across multiple IT systems tocombine customer transaction and interactions data in order to betterpredict customer defections

I By leveraging social media data along with transaction data from CRMand Billing systems, customer defections has been cut in half in asingle quarter.

I US Xpress: collects data elements ranging from fuel usage to tirecondition to truck engine operations to GPS information

I Optimal fleet management

I McLaren’s Formula One racing team: real-time car sensor dataduring car races

I Real time identification of issues with its racing cars

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 11 / 36

Page 15: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What to do with Big Data? - The BI Approach

DataWarehouse

I Static databases

I Structured data

I Centralized approaches

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 12 / 36

Page 16: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What to do with Big Data?

I Massive Parallelism

I Heterogeneous datasources

I Unstructured data

I Data streams

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 13 / 36

Page 17: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

What to do with Big Data?

Application examples:

I Online personalized advertising

I Sentiment analysis and behavior prediction

I Detecting adverse events and predicting their impact

I Automatic Translation

I Image Classification and object recognition

I Intelligent public services

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 14 / 36

Page 18: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 1. What is Big Data?

How?

In order to deal with large volumes of data we need to address thefollowing challenges:

I Effectively store and large amounts of data in a distributedenvironment

I Query distributed databases

I Parallel and distributed programing models

I Data Mining and machine learning techniques to make sense of thedata

I Effective data visualisation techniques

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 15 / 36

Page 19: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Outline

1. What is Big Data?

2. Overview

3. Organizational Stuff

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 16 / 36

Page 20: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Overview

Part III

Part II

Part I

Machine Learning Algorithms

Large Scale Computational Models

Distributed Database

Distributed File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 16 / 36

Page 21: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Overview

Part I

Distributed Database

Distributed File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 17 / 36

Page 22: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Storing

In a distributed environment the data storing mechanisms should addressthe following issues

I Parallel Reading and Writing

I Data node Failures

I High Availability

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 18 / 36

Page 23: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Distributed File Systems

The Google File System Architecture

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 19 / 36

Page 24: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Databases

Databases are needed for

I Querying and indexing

I transaction procesing

State-of-the-art: Relational Databases

For processing big data one needs a database which:

I Supports high level of parallelism

I Supports analytical processing

I Has a flexible data model to deal with unstructured data sources

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 20 / 36

Page 25: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Databases for Big Data - NoSQL

NoSQL - “Not only SQL”

I Wide variety of database technologies

I Dynamic Schema

I sharded indexing

I horizontal scaling

I support columnar storage

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 21 / 36

Page 26: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

NoSQL Databases

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 22 / 36

Page 27: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Overview

Part II

Part I

Large Scale Computational Models

Distributed Database

Distributed File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 23 / 36

Page 28: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Accessing

A computational model is needed to:

I Provide a set of useful computational primitives

I Hide the complexity of distributed and parallel programming

I Ensure Fault Tolerance

Examples:

I MapReduce

I GraphLab

I Pregel

I Apache Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 24 / 36

Page 29: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

MapReduce

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 25 / 36

Page 30: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

GraphLab

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 26 / 36

Page 31: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Apache Spark

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 27 / 36

Page 32: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Overview

Part III

Part II

Part I

Machine Learning Algorithms

Large Scale Computational Models

Distributed Database

Distributed File System

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 28 / 36

Page 33: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Making sense of the data

I Linear and Non Linear Models for classification and regression

I Scalable learning algorithms (e.g. Stochastic Gradient Descent)

I Distributed Learning Algorithms (e.g. ADMM)

I Models for Link Prediction and link analysisI Factorization models

I Distributed Learning Schemes (e.g. NOMAD, FPSGD)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 29 / 36

Page 34: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Classification

x2

x1

w· x

+b

=0

w· x

+b

=1

w· x

+b

=−1

2‖w‖

b‖w‖

w

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 30 / 36

Page 35: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Recommender Systems

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 31 / 36

Page 36: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Graph Analysis

lLucas

nNico

oObama c

Annexationof Crimea

p

Paco’s Death

yF (l , o) = 1

yS(l , n) = 1yS(n, l) = 1

yC (l , p) = 1

yC (o, c) = 1

yF (o, l) =?yF (n, o) =?

yF (n, l) =?

yS(o, l) =?yS(n, o) =?

yS(n, l) =?

yC (o, p) =?

yC (n, o) =?

yC (l , c) =?

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 32 / 36

Page 37: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 2. Overview

Course Overview

Main goal: predictive analytics from large scale data!

I Introduction (1 Lecture)

I Machine Learning problems afflicted by Big Data (3 Lectures)

I Distributed Learning algorithms (3 Lectures)

I Parallel and distributed programing models (4 Lectures)

I Large scale storage and retrieval mechanisms (1 Lecture)

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 33 / 36

Page 38: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 3. Organizational Stuff

Outline

1. What is Big Data?

2. Overview

3. Organizational Stuff

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 34 / 36

Page 39: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 3. Organizational Stuff

Exercises and tutorials

I There will be a weekly sheet with two exercises handed out eachFriday in the tutorial.1st sheet will be handed out Fri. 24.04

I Solutions to the exercises can be submitted until next Thursdaybefore the lecture.1st sheet is due Thu. 30.04.

I Exercises will be corrected

I Tutorials each Friday 10-12,1st tutorial at Friday 24.04

I Successful participation in the tutorial gives up to 10% bonus pointsfor the exam.

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 34 / 36

Page 40: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 3. Organizational Stuff

Exams and credit points

I There will be a written exam at the end of the term (2h, 4 problems).

I The course gives 6 ECTSI The course can be used in

I IMIT MSc. / Informatik / Gebiet KI & MLI Wirtschaftsinformatik MSc / Informatik / Gebiet KI & ML

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 35 / 36

Page 41: Big Data Analytics - Universität Hildesheim · Big Data Analytics 2. Overview Databases for Big Data - NoSQL NoSQL - \Not only SQL" I Wide variety of database technologies I Dynamic

Big Data Analytics 3. Organizational Stuff

Some books

I Anand Rajaraman, Jure Leskovec, and Jeffrey Ullman: ”Mining ofmassive datasets” Available online:http://infolab.stanford.edu/ ullman/mmds.html

I Gautam Shroff: “The Intelligent Web: Search, smart algorithms, andbig data”

Lucas Rego Drumond, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Big Data Analytics 36 / 36