ER 2018 keynote (rev8)-2(공개버전) 2018 keynote...Iterative processing Enhancements: Spark, Flink, … Stream processing Spark Streaming, Flink, … Graph processing Giraph, Turi,

2018. 10 KAIST/DGIST CIKM Copyright © 2009 by Kyu-Young Whang 1

October 23rd, 2018

Recent Trends of Big Data Platforms and Applications*

Kyu-Young WhangACM Fellow, IEEE Life FellowKAIST Distinguished Professor, Professor Emeritus

School of Computing, KAISTDGIST Visiting Chair Professor

Department of Information and Communication Engineering

* A joint work with Jae-Gil Lee, Prof., Dept. of Industrial and Systems Eng., KAIST

ER 2018 keynote

2018. 10 KAIST/DGIST

ContentsContents

Part I: Big Data—Introduction

Part II: Big Data Platforms

Part III: Big Data Applications

Conclusions

Copyright © 2018 by Kyu-Young Whang 2


Part I: Big Data—Introduction



Big Data: DefinitionsBig Data: Definitions

“A term used to refer to the study and applications of data sets that are too complex for traditional data-processing application software to adequately deal with” [Wik18a]

“Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions” The highlighted answer of Google Search as of Sept. 2018

Volume, Velocity, and Variety, Veracity, Value (a.k.a. Three V’s, Five V’s) [Lan01, Goe14, Wik18a]


Copyright © 2018 by Kyu-Young Whang 5Source: http://www.visualcapitalist.com/internet-minute-2018/


Main Sources of Big DataMain Sources of Big Data

Internet-of-Things (IoT)

Web 2.0 (e.g., social media)

Scientific experiments

Mobile data

Cloud data

Healthcare data…..



Big Data from IoTBig Data from IoT

Abundant data are being generated from many sensors in our daily life Example: GPS data from our cell phones

More and more devices are being connected to the Internet (by wireless network) Example: Smart TV’s, Blu-ray Players,

Gaming Consoles, Streaming Media Hubs (Apple TV), Refrigerators

All these are part of what’s called Internet of Things (IoT)



Big Data from Web 2.0Big Data from Web 2.0

Users can provide the data that is on a Web 2.0 site and exercise some control over that data Web 2.0 is bidirectional, i.e., users are creators of user-generated

content as well as consumers

Examples of Web 2.0 include Social networking sites (e.g., Facebook, Twitter) Blogs (e.g., Tumblr) Wikis (e.g., Wikipedia) Photo/video sharing sites (e.g., Flickr, YouTube) …



Social networking sites are attracting significant interestsworldwide and producing big data The data are modeled using a graph, where a node is a person and

an edge is a relationship between them (followers, friends, etc.)


user-generated data a user

a relationship

U1


Big Data from Scientific ExperimentsBig Data from Scientific Experiments

Many scientists are using and producing vast amounts of data through scientific simulations and observations [Gra02]

Scientists try to discover patterns, trends, hidden messages, or even truth from this vast amount of scientific data through intensive analysis

Here comes the notion of data science that “uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data in various forms”[Wik18g]

(Data science deals with any type of data as well as scientific data) Turing award winner Jim Gray envisioned data science as a

fourth paradigm of science following empirical, theoretical, and computational sciences in the human history[Gra02]


=

~1000 AD, empirical science

~1500-1950s,theoretical science

~1950s-1990s, computational science

~1990-now,data science


Examples of Big Data Sources from Scientific ExperimentsExamples of Big Data Sources from Scientific Experiments

CERN particle accelerator The researches proved the existence of the Higgs boson, a particle

predicted to exist 50 years earlier They evaluated massive amounts of data (25 petabytes/year)

collected during the three years of testing [Wik18h]

DNA sequencing The amount of DNA sequence data reaches a fourth of the size of

YouTube's yearly data production Source: Washington Post, July 7, 2015


Illumina DNA Sequencer


Recent Rise of Big DataRecent Rise of Big Data

Computing power improvement⇒ both CPU and data storage capacity

Accessibility improvement ⇒ in terms of computing power and data storage

through Cloud service

Easy distributed, parallel programming ⇒ such as MapReduce and Hadoop on distributed platforms



Computing Power ImprovementComputing Power Improvement

Moore’s Law: “…the number of transistors in a dense integrated circuit doubles about every two years” [Wik18b]

Copyright © 2018 by Kyu-Young Whang 13Source: https://en.wikipedia.org/wiki/Moore%27s_law


Accessibility ImprovementAccessibility Improvement

Users now can lease the computing power and data storage from the Cloud service at a cheap price Example: Amazon EC2, Microsoft Azure, Google Cloud


Create accounts for the virtual machine and storage through the Web


Distributed, Parallel Programming: IntroductionDistributed, Parallel Programming: Introduction

Computing (for analysis) The amount of work required becomes greater than the capacity of a

single CPU. Thus, to fill in the gap, we find the need to employ parallelism.

Data Storage Data are too big to store in one node and too slow to read from a

singe node. The obvious solution is to store data in multiple nodes and read from them simultaneously and in a distributed fashion.


Source: http://www.spiral.net/

gap


Distributed, Parallel Programming: ChallengesDistributed, Parallel Programming: Challenges

However, there are several problems in working with multiple machines Coordination among multiple nodes Dealing with frequent hardware failures when we work with a large

number of inexpensive processors and storage devices → replication

Nevertheless, the programmers do not want to think about these complexities


Image Source : http://hirendave.tech/programmers/tech-humor-a-programmers-revenge-deal-with-frustration-at-work/


MapReduceMapReduce

“A programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster” [Wik18c]

Typically for batch-oriented large-scale parallelization

Inspired by functional programming’s map() and reduce() functions

Proposed by Jeffrey Dean and Sanjay Ghemawat [Dea04] at Google in 2004 Cited more than 25,000 times as of Sept. 2018



MapReduce: Map & ReduceMapReduce: Map & Reduce

MapReduce works by breaking the processing into two phases: the map phase and the reduce phase Map usually performs filtering and sorting Reduce usually performs a summary operation The output of map is provided to the input of reduce Each phase has key-value pairs as input and output, the types of

which may be chosen by the programmer



Example: Count the frequency of each unique word in a large document collection Required for building an inverted index of a search engine

1. Map: Going through all the words in a split and producing a set of key-value pairs <word, frequency>

2. Reduce: Aggregating the frequencies for the same key3. Finally producing counts of all unique words in the output

MapReduce: ExampleMapReduce: Example

Copyright © 2018 by Kyu-Young Whang

19

big … big ... data … data …

big … data … … ER … ER

Split 1

Split 2

Map

Map

Reduce<data, 4><ER, 2>

<big, 1>

<big, 1><big, 1>

big … data ... …

Split 3

Map

Reduce<data, 1><data, 1>

<data, 1><ER, 1><ER 1>

<big, 1><data, 1>

<big, 4>

<big, 4><data, 4><ER, 2>

Output


HadoopHadoop

A software framework for distributed storage and processing of big data using the MapReduce programming model The most popular open-source implementation of MapReduce Being developed as a top-level Apache project

Significance: Why Hadoop? Because Hadoop takes care of these complexities, the programmers

don’t have to worry about the complexities behind parallel distributed processing



In addition,

Hadoop Distributed File System (HDFS) supports fault tolerance and load balancing through replication (default: 3)

Hadoop provides horizontal scalability by using inexpensive commodity machines



Hadoop: Main ComponentsHadoop: Main Components

Hadoop Distributed File System (HDFS): a distributed file system that stores data on a massive number of commodity machines

Hadoop YARN: a platform for managing and scheduling computing resources in clusters (resource manager)

Hadoop MapReduce: an implementation of the MapReduceprogramming model

Hadoop Common: libraries and utilities


Common

Source: http://www.dataintegration.ninja/big-data-and-hadoop-features-and-core-architecture/


Hadoop: Limitations and Other PlatformsHadoop: Limitations and Other Platforms

Iterative processing Enhancements: Spark, Flink, …

Stream processing Spark Streaming, Flink, …

Graph processing Giraph, Turi, … ← not covered in this talk

High-level functionality—such as SQL, schemas, transactions, and indexes SQL-on-Hadoop, NewSQL

…



Iterative ProcessingIterative Processing

Hadoop is optimized for a single batch job, whereas applications such as machine learning typically needs iterative computation that repeats until convergence

Limitations of Hadoop for iterative processing Repeatedly writing the intermediate output (the result of the i-th

iteration) to disk and reading it from disk again at the next iteration, causing excessive disk I/O’s—degrading performance.


Output from Iter1

Input Output from Iter2

Map Reduce Map Reduce Map Reduce

HDFS

Iteration (Job) 1 Iteration (Job) 2 Iteration (Job) 3


Spark for Iterative ProcessingSpark for Iterative Processing

Solutions of Spark In Spark, a main architectural component is the Resilient

Distributed Dataset (RDD), which is a main-memory structure representing a working set (i.e., intermediate results). It is distributed over a cluster of nodes, and is fault-tolerant.

By using RDDs across iterations, we can eliminate expensive disk I/O’s.


RDD’s are stored in main memory

Main components of Spark (Source: http://datastrophic.io/ )


RDD A read-only main memory structure distributed over a cluster of

nodes Acting as a cache for an input, an intermediate result, or an output Speeding up iterative operations (transformations)

Allows parallel execution according to a DAG structure representing data flow

Fault-tolerant (or chunks in RDD) An RDD is made fault tolerant by keeping track of its lineage and

reconstructing it in case of data loss.



Stream ProcessingStream Processing

Hadoop is designed for a job that requires (almost) allinput data ready to start processing, whereas real-time applications need to return the results as soon as an input stream arrives

Limitation of Hadoop for stream processing The entire input should reside on the HDFS (disks) before processing The reducers will start only after all the mappers are completed, i.e.,

after all data splits (each with 128 MB) are read in and processed by Map

Therefore, real-time processing is not feasible



Storm for Stream ProcessingStorm for Stream Processing

Solutions of Storm for stream processing Storm: a streaming engine or a data stream management system(DSMS) In Storm, data are processed in real time as they arrive

Topology: defines a (continuous) query, in the form of a directed acyclic graph(DAG), consisting of Spouts, Bolts, and Streams (edges)

Spout: defines a stream source Bolt: defines work logic executing each record (default) or each micro batch

(Microbatch is a set of data records collected for a very short period of time)


Spout

Bolt

Storm topology

<source: http://storm.apache.org>


Part II: Big Data Platforms



Evolution of Data Management Systems [Wha18]Evolution of Data Management Systems [Wha18]

When MapReduce (or NoSQL) initially came about in 2004, we lost much of the high-level functionality of the relational DBMS—such as SQL, indexing, schemas, and transactions—in return for scalability

Since then, there have been a lot of efforts to restore them Two distinct trends SQL-on-Hadoop NewSQL initiatives


O/Sfile system

Scalability

FunctionalityRDBMS

NoSQL

SQL-on-HadoopNewSQL


NoSQLNoSQL

Providing a mechanism for storage and retrieval of data other than in tabular form (typically, in key-value format)

Mostly using low-level query languages instead of SQLTypically compromising consistency in favor of scalability

and speed Optimized for horizontal scaling to a cluster of machines Add more machines to the cluster when needed without

architectural changesExamples:


Big Table


Advantages and Disadvantages of NoSQLAdvantages and Disadvantages of NoSQL

Advantages Scalability

A large address space for big data spanning over a number of independent nodes(machines) connected through a network (i.e., a distributed file system)

High performance of sequential scan Good for batch-oriented jobs (such as OLAP work load)

Fault tolerance and high availability Through (low-level) replication (with inexpensive commodity hardware)

Disadvantages Lack of transactions and consistency(i.e., ACIDity) Not adequate for OLTP workload (short transactions with typically

random access, requiring ACIDity) Lack of high-level functionality such as SQL, schemas, and secondary

indexes



SQL-on-HadoopSQL-on-Hadoop

Providing an SQL-like high-level language on top of the parallel execution layer of MapReduce/Hadoop

Suitable for processing OLAP-like batch operations

Examples [Abo15]:


(Facebook)

(open source)


NewSQLNewSQL

“…provide the same scalable performance of NoSQL systems for on-line transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system” [Wik18d]

Providing high-level functionality (e.g., SQL, transactions, schemas, and secondary indexes) of conventional DBMSs

Base architecture: “shared-nothing parallel DBMS”Examples [Asl11][Kat13]:


Google Spanner


SQL on Hadoop vs. NewSQL: Representative Systems(by no means exhaustive)SQL on Hadoop vs. NewSQL: Representative Systems(by no means exhaustive)


Shared-nothing parallel DBMS

F1language&modellayer

parallel executionlayer

key-value/relationallayer

storagelayer GFS HDFS Colossus

Spanner VoltDB

Hadapt(Hadoop

DB)Presto

ImpalaSparkSQL

Phoenix

Hadoop (MapReduce)

Hbase

CassandraDBMSs

for storage

Bigtable

Google Global

Cache(GGC)cachelayer

NoSQL/SQL-on-Hadoop NewSQL

Google Facebook

MySQLCluster

Mem

cach

ed/T

AO

Odysseus DBMS

ODYS Parallel-IR

Local disk

KAIST(ODYS)

Spark

Map/Reduce


Shared-Nothing Parallel DBMSsShared-Nothing Parallel DBMSs

Parallel DBMSs can be as good as or even better than MapReduce in performance Stonebreaker et al.[Sto10] have shown that parallel DBMSs are (linearly)

scalable and capable of processing petabyte-scale databases and large-scale query loads

Floratou et al.[Flo11] have shown that parallel DBMSs outperform MapReduce providing high performance and scalability by partitioning and storing tables in multiple nodes configured in a shared-nothing manner

Drawbacks of Parallel DBMS Expensive Too heavy by having too much functionality that is not needed in practical

large-scale applications—including capability of processing global transactions with general workload

Not suitable where faults occur frequently Hard to setup and use



NewSQL FeaturesNewSQL Features

Improvements were made in various aspects towards inexpensive massively-parallel DBMSs Shared-nothing architecture ⇒ all NewSQL systems

Row/column partitioning Hash partitioning

Light-weighted (slimed down) ⇒ VoltDB Single-threaded: minimal locking, no latches Simple logs: snapshots+command logging (to disk) Fault tolerance, availability: relying on replicas (automatic data replication)

Main memory DBMS ⇒ VoltDB, SAP Hana Concurrency control

Timestamp based(minimizing use of locks) Multi-version concurrency control (MVCC)⇒ Spanner, VoltDB, SAP Hana

37Copyright © 2016 by Kyu-Young Whang


ODYS Search Engine[Wha13]—A NewSQL ApproachODYS Search Engine[Wha13]—A NewSQL Approach

A massively-parallel search engine [Wha13] using the Odysseus Object-Relational DBMS developed at KAIST [Wha02,05,15]

Efficiency DB-IR tight integration [Wha02] IR features are implemented directly into the core of the DBMS

speeding up search performance

Scalability Shared-nothing architecture Each slave capable of indexing 100 million Web documents Supporting massive databases (64-bit architecture)



High-level functionality of ODYS search engine SQL (especially, DB-IR integrated queries)

Easy implementation of query interfaces by translating keyword queries into SQL queries

Selection, aggregation, limited join (where only one table is partitioned), etc. Schemas

Activation of advanced query processing methods (i.e., attribute embedding or IR index join) by controlling the schema

Indexes B+-tree indexes for structured data and IR indexes for unstructured data (i.e.,

text) Transactions/consistency

ACIDity for local transactions in a single machine Global transactions possible with 2-phase commit (not fully implemented)

Immediate update allowed


2018. 10 KAIST/DGIST Copyright © 2018 by Kyu-Young Whang 40

Odysseus DB-IR-Spatial Tightly-Integrated DBMS Odysseus DB-IR-Spatial Tightly-Integrated DBMS

An Object-Relational DBMS developed at KAIST for over 26 years (1990 – 2016) [Wha02, 03, 05, 07, 10, 12, 13, 15]

An earlier version of this technology played a vital role in starting up NaverCom Co. (currently, Naver Co.) in 1996-2000, which has been the number one portal in Korea

Best Demonstration Award at the IEEE 21st Int’l Conf. on Data Engineering (ICDE), Tokyo, Japan, Apr. 5-8, 2005 [Wha05]

Tight integration of IR features (U.S. patented [Wha02]) as well as spatial database features with the DBMS Being a DBMS and, at the same time, a search engine [Wha02, Wha03, Wha05, Wha15] Being a DBMS and, at the same time, a GIS engine [Wha07, Wha10]

Concurrency control and recovery Coarse granularity locking version: the shadow-page deferred-update recovery method (US patented

[Wha12]) Fine granularity locking version: the ARIES recovery method [Moh92]

Having many commercial applications

Consisting of approximately 600,000 lines of C/C++ (precision) codes

Open source released (600,000+ lines of C, C++)(Only the coarse granularity locking version has been released as of Aug. 2016)

https://github.com/odysseus-oosql/ODYSSEUS-OOSQLhttps://dblab.kaist.ac.kr/Open-Software/ODYSSEUS/main.html


Structure of the IR Index for DB-IR Tight Integration (U.S. Patented) [Wha02]Structure of the IR Index for DB-IR Tight Integration (U.S. Patented) [Wha02]


Keyword

B+-tree

Large object tree

Posting List

Large object tree

Large object tree

Large object tree

Subindex (for each large object)

B+-tree

# of Postings docId OID # of Occurrences offset

posting

Posting List


Implementation of the IR Index [Wha15]Implementation of the IR Index [Wha15]

IR index is a composite structure consisting of a relation, B+-trees, and sub-indexes


Example: pageInfo relationsiteId

(integer)content(text)

pageInfo_content_Inverted

keyword reverseKeyword nPostings postingList

B+-tree B+-tree

Sub-index

Sub-index

large object

IR index is created

Reversed strings need not be stored. This

field does not contain a value.

title(text)

URL(varchar)

pageInfo_title_Inverted

IR index for pageInfo.content

an index on a tuple instance


ODYS Search Engine: Shared-Nothing Parallel Architecture Using Odysseus DBMS [Wha13]ODYS Search Engine: Shared-Nothing Parallel Architecture Using Odysseus DBMS [Wha13]


ODYS Parallel-IR Masters

. . .

LAN card1 LAN cardnh. . .

machineprocess

disk array

Slave1 Slavens

. . . . . .. . .

. . . . . .

Slave 1+nhns1)-(nhSlave

nhns

. . .Disk1 Diskw

Odysseus DBMS

Shared buffer

. . .Odysseus DBMS

Parent

Child(async. calls)

Hub1 Hubnh


Experimental Setting (10-Node ODYS Prototype)Experimental Setting (10-Node ODYS Prototype)

Master †

One Linux machine (one Quad-Core 3.0GHz CPUs, 6GB RAM)

Slaves ‡ (10 slaves) Four Linux machines (two Dual-Core 3.0GHz CPUs, 4GB RAM) One Linux machine (one Quad-Core 2.5GHz CPU, 4GB RAM) Five Linux machines (one Quad-Core 2.4GHz CPU, 8GB RAM) Four disk arrays (AS-2400~AS-2500, 0.9TB~3.9TB, RAID5, 200MB/s bandwidth, 512MB~1GB cache, average

59.5 MB/s disk transfer rate, 13 disks (arms) + 1 parity disk + 1 hot spare ) One disk array (TN-6416S, 13TB, RAID5, 4Gbit/s bandwidth, 512MB cache, average 83.3MB/s disk transfer

rate, 13 disks (arms) + 1 parity disk) Five internal disk arrays (B110i, 5TB, 768MB/s bandwidth, 81.2MB/s disk transfer rate, 10 disk (arms) + 1

parity disk)

Network †‡

Eleven gigabit LAN cards(Intel 82574L dual-port(1), Intel 82541GI single-port(5), HP NC326i dual-port(5)) A gigabit hub (HP 1410-24G, 1000Mbps, 24port)

Data 114 million Web documents 2 (duplicated)

Size of loaded data : Web page (1.55TB) and IR index (1.84TB) for 228 million Web documents Each slave indexes 22.8 million web documents (Note: A slave is capable of indexing 100 million documents)

† The master (ODYS Parallel-IR) consists of 58,000 lines of C and C++ code‡ The slave (Odysseus DBMS) consists of 600,000 lines of C and C++ code†‡ We use socket-based RPC consisting of 17,000 lines of C, C++, and Python code developed by the authors



Performance Projection for 300-Node Real-World-Scale ODYS Performance Projection for 300-Node Real-World-Scale ODYS

300-Node Real-World-Scale ODYS One ODYS set consisting of 4 masters and 300 slaves 300 slaves capable of indexing 30 billion Web pages Performance projection through performance modelling and

experimental data from the 10-Node ODYS prototype



Estimated Average Total Query Response Time (300-node ODYS) [Wha13](measured with a 10-node ODYS and extended to a 300-node one through performance modelling)


0

100

200

300

400

500

600

700

800

0 2 4 6 8 10 12 14 16

TOTAL-EST-300SLAVE-MAX-EST-300

Arrival rate (million queries/day/set)

Ave

rage

tota

l que

ryre

spon

se ti

me

(ms)

− Query load: 1 billion queries/day †

(QUERY-MIX1)

− Web pages indexed: 6.84 (30) billions(22.8 (100) million Web pages/slave)() indicates max capacity

− Nodes required: 43,472 (for 194ms/query)

− Nodes required: 86,944 (for 148ms/query)

3.5 7

Requires 143 sets of 304 nodes = 43,472 nodes

Requires 286 sets of 304 nodes = 86,944 nodes

194ms

148ms

† Google Search Statistics [Goo18] indicates Google Search handles 3 billion queries/day as of May 30th, 2018Nielsenwire [Nie10] reports that Google handled 214 million queries/day in the U.S. in Feb. 2010


Summary of ODYSSummary of ODYS

We have shown that a massively-parallel search engine can be implemented using a DB-IR tightly-integrated parallel DBMS Capable of handing real-world scale data and query loads

Providing high-level functionality

We have shown detailed implementation of the ODYS search engine Being capable of indexing 100 million Web pages/node with shared-

nothing architecture → high scalability

Having tightly integrated DB-IR capability → high performance

Having SQL, schemas, and indexes → high-level functionality



Part III: Big Data Applications



IBM Watson

Recommender Systems

Intelligent Personal Assistants

Alpha Go



IBM WatsonIBM Watson

A question-answering computer system that can answer the questions in natural language In 2011, Watson competed on Jeopardy! against legendary

champions Brad Rutter and Ken Jennings and won the 1st place


Source: https://www.youtube.com/watch?v=P18EdAKuC1U


The knowledge was built from a vast amount of data Encyclopedias, dictionaries, thesauri, newswire articles, and literary

works as well as databases, taxonomies, and ontologies (e.g., DBPedia, WordNet, and Yago) ← tens of millions of documents[Wik18i]


<Image Source: [Fer10]>

Big Data


IBM Watson for OncologyIBM Watson for Oncology

A solution for cancer treatments A Q&A system similar to Jeopardy! Fueled by Big Data from relevant guidelines, best

practices, and medical journals and textbooks (B.T.W., doctors can hardly afford time to read allthese articles or documents)

Evaluates medical evidence and displays potential treatment options ranked by level of confidence


Image source: http://www.ibm.com/


However, it was also reported that IBM's Watson gave unsafe recommendations for treating cancer “IBM’s Watson Hasn’t Beaten Cancer, But A.I. Still Has Promise”

“…But in the documents obtained by STAT (medical web site), doctors who had tried to use Watson to help them design treatment complained that the system wasn’t ready to practice medicine.”

<Source: Bloomberg, August 25, 2018, https://www.bloomberg.com/view/articles/2018-08-24/ibm-s-watson-failed-against-cancer-but-a-i-still-has-promise >

Possible problems: quality of the information sources



Recommender SystemsRecommender Systems

A platform or engine that suggests items to users by predicting how they would rate the items 35% of consumer purchases on Amazon and 75% of video watches

on Netflix come from recommendations [Mac13] Example: Amazon.com’s recommender system



Basic approach: collaborative filtering Example: Two users U1 and U3 have provided similar feedback (preference) on

other items, and it is reasonable to recommend I3 to U3 Here, again, Big Data plays a vital role in making the recommender system

produce intelligent answer with high confidence. (We have a number of U1’s to make the confidence level high)


I1 I2 I3 I4 I5 I6 I7U1 5.0 5.0 3.0 4.0U2 3.0 2.5U3 4.5 ? 3.0 4.5U4 3.0 4.0U5U6 3.5 5.0 4.0 4.5U7 4.0 5.0 4.0 4.5 4.5

Items

Use

rs

purchases, ratings, or preferences

Big Data: abundant data for recommendation from 26.5M transactions/sec in Amazon


Recently, deep learning has been actively adopted for recommendation Given user profiles and item features, the probability of a user u’s

buying an item i is trained by deep learning techniques

Amazon DSSTNE (pronounced "Destiny") is an open source software library for training and deploying recommendation models


Inpu

t:U

ser p

rofil

es a

ndite

m fe

atur

es

Out

put:

Use

r-ite

m p

roba

bilit

y


Intelligent Personal AssistantIntelligent Personal Assistant

“A mobile software agent that can perform tasks, or services, on behalf of an individual based on a combination of user input, location awareness, and the ability to access information from a variety of online sources…” [Wik18e]

Example:


Apple SiriSamsung

Amazon


Example: Google AssistantExample: Google Assistant

“Google Assistant” exploits Big Data, such as search keywords, locations visited, e-mails, and calendar entries, to provide suitable answers and recommendations Example: It prompts you about half an

hour before you leave to let you know the approximate drive time based on current traffic conditions (by looking at your calendar entries!)



Example: Amazon EchoExample: Amazon Echo

“Echo” is a device connected to Amazon’s intelligent personal assistant, Alexa

Amazon is selling Echo at a very low price ($30) to customers Over 5M Echo devices have been sold in the last 2 years All those people asking Alexa to order kitchen supplies, turn

on the lights, or play music gives Amazon a valuable stockpile of data (adding to Big Data)

By using this Big Data, Amazon builds a “360-degree view” of their customers’ buying habits.


Amazon Echo

Big data are the most valuable assets Facebook knows the people you know and places you go;Google knows the things you use and search on the Internet;Amazon knows the items you buy online (or even offline—through Amazon go, etc.)


AlphaGoAlphaGo

A computer program that plays Go, which is developed by Google DeepMind In March 2016, AlphaGo beat Lee Sedol with 4 to 1 score In May 2017, AlphaGo beat Ke Jie, the world No. 1 ranked

player at the time, by 3 to 0 score.


Source: https://www.inverse.com/article/30681-alphago-documentary-tribeca-film-festival

[Sil16]


“Configuration and Strength”[Wik18f], “Power Consumption”[Dee18]

Learning [Wik18f] AlphaGo was initially trained to mimic human play by attempting to

match the moves of expert players from recorded historical games It was trained using a KGS Go Server database (Big Data) of around 30

million moves from 160,000 games played by 6 to 9 dan human players Supervised learning + reinforcement learning In this version, without the Big Data of human games, it would not have

been possible to reach the level of human players


Power consumption


AlphaGo ZeroAlphaGo Zero

A version created without using data from human games but became stronger than any other previous versions [Sil17]

By playing games against itself (self-play), AlphaGo Zero surpassed the strength of AlphaGo Lee in three days by winning 100 games to 0, reached the level of AlphaGo Master in 21 days, and exceeded all the old versions in 40 days[Dee18]

Training was done soly based on reinforcement learning without recorded moves from human games

Where is the role of Big Data? Answer: it generates its own Big Data through an enormous number of self-plays


[Sil17]


ConclusionsConclusions

Many distributed data processing platforms for Big Data have been actively developed in industry and academia

The ODYS search engine, developed at KAIST, has shown that a massively-parallel search engine with higher functionality can be implemented using a DB-IR tightly-integrated parallel DBMS.

Emerging applications are realizing big data intelligence

The boom of artificial intelligence is fueled by recent Big Data technologies Big Data is essential for training the deep neural network



ReferencesReferences

[Abo15] Daniel Abadi et al., “Tutorial: SQL-on-Hadoop Systems,”, In Proc. 41st Int’l Conf. on Very Large Data Bases, pp. 2050-2051, Kohala Coast, Hawaii, Aug. 2015.

[Abi05] Serge Abiteboul, et al., “The Lowell Database Research Self-Assessment,” Comm. of ACM, Vol. 48, No. 5, pp. 111-118, May 2005.

[Asl11] Matt Aslett, "How Will the Database Incumbents Respond to NoSQL and NewSQL?," Technical Report, the 451 Group, Apr. 2011. (available at https://451research.com/report-short?entityId=66963)

[CRW05] Surajit Chaudhuri, Raghu Ramakrishnan, and Gerhard Weikum, “Integrating DB and IR Technologies: What is the Sound of One Hand Clapping?,” In Proc. 2nd Biennial Conf. on Innovative Data Systems Research, Asilomar, California, pp. 1-12, Jan. 2005.

[Dea04] Dean, J. and Ghemawat, S., “MapReduce: Simplified Data Processing on Large Clusters,” In Proc. 6th Symposium on Operating System Design and Implementation (OSDI), pp. 137-150, Oct. 2014.

[Fer10] Ferrucci, D. et al., "Building Watson: An Overview of the DeepQA Project,” AI Magazine, Vol.31, No. 3, pp. 59-79, July 2010.

[Flo11] Floratou, A., Patel, J. M., Shekita, E. J., and Tata, S., “Column-oriented Storage Techniques for MapReduce,” In Proc. of the VLDB Endowment, Vol. 4, No. 7, pp. 419-429, 2011.

[Goe14] Goes, P., "Design Science Research in Top Information Systems Journals,” MIS Quarterly: Management Information Systems, Vol. 38, No. 1, 2014.

[Goo18] Google Search Statistics - Internet Live Stats, www.internetlivestats.com, retrieved 2018-05-30.[Gra02] Gray, J. and Szalay, A., “The World Wide Telescope: An Archetype for Online Science,” Comm. ACM, Vo. 45, No.

11, pp. 50-54, Nov. 2002.[Kat13] Katarina Grolinger, Wilson A Higashino, Abhinav Tiwari and Miriam AM Capretz, “Data Management in Cloud

Environments: NoSQL and NewSQL Data Stores,” Journal of Cloud Computing: Advances, Systems and Applications, Vol. 2, No. 22, 2013.

[Lan01] Laney, D.,"3D Data Management: Controlling Data Volume, Velocity and Variety,” META Group Research Note, Vol. 6, No. 70, 2001.

[Len04] Lentz, A., “MySQL Storage Engine Architecture,” In MySQL Developer Articles, MySQL AB, May 2004.



[Mac13] MacKenzie, I., Meyer, C., and Noble, S., "How Retailers Can Keep up with Consumers," McKinsey&CompanyReport, Oct. 2013 (https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers).

[Moh92] Mohan, C. et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.,” ACM Trans. Database Systems, Vol. 17, No. 1, pp.94-162, 1992.

[Nie10] Nielsenwire, “Nielsen Reports February 2010 U.S. Search Rankings,” Technical Report, Mar. 15, 2010 (available at http://blog.nielsen.com/nielsenwire/online_mobile/nielsen-reports-february-2010-u-s-search-rankings/).

[Sil16] Silver D. et al, "Mastering the Game of Go with Deep Neural Networks and Tree Search," Nature, Vol. 529, pp. 484-489, Jan. 2016.

[Sil17] Silver D. et al, "Mastering the Game of Go without Human Knowledge," Nature, Vol. 550, pp. 354-349, Oct. 2017.[Sto10] Stonebraker, M. et al., “MapReduce and Parallel DBMSs: Friends or Foes?,” Communications of the ACM (CACM),

pp. 64-71, Jan. 2010.[Weik07] Gerhard Weikum, “DB&IR: Both Sides Now,” In Proc. 2007 ACM SIGMOD Int’l Conf. on Management of Data, pp.

25-30, Beijing, China, June 12-14, 2007.[Wha02] Whang, K. et al., An Inverted Index Storage Structure Using Subindexes and Large Objects for Tight Coupling of

Information Retrieval with Database Management Systems, U.S. Patent No. 6,349,308, Feb. 19, 2002, Application No. 09/250,487, Feb. 15, 1999.

[Wha03] Whang, K., “Tight Coupling: A Way of Building High-Performance Application Specific Engines,” a presentation at the panel Next-Generation Web Technology and Database Issues, the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003), Kyoto, Japan, URL:http://db-www.aist-nara.ac.jp/dasfaa2003/ppt.html, Mar. 2003.

[Wha05] Whang, K., Lee, M., Lee, J., Kim, M. and Han, W., “Odysseus: a High-Performance ORDBMS Tightly-Coupled with IR Features,” In Proc. IEEE Int'l Conf. on Data Engineering, Tokyo, Japan, pp. 1104-1105, Apr. 2005. This paper received the Best Demonstrate Award.

[Wha07] Whang, K., Lee, J., Kim, M., Lee, M., Lee, K., “Odysseus: a High-Performance ORDBMS Tightly-Coupled with Spatial Database Features,” In Proc. IEEE Int'l Conf. on Data Engineering, Istanbul, Turkey, p.1493-1494, Apr. 2007.



[Wha10] Whang, K., Lee, J., Kim, M., Lee, M., Lee, K., Han, W., Kim, J., “Tightly-Coupled Spatial Database Features in the Odysseus/OpenGIS DBMS for High-Performance, GeoInformatica, Vol. 14, No. 4, pp. 425-446, 2010.

[Wha12] Whang, K. et al., “A Method for Recovering Data in a Storage System” U.S. Patent No. 8,108,356, Jan. 31, 2012, Application No. 12/208,014, Sept. 10, 2008.

[Wha13] Kyu-Young Whang, Tae-Seob Yun, Yeon-Mi Yeo, Il-Yeol Song, Hyuk-Yoon Kwon, and In-Joong Kim, “ODYS: an Approach to Building a Massively-Parallel Search Engine Using a DB-IR Tightly-Integrated Parallel DBMS for Higher-Level Functionality,” In Proc. 2013 ACM Int’l Conf. on Management of Data (SIGMOD), pp. 313-324, June 2013.

[Wha15] Whang, K., Lee, J., Lee, M., Han, W., Kim, M., Kim, J., “DB-IR Integration Using Tight-Coupling in the Odysseus DBMS, World Wide Web, Vol. 18, No. 3, pp.491-520, 2015.

[Wha18] Whang, K., Yun, T., Park, J., Cho, K., Kim, S., Yi, I., Na, I., and Lee, B., Building Social Networking ServicesSystems Using the Relational Shared-Nothing Parallel DBMS, Tech. Report CS-TR-2018-419, School of Computing, KAIST, August 2018.

[Wik18a] Wikipedia, “Big Data,” https://en.wikipedia.org/wiki/Big_data[Wik18b] Wikipedia, “Moore’s Law,” https://en.wikipedia.org/wiki/Moore%27s_law[Wik18c] Wikipedia, “MapReduce,” https://en.wikipedia.org/wiki/MapReduce[Wik18d] Wikipedia, “NewSQL,” https://en.wikipedia.org/wiki/NewSQL[Wik18e] Wikipedia, “Automated Personal Assistant,” https://en.wikipedia.org/wiki/Automated_personal_assistant[Wik18f] Wikipedia, “AlphaGO,” https://en.wikipedia.org/wiki/AlphaGo[Wik18g] Wikipedia, “Data Science,” https://en.wikipedia.org/wiki/Data Science[Wik18h] Wikipedia, “Higgs Boson,” https://en.wikipedia.org/wiki/Higgs_boson [Wik18i] Wikipedia, “Watson (computer),” https://en.wikipedia.org/wiki/Watson_(computer)[Dee18] Deep Mind, “AlphaGo Zero: Learning from Scratch,” https://deepmind.com/blog/alphago-zero-learning-

scratch/



Thanks!


ER 2018 keynote (rev8)-2(공개버전) 2018 keynote...Iterative processing Enhancements: Spark, Flink, … Stream processing Spark Streaming, Flink, … Graph processing Giraph, Turi,

Documents