Top Banner
BI Seminar project: High- level languages for Big Data Analytics Janani Chakkaradhari Jose Luis Lopez Pino
49

High-level languages for Big Data Analytics (Presentation)

May 21, 2015

Download

Technology

Presentation for the course 'Business Intelligence Seminar' of the IT4BI Erasmus Mundus Master's Programme
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: High-level languages for Big Data Analytics (Presentation)

BI Seminar project: High-level languages for Big Data AnalyticsJanani ChakkaradhariJose Luis Lopez Pino

Page 2: High-level languages for Big Data Analytics (Presentation)

Outline1. Introduction

1.1 The Map Reduce programming model1.2 Hadoop

2. High Level Languages2.1 Pig Latin2.2 Hive2.3 JAQL2.4 Other Languages

3. Comparison of HLLs3.1 Syntax Comparison3.2 Performance3.3 Query Compilation3.4 JOIN Implementation

4. Future Work4.1 Machine Learning4.2 Interactive Queries

5. Conclusion

Page 3: High-level languages for Big Data Analytics (Presentation)

Introduction

Page 4: High-level languages for Big Data Analytics (Presentation)

The MapReduce model• Introduced in 2004 by Google• This model allows programmers without any experience in

parallel coding to write the highly scalable programs and hence process voluminous data sets.

• This high level of scalability is reached thanks to the decomposition of the problem into a big number of tasks.• The Map function produces a set of key/value pairs, taking a

single pair key/value as input.• The Reduce function takes a key and a set of values related to this

key as input and it might also produce a set of values, but commonly it emits only one or zero values as output

Page 5: High-level languages for Big Data Analytics (Presentation)

The MapReduce model• Advantages:• Scalability• Handle failures and balance the system

• Pitfalls• Complicated to code some tasks.• Some tasks are very expensive.• Difficulties to debug the code.• Absence of schema and indexes.• A lot of bandwidth might be consumed.

Page 6: High-level languages for Big Data Analytics (Presentation)

Hadoop• An Apache Software foundation open source

project• Hadoop – HDFS + Map Reduce• DFS – Partitioning data & Storing in separate

machine• HDFS – Stores large files, running on commodity

clusters of hardware and typically 64 MB for per block• Both FS and Map reduce are Co-Designed

Page 7: High-level languages for Big Data Analytics (Presentation)

Hadoop• No separate storage network and processing network• Moving compute to the data node

Page 8: High-level languages for Big Data Analytics (Presentation)

High level languages

Page 9: High-level languages for Big Data Analytics (Presentation)

High level languages• Two different types• Created specifically for this model.• Already existing languages

• Languages present in the comparison• Pig Latin• HiveQL• Jaql

• Interesting languages• Meteor• DryadLINQ

Page 10: High-level languages for Big Data Analytics (Presentation)

Pig Latin• Executed over Hadoop.• Procedural language.• High level operations similar to those that we can find in SQL• Some interesting operators:• FOREACH to process a transformation over every tuple of the set.

To make possible to parallelise this operation, the transformation of one row should depend on another.

• COGROUP to group related tuples of multiple datasets. It is similar to the first step of a join.

• LOAD to load the input data and its structure and STORE to save data in a file.

Page 11: High-level languages for Big Data Analytics (Presentation)

Pig Latin• Goal: to reduce the time of development. • Nested data model.• User-defined functions• Analytic queries over text files (not need of loading the data)

• Procedural language -> control over the execution plan• The user can speed the performance up.• It makes easier the work of the query optimiser.• Unlike SQL.

Page 12: High-level languages for Big Data Analytics (Presentation)

HiveQL• Open-source DW

solution built on top of Hadoop• The queries looks

similar to SQL and also has extensions on it• Complex column

types- map, array, struct as data types• It stores the metadata

in RDBMs

Page 13: High-level languages for Big Data Analytics (Presentation)

HiveQL

• The Metastore acts as the system catalog for Hive• It stores all the information about the tables,

their partition, the schema and etc.,• Without the system catalog it is not possible to

impose a structure on hadoop files.• Facebook uses MySQL to store this metadata.

Reason: Since these information has to be served fast to the compiler

Page 14: High-level languages for Big Data Analytics (Presentation)

JAQL• What is Jaql?• Declarative scripting programming language.• Used over Hadoop’s MapReduce framework• Included in IBM’s InfoSphere BigInsights and Cognos Consumer Insight

products.• Developed after Pig and Hive.• More scalable.• More flexible• More reusable.

• Data model• Simple: similar to JSON.

• Values as trees.• No references.• Textual representation very similar.

• Flexible• Handle semistructured documents.• But also structured records validated against a schema.

Page 15: High-level languages for Big Data Analytics (Presentation)

JAQL• Control over the evaluation plan.• The programmer can work at different levels of abstraction

using Jaql's syntax:• Full definition of the execution plan.• Use of hints to indicate to the optimizer some evaluation

features.• This feature is present in most of the database engines that use SQL

as query language.• Declarative programming, without any control over the flow.

Page 16: High-level languages for Big Data Analytics (Presentation)

Other languages: Meteor• Stratosphere stack• Pact:

• Programming model• It extends MapReduce with new second-order functions• Cross: Cartesian product• CoGroup: group all the records with the same key and process them.• Match: similar to CoGroup but pairs with the same key could be processed

separately.

• Sopemo:• Semantically rich operator model• Extensible

• Meteor: query language• Optimization• Meteor code• Logical plan using Sopemo operators -> Optimized• Pact final program -> Physically optimized

Page 17: High-level languages for Big Data Analytics (Presentation)

Other languages: DryadLINQ• Coded embedded in .NET programming languages• Operators• Almost all the operators available in LINQ.• Some specific operators for parallel programming.• Develop can include their own implementations.

• DryadLINQ code is translated to a Dryad plan• Optimization• Pipeline operations• Remove redundancy• Push aggregations• Reduce network traffic

Page 18: High-level languages for Big Data Analytics (Presentation)

Comparison HLLs

Page 19: High-level languages for Big Data Analytics (Presentation)

Comparing HLLs

• Different design motivations• Developers preferences• Write concise code -> Expressiveness• Efficiency -> Performance

• Criteria that impact performance• Join implementation• Query processing

• Some other are not included• Language paradigm• Scalability

Page 20: High-level languages for Big Data Analytics (Presentation)

Comparison Criteria

• Expressive power• Performance• Query Compilation• JOIN Implementation

Page 21: High-level languages for Big Data Analytics (Presentation)

Expressive power

• Three categories by Robert Stewart:• Relational complete• SQL equivalent (aggregate functions)• Turing complete• Conditional branching• Indefinite iterations by means of recursion• Emulation of infinite memory model

Page 22: High-level languages for Big Data Analytics (Presentation)

Expressive power

• Three categories by Robert Stewart:• Relational complete• SQL equivalent (aggregate functions)• Turing complete• Conditional branching• Indefinite iterations by means of recursion• Emulation of infinite memory model

Page 23: High-level languages for Big Data Analytics (Presentation)

Expressive power

• Three categories by Robert Stewart:• Relational complete• SQL equivalent (aggregate functions)• Turing complete• Conditional branching• Indefinite iterations by means of recursion• Emulation of infinite memory model

Page 24: High-level languages for Big Data Analytics (Presentation)

Expressive power

Page 25: High-level languages for Big Data Analytics (Presentation)

Expressive power

• But this do not mean that they are SQL, Pig Latin and HiveQL are the same.• HiveQL• Is inspired by SQL but it does not support the full

repertoire included in the SQL-92 specification• Includes features notably inspired by MySQL and

MapReduce that are not part of SQL.• Pig Latin• It is not inspired by SQL.• For instance, do not have OVER clause

Page 26: High-level languages for Big Data Analytics (Presentation)

SQL Vs. HiveQL (2009)

SQL HiveQLTransactions Yes NoIndexes Yes NoCreate table as select Not SQL-92 YesSubqueries In any clause

Correlated or notOnly in FROM clauseOnly noncorrelated

Views Yes Not materializedExtension with map/reduce scripts

No Yes

Page 27: High-level languages for Big Data Analytics (Presentation)

Query Processing

Page 28: High-level languages for Big Data Analytics (Presentation)

Query Processing

• In order to make a good comparison we should have the basic knowledge on how these HLQL are working. • How the abstract user representation of the

query or the script is converted to map reduce jobs?

Page 29: High-level languages for Big Data Analytics (Presentation)

Query Processing – Pig Latin• The goal of writing

Pig Latin script is to produce an equivalent map reduce jobs that can be executed in the Hadoop environment• Parser first checks for

the syntactic errors

Page 30: High-level languages for Big Data Analytics (Presentation)

Query Processing – Pig Latin

Page 31: High-level languages for Big Data Analytics (Presentation)

Query Processing – Pig Latin

Page 32: High-level languages for Big Data Analytics (Presentation)

Query Processing - Hive

• It gets the Hive SQL string from the client• The parser phase converts it into parse tree

representation• The logical query plan generator converts it into

logical query representation. Prunes the columns early and pushes the predicates closer to the table.• The logical plan is converted to physical plan and

then map reduce jobs.

Page 33: High-level languages for Big Data Analytics (Presentation)

Query Processing - JAQL

• JAQL includes two higher order functions such as mapReduceFn and mapAggregate

• The rewriter engine generates calls to the mapReduceFn or mapAggregate

Page 34: High-level languages for Big Data Analytics (Presentation)

QP - Summary

• All these languages has its own methods • All supports syntax checking usually done by the

compiler• Pig currently misses out on optimized storage

structures like indexes and column groups• HiveQL provides more optimizations• it prunes the buckets that are not needed• Predicate push down

• Query rewriting is the future work of JAQL (Projection push-down )

Page 35: High-level languages for Big Data Analytics (Presentation)

JOIN Implementation

Page 36: High-level languages for Big Data Analytics (Presentation)

JOIN in Pig Latin

• Pig Latin Supports inner join, equijoin and outer join. The JOIN operator always performs inner join. • Join can also be achieved by COGROUP operation

followed by FLATTEN• JOIN creates a flat set of output records while

COGROUP creates a nested set of output records• GROUP – when only one relation • COGROUP – when multiple relations are involved• FLATTEN - (a, {(b,c), (d,e)}) (a, b, c) and (a, d, e)

Page 37: High-level languages for Big Data Analytics (Presentation)

JOIN in Pig Latin

• Fragment Replicate joins• Trivial case, only possible if one of two relations are

small enough to fit into memory• JOIN is in Map phase

• Skewed Joins• Not equally distributed data• Basically computes histogram of the key space and

uses this data to allocate reducers for a given key• JOIN in reduce phase

• Merge Joins• Only possible if the relations are already sorted

Page 38: High-level languages for Big Data Analytics (Presentation)

JOIN in Pig Latin• The choice of join strategy can be specified by the user

Page 39: High-level languages for Big Data Analytics (Presentation)

JOIN in Hive

• Normal map-reduce Join• Mapper sends all rows with the same key to a

single reducer• Reducer does the join

• SELECT t1.a1 as c1, t2.b1 as c2 FROM t1 JOIN t2 ON (t1.a2 = t2.b2);• Map side Joins• small tables are replicated in all the mappers

and joined with other tables

Page 40: High-level languages for Big Data Analytics (Presentation)

JOIN in JAQL

• Currently JAQL supports equijoin• The join expression supports equijoin of 2 or

more inputs. All of the options for inner and outer joins are also supported

joinedRefs = join w in wroteAbout, p in products

where w.product == p.nameinto { w.author, p.* };

Page 41: High-level languages for Big Data Analytics (Presentation)

JOIN - Summary

• Both Pig and Hive has the possibility to performs join in map phase instead of reduce phase

• For skewed distribution of data, the performance of JAQL for join is not comparable to other two languages

Page 42: High-level languages for Big Data Analytics (Presentation)

Performance

Page 43: High-level languages for Big Data Analytics (Presentation)

Benchmarks

• Pig Mix is a set of queries to test the performance. These set checks the scalability and latency• Hive’s benchmark is mainly based on the queries

that are specified by Pavlo et al (selection task, Aggregation task and a Join task)• Pig-Latin implementation for the TPC-H queries

and HiveQL implementation of TPC-H queries

Page 44: High-level languages for Big Data Analytics (Presentation)

Performance - Summary

• The paper describes Scale up, Scale out and runtime• For skewed data, Pig and Hive seems to be more

effective in handling it compared to JAQL runtime • Pig and Hive better in utilizing the increase in the

cluster size compared JAQL• Pig and Hive allows the user to explicitly specify

the number of reducers task• This feature has significant influence on the

performance

Page 45: High-level languages for Big Data Analytics (Presentation)
Page 46: High-level languages for Big Data Analytics (Presentation)

Machine Learning

• What page will the visitor next visit?• Twitter has extended Pig’s support of ML by

placing learning algorithms in Pig Storage functions• Hive - the machine learning is treated as UAFs• A new data analytics platform Ricardo is

proposed combines the functionalities of R and Jaql.

Page 47: High-level languages for Big Data Analytics (Presentation)

Interactive queries• One of the main problems of MapReduce all the languages built on top of this

framework (Pig, Hive, etc.) is the latency.• As a complement of those technologies, some new frameworks that allow

programmers to query large datasets in an interactive manner have been developed• Dremel by Google• The open source project Apache Drill.

• How to reduce the latency?• Store the information as nested columns.• Query execution based on a multi-level tree architecture.• Balance the load by means of a query dispatcher.

• Not too many details of the query language• It is based on SQL • It includes the usual operations (selection, projection, etc.) • SQL-like languages features: user define functions or nested subqueries• The characteristic that distinguish this languages is that it operates with nested

tables as inputs and outputs.

Page 48: High-level languages for Big Data Analytics (Presentation)

Conclusions• The MapReduce programming model have big pitfalls.• Each programming language try to solve some of these

disadvantages in a different way.• No single language beat all the other options.• Comparison• Jaql is expressively more powerful.• JAQL is at the lower level in case of performance when compared to

Hive and Pig• HiveQL and Pig Latin supports map phase JOIN.• HiveQL use more advanced optimization techniques for query

processing

• New technologies to solve those problems:• Languages: Dremel and Apache Drill• Libraries: Mahaut

Page 49: High-level languages for Big Data Analytics (Presentation)

Thank you very much!