Big Data BI – Apache Hive ● „open-source data warehouse solution built on top of Hadoop” ● „files are insufficient data abstractions” ● „SQL is highly popular” ● „need for an open data format” ● „as a familiar data warehousing tool” ● Java, extensible, interoperable ● data warehousing tool → no OLTP, no low-latency by default bigdata bi 2012.04.27. Sidló Csaba
25
Embed
Big Data BI – Apache Hive - dms.sztaki.hu · Big Data BI – Apache Hive „open-source data warehouse solution built on top of Hadoop” „files are insufficient data abstractions”
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Big Data BI – Apache Hive
● „open-source data warehouse solution built on top of Hadoop”● „files are insufficient data abstractions”● „SQL is highly popular”● „need for an open data format”
● „as a familiar data warehousing tool”● Java, extensible, interoperable● data warehousing tool → no OLTP, no low-latency by
default
bigdata bi 2012.04.27.Sidló Csaba
alapvetések
● data model: tables ← partitions ← buckets● relációs● primitive data types, collections: array, map, user
defined types ● HiveQL: SQL-like query language
● + DDL, DML● user defined functions: transformation, aggregation● custom Map-Reduce scripts (any language, streaming
interface)● interfaces: command line, JDBC, ODBC, web
interface
Hive történet
● Dec 2004: Google GFS paper● 2008: started at Facebook; refaktor után: Hadoop
subproject● Sep 2008: Hadoop subproject● May 2009: release 0.3.0● Aug 2009: Facebook VLDB demo● Sep 2010: Hive, Pig: top level Apache projects ● 2011: release 0.8.1, pörgés, pl. NYC Hive Meetup
– 4800 cores, 600 machines, 16GB per machine – April 2009– 8000 cores, 1000 machines, 32 GB per machine – July 2009– 4 SATA disks of 1 TB each per machine– 2 level network hierarchy, 40 machines per rack– total cluster size is 2 PB, projected to be 12 PB in Q3 2009
● ~2010 ősz, Balázs: Hive többnyire működik, de lassú (0.6 körül?)
● most: ● egyszerű install, akár standalone● egyszerű CLI; web GUI● teljesítmény (join főleg): ?● kompatibilitás: 0.8.01 Hive → 0.20.x Hadoop-ra
tesztelve
HQL DDL
● browsing: show tables; show partitions ; describe (extended) page_view ;
● definition: CREATE TABLE page_view( viewTime INT, userid BIGINT, page_url STRING, referrer_url STRING, ip STRING COMMENT 'IP Address of the User') COMMENT 'This is the page view table' PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '1' COLLECTION ITEMS TERMINATED BY '2' MAP KEYS TERMINATED BY '3' STORED AS SEQUENCEFILE;
● alter table ...● views● external tables: már létező HDFS file-ok
HQL DML
● nincs row-level update, delete● törlés: drop table, partition; insert overwrite
● multi-table insert, insert from queries, insert into files, load files to tablesLOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
FROM usersINSERT INTO TABLE pv_gender_sum SELECT gender, count(DISTINCT userid) GROUP BY gender INSERT INTO DIRECTORY '/user/facebook/tmp/pv_age_sum.dir' SELECT age, count(DISTINCT userid) GROUP BY age INSERT INTO LOCAL DIRECTORY '/home/me/pv_age_sum.dir' SELECT country, gender, count(DISTINCT userid) GROUP BY country, gender;
HQL select: JOIN
● ANSI equi-jon● data skew → different plans;
● normal: 1 reducer, gets all records● map-side join:
– mapper loads a small table + a portion of big table– does the join