PRESENTER: HUNGVV W: http://me.zing.vn/hung.vo E: [email protected] 2011-08 HADOOP & ZING HADOOP & ZING
PRESENTER: HUNGVVW: http://me.zing.vn/hung.vo
2011-08
HADOOP & ZINGHADOOP & ZING
AGENDAAGENDA
Using Hadoop in ZingRank
Introduction to Hadoop, Hive
A case study: Log Collecting, Analyzing & Reporting Systemter Estimate
1
3
2
Conclusion
Hadoop & ZingHadoop & Zing
WhatIt’s a framework for large-scale data processingInspired by Google’s architecture: Map Reduce
and GFSA top-level Apache project – Hadoop is open
sourceWhy
Fault-tolerant hardware is expensiveHadoop is designed to run on cheap commodity
hardwareIt automatically handles data replication and
node failureIt does the hard work – you can focus on
processing data
Data Flow into HadoopData Flow into Hadoop
Web Servers Scribe MidTier
Network Storage and Servers
Hadoop Hive Warehouse MySQL
Hive – Data WarehouseHive – Data Warehouse
A system for managing and querying structured data build on top of HadoopMap-Reduce for executionHDFS for storageMetadata in an RDBMS
Key building Principles:SQL as a familiar data warehousing toolExtensibility - Types, Functions, Formats, ScriptsScalability and Performance
Efficient SQL to Map-Reduce Compiler
Hive ArchitectureHive Architecture
HDFSMap ReduceWeb UI + Hive CLI +
JDBC/ODBC
Browse, Query, DDL
Hive QL
Parser
Planner
Optimizer
Execution
SerDe
CSVThriftRegex
UDF/UDAF
substrsum
averageFileFormat
s
TextFileSequenceFi
leRCFile
User-definedMap-reduce
Scripts
Hive DDLHive DDL
DDLComplex columnsPartitionsBuckets
Example CREATE TABLE stats_active_daily( username STRING, userid INT, last_login INT, num_login INT, num_longsession INT) PARTITIONED BY(dt STRING, app STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n‘ STORED AS TEXTFILE;
Hive DMLHive DML
Data loading LOAD DATA LOCAL INPATH '/data/scribe_log/STATSLOGIN/STATSLOGIN-$YESTERDAY*' OVERWRITE INTO TABLE stats_login PARTITION(dt='$YESTERDAY', app='${APP}');
Insert data into Hive tables INSERT OVERWRITE TABLE stats_active_daily PARTITION (dt='$YESTERDAY', app='${APP}')SELECT username, userid, MAX(login_time), COUNT(1), SUM(IF(login_type=3,1,0)) FROM stats_login WHERE dt='$YESTERDAY' and app='${APP}' GROUP BY username, userid;
Hive Query LanguageHive Query Language
SQLWhereGroup ByEqui-JoinSub query in "From" clause
Multi-table Group-By/InsertMulti-table Group-By/Insert
FROM user_information
INSERT OVERWRITE TABLE log_user_gender PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', genderid, COUNT(1) GROUP BY genderid
INSERT OVERWRITE TABLE log_user_age PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', YEAR(dob), COUNT(1) GROUP BY YEAR(dob)
INSERT OVERWRITE TABLE log_user_education PARTITION (dt='$YESTERDAY') SELECT '$YESTERDAY', educationid, COUNT(1) GROUP BY educationid
INSERT OVERWRITE TABLE log_user_job PARTITION (dt='$YESTERDAY‘) SELECT '$YESTERDAY', jobid, COUNT(1) GROUP BY jobid
File FormatsFile Formats
TextFile:Easy for other applications to write/readGzip text files are not splittable
SequenceFile:http://wiki.apache.org/hadoop/SequenceFileOnly hadoop can read itSupport splittable compression
RCFile: Block-based columnar storagehttps://issues.apache.org/jira/browse/HIVE-352Use SequenceFile block formatColumnar storage inside a block25% smaller compressed sizeOn-par or better query performance depending on the
query
SerDeSerDe
Serialization/DeserializationRow Format
CSV (LazySimpleSerDe)Thrift (ThriftSerDe)Regex (RegexSerDe)Hive Binary Format (LazyBinarySerDe)
LazySimpleSerDe and LazyBinarySerDeDeserialize the field when neededReuse objects across different rowsText and Binary format
UDF/UDAFUDF/UDAF
Features:Use either Java or Hadoop Objects (int, Integer,
IntWritable)OverloadingVariable-length argumentsPartial aggregation for UDAF
Example UDF:public class UDFExampleAdd extends UDF { public int evaluate(int a, int b) { return a + b; }}
What we use Hadoop for?What we use Hadoop for?
Storing Zing Me core log dataStoring Zing Me Game/App log dataStoring backup dataProcessing/Analyzing data with HIVEStoring social data (feed, comment, voting,
chat messages, …) with HBase
Data UsageData Usage
Statistics per day:~ 300 GB of new data added per day~ 800 GB of data scanned per day~ 10,000 Hive jobs per day
Where is the data stored?Where is the data stored?
Hadoop/Hive Warehouse90T data20 nodes, 16 cores/node16 TB per nodeReplication=2
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
NeedSimple & high performance framework for log
collectionCentral, high-available & scalable storageEase-of-use tool for data analyzing (schema-
based, SQL-like query, …)Robust framework to develop report
Version 1 (RDBMS-style)Log data go directly into MySQL database
(Master)Transform data into another MySQL database
(off-load)Statistics queries running and export data into
another MySQL tablesPerformance problem
Slow log insert, concurrent insertSlow query-time on large dataset
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Version 2 (Scribe, Hadoop & Hive)Fast logAcceptable query-time on large datasetData replicationDistributed calculation
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
ComponentsLog CollectorLog/Data TransformerData AnalyzerWeb Reporter
ProcessLog defineLog integrate (into application)Log/Data analyzeReport develop
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Log CollectorScribe:
a server for aggregating streaming log data designed to scale to a very large number of nodes and be
robust to network and node failures hierarchy stores Thrift service using the non-blocking C++ server
Thrift-client in C/C++, Java, PHP, …
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Log format (common)Application-action log
server_ip server_domain client_ip username actionid createdtime appdata execution_time
Request log server_ip request_domain request_uri request_time execution_time memory client_ip username application
Game action log time username actionid gameid goldgain coingain expgain itemtype itemid userid_affect appdata
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Scribe – file store port=1463 max_msg_per_second=2000000 max_queue_size=10000000 new_thread_per_category=yes num_thrift_server_threads=10 check_interval=3
# DEFAULT - write all other categories to /data/scribe_log <store> category=default type=file file_path=/data/scribe_log base_filename=default_log max_size=8000000000 add_newlines=1 rotate_period=hourly #rotate_hour=0 rotate_minute=1 </store>
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Scribe – buffer store <store> category=default type=buffer target_write_size=20480 max_write_interval=1 buffer_send_rate=1 retry_interval=30 retry_interval_range=10 <primary> type=network remote_host=xxx.yyy.zzz.ttt remote_port=1463 </primary> <secondary> type=file fs_type=std file_path=/tmp base_filename=zmlog_backup max_size=30000000 </secondary> </store>
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Log/Data TransformerHelp to import data from multi-type source into
HiveSemi-automated
Log files to Hive: LOAD DATA LOCAL INPATH … OVERWRITE INTO TABLE…
MySQL data to Hive: Data extract using SELECT … INTO OUTFILE … Import using LOAD DATA
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Data AnalyzerCalculation using Hive query language (HQL):
SQL-likeData partitioning, query optimization:
very important to improve speed distributed data reading optimize query for one-pass data reading
Automation hive --service cli -f hql_file Bash shell, crontab
Export data and import into MySQL for web report Export with Hadoop command-line: hadoop fs -cat Import using LOAD DATA LOCAL INFILE … INTO TABLE …
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
Web ReporterPHP web applicationModularStandard format and template
jpgraph
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
ApplicationsSummarization
User/Apps indicators: active, churn-rate, login, return… User demographics: age, gender, education, job,
location… User interactions/Apps actions
Data miningSpam DetectionApplication performanceAd-hoc Analysis…
Log Collecting, Analyzing & Log Collecting, Analyzing & ReportingReporting
THANK YOU!