Flume-based Independent News Aggregator

Flume-based news aggregator service on

Amazon EC2Arinto MurdopoMário Almeida

Zafar Gilani

SDS, EMDC 2012

Outline

● Introduction○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System

● Infrastructure setup● Architecture● News recommendation● RSS News aggregator● Proof of concept

● Issues faced● Future work● Conclusions● References

Introduction

● A flume-based independent news aggregator service.

● Using:○ Amazon EC2 IaaS○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System

Cloudera Manager CDH3

● Automates the installation and configuration process of CDH3 on an entire cluster.

● We used free edition (up to 50 nodes).

Cloudera Flume

● A distributed, reliable and available system.● To efficiently collect, aggregate and move

large amounts of log data.● From many different sources to a centralized

or distributed data store (such as Hadoop HDFS).

Hadoop HDFS (1/2)

● For our purpose Hadoop handles:○ Log receipt and storage.○ Search and log processing.

● Coordinates work among cluster of machines.

Hadoop HDFS (2/2)

Infrastructure setup

● 2 Agent nodes collecting data:○ Source: RSS feed○ Sink: Collector

● 1 Agent node (Collector):○ Source: Agents○ Sink: HDFS

● HDFS NameNode:○ Replicates data to DataNodes 1, 2 and 3.

● Cloudera Manager CDH3 node:○ Managing all our nodes (Agents and HDFS nodes).

Architecture

News Recommendation

● We hosted a webpage in which people can recommend possible sources for news. ○ http://web.ist.utl.pt/~ist156947/sds/

● Retrieved a big compilation of news websites and blogs from a reasonable variety of countries ○ E.g. Spain, Libya, Russia, Syria, Iran...

http://web.ist.utl.pt/~ist156947/sds/

RSS News aggregator

● We wrote a Java application to read RSS feeds using:○ java.net.URL to handle the resource pointed-to by

the URL.○ javax.xml.parsers for XML parsing.○ org.w3c.dom provides interfaces for DOM to process

XML.

Proof of concept (1/3)

● Our Agent collects the RSS feeds and sends it to the Collector Agent.

● The collector receives the events from both Agents and stores them into the HDFS.



● Because we have a level of replication of 3, every DataNode will end up with the same amount of data.

Issues faced (1/4)

● DataNode Setting dfs.datanode.du.reserved is set by default to 10 GB.○ This means that if a datanode has less than 10 GB of

capacity, then there is no remaining available space for the file system. (Warning: Not able to place enough replicas)

Issues faced (2/4)

● In order for CDH Manager to work, all nodes must run either Suse or RedHat.

● The CDH Manager cannot run on a AWS EC2 micro instance.

● Upon instance restart, its IP changes.○ So the CDH Manager loses track of the node

● CDH Manager operates with private DNS and so any references it makes point to this private DNS.○ Web UI's are only accessible from our machines web

browsers through public DNS names.

Issues faced (3/4)

● Some installation guides forget to mention the required ports to allow communication with its services.○ Cloudera provides a page with all the required ports.

● The creating folders and changing user permissions is not mentioned in the user guide.○ We needed to access hadoop with username hdfs and

create the flume folder and change its owner to flume using chown command. (AccessControlException)

Issues faced (4/4)

● Although scaling through the addition of new Agents is easy, it requires fine-tuning of the channels capacity (number of events) and transaction size for each Agent.

Future work

● Expand RSS sources.● Implement a web UI.● Provide search services on the HDFS.● Improve the HDFS load balancing.

Conclusions (1/3)

● HDFS default configuration parameters are not suitable for deploying it in AWS EC2.

● Cloudera Manager makes installation and configuration process much easier!○ but it also introduces a few constraints that might

result in higher operating costs.● Adapting the RSS reader of the agents is not

trivial!○ different RSS sources have different contents (e.g.

posts with ad banners).

Conclusions (2/3)

● Amazon EC2 service is easier to use and more reliable than other cloud providers!○ E.g. PlanetLab.

● Flume's architecture based on streaming data flows makes it easier to add new sources and sinks. ○ the service can scale by adding new Agents.

● Flume is horizontally scalable!○ because its performance is proportional to the

number of machines on which it is deployed.

● Fine tunage of Flume's configuration files is not trivial!

● HDFS NameNode is no longer a single point of failure!○ since NameNode replication was introduced. Adding

passive NameNodes affects the overall performance of the HDFS cluster though.

Conclusions (3/3)

References (1/2)

● Cloudera Flume 1.x installation○ https://ccp.cloudera.

com/display/CDHDOC/Flume+1.x+Installation● Cloudera Manager CDH3

○ https://ccp.cloudera.com/display/FREE374/Cloudera+Manager+Free+Edition+Installation+Guide

● Cloudera port information○ https://ccp.cloudera.

com/display/CDHDOC/Configuring+Ports+for+CDH3

● Cloudera Flume User Guide○ http://archive.cloudera.com/cdh4/cdh/4/flume-

ng/FlumeUserGuide.html

https://ccp.cloudera.com/display/CDHDOC/Flume+1.x+Installation

https://ccp.cloudera.com/display/CDHDOC/Flume+1.x+Installation

https://ccp.cloudera.com/display/FREE374/Cloudera+Manager+Free+Edition+Installation+Guide



https://ccp.cloudera.com/display/CDHDOC/Configuring+Ports+for+CDH3



http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html

http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html

References (2/2)

● Find more detailed information on our setup and configuration on our personal blogs:○ http://www.aknahs.pt/○ http://www.otnira.com/○ http://115.186.131.91/~zafar/

http://www.aknahs.pt/

http://www.otnira.com/

http://115.186.131.91/~zafar/

Easter Egg: Issues faced

● One islamic team member declared love to a Cloudera female member and ended up having to marry her during the project.○ Turns out it was a male.

● One member became angry because other team was using demos on their project and ended up cutting a poor rastafarian hair off.○ Turns out that screenshots are better than demos.

● One member managed to get sun burned while doing the project. Before this it was thought that computer scientists would only work in caves.○ Turns out that he just took a very hot shower.

Special Thanks

● Leandro Navarro - UPC● Amazon● jarcec - #flume on irc.freenode.net● mids - #cloudera on irc.freenode.net

(@mids106) Hanging out in IRC is useful!

https://twitter.com/#!/mids106

News aggregator service on Amazon EC2

Arinto MurdopoMário Almeida

Zafar Gilani

SDS, EMDC 2012

Flume-based Independent News Aggregator

Technology

cloudera manager cdh3

hadoop hdfs

hdfs hdfs namenode

hdfs nodes

hdfs cluster

node cdh manager

flume folder

different rss sources