Flume-based news aggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012
May 25, 2015
Flume-based news aggregator service on
Amazon EC2Arinto MurdopoMário Almeida
Zafar Gilani
SDS, EMDC 2012
Outline
● Introduction○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System
● Infrastructure setup● Architecture● News recommendation● RSS News aggregator● Proof of concept
● Issues faced● Future work● Conclusions● References
Introduction
● A flume-based independent news aggregator service.
● Using:○ Amazon EC2 IaaS○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System
Cloudera Manager CDH3
● Automates the installation and configuration process of CDH3 on an entire cluster.
● We used free edition (up to 50 nodes).
Cloudera Flume
● A distributed, reliable and available system.● To efficiently collect, aggregate and move
large amounts of log data.● From many different sources to a centralized
or distributed data store (such as Hadoop HDFS).
Hadoop HDFS (1/2)
● For our purpose Hadoop handles:○ Log receipt and storage.○ Search and log processing.
● Coordinates work among cluster of machines.
Hadoop HDFS (2/2)
Infrastructure setup
● 2 Agent nodes collecting data:○ Source: RSS feed○ Sink: Collector
● 1 Agent node (Collector):○ Source: Agents○ Sink: HDFS
● HDFS NameNode:○ Replicates data to DataNodes 1, 2 and 3.
● Cloudera Manager CDH3 node:○ Managing all our nodes (Agents and HDFS nodes).
Architecture
News Recommendation
● We hosted a webpage in which people can recommend possible sources for news. ○ http://web.ist.utl.pt/~ist156947/sds/
● Retrieved a big compilation of news websites and blogs from a reasonable variety of countries ○ E.g. Spain, Libya, Russia, Syria, Iran...
RSS News aggregator
● We wrote a Java application to read RSS feeds using:○ java.net.URL to handle the resource pointed-to by
the URL.○ javax.xml.parsers for XML parsing.○ org.w3c.dom provides interfaces for DOM to process
XML.
Proof of concept (1/3)
● Our Agent collects the RSS feeds and sends it to the Collector Agent.
● The collector receives the events from both Agents and stores them into the HDFS.
Proof of concept (2/3)
Proof of concept (3/3)
● Because we have a level of replication of 3, every DataNode will end up with the same amount of data.
Issues faced (1/4)
● DataNode Setting dfs.datanode.du.reserved is set by default to 10 GB.○ This means that if a datanode has less than 10 GB of
capacity, then there is no remaining available space for the file system. (Warning: Not able to place enough replicas)
Issues faced (2/4)
● In order for CDH Manager to work, all nodes must run either Suse or RedHat.
● The CDH Manager cannot run on a AWS EC2 micro instance.
● Upon instance restart, its IP changes.○ So the CDH Manager loses track of the node
● CDH Manager operates with private DNS and so any references it makes point to this private DNS.○ Web UI's are only accessible from our machines web
browsers through public DNS names.
Issues faced (3/4)
● Some installation guides forget to mention the required ports to allow communication with its services.○ Cloudera provides a page with all the required ports.
● The creating folders and changing user permissions is not mentioned in the user guide.○ We needed to access hadoop with username hdfs and
create the flume folder and change its owner to flume using chown command. (AccessControlException)
Issues faced (4/4)
● Although scaling through the addition of new Agents is easy, it requires fine-tuning of the channels capacity (number of events) and transaction size for each Agent.
Future work
● Expand RSS sources.● Implement a web UI.● Provide search services on the HDFS.● Improve the HDFS load balancing.
Conclusions (1/3)
● HDFS default configuration parameters are not suitable for deploying it in AWS EC2.
● Cloudera Manager makes installation and configuration process much easier!○ but it also introduces a few constraints that might
result in higher operating costs.● Adapting the RSS reader of the agents is not
trivial!○ different RSS sources have different contents (e.g.
posts with ad banners).
Conclusions (2/3)
● Amazon EC2 service is easier to use and more reliable than other cloud providers!○ E.g. PlanetLab.
● Flume's architecture based on streaming data flows makes it easier to add new sources and sinks. ○ the service can scale by adding new Agents.
● Flume is horizontally scalable!○ because its performance is proportional to the
number of machines on which it is deployed.
● Fine tunage of Flume's configuration files is not trivial!
● HDFS NameNode is no longer a single point of failure!○ since NameNode replication was introduced. Adding
passive NameNodes affects the overall performance of the HDFS cluster though.
Conclusions (3/3)
References (1/2)
● Cloudera Flume 1.x installation○ https://ccp.cloudera.
com/display/CDHDOC/Flume+1.x+Installation● Cloudera Manager CDH3
○ https://ccp.cloudera.com/display/FREE374/Cloudera+Manager+Free+Edition+Installation+Guide
● Cloudera port information○ https://ccp.cloudera.
com/display/CDHDOC/Configuring+Ports+for+CDH3
● Cloudera Flume User Guide○ http://archive.cloudera.com/cdh4/cdh/4/flume-
ng/FlumeUserGuide.html
References (2/2)
● Find more detailed information on our setup and configuration on our personal blogs:○ http://www.aknahs.pt/○ http://www.otnira.com/○ http://115.186.131.91/~zafar/
Easter Egg: Issues faced
● One islamic team member declared love to a Cloudera female member and ended up having to marry her during the project.○ Turns out it was a male.
● One member became angry because other team was using demos on their project and ended up cutting a poor rastafarian hair off.○ Turns out that screenshots are better than demos.
● One member managed to get sun burned while doing the project. Before this it was thought that computer scientists would only work in caves.○ Turns out that he just took a very hot shower.
Special Thanks
● Leandro Navarro - UPC● Amazon● jarcec - #flume on irc.freenode.net● mids - #cloudera on irc.freenode.net
(@mids106) Hanging out in IRC is useful!
News aggregator service on Amazon EC2
Arinto MurdopoMário Almeida
Zafar Gilani
SDS, EMDC 2012