Top Banner
Flume-based news aggregator service on Amazon EC2 Arinto Murdopo Mário Almeida Zafar Gilani SDS, EMDC 2012
27
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Flume-based Independent News Aggregator

Flume-based news aggregator service on

Amazon EC2Arinto MurdopoMário Almeida

Zafar Gilani

SDS, EMDC 2012

Page 2: Flume-based Independent News Aggregator

Outline

● Introduction○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System

● Infrastructure setup● Architecture● News recommendation● RSS News aggregator● Proof of concept

● Issues faced● Future work● Conclusions● References

Page 3: Flume-based Independent News Aggregator

Introduction

● A flume-based independent news aggregator service.

● Using:○ Amazon EC2 IaaS○ Cloudera Manager CDH3○ Cloudera Flume○ Hadoop Distributed File System

Page 4: Flume-based Independent News Aggregator

Cloudera Manager CDH3

● Automates the installation and configuration process of CDH3 on an entire cluster.

● We used free edition (up to 50 nodes).

Page 5: Flume-based Independent News Aggregator

Cloudera Flume

● A distributed, reliable and available system.● To efficiently collect, aggregate and move

large amounts of log data.● From many different sources to a centralized

or distributed data store (such as Hadoop HDFS).

Page 6: Flume-based Independent News Aggregator

Hadoop HDFS (1/2)

● For our purpose Hadoop handles:○ Log receipt and storage.○ Search and log processing.

● Coordinates work among cluster of machines.

Page 7: Flume-based Independent News Aggregator

Hadoop HDFS (2/2)

Page 8: Flume-based Independent News Aggregator

Infrastructure setup

● 2 Agent nodes collecting data:○ Source: RSS feed○ Sink: Collector

● 1 Agent node (Collector):○ Source: Agents○ Sink: HDFS

● HDFS NameNode:○ Replicates data to DataNodes 1, 2 and 3.

● Cloudera Manager CDH3 node:○ Managing all our nodes (Agents and HDFS nodes).

Page 9: Flume-based Independent News Aggregator

Architecture

Page 10: Flume-based Independent News Aggregator

News Recommendation

● We hosted a webpage in which people can recommend possible sources for news. ○ http://web.ist.utl.pt/~ist156947/sds/

● Retrieved a big compilation of news websites and blogs from a reasonable variety of countries ○ E.g. Spain, Libya, Russia, Syria, Iran...

Page 11: Flume-based Independent News Aggregator

RSS News aggregator

● We wrote a Java application to read RSS feeds using:○ java.net.URL to handle the resource pointed-to by

the URL.○ javax.xml.parsers for XML parsing.○ org.w3c.dom provides interfaces for DOM to process

XML.

Page 12: Flume-based Independent News Aggregator

Proof of concept (1/3)

● Our Agent collects the RSS feeds and sends it to the Collector Agent.

Page 13: Flume-based Independent News Aggregator

● The collector receives the events from both Agents and stores them into the HDFS.

Proof of concept (2/3)

Page 14: Flume-based Independent News Aggregator

Proof of concept (3/3)

● Because we have a level of replication of 3, every DataNode will end up with the same amount of data.

Page 15: Flume-based Independent News Aggregator

Issues faced (1/4)

● DataNode Setting dfs.datanode.du.reserved is set by default to 10 GB.○ This means that if a datanode has less than 10 GB of

capacity, then there is no remaining available space for the file system. (Warning: Not able to place enough replicas)

Page 16: Flume-based Independent News Aggregator

Issues faced (2/4)

● In order for CDH Manager to work, all nodes must run either Suse or RedHat.

● The CDH Manager cannot run on a AWS EC2 micro instance.

● Upon instance restart, its IP changes.○ So the CDH Manager loses track of the node

● CDH Manager operates with private DNS and so any references it makes point to this private DNS.○ Web UI's are only accessible from our machines web

browsers through public DNS names.

Page 17: Flume-based Independent News Aggregator

Issues faced (3/4)

● Some installation guides forget to mention the required ports to allow communication with its services.○ Cloudera provides a page with all the required ports.

● The creating folders and changing user permissions is not mentioned in the user guide.○ We needed to access hadoop with username hdfs and

create the flume folder and change its owner to flume using chown command. (AccessControlException)

Page 18: Flume-based Independent News Aggregator

Issues faced (4/4)

● Although scaling through the addition of new Agents is easy, it requires fine-tuning of the channels capacity (number of events) and transaction size for each Agent.

Page 19: Flume-based Independent News Aggregator

Future work

● Expand RSS sources.● Implement a web UI.● Provide search services on the HDFS.● Improve the HDFS load balancing.

Page 20: Flume-based Independent News Aggregator

Conclusions (1/3)

● HDFS default configuration parameters are not suitable for deploying it in AWS EC2.

● Cloudera Manager makes installation and configuration process much easier!○ but it also introduces a few constraints that might

result in higher operating costs.● Adapting the RSS reader of the agents is not

trivial!○ different RSS sources have different contents (e.g.

posts with ad banners).

Page 21: Flume-based Independent News Aggregator

Conclusions (2/3)

● Amazon EC2 service is easier to use and more reliable than other cloud providers!○ E.g. PlanetLab.

● Flume's architecture based on streaming data flows makes it easier to add new sources and sinks. ○ the service can scale by adding new Agents.

● Flume is horizontally scalable!○ because its performance is proportional to the

number of machines on which it is deployed.

Page 22: Flume-based Independent News Aggregator

● Fine tunage of Flume's configuration files is not trivial!

● HDFS NameNode is no longer a single point of failure!○ since NameNode replication was introduced. Adding

passive NameNodes affects the overall performance of the HDFS cluster though.

Conclusions (3/3)

Page 24: Flume-based Independent News Aggregator

References (2/2)

● Find more detailed information on our setup and configuration on our personal blogs:○ http://www.aknahs.pt/○ http://www.otnira.com/○ http://115.186.131.91/~zafar/

Page 25: Flume-based Independent News Aggregator

Easter Egg: Issues faced

● One islamic team member declared love to a Cloudera female member and ended up having to marry her during the project.○ Turns out it was a male.

● One member became angry because other team was using demos on their project and ended up cutting a poor rastafarian hair off.○ Turns out that screenshots are better than demos.

● One member managed to get sun burned while doing the project. Before this it was thought that computer scientists would only work in caves.○ Turns out that he just took a very hot shower.

Page 26: Flume-based Independent News Aggregator

Special Thanks

● Leandro Navarro - UPC● Amazon● jarcec - #flume on irc.freenode.net● mids - #cloudera on irc.freenode.net

(@mids106) Hanging out in IRC is useful!

Page 27: Flume-based Independent News Aggregator

News aggregator service on Amazon EC2

Arinto MurdopoMário Almeida

Zafar Gilani

SDS, EMDC 2012