Transcript

Next RevolutionToward Open Platform

Terapot: Massive Email Archiving with Hadoop & Friends

Jaesun HanFounder & CEO of NexRjshan@nexrcorp.com

- Commercial Hadoop Application

#2About NexR

icube-cc (Compute)

icube-sc(Storage)

Hadoop

Pro

visionin

g &

Managem

ent

Massive Data Storage & Processing Platform

Cloud Computing Platform(Compatible with Amazon AWS)

Academic SupportProgramMassive Email Archiving MapReduce Workflow

Hadoop & Cloud Computing Services

Offering Hadoop & Cloud Computing Platform and Services

#3What is Email Archiving?

The Objectives of Email Archiving- Regulatory compliance- e-Discovery: Litigation and legal discovery- E-mail backup and disaster recovery- Messaging system & storage optimization- Monitoring of internal and external e-mail content

#4The Architecture of Email Archiving

Email ArchivingServer

Indexes

Journaling

Search

Crawling

EmailServers

Archival Storageemail data

Indexing

Discovery

Data AcquisitionJournaling

Mailbox Crawling

Data ProcessingIndexingFiltering

Data AccessSearch

Discovery

auditoradministrator

employee

#5The Challenges of Email Archiving

Explosive growth of digital data- 6 times (988XB) in 2010 than 2006- 95% (939 XB) unstructured data including email- Increasing the cost and complexity of archiving Requiring scalable & low cost archiving

Reinforcement of data retention regulation- Retention, Disposal, e-Discovery, Security- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery

Needs for intelligent data management- Knowledge management from email data- Filtering, monitoring, data mining, etc Requiring integration with intelligent system

#6New Requirements of Email Archiving

High Scalability

Low Cost

High Performance

Intelligence

#7Terapot: When Hadoop Met Email Archiving…

EmailServers

Distributed Crawling

JournalingServer

Journaling

Hadoop HDFS(Archiving)

Hadoop MapReduce(Crawling, Indexing, etc)

Distributed Search & Discovery

Scale-out architecture with Hadoop- Hadoop HDFS for archiving email data- Hadoop MapReduce for crawling & indexing- Apache Lucene for search & discovery

#8Features of Terapot

Distributed Massive Email Archiving High Scalability by Shared-Nothing Architecture

- Thousands of servers, billions of emails

Low Cost by Inexpensive Hardware- Entry servers under $5,000

High Performance by Parallelism- Fast search under 1-2 seconds for each user account- Fast discovery in parallel with MapReduce

Intelligence by Data Mining- Contact network analysis, content analysis, statistics

Support Both On-premise Version and Cloud(hosted) Version Development with Various Open Source Software

#9The Architecture of Terapot

Crawling

Batch processing

MR Workflow Manager

Terapot Frontend

Terapot Clients

POP3Server

HTTP/FTP/SFTPServer

MailServer

NAS/NFS

Email Sources

SOAP REST JSON

Local(index)

HDFS(email)

Indexing Merging

Analyzer

MiningReal-Time

Indexing

MailServer

Searching

Search Gateway

ETL

Analysis

Hadoop MapReduce, Lucene, & Hive

4 keycomponents

#10Batch Processing Component

Crawling(MR)

Indexing(MR)

Merging

An archive file per user(sequence file)

a temporary index file per user

(lucene index file)

a merged index file(for backing up)

Email Sources

HDFS

Archiving policies An archive file per user Several archive files per crawling

configuredperiod

Local file system

index shard(3 copy replication)

shard 1 shard 0

Search

#11Real-Time Indexing Component

Real-TimeIndexing

JournalingServer

Memory

Real-TimeIndex

Database

BatchProcessingComponent

Crawling

Indexing Archiving

HDFS

archive

index

Flushing

Forwarding

#12Search & Discovery Component

SearchGateway

Zookeeper

Updatingshard status

Locatingindex shards

Assigningshards

DistributedSearch

HDFS

index shards

Real-TimeIndexing Nodes

Search Nodescopy index shardsto local file system

#13Data Analysis Component

ETL (MR)Extract-Transform-

Load

email archive files Hive table

Hive

MiningEngine

MR MR MR MR MR

analysis results database

generatingreports

HDFS

Hive queries

AnalyzerWeb

Reporter

reports

Personal contact network analysis Domain statistics

#14Installation & Quantitative Analysis

2masternodes

10workernodes

(datanode, tasktracker,searcher,

etc)

Description Qty

CPUIntel Xeon Nehalem

E5504 2.0GHz2

(8 cores)

MemoryDDR3 2GB PC3-10600

Registered Dimm9

(18GB)

HDD 1TB 7200 RPM SATA24

(4TB)

HA Assuming

- 1000 employees- 16 emails per day for each person- 215KB (content 142 KB + attachment 73 KB)for average email size

- 1.25 GB per year for 1 employee Storage

- index size: about 80% of email- compression ratio: about 50 %

Disk volume required for 1 year- email archive (HDFS): 1881 GB- indexes (HDFS + Local): 4559 GB- total: about 6.4 TB per year

40 TB may cover 6 years archiving

Quantitative Analysis

#15Demonstration

Hadoop & Cloud ComputingCompany

www.nexrcorp.com

For more information- www.nexrcorp.com- www.terapot.com- jshan@nexrcorp.com- @jaesun_han

top related