Hw09 Terapot Email Archiving With Hadoop

Next RevolutionToward Open Platform

Terapot: Massive Email Archiving with Hadoop & Friends

Jaesun HanFounder & CEO of NexRjshan@nexrcorp.com

- Commercial Hadoop Application

#2About NexR

icube-cc (Compute)

icube-sc(Storage)

Hadoop

visionin

Managem

Massive Data Storage & Processing Platform

Cloud Computing Platform(Compatible with Amazon AWS)

Academic SupportProgramMassive Email Archiving MapReduce Workflow

Hadoop & Cloud Computing Services

Offering Hadoop & Cloud Computing Platform and Services

#3What is Email Archiving?

The Objectives of Email Archiving- Regulatory compliance- e-Discovery: Litigation and legal discovery- E-mail backup and disaster recovery- Messaging system & storage optimization- Monitoring of internal and external e-mail content

#4The Architecture of Email Archiving

Email ArchivingServer

Indexes

Journaling

Search

Crawling

EmailServers

Archival Storageemail data

Indexing

Discovery

Data AcquisitionJournaling

Mailbox Crawling

Data ProcessingIndexingFiltering

Data AccessSearch

Discovery

auditoradministrator

employee

#5The Challenges of Email Archiving

Explosive growth of digital data- 6 times (988XB) in 2010 than 2006- 95% (939 XB) unstructured data including email- Increasing the cost and complexity of archiving Requiring scalable & low cost archiving

Reinforcement of data retention regulation- Retention, Disposal, e-Discovery, Security- HIPPA(Healthcare) 21 ~ 23 yrs, SEC17(Trading) 6 yrs,OSHA(Toxic) 30 yrs, SOX(Finance) 5 yrs, J-SOX, K-SOX Requiring scalable archiving & fast discovery

Needs for intelligent data management- Knowledge management from email data- Filtering, monitoring, data mining, etc Requiring integration with intelligent system

#6New Requirements of Email Archiving

High Scalability

Low Cost

High Performance

Intelligence

#7Terapot: When Hadoop Met Email Archiving…

EmailServers

Distributed Crawling

JournalingServer

Journaling

Hadoop HDFS(Archiving)

Hadoop MapReduce(Crawling, Indexing, etc)

Distributed Search & Discovery

Scale-out architecture with Hadoop- Hadoop HDFS for archiving email data- Hadoop MapReduce for crawling & indexing- Apache Lucene for search & discovery

#8Features of Terapot

Distributed Massive Email Archiving High Scalability by Shared-Nothing Architecture

- Thousands of servers, billions of emails

Low Cost by Inexpensive Hardware- Entry servers under $5,000

High Performance by Parallelism- Fast search under 1-2 seconds for each user account- Fast discovery in parallel with MapReduce

Intelligence by Data Mining- Contact network analysis, content analysis, statistics

Support Both On-premise Version and Cloud(hosted) Version Development with Various Open Source Software

#9The Architecture of Terapot

Crawling

Batch processing

MR Workflow Manager

Terapot Frontend

Terapot Clients

POP3Server

HTTP/FTP/SFTPServer

MailServer

NAS/NFS

Email Sources

SOAP REST JSON

Local(index)

HDFS(email)

Indexing Merging

Analyzer

MiningReal-Time

Indexing

MailServer

Searching

Search Gateway

Analysis

Hadoop MapReduce, Lucene, & Hive

4 keycomponents

#10Batch Processing Component

Crawling(MR)

Indexing(MR)

Merging

An archive file per user(sequence file)

a temporary index file per user

(lucene index file)

a merged index file(for backing up)

Email Sources

Archiving policies An archive file per user Several archive files per crawling

configuredperiod

Local file system

index shard(3 copy replication)

shard 1 shard 0

Search

#11Real-Time Indexing Component

Real-TimeIndexing

JournalingServer

Memory

Real-TimeIndex

Database

BatchProcessingComponent

Crawling

Indexing Archiving

Hw09 Terapot Email Archiving With Hadoop

email data filtering

hadoop hadoop hdfs

hive hdfs email local

archiving indexingcrawling

complexity of archiving

year nodes email archive

email compression ratio

index shardsdistributedsearch

Technology

Workshop on Web Archiving · Module 3: Doing Your Own Web.....

Office 365 Archiving · Your Cloud. Our SaaS. Redefine Best...

October 2016 HUG: The Pillars of Effective Data Archiving...

Hw09 Large Scale Transaction Analysis

Hw09 Making Hadoop Easy On Amazon Web Services

Hw09 Monitoring Best Practices

Hw09 Hadoop Db

Twitter Sentiment Analysis with Hadoop - TBTLA | … 5...

Hw09 Hadoop + Vertica

Hadoop , Hadoop , Hadoop !!!

Archiving Challenges of Scholarly Electronic Journals: How.....

Hw09 Optimizing Hadoop Deployments

II. EDUCATION...Gopinath Ganapathy and S. Sagayaraj,...

Hw09 Hadoop Development At Facebook Hive And Hdfs

(Web) Archiving Online Media - SIEPRWeb) Archiving Online...

[ACADEMIC] Mathcad - HW09 solution