Petabyte-Scale Text Processing with Spark

Oleksii Sliusarenko, Grammarly Inc.E-mail: aliaxey90 (at) gmail (dot) com

Read the full article in Grammarly tech blog

Modern error correcting

depending from the weatherdepending on the weather

Size: 3 Petabytes

Format: WARC - Raw HTTP protocol dump

We need: 1 PB or 2000 x 480GB SSD disks

Common Crawl = internet dump

High-level pipeline view

Extract texts English Filter Deduplicate

Break into words

Count frequencies

Typical processing step example

Processing example:

count each n-gram frequency

Input data example:

Output data example:

<n-gram> <tab> <frequency>

My name is Bob. 12

Kiev is a capital. 25

name is 12

Classic and modern approaches

Our alternatives

$12000

$3000$1000

Default choice: Amazon EMR

$12000

$24000OOM

segfault

Our MapReduce

12x faster than Hadoop

Easy to learn Full support

Our MapReduce

Hardware failures Network failures

Distributed failsafe difficulties:

Fixing Spark

3 months!

First of all

Latest stable

◈ Build Spark with patch◈ Don’t forget Hadoop native libraries

The hardest button

S3 HEAD request failed for "file path" -

ResponseCode=403, ResponseMessage=Forbidden.

Why???

HTTP Head Request

HTTP body contains the

error description, but it’

s not fetched!

No body!

Possible reasons

Possible reasons:

◈ AccessDenied◈ AccountProblem◈ CrossLocationLoggingProhibited◈ InvalidAccessKeyId◈ InvalidObjectState◈ InvalidPayer◈ InvalidSecurity◈ NotSignedUp◈ RequestTimeTooSkewed◈ SignatureDoesNotMatch

We need to go deeper!

Spark Hadoop JetS3t HttpClient

Fix here

Fixing Spark

◈ Choose latest filesystem: S3A, not S3 or S3N

◈ conf.setInt("fs.s3a.connection.maximum", 100)

◈ Use DirectOutputCommitter

◈ --conf spark.hadoop.fs.s3a.access.key=…

Fixing S3

Fixing Spark

◈ Spark.default.parallelism = cores * 3

◈ spark_mb = system_ram_mb * 4 // 5

◈ set("spark.akka.frameSize", "2047") Fixing OOM

Fixing Spark

◈ Don’t force Kryo class registration

◈ Use bzip2 compression for input filesFixing miscellaneous

Our Ultimate Spark Recipe

See Grammarly tech blog for more info

Use spot instances

Spot instance

80% cheaper!

Safe Transient

Regular instance

Expensive

◈ We spent the same amount of money

◈ Further experiments will be cheaper

◈ You can save three months!

Was It All Worth It?

◈ Don’t reinvent the wheel

◈ New technology will eat a lot of time

◈ Don’t be afraid to dive into code

◈ Look at problems from various angles

◈ Use spot instances

Take-aways

Thanks!Any questions?You can find me at aliaxey90 (at) gmail (dot) com

Read the full article in Grammarly tech blog

Petabyte-Scale Text Processing with Spark

Data & Analytics

BIG DATA ÉS GÉPI TANULÁS KÖRNYEZET AZ MTA Cloud.pdf ·....

Transferring a Petabyte in a Day - Argonne National...

Explore Spark for Metagenome assembly - NERSC · – Data.....

Tera/Petabyte data distribution architectures

A Escola na Era do PetaByte

Facebook’s Petabyte Scale Data Warehouse Using Hive and...

Composable, Petabyte-Scale Genomics Workflows with Docker...

Hive A Petabyte Scale Data Warehouse Using...

Surviving the Petabyte Age: A Practitioner's Guide

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark...

Revolutionize Text Mining with Spark and Zeppelin

In Search of PetaByte Databases

The Personal Petabyte The Enterprise Exabyte

Petabyte scale on commodity infrastructure

•Please silence all electronic devices •Photos...

Petabyte-scale Data with Apache HDFS - MSST...