YOU ARE DOWNLOADING DOCUMENT

Please tick the box to continue:

Transcript
Page 1: Indexing big data in the cloud

Indexing Big Data in the Cloud

Page 2: Indexing big data in the cloud

Indexing Big Data in the Cloud 2

Me

Scott StultsCo-Founder of OpenSource Connections

Solr / Lucene

Bash / Python / Java

Page 3: Indexing big data in the cloud

Indexing Big Data in the Cloud 3

Eric

Page 4: Indexing big data in the cloud

Indexing Big Data in the Cloud 4

Big Data

Page 5: Indexing big data in the cloud

Indexing Big Data in the Cloud 5

Big Data Wrangler

Page 6: Indexing big data in the cloud

Indexing Big Data in the Cloud 6

How?

Address a Real ProjectBe Agile

Make Small Mistaeks FastSucceed BIG

Page 7: Indexing big data in the cloud

Indexing Big Data in the Cloud 7

USPTO Goals

Prototype Search UX

Prove Solr:Scales

IntegratesExcels

Page 8: Indexing big data in the cloud

Indexing Big Data in the Cloud 8

Scale?

Page 9: Indexing big data in the cloud

Indexing Big Data in the Cloud 9

Our Approach

KISSYAGNI

(This space intentionally left blank)

Page 10: Indexing big data in the cloud

Indexing Big Data in the Cloud 10

Minimal Flair

Page 11: Indexing big data in the cloud

Indexing Big Data in the Cloud 11

Record Everything!

Page 12: Indexing big data in the cloud

Indexing Big Data in the Cloud 12

Some Numbers

Doc Count 1.1 MillionZip Files 313

Docs per Zip File 4,000

Zip File Size 75M

File Size 300M

Page 13: Indexing big data in the cloud

Indexing Big Data in the Cloud 13

Testing

Start some serversProcess a batchCheck the clock

Page 14: Indexing big data in the cloud

Indexing Big Data in the Cloud 14

start_nodes

start_nodes() { ec2-run-instances ami-1b814f72 \ --block-device-mapping '/dev/sdb=snap-48adde35::true' \ --block-device-mapping '/dev/sdi1=:10:false' \ --block-device-mapping '/dev/sdi2=:10:false' \ --block-device-mapping '/dev/sdi3=:20:false' \ --instance-type m1.large \ --key uspto-proto \ --instance-count $MAX_NODES \ --group default > ~/run-output}

Page 15: Indexing big data in the cloud

Indexing Big Data in the Cloud 15

Gut Check

How fast can we do this?

What can we do in parallel?

Page 16: Indexing big data in the cloud

Indexing Big Data in the Cloud 16

Scaling

Raise our instance limit

xargs -P GNU parallel

Page 17: Indexing big data in the cloud

Indexing Big Data in the Cloud 17

Shortcomings

SSH?Error recovery

One Solr

Page 18: Indexing big data in the cloud

Indexing Big Data in the Cloud 18

Alternatives

CloudFormationPuppet / Chef

Multiple Cores / ShardsHadoop

Page 19: Indexing big data in the cloud

Indexing Big Data in the Cloud 19

Success

Page 20: Indexing big data in the cloud

Indexing Big Data in the Cloud 20

Victory Lap

Page 21: Indexing big data in the cloud

Indexing Big Data in the Cloud 21

Instances / Time

Page 22: Indexing big data in the cloud

Indexing Big Data in the Cloud 22

Thank You

https://github.com/sstults/patent-indexing

@scottstults#o19s


Related Documents