Top Banner
Alan Smith Active Solution [email protected] @alansmith www.cloudcasts.net Handling Big Data in Windows Azure Storage
33

Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Feb 23, 2016

Download

Documents

dewitt

Handling Big Data in Windows Azure Storage. Alan Smith Active Solution c [email protected] @alansmith www.cloudcasts.net. On-Premise. On-Premise. Replication. MSDN Universal - $150. Implementation Challenges. Text Search Implementation. Windows Azure Websites. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Alan SmithActive Solution

[email protected]

@alansmithwww.cloudcasts.net

Handling Big Data in Windows Azure

Storage

Page 2: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 3: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 4: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 5: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 6: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

On-Premise

Replication

On-Premise

Page 7: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 8: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Time Data30 Days 1.6 TB10 Days 4.8 TB2 Days 24.4 TB

MSDN Universal - $150

Page 9: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 10: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Implementation Challenges

Number of Articles 4,356,508Number of Indexed Words 27,765,188Total number of Index Entries 1,003,489,254Total Text Content File Size 41.4 GB

Page 11: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Text Search ImplementationWindows Azure StorageWindows Azure Websites

Table Storage – Text Index

Blob Storage – Pages

Azure Wiki Website

Page 12: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Text Index Table DesignPartitionKey WordRowKey (10,000 – word count on page)_PageIdPageId Numeric page ID (Integer)PageTitle Title of Page (String)

• Query on PartitionKey (word)• Ordered by RowKey (word count on page)

Page 13: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Text Index Table ExamplePartitionKey RowKey PageId PageTitle

azure 999604_33300527 33300527 Capetian Armorial

azure 999635_23352685 23352685 Morphological classification of Czech verbs

azure 999790_25148196 25148196 Armorial of the Communes of Seine-Maritime

azure 999901_00864847 864847 Azure (color)

azure 999913_19961416 19961416 Windows Azure

azure 999913_31687088 31687088 Ministry of Defence (Spain)

azure 999920_14011854 14011854 Coats of arms of the Holy Roman Empire

azure 999926_25312186 25312186 Armorial of the Communes of Eure

azure 999930_01317679 1317679 Lancia Aurelia

azure 999935_00717434 717434 Ordinary (heraldry)

azure 999935_04644383 4644383 Characters of The Order of the Stick

Page 14: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Uploading Page Data

Upload Page Content to Blob Storage

27 XML Content Files(41.4 GB - 4,356,508 Pages)

Windows Azure Storage

Blob Storage(4,356,508 Blobs)

Page 15: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Creating Text Index Data

Parse Page Text

27 XML Content Files(41.4 GB - 4,356,508 Pages)

Page IDs and Titles (124 MB)

Index Entries(19,277 Files - 9.83 GB)

Page 16: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Index Data Files

typical#2356523,1|2356987,1|2357098,1|2357186,1|2357237,1|2357704,1|2357705,1history#2375229,1|2375230,1|2375232,1|2375279,1|2375293,3|2375300,1|2375314,2renowned#2338682,1|2338841,2|2339194,1|2339509,1|2339791,1|2340298,1|2340408,1line#2372733,1|2372749,2|2372774,2|2372784,2|2372790,1|2372796,1|2372813,1varies#2316134,1|2317202,1|2318782,1|2319263,1|2319437,1|2319766,1|2319969,1moore#2348931,2|2349076,2|2349268,1|2349746,8|2349903,1|2350368,2|2350437,1journal#2371460,2|2371490,1|2371518,2|2371524,1|2371565,3|2371591,6|2371609,2elderly#2300000,2|2300127,1|2301060,1|2301207,1|2301873,1|2302199,1|2302733,1bearing#2331971,1|2332125,1|2332422,1|2332610,1|2333094,1|2333854,1|2334189,1

• Contains 1,000 lines• Each line contains 100 entries for a word (1 transaction)

Page 17: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Insert Index EntriesWindows Azure StorageWindows Azure Storage

Blob Storage

Queue

Table Storage

Windows Azure Services

Worker Roles

Page 18: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Insert Index Entries

Page 19: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Windows Azure

On-Premise

Windows Azure Storage

Tables Blobs Queues

http://azurespeedtest.azurewebsites.net/

Page 20: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Windows Azure

On-Premise

Windows Azure Storage

Tables Blobs Queues

Windows Azure Virtual Machines

VMVM

http://azurespeedtest.azurewebsites.net/

Page 21: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 22: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 23: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

ServicePointManager.DefaultConnectionLimit = 100;ServicePointManager.UseNagleAlgorithm = false;ServicePointManager.Expect100Continue = false;

Page 24: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Block Blob OperationsSingle HTTP request

for blob

Sequential HTTP requests for blocks

Parallel HTTP requests for blocks

Blob UploadBlock UploadBlock Commit

Page 25: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Tuning Block Blob OperationsSingle HTTP request

for blob

Sequential HTTP requests for blocks

Parallel HTTP requests for blocks

SingleBlobUploadThresholdInBytes

ParallelOperationThreadCount

StreamWriteSizeInBytes

Blob UploadBlock UploadBlock Commit

Page 26: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Tuning Blob OperationsProperty Default Range DescriptionSingleBlobUploadThresholdInBytes 32 MB 1-64 MB Maximum size of a blob in bytes that may be uploaded as a

single blob.

ParallelOperationThreadCount 1 1-64 Number of blocks that may be simultaneously uploaded

Property Default Range DescriptionStreamWriteSizeInBytes (Block) 4 MB 1-4 MB Block size for writing to a block blob.

StreamWriteSizeInBytes (Page) 512 bytes – 4 MB Number of bytes to buffer when writing to a page blob stream.

StreamMinimumReadSizeInBytes 1-4 MB Minimum number of bytes to buffer when reading from a blob stream.

CloudBlobClient

CloudBlockBlob

Page 27: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Parallel and Asynchronous UploadsParallel Blobs

Blob Container

Files

Blob Blob Blob

Parallel Blocks

Blob Container

Files

Blob

Parallel Blobs & Blocks

Blob Container

Files

Blob Blob Blob

Page 28: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 29: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts
Page 30: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Storage Monitoring Tables• $MetricsCapacityBlob• $MetricsTransactionsBlob• $MetricsTransactionsTable• $MetricsTransactionsQueue

Page 31: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Handling Outages• 29th February 2012 – Major due to certificate error– MVP Summit 2012 - February 28th – March 2nd

• 22nd February 2013 – Storage outage due to certificate error– MVP Summit 2013 – February 18th – 22nd

• MVP Summit November 2013 – November 18th – 21st

– Correlation does not mean causation!

Page 32: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

• Consider processing “In the Cloud”• Modify ServicePointManager Settings• Use Parallel and Asynchronous Actions• Tune CloudBlobClient and CloudBlockBlob properties• Fiddler is Your Friend (Especially the Timeline)• Use the Source (Windows Azure SDK on GitHub)• Understand Storage Emulator Limitations• Understand transient faults• Understand Pricing Implications• Leverage Storage Analytics

Page 33: Alan Smith Active Solution c loudcasts@gmail @alansmith cloudcasts

Thanks!http://wikisearch.azurewebsites.net/