Alan Smith Active Solution [email protected] @alansmith www.cloudcasts.net Handling Big Data in Windows Azure Storage
Feb 23, 2016
Alan SmithActive Solution
@alansmithwww.cloudcasts.net
Handling Big Data in Windows Azure
Storage
On-Premise
Replication
On-Premise
Time Data30 Days 1.6 TB10 Days 4.8 TB2 Days 24.4 TB
MSDN Universal - $150
Implementation Challenges
Number of Articles 4,356,508Number of Indexed Words 27,765,188Total number of Index Entries 1,003,489,254Total Text Content File Size 41.4 GB
Text Search ImplementationWindows Azure StorageWindows Azure Websites
Table Storage – Text Index
Blob Storage – Pages
Azure Wiki Website
Text Index Table DesignPartitionKey WordRowKey (10,000 – word count on page)_PageIdPageId Numeric page ID (Integer)PageTitle Title of Page (String)
• Query on PartitionKey (word)• Ordered by RowKey (word count on page)
Text Index Table ExamplePartitionKey RowKey PageId PageTitle
azure 999604_33300527 33300527 Capetian Armorial
azure 999635_23352685 23352685 Morphological classification of Czech verbs
azure 999790_25148196 25148196 Armorial of the Communes of Seine-Maritime
azure 999901_00864847 864847 Azure (color)
azure 999913_19961416 19961416 Windows Azure
azure 999913_31687088 31687088 Ministry of Defence (Spain)
azure 999920_14011854 14011854 Coats of arms of the Holy Roman Empire
azure 999926_25312186 25312186 Armorial of the Communes of Eure
azure 999930_01317679 1317679 Lancia Aurelia
azure 999935_00717434 717434 Ordinary (heraldry)
azure 999935_04644383 4644383 Characters of The Order of the Stick
Uploading Page Data
Upload Page Content to Blob Storage
27 XML Content Files(41.4 GB - 4,356,508 Pages)
Windows Azure Storage
Blob Storage(4,356,508 Blobs)
Creating Text Index Data
Parse Page Text
27 XML Content Files(41.4 GB - 4,356,508 Pages)
Page IDs and Titles (124 MB)
Index Entries(19,277 Files - 9.83 GB)
Index Data Files
typical#2356523,1|2356987,1|2357098,1|2357186,1|2357237,1|2357704,1|2357705,1history#2375229,1|2375230,1|2375232,1|2375279,1|2375293,3|2375300,1|2375314,2renowned#2338682,1|2338841,2|2339194,1|2339509,1|2339791,1|2340298,1|2340408,1line#2372733,1|2372749,2|2372774,2|2372784,2|2372790,1|2372796,1|2372813,1varies#2316134,1|2317202,1|2318782,1|2319263,1|2319437,1|2319766,1|2319969,1moore#2348931,2|2349076,2|2349268,1|2349746,8|2349903,1|2350368,2|2350437,1journal#2371460,2|2371490,1|2371518,2|2371524,1|2371565,3|2371591,6|2371609,2elderly#2300000,2|2300127,1|2301060,1|2301207,1|2301873,1|2302199,1|2302733,1bearing#2331971,1|2332125,1|2332422,1|2332610,1|2333094,1|2333854,1|2334189,1
• Contains 1,000 lines• Each line contains 100 entries for a word (1 transaction)
Insert Index EntriesWindows Azure StorageWindows Azure Storage
Blob Storage
Queue
Table Storage
Windows Azure Services
Worker Roles
Insert Index Entries
Windows Azure
On-Premise
Windows Azure Storage
Tables Blobs Queues
http://azurespeedtest.azurewebsites.net/
Windows Azure
On-Premise
Windows Azure Storage
Tables Blobs Queues
Windows Azure Virtual Machines
VMVM
http://azurespeedtest.azurewebsites.net/
ServicePointManager.DefaultConnectionLimit = 100;ServicePointManager.UseNagleAlgorithm = false;ServicePointManager.Expect100Continue = false;
Block Blob OperationsSingle HTTP request
for blob
Sequential HTTP requests for blocks
Parallel HTTP requests for blocks
Blob UploadBlock UploadBlock Commit
Tuning Block Blob OperationsSingle HTTP request
for blob
Sequential HTTP requests for blocks
Parallel HTTP requests for blocks
SingleBlobUploadThresholdInBytes
ParallelOperationThreadCount
StreamWriteSizeInBytes
Blob UploadBlock UploadBlock Commit
Tuning Blob OperationsProperty Default Range DescriptionSingleBlobUploadThresholdInBytes 32 MB 1-64 MB Maximum size of a blob in bytes that may be uploaded as a
single blob.
ParallelOperationThreadCount 1 1-64 Number of blocks that may be simultaneously uploaded
Property Default Range DescriptionStreamWriteSizeInBytes (Block) 4 MB 1-4 MB Block size for writing to a block blob.
StreamWriteSizeInBytes (Page) 512 bytes – 4 MB Number of bytes to buffer when writing to a page blob stream.
StreamMinimumReadSizeInBytes 1-4 MB Minimum number of bytes to buffer when reading from a blob stream.
CloudBlobClient
CloudBlockBlob
Parallel and Asynchronous UploadsParallel Blobs
Blob Container
Files
Blob Blob Blob
Parallel Blocks
Blob Container
Files
Blob
Parallel Blobs & Blocks
Blob Container
Files
Blob Blob Blob
Storage Monitoring Tables• $MetricsCapacityBlob• $MetricsTransactionsBlob• $MetricsTransactionsTable• $MetricsTransactionsQueue
Handling Outages• 29th February 2012 – Major due to certificate error– MVP Summit 2012 - February 28th – March 2nd
• 22nd February 2013 – Storage outage due to certificate error– MVP Summit 2013 – February 18th – 22nd
• MVP Summit November 2013 – November 18th – 21st
– Correlation does not mean causation!
• Consider processing “In the Cloud”• Modify ServicePointManager Settings• Use Parallel and Asynchronous Actions• Tune CloudBlobClient and CloudBlockBlob properties• Fiddler is Your Friend (Especially the Timeline)• Use the Source (Windows Azure SDK on GitHub)• Understand Storage Emulator Limitations• Understand transient faults• Understand Pricing Implications• Leverage Storage Analytics
Thanks!http://wikisearch.azurewebsites.net/