CS246 Search Engine Scale
Dec 23, 2015
Junghoo "John" Cho (UCLA Computer Science)
2
High-Level Architecture Major modules for a search engine?1. Crawler
Page download & Refresh
2. Indexer Index construction PageRank computation
3. Query Processor Page ranking Query logging
Junghoo "John" Cho (UCLA Computer Science)
4
Scale of Web Search Number of pages indexed:
~ 10B pages Index refresh interval:
Once per month ~ 1200 pages/sec Number of queries per day:
290M in Dec 2008 ~ 3000 queries/sec Services often run on commodity Intel-Linux
boxes
Junghoo "John" Cho (UCLA Computer Science)
5
Other Statistics Average page size: 15KB Average query size: 40B Average result size: 5KB Average number of links per page: 10
Junghoo "John" Cho (UCLA Computer Science)
6
Size of Dataset (1) Total raw HTML data size
10G x 15KB = 150 TB! Inverted index roughly the same size as raw
corpus: 150 TB for index itself With appropriate compression, 3:1
compression ratio100 TB data residing in disk
Junghoo "John" Cho (UCLA Computer Science)
7
Size of Dataset (2) Number of disks necessary for one copy (100 TB) / (1TB per disk) = 100 disk
Junghoo "John" Cho (UCLA Computer Science)
8
Data Size and Crawling Efficient crawl is very important
1 page/sec 1200 machines just for crawling Parallelization through thread/event queue necessary Complex crawling algorithm -- No, No!
Well-optimized crawler ~ 100 pages/sec (10 ms/page) ~ 12 machines for crawling
Bandwidth consumption 1200 x 15KB x 8bit ~ 150Mbps One dedicated OC3 line (155Mbps) for crawling
~ $400,000 per year
Junghoo "John" Cho (UCLA Computer Science)
9
Data Size and Indexing Efficient Indexing is very important
1200 pages / sec Indexing steps
Load page, extract words – Network/disk intensive Sort word, postings – CPU intensive Write sorted postings – Disk intensive
Pipeline indexing steps
L S WL S W
L S W
P1P2
P3
Junghoo "John" Cho (UCLA Computer Science)
10
Simplified Indexing Model
Model Copy 50TB data from disks in S
to disks in D through network S: crawling machines D: indexing machines 50TB crawled data, 50TB index Ignore actual processing
1TB disk per machine ~50 machines in S and D each 8GB RAM per machine
::
S D
Junghoo "John" Cho (UCLA Computer Science)
11
Data Flow
Disk RAM Network RAM Disk No hardware is error free
Disk (undetected) error rate ~ 1 per 1013
Network error rate ~ 1 bit per 1012
Memory soft error rate ~ 1 bit error per month (1GB) Typically go unnoticed for small data
Disk RAM
Network
RAM Disk
Junghoo "John" Cho (UCLA Computer Science)
12
Data Flow
Assuming 1Gbit/s link between machines 1TB per machine, 30MB/s transfer rate
Half day just for data transfer
Disk RAM
Network
RAM Disk
Junghoo "John" Cho (UCLA Computer Science)
13
Errors from Disk
Undetected disk error rate ~ 1 per 1013 5x1013 bytes data read in total
5X1013 bytes data write in total 10 byte errors from disk read/write
Disk RAM
Network
RAM Disk
Junghoo "John" Cho (UCLA Computer Science)
14
Errors from Memory
1 bit error per month per 1GB 100 machines with 8GB each
8*100 bit errors/month 15 bit error per half day 15 byte error from memory corruption
Disk RAM
Network
RAM Disk
Junghoo "John" Cho (UCLA Computer Science)
15
Errors from Network
1 error per 1012
5 x 8 x 1013 bits transfer 400 bit errors scatters around the stream 400 byte errors
Disk RAM
Network
RAM Disk
Junghoo "John" Cho (UCLA Computer Science)
16
Data Size and Errors (1) During index construction/copy, something always goes wrong 400 byte errors from network, 30 byte errors from
disk, 15 byte errors from memory Very difficult to trace and debug
Particularly disk and memory error No OS/application assumes such errors yet
Pure hardware errors, but very difficult to differentiate hardware error and software bug Software bugs may also cause similar errors
Junghoo "John" Cho (UCLA Computer Science)
17
Data Size and Errors (2) Very difficult to trace and debug
Data corruption in the middle of, say, sorting completely screws up the sorting
Need a data-verification step after every operation
Algorithm, data structure must be resilient to data corruption Check points, etc.
ECC RAM is a must Can detect most of 1 bit errors
Junghoo "John" Cho (UCLA Computer Science)
18
Data Size and Reliability Disk mean time to failure ~ 3 years
(3 x 365 days) / 100 disks ~ 10 dayOne disk failure every 10 days
Remember, this is just for one copy Data organization should be very resilient to disk
failure
Junghoo "John" Cho (UCLA Computer Science)
19
Data Size and Query Processing Index size: 50TB 50 disks Potentially 50-machine cluster to answer a
query If one machine goes down, the cluster goes down
Multi-tier index structure can be helpful Tier 1: Popular (high PageRank) page index Tier 2: Less popular page index Most queries can be answered by tier-1 cluster (with
fewer machines)
Junghoo "John" Cho (UCLA Computer Science)
20
Implication of Query Load 3000 queries / sec Rule of thumb: 1 query / sec per CPU
Depends on number of disks, memory size, etc. ~ 3000 machines just to answer queries
5KB / answer page 3000 x 5KB x 8bit ~ 120 Mbps Half dedicated OC3 line (155Mbps) ~ $300,000
Junghoo "John" Cho (UCLA Computer Science)
21
Hardware at Google ~10K Intel-Linux cluster Assuming 99.9% uptime (8 hour downtime per
year) 10 machines are always down Nightmare for system administrators
Assuming 3-year hardware replacement Set up, replace and dump 10 machines every day Heterogeneity is unavoidable
Position Requirements:
• Able to lift/move 20-30 lbs equipment on a daily basis.
Job posting at Google