Data Challenges I'm Struggling With Jim Gray, Microsoft Research ernet is probably the best way to moving WAN data at 1GBps transfer efforts are currently 550MBps via Internet2. o manage the multi-petybyte file repository we are about to generat erraServer has evolved from a mainframe to a bunch of bricks. ew design has been operating for a year and we are quite pleased wi e face "how-do-you-manage a bunch?" and what the best geoplex str kyServer website is built using database technology and web servic oving the web services inside the database. s are working to design a scale-out version of the server. are several interesting data challenges in these changes. relational tuples to represent spatial volumes as constraints. -in-polygon and polygon-overlap queries can then be quickly evaluat l briefly describe this idea.
24
Embed
Data Challenges I'm Struggling With Jim Gray, Microsoft Research 1.Sneakernet is probably the best way to moving WAN data at 1GBps File transfer efforts.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Challenges I'm Struggling With
Jim Gray, Microsoft Research 1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2. How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks. The new design has been operating for a year and we are quite pleased with it. But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services. Now moving the web services inside the database. Others are working to design a scale-out version of the server. There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints. Point-in-polygon and polygon-overlap queries can then be quickly evaluated. I will briefly describe this idea.
How Do You Move A Terabyte?
14 minutes6172001,920,0009600OC 192
2.2 hours1000Gbps
1 day100100 Mpbs
14 hours97631649,000155OC3
2 days2,01065128,00043T3
2 months2,4698001,2001.5T1
5 months360117500.6Home DSL
6 years3,0861,000400.04Home phone
Time/TB$/TBSent
$/MbpsRent
$/monthSpeedMbps
Context
Source: TeraScale Sneakernet, Source: TeraScale Sneakernet, Microsoft Technical Report May 2002, MSR-TR-2002-54 http://research.microsoft.com/research/pubs/view.aspx?tr_id=569
Moving Data Bricks• WAN costs >> 100$/Mbps/month
>> 1$/GB • Beowulf networking
10,000x cheaper than WAN factors of 105 matter.
• The cheapest and fastest way to move a Terabyte cross country is sneakernet.24 hours = 4 MB/s50$ shipping vs 1,000$ wan cost.
Giga Byte Per Second File Mover• CERN to Pasadena
– Windows TCP/IP stack improvements– Opteron demo– Disk-to-Disk at 550MBps now (~2 TB/Hour)
• What we learned:– Linux tcp stack is good/better at high perf
we are catching up.– NTFS is better than various Linux FS– Near the PCI-X limit– Good way to engage the community.
• GOAL: 1GBps disk-to-disk.
OC192 = 9.9 Gbps
CERN-Caltech Trasfer SpeedsNewisys->Newisys
0
100
200
300
400
500
600
700
800
900
1000
Mar-04 May-04 Jun-04 Aug-04 Sep-04
MB
ps
File Transfer MBps1 Stream tcp MBps
PCI -X limit
tcp limit
But then what?Managing Petabytes
• CERN files are 30MB
• They produce 1 B files/year.
• How name them?
• How manage them?
• Depends on workload: how use them.
• It’s a DB problem.
Data Challenges I'm Struggling With
Jim Gray, Microsoft Research 1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2. How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks. The new design has been operating for a year and we are quite pleased with it. But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services. Now moving the web services inside the database. Others are working to design a scale-out version of the server. There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints. Point-in-polygon and polygon-overlap queries can then be quickly evaluated. I will briefly describe this idea.
KVM / IPKVM / IP
TerraServer – What’s new• Web Service and Web Server• New ~1 ft2/pixel full color image
of 120 urban areas• Storage Bricks
– Commodity servers”– 4 TB raw / 2 TB Raid1 SATA storage– Dual 2 GHz + 4GB RAM– 3 Bricks = TerraServer data – Data partitioned – Moving to Yukon– Working on low TCO
• Load balances mirrors• Uses surviving database on failure
TerraServer Challenges
• Best Geoplex strategy?
• Moving Web Services into the DB?
• Managing bunches (lower TCO).
Data Challenges I'm Struggling With
Jim Gray, Microsoft Research 1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2. How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks. The new design has been operating for a year and we are quite pleased with it. But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services. Now moving the web services inside the database. Others are working to design a scale-out version of the server. There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints. Point-in-polygon and polygon-overlap queries can then be quickly evaluated. I will briefly describe this idea.
• Parallel data search (data pump).How to partition?How manage load
• Moving web services to DB What is the right approach?
• Move objects into DBSpatial access methodsData analysis in the DB.
Data Challenges I'm Struggling With
Jim Gray, Microsoft Research 1. Sneakernet is probably the best way to moving WAN data at 1GBps
File transfer efforts are currently 550MBps via Internet2. How to manage the multi-petybyte file repository we are about to generate.
2. The TerraServer has evolved from a mainframe to a bunch of bricks. The new design has been operating for a year and we are quite pleased with it. But we face "how-do-you-manage a bunch?" and what the best geoplex strategy?.
3. The SkyServer website is built using database technology and web services. Now moving the web services inside the database. Others are working to design a scale-out version of the server. There are several interesting data challenges in these changes.
4. Using relational tuples to represent spatial volumes as constraints. Point-in-polygon and polygon-overlap queries can then be quickly evaluated. I will briefly describe this idea.
A Detail: 3 Ways We Do Spatial?• Hierarchical mesh (extension to SQL)
– Uses table valued stored procedures– Acts as a new “spatial access method”– Porting to Yukon CLR for a 10x speedup.
• Zones: fits SQL like a glove– Amazingly simple, amazingly good.
• Constraints: a really novel idea– Lets us do algebra on regions.
• Paper:There Goes the Neighborhood: Relational Algebra for Spatial Data Search
• Complex volumes have holes and their holes have holes. (that is harder).
Not a convex hull
+
Now in Relational Termscreate table HalfSpace (
domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ½ spaces halfSpaceID int identity(), -- a particular ½ space x float not null, -- the (a,b,..) parameters y float not null, -- defining the ½ space z float not null, c float not null, -- the constant (“c” above) primary key (domainID, convexID, halfSpaceID)
(x,y,z) inside a convex if it is inside all lines of the convex(x,y,z) inside a convex if it is NOT OUTSIDE ANY line of the convex
select convexID -- return the convex hullsfrom HalfSpace -- from the constraintswhere @x * x + @y * y + @x * z < l -- point outside the line?group by all convexID -- consider all the lines of a
Zone Approach• Divide space into zones• Key points by Zone, offset
(on the sphere this need wrap-around margin.)
• Point search look in a few zonesat a limited offset: ra ± ra bounding box that has
1-π/4 false positives• All inside the relational engine• Avoids “impedance mismatch” • Can “batch” all-all comparisons• 33x faster and parallel
6 days, not 6 months!
r ra-zoneMax
√(r2+(ra-zoneMax)2)cos(radians(zoneMax))
zoneMax
x
Ra ± x
In SQL
select o1.objID -- find objectsfrom zone o1 -- in the zoned tablewhere o1.zoneID between -- where zone #
floor((@dec-@r)/@zoneHeight) and -- overlaps the circlefloor((@dec+@r)/@zoneHeight)
and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))
< @r -- careful filter on distance
Eliminates the ~ 21% = 1-π/4False positives
Bounding box
Summary
• SQL is a set oriented language
• You can express constraints as rows
• Then You – Can evaluate LOTS of predicates per second– Can do set algebra on the predicates.