Sector: An Open Source Cloud for Data Intensive Computing Robert Grossman University of Illinois at Chicago Open Data Group October 20, 2009
Jun 21, 2015
Sector: An Open Source Cloud for Data Intensive Computing
Robert GrossmanUniversity of Illinois at Chicago
Open Data Group
October 20, 2009
Part 1. Sector
2
http://sector.sourceforge.net
Sector Overview
Sector is fastest open source large data cloud– As measured by MalStone & Terasort
Sector is easy to program– Supports UDFs, MapReduce & Python over streams
Sector is secure– A HIPAA compliant Sector cloud is being set up
Sector is reliable– Sector v1.24 has a backup master node server
3
About Sector
Yunhong Gu from the Laboratory for Advanced Computing at the University of Illinois at Chicago is the Lead Developer of Sector.
Sector is open source (BSD License) and available from sector.sourceforge.net
The current version is 1.24a
4
Target Configurations
Sector is designed to run on racks of commodity computers
Typical rack configuration today (Oct, 2009)– Rack of 32 quad-core 1U computers – Each computer has 4 x 1TB disks – Each computer has 1 Gbps connection to a top of
a rack switch Sometimes these are called Raywulf clusters
5
Google’s Large Data Cloud
Storage Services
Data Services
Compute Services
6
Google’s Stack
Applications
Google File System (GFS)
Google’s MapReduce
Google’s BigTable
Hadoop’s Large Data Cloud
Storage Services
Compute Services
7
Hadoop’s Stack
Applications
Hadoop Distributed File System (HDFS)
Hadoop’s MapReduce
Data Services
Sector’s Large Data Cloud
Storage Services
Compute Services
8
Sector’s Stack
Applications
Sector’s Distributed File System (SDFS)
Sphere’s UDFs
Routing & Transport Services
UDP-based Data Transport Protocol (UDT)
Data Services
Comparing Sector and Hadoop
Hadoop SectorStorage Cloud Block-based file
systemFile-based
Programming Model
MapReduce UDF & MapReduce
Protocol TCP UDP-based protocol (UDT)
Replication At time of writing PeriodicallySecurity Not yet HIPAA capableLanguage Java C++
9
Terasort - Sector vs Hadoop Performance1 Rack 2 Racks 3 Racks 4 Racks
Nodes 32 64 96 128
Cores 128 256 384 512
Hadoop 85m 49s 37m 0s 25m 14s 17m 45s
Sector 28m 25s 15m 20s 10m 19s 7m 56s
Speed up 3.0 2.4 2.4 2.2
Sector/Sphere 1.24a, Hadoop 0.20.1 with no replication on Phase 2 of Open Cloud Testbed with co-located racks.
MalStone (OCC-Developed Benchmark)MalStone A MalStone B
Hadoop 455m 13s 840m 50s
Hadoop streaming with Python
87m 29s 142m 32s
Sector/Sphere 33m 40s 43m 44s
Speed up (Sector v Hadoop)
13.5x 19.2x
Sector/Sphere 1.20, Hadoop 0.18.3 with no replication on Phase 1 of Open Cloud Testbed in a single rack. Data consisted of 20 nodes with 500 million 100-byte records / node.
How Do You Program A Data Center?
12
Idea 1 – Support UDF’s Over Data Center
Think of MapReduce as– Map acting on (text) records– With fixed Shuffle and Sort– Followed by Reducing acting on (text) records
We generalize this framework as follows:– Support a sequence of User Defined Functions (UDF) acting on
segments (=chunks) of files.– MapReduce is one special case consisting of a user defined Map,
a system-defined shuffle and sort, and a user defined reduce– In both cases, framework takes care of assigning nodes to
process data, restarting failed processes, etc.
13
Applying UDF using Sector/Sphere
14
Application Sphere Client
SPE SPE SPE
Outputstream
2. Locate & schedule Sphere Processing Engine (SPE)
1. Split data
3. Collect results
Input stream
Sector Programming Model
Sector dataset consists of one or more physical files Sphere applies User Defined Functions over streams of
data consisting of data segments Data segments can be data records, collections of data
records, or files Example of UDFs: Map function, Reduce function, Split
function for CART, etc. Outputs of UDFs can be returned to originating node,
written to local node, or shuffled to another node.
15
How Do Move Data in a Cloud & Between Clouds?
16
Option 1: Use TCP and close your eyes.
Option 2: ?????
Idea 2: Sector is Built on Top of UDT
17
UDT is a specialized network transport protocol.
UDT can take advantage of wide area high performance 10 Gbps network
Sector is a wide area distributed file system built over UDT.
Sector is layered over the native file system (vs being a block-based file system).
UDT Has Been Downloaded 25,000+ Times
18
Sterling Commerce
Nifty TVGlobus
Movie2Me
Power Folder
udt.sourceforge.net
http://udt.sourceforge.net
Alternatives to TCP – Decreasing Increases AIMD Protocols
increase of packet sending rate x
€
x← x +α (x)
€
x← (1−β ) x
(x)
x
AIMD (TCP NewReno)
UDT
HighSpeed TCP
Scalable TCP
decrease factor
UDT Makes Wide Area Clouds Possible
Using UDT, Sector can take advantage of wide area high performance networks (10+ Gbps)
20
10 Gbps per application
What About Security?
21
Idea 3: Add Security From the Start
Security server maintains information about users and slaves.
User access control: password and client IP address.
File level access control. Messages are encrypted
over SSL. Certificate is used for authentication.
Sector is HIPAA capable.
Security Server
Master Client
Slaves
dataAAA
SSLSSL
For More Information About Sector
Yunhong Gu and Robert L Grossman, Sector and Sphere: Towards Simplified Storage and Processing of Large Scale Distributed Data, Philosophical Transactions of the Royal Society A, Volume 367, Number 1897, pages 2429--2445, 2009
http://arxiv.org/abs/0809.1181 http://rsta.royalsocietypublishing.org/
content/367/1897/2429
23
For Related Information
Related information can be found at:– blog.rgrossman.com– www.rgrossman.com
24
Sector Sponsors