Top Banner
PDS Data Movement and Storage Planning (PMWG) PDS MC F2F UCLA Dan Crichton November 28-29, 2012 1
32

PDS Data Movement and Storage Planning (PMWG)

Jan 11, 2016

Download

Documents

Chul

PDS Data Movement and Storage Planning (PMWG). PDS MC F2F UCLA Dan Crichton November 28-29, 2012. Growth of Planetary Data Archived from U.S. Solar System Research. Yes, size matters, but so does complexity…. Big Data Challenges. Storage Computation Movement of Data Heterogeneity - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PDS Data Movement and Storage Planning (PMWG)

PDS Data Movement and Storage Planning(PMWG)

PDS MC F2FUCLA

Dan CrichtonNovember 28-29, 2012

1

Page 2: PDS Data Movement and Storage Planning (PMWG)

Growth of Planetary Data Archived from U.S. Solar System Research

Yes, size matters, but so does complexity… 2

Page 3: PDS Data Movement and Storage Planning (PMWG)

Big Data Challenges

• Storage• Computation• Movement of Data• Heterogeneity• Distribution

…can affect how we generate, manage, and analyze science data.

…commodity computing can help, if architected correctly

Page 4: PDS Data Movement and Storage Planning (PMWG)

Big Data Technologies

Page 5: PDS Data Movement and Storage Planning (PMWG)

5

Architecting PDS Towards a Decoupled Architecture

Data Providers

Data Providers

PDSData

Management

PDSData

ManagementDistributionDistributionTrans

formTransform IngestIngest Trans

formTransform UsersUsers

Preserve and ensure the stability and integrity of PDS data

Core PDS

Improve user support and usability of the data in the archive

Improve efficiency and support to deliver high quality science products to PDS

Data Movement Data MovementComputation

Storage

Heterogeneous Data

Page 6: PDS Data Movement and Storage Planning (PMWG)

Big Data Challenges

• Storage• Computation• Movement of Data• Heterogeneity• Distribution

…can affect how we generate, manage, and analyze science data.

Page 7: PDS Data Movement and Storage Planning (PMWG)

Storage Eye Chart• Direct Attached Storage (DAS)

• DAS based storage (usually disk or tape) is directly attached to internal server (point-to-point).

• Network Attached Storage (NAS)• A NAS unit or “appliance” is a dedicated storage server connected to an Ethernet network

that provides file-based data storage services to other devices on the network. NAS units remove the responsibility of file serving from other servers on the network.

• Storage Area Network (SAN)• SAN is an architecture to connect detached storage devices, such as disk arrays, tape

libraries, and optical jukeboxes, to servers in a way that the devices appear as local resources.

• Redundant Array of Inexpensive Disks (RAID)• The concept of RAID is to combine multiple inexpensive disk drives into an array of disk

drives which perform (usually) better then a single disk drive. The RAID array will appear as a single drive to the connected server. RAID technology is typically employed in a DAS, NAS, or SAN solution.

• Cloud Storage• Cloud Storage involves storage capacity that is accessed through the internet or wide area

network (WAN) , storage is usually purchased on an as-needed basis. Users can expand capacity on the fly. Providers operates a highly scalable storage infrastructure ,often in physically dispersed locations.

• Solid State Drive Storage• Solid State Drive storage technology is evolving to a point where SSDs can, in some cases,

start to supplant traditional storage. SSDs that use DRAM-based technology (volatile memory) cannot survive a power loss but flash-based SSDs (non-volatile), although slower then DRAM-based SSDs, do not require a battery backup and therefore become acceptable in the enterprise. It has recently been announced that 1TB SSDs are available for industrial applications, like military, medical and the like. SSD technology is rapidly evolving and in the near future will be a major contender in the storage arena.

Page 8: PDS Data Movement and Storage Planning (PMWG)

Storage Architectural Concepts

8

Page 9: PDS Data Movement and Storage Planning (PMWG)

Cloud Deployment Models• Public Cloud:

• Cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services (e.g. Amazon, RackSpace, Nirvanix)

• Applications are typically “multi-tenant” and physical infrastructure is shared• Private Cloud:

• Cloud infrastructure is operated solely for an organization. It may be managed by the organization or on their behalf by a third party and may exist on premise or at a provider’s site in a hosting center. Could be using cloud software (e.g., Eucalyptus)

• Hybrid Cloud:• Organization provides and manages some resources in-house and has others

provided externally• Possibility to leverage existing technologies and future technologies with minimal

cost (e.g. backup/archive data managed externally, operational data managed internally)

Photo credit: AcuteSys

Page 10: PDS Data Movement and Storage Planning (PMWG)

Many Benefits of Cloud Computing

10

Broad network access

Measured Service

Resource Pooling

Rapid Elasticity

Accessible from anywhere

Shared pool of configurable computing resources; reliability through replicas, etc Scale when needed with storage and services/cores, etc

Utility Computing, pay by the drink, rapidly provisioned

Page 11: PDS Data Movement and Storage Planning (PMWG)

Challenges of Cloud Storage

• Data Integrity• Ownership (local control, etc)• Security• ITAR• Data movement to/from cloud• Procurement• Cost arrangements

Page 12: PDS Data Movement and Storage Planning (PMWG)

The Planetary Cloud Experiment

• Utility to PDS• How does it fit

PDS4 architecture• APIs• Decoupled

storage and services

• Data movement challenges?

• Cloud Storage Tested as a secondary storage option• iRODS @ SDSC,

Amazon (S3), Nirvanix

12

IEEE Pro, Sept/Oct 2010

Page 13: PDS Data Movement and Storage Planning (PMWG)

Results of Study

NirvanixiRODS @ SDSCAmazon

• Moving massive amounts of data “online” a limiting factor…more to come

• Varying cost scenarios• (target < $500/TB/year)

• Proprietary APIs (but some open source cloud implementations gaining steam)

• But, entirely feasible as a decoupled ”storage service” in PDS4

• Low risk option is to explore as an operational, secondary copy and access point for planetary data

Page 14: PDS Data Movement and Storage Planning (PMWG)

Benchmarking (2009)

Page 15: PDS Data Movement and Storage Planning (PMWG)

MER Planning on the Cloud

* Credit: Khawaja Shams

Page 16: PDS Data Movement and Storage Planning (PMWG)

S3

Archive, Compression, Encryption(in memory)

Parallel Uploads to S3Daily Mars Data

5x

Polyphony Schedules Backups for Each of the Last 5 Days Daily

MER Planning: Backup to the Cloud*

* Credit: Khawaja Shams, George Chang

Page 17: PDS Data Movement and Storage Planning (PMWG)

S3

Polyphony Immediately Schedules Another Backup of Inconsistent Data

If Downloaded Backup Does Not Match Local Data

MER Planning: Data Integrity on the Cloud

Page 18: PDS Data Movement and Storage Planning (PMWG)

Big Data Challenges

• Storage• Computation• Movement of Data• Heterogeneity• Distribution

…can affect how we generate, manage, and analyze science data.

Page 19: PDS Data Movement and Storage Planning (PMWG)

Cloud Computing and Computation

• On-demand computation (scaling to massive number of cores)

• Amazon EC2, one of the most popular

• Commoditizing super-computing

• Again, architecting systems to decouple “processing” and “computation” so it can be executed on the cloud is key… two examples• LMMP example (to come)• Airborne data processing (to come)

• Coupled with computational frameworks (e.g., Apache Hadoop)• Open source implementation of Map-Reduce

Page 20: PDS Data Movement and Storage Planning (PMWG)

Lunar Mapping and Modeling Project:

Big Data Challenges*• The image files LMMP manages range from a few gigabytes to hundreds of gigabytes in size with new data arriving every day

• Lunar surface images are too large to efficiently load and manipulate in memory

• LMMP must make the data readily available in a timely manner for users to view and analyze

• LMMP needs to accommodate large numbers of users with minimal latency

20* Credit: Emily Law, George Chang

Page 21: PDS Data Movement and Storage Planning (PMWG)

Cloud Computing Solutions with Map-Reduce

• Slice a large image into many small images and to merge and resize until the last merge and reduce yields a reasonably sized image that depicts the entire image

• Amazon EC2 for computing; S3 for storage

• Installed Hadoop framework on a number of EC2 instances

• Used distributed approach with Elastic Map-Reduce in Hadoop to tile images

• Developed a hybrid solution (multi-tiered data access approach) to serve images to users by cloud storage

21

Page 22: PDS Data Movement and Storage Planning (PMWG)

LMMP Tiling Test Results(Cloud vs Local)

• Configuration 1• 2x Sun Fire 4170• Gigabit Network

Interconnects• 72 GB RAM• 64 GB SSD Storage• $10K each, plus

administration and infrastructure costs

• Configuration 2– 20 EC2 Large Instances (4

Compute Units ~ 4x1GHz Xeon)

– 7.5 GB RAM– 850 GB Storage– $0.34/instance/hour

• Configuration 3– 4 EC2 CC Instances (33.5

Compute Units)– Gigabit Interconnects– 23 GB RAM– 1.69 TB Storage– $1.60/instance/hour

Page 23: PDS Data Movement and Storage Planning (PMWG)

Cloud Computing: Addressing Challenges

• Cloud has shown very promising results, but there are challenges• Proprietary APIs• Support for ITAR-sensitive data• Data transfer rates to the commercial cloud• Firewall issues• Procurement• Costs for long term storage

• More work ahead• Amazon EC2/S3 reported an “ITAR Region” available• Continued benchmarking and optimization has demonstrated increased

data transfer rates, particularly using Internet2• JPL developing a “Virtual Private Cloud” connection to Amazon, causing

EC2 nodes to appear inside the JPL Firewall• Improved procurement process to allow JPL projects to use AWS

23

Page 24: PDS Data Movement and Storage Planning (PMWG)

Big Data Challenges

• Storage• Computation• Movement of Data• Heterogeneity• Distribution

…can affect how we generate, manage, and analyze science data.

Page 25: PDS Data Movement and Storage Planning (PMWG)

The Planetary Data Movement Experiment

• Online data movement has been a limiting factor for embracing big data technologies

• Conducted in 2006*, 2009 and 2012

• Evaluate trade offs for moving data

• to PDS• between Nodes• to NSSDC/deep archive• to Cloud

25

* C. Mattmann, S. Kelly, D. Crichton, J. S. Hughes, S. Hardman, R. Joyner and P. Ramirez. A Classification and Evaluation of Data Movement Technologies for the Delivery of Highly Voluminous Scientific Data Products. In Proceedings of the NASA/IEEE Conference on Mass Storage Systems and Technologies (MSST2006), pp. 131-135, College Park, Maryland, May 15-18, 2006

Page 26: PDS Data Movement and Storage Planning (PMWG)

Data Xfer Technologies Evaluated

• FTP uses a single connection from transferring files; in general it is ubiquitous and where possible the simplest way for PDS to transfer data electronically

• bbFTP uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit

• GridFTP uses multiple threads/connections. It is part of the Globus project and is used by the climate research community to move models. In general, tests have shown that it is more difficult to set up due to the security infrastructure, etc

• iRODS uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit

• FDT uses multiple threads/connections to improve data transfer. It works well as long as the number of connections are kept to a reasonable limit

Page 27: PDS Data Movement and Storage Planning (PMWG)

Some of our Findings• Transfer speed among the nodes differ

greatly, however, the fundamental findings about how to best transfer data for each scenario is consistent

• Parallel transfer mechanisms show improvement over conventional transfer mechanisms (FTP, socket-to-socket) for files larger than ~10MB

• Packaging/bundling small files help to achieve significantly better transfer performance with parallel data transfer

• Reliability has improved over the past five years in many of the products we have tested

• However, UDP approaches have suffered largely due to more aggressive network infrastructure seeing this as distributed denial of service attacks (DDOS)

27

Transfer rate (Y axis) versus file size (X axis)GridFTP: blue, bbFTP: red, FTP: green

Data Movement of WAN using TCP/IP

Page 28: PDS Data Movement and Storage Planning (PMWG)

Data Movement Recommendations (2010)

  FTP bbFTP GridFTP Data Brick FDT iRODS

Efficiency High for files < 1 GB HighSlightly lower than bbFTP

Low Very High High

Scalability LinearBased on number of

threads

Based on number of

threads

Based on available storage

sizes Adaptive Adaptive

Reliability

Fault rate dependent on underlying TCP/IP protocol, but 0 faults /

20 hours of testing and 10s of GBs of data

Good (support retransmit,

issue with > 12 threads)

High (support retransmit)

High Poor Excellent

Ease of Use Easy Easy Medium Based on brand Medium Easy

Ease of Deployment

Easy (standard component on

Linux/UNIX/Mac, and some Windows

solutions)

Easy to deploy on Unix based systems with /etc/passwd

security. Can also use Globus

GSI security)

Difficult to deploy; relies

on Grid Security

Infrastructure and certificate management

for hosts, users, services

Based on brand Medium Difficult

Cost (Operate & Implement)

Low LowMedium (hard

to deploy)Based on brand

& volumeLow Low

Page 29: PDS Data Movement and Storage Planning (PMWG)

Pilot with DNs (Big Data)

• iRODS has shown to be the most promising for data transfer

• Setting up an iRODS infrastructure for data movement with 3 zones: GEO, USGS, JPL/IMG as a pilot• Run along side other mechanisms• Expand to other nodes if this proves

successful

Page 30: PDS Data Movement and Storage Planning (PMWG)

BenchmarksJPL to Geo

  File SizeTechnology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 0.55 0.94 0.93 1.33 0.94TCP 2 0.55 1.07 2.58 2.68 2.73TCP 4 0.55 1.19 5.07 5.46 5.45TCP 8 0.56 1.19 8.95 10.6 10.79TCP 16 0.56 1.19 12.02 18.45 20.32

Geo to JPL  File Size

Technology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 0.36 0.61 0.66 0.58 0.68TCP 2 0.36 0.63 1.31 1.36 1.37TCP 4 0.39 0.62 2.26 2.69 2.7TCP 8 0.41 0.62 3.8 5.06 5.2TCP 16 0.41 0.63 5.72 8.06 8.87

Page 31: PDS Data Movement and Storage Planning (PMWG)

Benchmarks (2)USGS to JPL

  File SizeTechnology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 1.29 2.11 2.59 2.61 1.78TCP 2 0.93 2.59 3.6 4.01 2.6TCP 4 0.9 1.87 4.3 4.17 3.22TCP 8 0.89 2.56 3.95 4.28 3.86TCP 16 0.89 2.16 4.16 4.19 3.84

JPL to USGS  File Size

Technology 1 MiB 10 MiB 100 MiB 1 GiB 2 GiBTCP 1 0.87 0.89 0.88 0.96 N/ATCP 2 0.83 1.01 1.71 1.81 N/ATCP 4 0.77 0.91 2.45 3.03 3.12TCP 8 0.87 1.02 2.89 3.73 3.76TCP 16 0.81 0.74 3.55 3.79 4.02

Page 32: PDS Data Movement and Storage Planning (PMWG)

Recommendations• Data Movement

• PMWG will update its current data movement recommendations based on these results

• Run current data movement deployment in parallel to FTP and other mechanisms as a pilot

• Consider adding another “zone” at NSSDC for electronic data transfers• Capture updated benchmarks for Flagstaff after the network upgrade• Other DNs worry about this when they hit the larger thresholds

• Data Storage• We have quite a bit of experience now with cloud computing, etc to

comment• Focus on requirements for data storage (e.g., storage service) as other

development activities are under control

• Computation• The new PDS4 architecture allows us to run computationally

intensive services in many different topologies. Explore as needed.