This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 20159
Measure Woody Biomass on South Side of the Sahara at the 40–50 cm Scale Using AWS
Overview of the NASA Head in the Clouds Project presented at the Amazon Web Services Public Summit 2015
Daniel Duffy [email protected] and on Twitter @dqduffyHigh Performance Computing Lead at the
NASA Center for Climate Simulation (NCCS) – http://www.nccs.nasa.gov and @NASA_NCCSGoddard Space Flight Center (GSFC) – http://www.nasa.gov/centers/goddard/home/
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
ESD Project Won Intel Head in Clouds Challenge Award to Estimate Biomass in South SaharaProject Goal• Using NGA data to estimate tree and bush biomass over the
entire arid and semi-arid zone on the south side of the Sahara
Project Summary• Estimate carbon stored in trees and bushes in arid and semi-
arid south Sahara• Establish carbon baseline for later research on expected CO2
uptake on the south side of the Sahara
Principal Investigators• Dr. Compton J. Tucker, NASA Goddard Space Flight Center• Dr. Paul Morin, University of Minnesota
Tree Crown
Shadow
NGA 40 cm imagery representing tree and shrub automated recognition
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Intel• Professional Services and Funding for AWS Resources
Amazon Web Services (AWS)• Compute and storage• Support to set up environment
Cycle Computing• Cloud Resource Management Software• Services to install and configure the software
Climate Model Data Services (CDS – GSFC Code 600)• NGA data support
NASA Center for Climate Simulation (NCCS – GSFC Code 606.2)• System administration, application support, and data movement
NASA CIO• General cloud consulting and coordination support
Partners and Resources
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Existing Sub-Saharan Arid and Semi-Arid Sub-Meter Commercial Imagery
9600 Strips (~80TB) to Be Delivered to GSFC
~1600 strips (~20TB) at GSFC
Area Of Interest (AOI) for Sub-Saharan Arid and Semi-Arid Africa
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
The DigtalGlobe Constellation
The Entire Archive is Licensed to the USG
GeoeyeQuickbird
Ikonos
Worldview 1
Worldview 2
Worldview 3 (Available Q1 2015)
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
14
Panchromatic and multispectral mappingat the 40- and 50-cm scale
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Use Niger as the test caseNGA data over Niger
• Currently have about 16,000 total scenes covering Niger (the data is already orthorectified)
• For this test case, approximately 3,120 scenes need to be processed to generate the vegetation index
• Each scene is approximately 30,000 x 30,000 data points (pixels)
• Will break each scene up into 100 tiles (3,000 x 3,000)
Where is the data?
• Data currently resides within the NCCS and in AWS
Additional data
• If we are successful and have additional time and resources, other African areas can be studied.
15NASA Head in the Clouds Project
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Processing requirementsBased on the tests run in the NCCS private cloud, following processing requirements were
estimated
• The tests were run on a single core (Intel E5-2670 2.5 GHz processor) virtual machine with 2 GB of
memory
• Each of the 3,120 scenes is broken up into 100 tiles
• Each tile took 24 minutes
• Hence, one scene will then take 24 * 100 = 2,400 minutes of total processor time (about 40 wall
clock hours)
• Tiles and scenes can be run in parallel
• Total scene to process = 312,000
• Total compute hours = 124,800
Target completion time
• 1 month will take between 175 to 200 virtual machines running non-stop
16NASA Head in the Clouds Project
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Input and output dataInput data
• Total input of about 8 TB for the 3,120 scenes
• Average of about 2.63 GB of data per scene
• Average of about 26.3 MB of data per tile
Intermediate data products
• Unsure of how much intermediate data products are needed; this will impact the amount of
temporary space required for each run
Output data products
• Total output data is estimated to be 25% of the input data
• Estimated total output is about 2 to 3 TB
• Output data will be transferred back to the NCCS
17NASA Head in the Clouds Project
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Cluster configuration requirements
18
Category Description Requirement
Number of Cores How many cores are required on a single node for the application?
1 per tile
Amount of Memory (RAM) How much memory on a node (or per core) is required for the application?
2 GB per tile
Operating System (OS) What operating system does the application need? Linux (Centos or debian)
Libraries/Tools/Software What additional libraries, tools, and software are needed to be installed? Compilers? Commercial software?
None
Parallelization Can the application run in a parallel manner? If so, how (threaded, MPI, or multiple instances of the application)?
Inherently parallel processing of each scene and/or tile
Cluster If the application runs in parallel across many nodes, how many nodes are required?
175 – 200 to complete in 1 month; more can be used
Storage How much storage space will be required for each run (input, intermediate, and output files)?
Total Input – 8 TB (approx. 2.6 GB for each scene)Intermediate – To be determinedTotal Output Back to NCCS – 2 TB ( approx. 25% of total input)
Shared Storage Does this storage have to be shared across all nodes? No
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Workflow
19
DataManCycle Computing Data Transfer
Software
NCCS Science Cloud
(Internal Cloud)Shared File
SystemNGA Data at NASA
NGA Data External to NASA (PGC, Digital Globe)
Data to be copied into the NCCS science cloud NGA data repository.
NCCS/NASA VM
Local
Data
A resource manager (batch queue) will be running in AWS. Scientists will interact and launch jobs through the Cycle Computing system directly in AWS.
Virtual machines will be launched in AWS. After the job is completed, the results will be copied back to the NCCS.
VM
Local
Data
VM
Local
Data
VM
Local
Data
AWS
VM VM VM
Virtual machines in the internal cloud can read the data directly from the shared disk in the NASA internal cloud. No additional data movement is required.
Amazon S3
Data to be processed is staged into Amazon S3. Data will be moved to the local storage of the VM’s for processing. Products could be stored in S3 for transfer to the NCCS at a later time.
Batch Queue System
The Cycle Computing DataMan software will be used to transfer the data into Amazon S3.
Cycle Computing System
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Time line
20
Category Dec Jan Feb Mar Apr May Jun Jul Aug Sep
Bi-Weekly Tag Ups
Requirements/Scope
Setup/Configuration
Test Runs
Transfer Data to S3
Configure S3 Buckets
Production Runs
Analysis
Final Report
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Why use Cycle Computing and AWS?• The bigger goal is to analyze the entire arid and semi-arid zone on the south side of
the Sahara– About 80 TB
– 10x the data that the initial project will analyze
• On 200 virtual machines, this will take 10 months!– How can we accelerate this?
• Can easily scale up the number of virtual machines using the Cycle Computing software and the AWS resources
– Once the data is in AWS, 80 TB of data can be analyzed in approximately the same amount
of time as 8 TB of data
– Scientists really love this part!
• Might need longer given the data transfers may take time – can overlap data transfers and computation
21
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Thanks goes to the following…NASA• Dr. Compton Tucker (Co-PI)• Katherine Melocik (GSFC)• Jennifer Small (GSFC)• Dr. Tsengdar Lee (HQ)• Daniel Duffy (GSFC)• Mark McInerney (GSFC)• Hoot Thompson (GSFC)• Garrison Vaughn (GSFC)• Brittany Wills (GSFC)• Scott Sinno (GSFC)• Ray Obrien (ARC)• Richard Schroeder (ARC)• Milton Checchi (ARC)
University Partners• Paul Morin (Co-PI, Univ. Minnesota)• Claire Porter (Univ. Minnesota)• Jamon Van Den Hoek (Oak Ridge)
22
Cycle Computing• Tim Carroll• Michael Requa• Carl Chesal• Bob Nordlund• Glen Otero• Rob Futrick
AWS• Jamie Baker• Jeff Layton
There are others… My apologies for those I missed. These are typically the ones on the our conference calls!
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015
Thank You.This presentation will be loaded to SlideShare the week following the Symposium.
http://www.slideshare.net/AmazonWebServices
AWS Government, Education, and Nonprofit Symposium Washington, DC I June 25-26, 2015