UCSD SAN DIEGO SUPERCOMPUTER CENTER Fran Berman Dr. Francine Berman Director, San Diego Supercomputer Center Professor and High Performance Computing Endowed.
Post on 24-Dec-2015
225 Views
Preview:
Transcript
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Dr. Francine BermanDirector, San Diego Supercomputer Center
Professor and High Performance Computing Endowed Chair, UC San Diego
A Grand Challenge for the Information Age
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
The Fundamental Driver of the Information Age is Digital Data
Shopping
Entertainment
Information
Business
Education
Health
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Digital Data Critical for Research and Education
Data at multiple scales in the Biosciences
Data from multiple sources in the Geosciences
DisciplinaryDatabasesUsers
Data Accessand Use
DataIntegration
Organisms
Organs
Cells
Atoms
Bio-polymers
Organelles
Cell Biology
Anatomy
Physiology
Proteomics
Medicinal Chemistry
Genomics
Where should we drill for oil?
What is the Impact of Global Warming?
How are the continents shifting?
DataIntegration
Geologic Map
Geo-Chemical
Geo-Physical
Geo-Chronologic
Foliation Map
Complex “multiple-worlds”
mediation
What genes are associated with cancer?
What parts of the brain are responsible for Alzheimers?
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Today’s Presentation
• Data Cyberinfrastructure Today – Designing and developing infrastructure to enable today’s data-oriented applications
• Challenges in Building and Delivering Capable Data Infrastructure
• Sustainable Digital Preservation – Grand Challenge for the Information age
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Data Cyberinfrastructure Today – Designing and Developing Infrastructure for
Today’s Data-Oriented Applications
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
COMPUTE (more FLOPS)
DA
TA
(mor
e B
YT
ES
)
Home, Lab, Campus, Desktop
Applications
Compute-intensive
HPCApplications
Data-intensiveand
Compute-intensive
HPCapplications
Grid Applications
Data Grid
Applications
NETWORK
(more BW)
Data-intensiveapplications
Today’s Data-oriented Applications Span the Spectrum
Designing Infrastructure for Data:
Data and High Performance Computing
Data and Grids
Data and Cyberinfrastructure Services
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Data and High Performance Computing
• For many applications, development of “balanced systems” needed to support applications which are both data-intensive and compute-intensive. Codes for which
• Grid platforms not a strong option• Data must be local to computation • I/O rates exceed WAN capabilities• Continuous and frequent I/O is
latency intolerant
• Scalability is key• Need high-bandwidth and large-
capacity local parallel file systems, archival storage
Compute-intensive
HPCApplications
Data-intensiveapplications
COMPUTE (more FLOPS)
DA
TA
(mo
re B
YT
ES
)
Data-intensiveand
Compute-intensiveHPC
applications
Data-intensiveapplications
Compute-intensiveapplications
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
: Earthquake Simulation at Petascale – better prediction accuracy creates greater data-intensive demands
Estimated figures for simulated 240
second period, 100 hour run-time
TeraShake domain (600x300x80 km^3)
PetaShake domain
(800x400x100 km^3)
Fault system interaction NO YES
Inner Scale 200m 25m
Resolution of terrain grid
1.8 billion mesh points
2.0 trillion mesh points
Magnitude of Earthquake 7.7 8.1
Time steps 20,000 (.012 sec/step)
160,000 (.0015 sec/step)
Surface data 1.1 TB 1.2 PB
Volume data 43 TB 4.9 PB
Information courtesy of the Southern California Earthquake Center
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman9
Data and HPC: What you see is what you’ve measured
FLOPS alone are not enough.
Appropriate benchmarks needed to rank/bring visibility to more balanced machines critical for today’s applications.
Information courtesy of Jack Dongarra
Cray XD1 -- Custom Interconnect
Dalco Linux Cluster -- Quadrics Interconnect
Sun Fire Cluster -- Gigabit ethernet Interconnect
• Three systems using the same processor and number of processors.• AMD Opteron 64 processors
2.2 GHz
• Difference is in way the processors are interconnected
• HPC Challenge benchmarks measure different machine characteristics• Linpack and matrix
multiply are computationally intensive
• PTRANS (matrix transpose), RandomAccess , bandwidth/latency tests and other tests begin to reflect stress on memory system
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Data and Grids• Data applications some of the first applications which
• required Grid environments• could naturally tolerate longer latencies
• Grid model supports key data application profiles
• Compute at site A with data from site B
• Store Data Collection at site A with copies at sitesB and C
• Operate instrument at site A, move data to site B for storage, post-processing, etc.
CERN data providing key driver for grid technologies
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Data Services Key for TeraGrid Science Gateways
• Science Gateways provide common application interface for science communities on TeraGrid
• Data services key for Gateway communities
• Analysis
• Visualization
• Management
• Remote access, etc.
LEAD
GridChem
NVO
Information and images courtesy of Nancy Wilkins-Diehr
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Unifying Data over the Grid – the TeraGrid GPFS WAN Effort
• User wish list• Unlimited data capacity.
(everyone’s aggregate storage almost looks like this)
• Transparent, high speed access anywhere on the Grid
• Automatic archiving and retrieval
• No Latency.
• TeraGrid GPFS-WAN effort focuses on providing “infinite“(SDSC) storage over the grid• Looks like local disk to grid sites
• Uses automatic migration with a large cache to keep files always “online” and accessible.
• Data automatically archived without user intervention
Information courtesy of Phil Andrews
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Data Services – Beyond Storage to Use
What are the trends and what is the
noise in my data?How should I display my
data?
How should I organize my
data?
How can I make my data accessible to my collaborators?
How can I combine my data with my colleague’s
data?
My data is confidential; how do I make sure that it is seen/used only by the right people?
How do I make sure that my
data will be there when I want it?
What services do users want?
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Integrated Infrastructure
Services: Integrated Environment Key to Usability
Data Storage
Data Management
Data Manipulation
Data Access
computers
Sensor-nets
instruments
File systems,Database systems,
Collection ManagementData Integration, etc.
simulation
analysis
visualization
modeling
Man
y D
ata
So
urc
es
• Database selection and schema design
• Portal creation and collection publication
• Data analysis
• Data mining
• Data hosting
• Preservation services
• Domain-specific tools• Biology Workbench
• Montage (astronomy mosaicking)
• Kepler (Workflow management)
• Data visualization
• Data anonymization, etc.
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Data Hosting: SDSC DataCentral – A Comprehensive Facility for Research Data
• Broad program to support research and community data collections and databases
• DataCentral services include:• Public Data Collections and Database
Hosting
• Long-term storage and preservation (tape and disk)
• Remote data management and access (SRB, portals)
• Data Analysis, Visualization and Data Mining
• Professional, qualified 24/7 support• DataCentral resources include
• 1 PB On-line disk
• 25 PB StorageTek tape library capacity
• 540 TB Storage-area Network (SAN)
• DB2, Oracle, MySQL
• Storage Resource Broker
• Gpfs-WAN with 700 TB
PDB – 28 TB
Web-based portal access
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
DataCentral Allocated Collections include
Seismology 3D Ground Motion Collection for the LA Basin
AtmosphericSciences50 year Downscaling of Global Analysis over California Region
Earth SciencesNEXRAD Data in Hydrometerology and Hydrology
Elementary Particle Physics
AMANDA data
Biology AfCS Molecule Pages
Biomedical Neuroscience BIRN
Networking Backbone Header Traces
Networking Backscatter Data
Biology Bee Behavior
Biology Biocyc (SRI)
Art C5 landscape Database
Geology Chronos
Biology CKAAPS
Biology DigEmbryo
Earth Science Education ERESE
Earth Sciences UCI ESMF
Earth Sciences EarthRef.org
Earth Sciences ERDA
Earth Sciences ERR
Biology Encyclopedia of Life
Life Sciences Protein Data Bank
Geosciences GEON
Geosciences GEON-LIDAR
Geochemistry Kd
Biology Gene Ontology
Geochemistry GERM
Networking HPWREN
Ecology HyperLter
Networking IMDC
Biology Interpro Mirror
Biology JCSG Data
Government Library of Congress Data
Geophysics Magnetics Information Consortium data
Education UC Merced Japanese Art Collections
Geochemistry NAVDAT
Earthquake Engineering NEESIT data
Education NSDL
Astronomy NVO
Government NARA
Anthropology GAPP
Neurobiology Salk data
Seismology SCEC TeraShake
Seismology SCEC CyberShake
Oceanography SIO Explorer
Networking Skitter
Astronomy Sloan Digital Sky Survey
Geology Sensitive Species Map Server
Geology SD and Tijuana Watershed data
Oceanography Seamount Catalogue
Oceanography Seamounts Online
Biodiversity WhyWhere
Ocean SciencesSoutheastern Coastal Ocean Observing and Prediction Data
Structural Engineering TeraBridge
Various TeraGrid data collections
Biology Transporter Classification Database
Biology TreeBase
Art Tsunami Data
Education ArtStor
Biology Yeast regulatory network
Biology Apoptosis Database
Cosmology LUSciD
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Data Visualization is key
SCEC Earthquake simulations
Visualization of Cancer Tumors
Prokudin– Gorskii historical images
Information and images courtesy of Amit Chourasia, SCEC, Steve Cutchin, Moores Cancer Center, David Minor, U.S. Library of Congress
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Building and Delivering Capable Data Cyberinfrastructure
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Infrastructure Should be Non-memorable
• Good infrastructure should be• Predictable• Pervasive• Cost-effective• Easy-to-use• Reliable• Unsurprising
• What’s required to build and provide useful, usable, and capable data Cyberinfrastructure?
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Building Capable Data Cyberinfrastructure: Incorporating the “ilities”
• Scalability• Interoperability• Reliability• Capability• Sustainability• Predictability• Accessibility• Responsibility• Accountability• …
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Entity at risk
What can go wrong Frequency
FileCorrupted media, disk failure
1 year
Tape+ Simultaneous failure of 2 copies
5 years
System
+ Systemic errors in vendor SW, or malicious user, or operator error that deletes multiple copies
15 years
Archive+ Natural disaster, obsolescence of standards
50 - 100 years
Reliability
Reliability: What can go wrong
• How can we maximize data reliability?• Replication, UPS
systems, heterogeneity, etc.
• How can we measure data reliability?• Network availability=
99.999% uptime (“5 nines”),
• What is the equivalent number of “0’s” for data reliability?
Information courtesy of Reagan Moore
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Responsibility and Accountability
• Who owns the data?
• Who takes care of the data?
• Who pays for the data?
• Who can access the data?
• What are reasonable expectations between users and repositories?
• What are reasonable expectations between federated partner repositories?
• What are appropriate models for evaluating repositories?
• What incentives promote good stewardship? What should happen if/when the system fails?
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Good Data Infrastructure Incurs Real Costs
• Most valuable data must be replicated
• SDSC research collections have been doubling every 15 months.
• SDSC storage is 25 PB and counting. Data is from supercomputer simulations, digital library collections, etc.
10.0
100.0
1000.0
10000.0
100000.0
June-97 June-98 June-99 June-00 June-01 June-02 June-03 June-04 June-05 June-06 June-07 June-08 June-09
Date
Arc
hiva
l Sto
rage
(TB
)
Model A (8-yr,15.2-mo 2X) TB Stored Planned Capacity
Information courtesy of Richard Moore
Capacity Costs• Reliability increased by up-to-date
and robust hardware and software for
• Replication (disk, tape, geographically)
• Backups, updates, syncing
• Audit trails
• Verification through checksums, physical media, network transfers, copies, etc.
• Data professionals needed to facilitate
• Infrastructure maintenance
• Long-term planning
• Restoration, and recovery
• Access, analysis, preservation, and other services
• Reporting, documentation, etc.
Capability Costs
Information courtesy of Richard Moore
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Economic Sustainability
• Making Infinite Funding Finite
• Difficult to support infrastructure for data preservation as an infinite, increasing mortgage
• Creative partnerships help create sustainable economic models
Relay Funding
Consortium support
User fees, recharges
Endowments
Hybrid solutions
Geisel Library at UCSD
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Preserving Digital Information Over the Long Term
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
How much Digital Data is there?
Kilo 103
Mega 106
Giga 109
Tera 1012
Peta 1015
Exa 1018
Zetta 1021
U.S. Library of Congress manages 295 TB of digital data, 230 TB of which is “born digital”
SDSC HPSS tape archive = 25+ PetaBytes
1 novel = 1 MegaByte
Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007
• 5 exabytes of digital information produced in 2003
• 161 exabytes of digital information produced in 2006• 25% of the 2006 digital
universe is born digital (digital pictures, keystrokes, phone calls, etc.)
• 75% is replicated (emails forwarded, backed up transaction records, movies in DVD format)
• 1 zettabyte aggregate digital information projected for 2010
iPod (up to 20K songs) = 80 GB
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
How much Storage is there?
• 2007 is the “crossover year” where the amount of digital information is greater than the amount of available storage
• Given the projected rates of growth, we will never have enough space again for all digital information
Source: “The Expanding Digital Universe: A forecast of Worldwide Information Growth through 2010” IDC Whitepaper, March 2007
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Focus for Preservation: the “most valuable” data
• What is “valuable”?• Community reference data
collections (e.g. UniProt, PDB)
• Irreplaceable collections
• Official collections (e.g. census data, electronic federal records)
• Collections which are very expensive to replicate (e.g. CERN data)
• Longitudinal and historical data
• and others …
Cost
Time
Value
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
“Regional” Scale
Local Scale
National, International
Scale
The Data Pyramid
A Framework for Digital Stewardship• Preservation efforts should
focus on collections deemed “most valuable”
• Key issues:• What do we preserve?• How do we guard
against data loss?• Who is responsible?• Who pays? Etc.
Digital Data Collections
Reference, nationally important, and irreplaceable
data collections
Key research and community data
collections
Personal data collections
IncreasingValue
IncreasingTrust
Repositories/Facilities
National / Internaional-scale data repositories, archives, and
libraries.
“Regional”-scale libraries and targeted data
centers.
Private repositories.
Increasingrisk/responsibility
Increasingstability
Increasinginfra-
structure
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Digital Collections of Community Value
“Regional” Scale
Local Scale
National, International
Scale
The Data Pyramid
• Key techniques for preservation: replication, heterogeneous support
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
: A Conceptual Model for Preservation Data Grids
The Chronopolis Model
• Geographically distributed preservation data grid that supports long-term management , stewardship of, and access to digital collections
• Implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure.
• Integrates targeted technology forecasting and migration to support of long-term life-cycle management and preservation
Digital Information
of Long-TermValue
Distributed Production Preservation Environment
TechnologyForecasting and
Migration
Administration, Policy,
Outreach
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Chronopolis Focus Areas and Demonstration Project Partners
• Chronopolis R&D, Policy, and Infrastructure Focus areas:
• Assessment of the needs of potential user communities and development of appropriate service models
• Development of formal roles and responsibilities of providers, partners, users
• Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc.
• Development of appropriate cost and risk models for long-term preservation
• Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure
2 Prototypes:National Demonstration
Project
Library of Congress Pilot Project
PartnersSDSC/UCSD
U Maryland
UCSD Libraries
NCAR
NARA
Library of Congress
NSF
ICPSR
Internet Archive
NVO
UCSD Libraries
Demonstration Project information courtesy of Robert McDonald
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
National Demonstration Project – Large-scale Replication and Distribution
• Focus on supporting multiple, geographically distributed copies of preservation collections:
• “Bright copy” – Chronopolis site supports ingestion, collection management, user access
• “Dim copy” – Chronopolis site supports remote replica of bright copy and supports user access
• “Dark copy” – Chronopolis site supports reference copy that may be used for disaster recovery but no user access
• Each site may play different roles for different collections
SDSC
U MdNCAR
Chronopolis Site
Chronopolis Federation architecture
Bright copy C1
Dim copy C1
Bright copy C2
Dark copy C1
Dim copy C2
Dark copy C2
Demonstration collections included:• National Virtual Observatory (NVO) [1 TB Digital Palomar Observatory Sky Survey]
• Copy of Interuniversity Consortium for Political and Social Research (ICPSR) data [1 TB Web-accessible Data]
• NCAR Observational Data [3 TB of Observational and Re-Analysis Data]
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
SDSC/ UCSD Libraries Pilot Project with U.S. Library of Congress
Goal: To “… demonstrate the feasibility and performance of current approaches for a production digital Data Center to support the Library of Congress’ requirements.”
Library of Congress Pilot Project information courtesy of David Minor
Prokudin-Gorskii Photographs
(Library of Congress Prints and Photographs Division)
http://www.loc.gov/exhibits/empire/
(also collection of web crawls from the Internet Archive)
• Historically important 600 GB Library of Congress image collection
• Images over 100 years old with red, blue, green components (kept as separate digital files).
• SDSC stores 5 copies with dark archival copy at NCAR
• Infrastructure must support idiosyncratic file structure. Special logging and monitoring software developed so that both SDSC and Library of Congress could access information
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Pilot Projects provided invaluable experience with key Issues
• Technical Issues• How to address Integrity,
verification, provenance, authentication, etc.
• Legal/Policy Issues• Who is responsible?• Who is liable?
• Social Issues• What formats/standards are
acceptable to the community?
• How do we formalize trust?
• Infrastructure Issues• What kinds of resources
(servers, storage, networks) are required?
• How should they operate?
• Evaluation Issues• What is reliable?• What is successful?
• Cost Issues• What is cost-effective?• How can support be
sustained over time?
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
• Inadequate/unrealistic general solution: “Let X do it” where X is:
• The Government
• The Libraries
• The Archivists
• The private sector
• Data owners
• Data generators, etc.
It’s Hard to be Successful in the Information Age without reliable, persistent information
• Creative partnerships needed to provide preservation solutions with
• Trusted stewards
• Feasible costs for users
• Sustainable costs for infrastructure
• Very low risk for data loss, etc.
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran BermanOctober 31, 2006
Office of CyberInfrastructure
Blue Ribbon Task Force to Focus on Economic Sustainability
State
Local
International
Non-profit
College
University
Commercial
Federal USER
Image courtesy of Chris Greer
• International Blue Ribbon Task Force (BRTF-SDPA) to begin in 2008 to study issues of economic sustainability of digital preservation and access
• Support from • National Science Foundation• Library of Congress• Mellon Foundation• Joint Information Systems
Committee• National Archives and Records
Administration• Council on Library and Information
Sources
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
BRTF-SDPACharge to the Task Force:
1. To conduct a comprehensive analysis of previous and current efforts to develop and/or implement models for sustainable digital information preservation; (First year report)
2. To identify and evaluate best practice regarding sustainable digital preservation among existing collections, repositories, and analogous enterprises;
3. To make specific recommendations for actions that will catalyze the development of sustainable resource strategies for the reliable preservation of digital information; (Second Year report)
4. Provide a research agenda to organize and motivate future work.
How you can be involved:
• Contribute your ideas (oral and written “testimony”)
• Suggest readings (website will serve as a community bibliography)
• Write an article on the issues for a new community (Important component will be to educate decision makers and the public about digital preservation)
Website to be launched this Fall. Will link from www.sdsc.edu
UCSD
SAN DIEGO SUPERCOMPUTER CENTER
Fran Berman
Many Thanks
• Phil Andrews, Reagan Moore, Ian Foster, Jack Dongarra, Authors of the IDC Report, Ben Tolo, Reagan Moore, Richard Moore, David Moore, Robert McDonald, Southern California Earthquake Center, David Minor, Amit Chourasia, U.S. Library of Congress, Moores Cancer Center, National Archives and Records Administration, NSF, Chris Greer, Nancy Wilkins-Diehr, and many others …
www.sdsc.edu
berman@sdsc.edu
top related