Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods David Gallaher (1) , Qin Lv (2) , Glenn Grant (1) , Garrett Campbell (1) 1 1) National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA 2) Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA
34
Embed
Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Data Rods: High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets Using Pure Object Database Methods
David Gallaher(1), Qin Lv(2), Glenn Grant(1), Garrett Campbell(1)
1
1) National Snow and Ice Data Center, University of Colorado, Boulder, Colorado, 80309, USA
2) Department of Computer Science, University of Colorado, Boulder, Colorado, 80309, USA
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
The National Snow and Ice Data Center
Creates tools for
data access
Manages and distributes
scientific data Performs scientific
research
Educates the publicabout the cryosphere
Supports data users
Affiliations and
Sponsorship
Cooperative Institute for Research in Environmental Sciences
University of Colorado at Boulder
World Data Center for Glaciology (since 1976)
Mission: To Monitor the Climate Data in Earth’s Icy Regions, Analyze and Distribute it Worldwide 24x7. Focus is Mainly NASA Satellite Data
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Data Rods - Project Basis
The “Data Rods” project proposes to create prototype a high-speed, scalable database structure for rapid retrieval, filtering, and analysis of massive multi-modality data sets.
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Objective: Remote Sensing Data Analysis
The Problem:
• Data sets are becoming too large to move over the internet
• Need for basic Boolean logic for time-series anomaly detection
• Data downloads for long time-series analysis are especially cumbersome
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Analysis Challenges
• A wide variety of data formats
• Ever-increasing data set sizes
• Myriad analysis and visualization requirements
• There will be uses and analysis of the data that cannot be anticipated (data discovery is not enough)
• Lack of direct access to the data (ie albedo > 15%)
• Our current directory trees impede data access (We really need to consider a database)
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
“Big Data” Considerations:
6
Search, Order and Transmission of data is ending.
•We must develop systems where the data stay fixed and analyses are rendered against it
•Rapid, scalable data access across time and space
•Direct query of the data, not just the metadata (we need more than what, where, when)
•Web-based spatio-temporal analysis and visualization
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Database Choice
Fast and efficient storage, query and retrieval of entire data sets – not just the metadata
Ability to store colossal amounts of small files
Relational databases can't handle it. The tables grow too big. (Object-relational is no better)
Hadoop excels at unstructured data but due to it’s batch oriented nature, it is inefficient with real-time analytics as well as intra-data analysis
A “pure-object” database seen as best choice
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
The Data Rods Project
The “Data Rods” project has created a high speed, scalable database structure for rapid retrieval,
filtering, and analysis of massive data sets.
We’ll cover the following:
• Database design
• Status on development
• Basic analysis examples and performance
• Planned analysis and potential applications
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Gridded data is key.
For consistency, NSIDC's Equal-Area Scalable Earth Grids (EASE-Grids) tool is used.
Common resolutions between data sets (1km, 5km, etc) and point data
Database design
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
The nesting relationship of differing resolutions in EASE-Grid
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets
Data Rods Concept
Y coordinateX coordinate
Tim
e
Data Rods:High Speed, Time-Series Analysis of Massive Data Sets