M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) CAS2K3 Workshop Sept. 2003 in Annecy, Fance Home: http://www.mad.zmaw.de/wdcc
21
Embed
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1
Semantic Data Management forOrganising Terabyte Data Archives
Michael Lautenschlager
World Data Center for Climate(M&D/MPIMET, Hamburg)
CAS2K3 Workshop Sept. 2003 in Annecy, Fance
Home: http://www.mad.zmaw.de/wdcc
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 2
Content:
• General remarks
• DKRZ archive development
• CERA1) concept
• CERA data model and structure
• Automatic fill process
• Database access statistics
1) Climate and Environmental data Retrieval and Archiving
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 3
Semantic data management
• Data consist of numbers and metadata.
• Metadata construct the semantic data context.
• Metadata form a data catalogue which makes data searchable.
• Data are produced, archived and extracted within their semantic context.
Data without explanation are only numbers.
Problems:• Metadata are of different complexity for different data types. • Consistency between numbers and metadata have to be ensured.
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 4
DKRZ Archive Development
Basics observations and assumptions:1) Unix-File archive content end of 2002: 600 TB including
Backup's
2) Observed archive rate (Jan. - May 2003): 40 TB/month
3) System changes: 50% compute power increase in August 2003
4) CERA DB size end of 2002: 12 TB
5) Observed Increase (Jan. - May 2003): 1 TB/month
6) Automatic fill process into CERA DB is going to become operational with 4 TB/month this year and should increase from 10% of the archiving rate to approx. 30% end of 2004
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 5
DKRZ Archive Development
DKRZ's Archive Increase (Estim. 09.03)
6001200
1920
2640
3360
4080
12 40 184424 664 904
2002 2003 2004 2005 2006 2007
Years
Dat
a A
mo
un
t [T
B]
Unix-File Archive
CERA DB
Conserva
tive
Estimate
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 6
Problems with direct file archive access: Missing Data CatalogueDirectory structure of the Unix file system is not sufficient to organise
millions of files. Data are not stored application-orientedRaw data contain time series of 4D data blocks (3D in space and type of
variable).
Access pattern is time series of 2D fields. Lack of experience with climate model dataProblems in extracting relevant information from climate model raw data
files. Lack of computing facilities at client siteNon-modelling scientists are not equipped to handle large amounts of data
(1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals).
Year 2003 2004 2005 2006 2007
Estimated File Archive Size
1,2 PB 1,9 PB 2,6 PB 3,4 PB 4,1 PB
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 7
Limits of model resolution
ECHAM4(T42)Grid resolution: 2.8°Time step: 40 min
ECHAM4(T106)Grid resolution: 1.1°Time step: 20 min
Noreiks (MPIM), 2001
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 8
(I) Data catalogue and pointer to Unix files Enable search and identification of data Allow for data access as they are
(II) Application-oriented data storage Time series of individual variables are stored as BLOB
entries in DB TablesAllow for fast and selective data access
Storage in standard file-format (GRIB)Allow for application of standard data processing routines
(PINGOs)
CERA Concept:Semantic Data Management
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 9
CERA Database: 7.1 TB (12.2001)* Data Catalogue* Processed Climate Data * Pointer to Raw Data files
Mass Storage Archive:210 TB neglecting Security Copies (12.2001)
CE
RA
Dat
abas
eS
yste
m
Web-Based User InterfaceCatalogue Inspection
Climate Data Retrieval
DK
RZ
Mas
s S
tora
ge A
rch
ive
In
tern
etA
cces
s
Current database size is 20.5074 Terabyte Number of experiments: 298 Number of datasets: 29715 Number of blob within CERA at 03-SEP-03: 1262566234
Typical BLOB sizes: 17 kB and 100 kB
Number of data retrievals:
1500 – 8000 / month
Parts of CERA DB
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 10
CERA Data: Jan. Temp.
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 11
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 12
Metadata EntryThis is the central CERA Block,providing information on• the entry's title• type and relation to other entries• the project the data belong to• a summary of the entry• a list of general keywords related to data• creation and review dates of the metadata
Additionally: Modules and Local Extensions
Module DATA_ORGANIZATION (grid structure)Module DATA_ACCESS (physical storage)Local extension for specific information on (e.g.)• data usage• data access and data administration
CoverageInformation on the volume of space-time
covered by the dataReference
Any publication related to the data togehter with the publication form
StatusStatus information like data quality, processing steps, etc.
DistributionDistribution information including access restrictions, data format and fees if necessary
Contact
Data related to contact persons and institutes like distributor, investigator, and owner of copyright
ParameterBlock describes data topic,
variable and unit
Spatial Reference
Information on the coordinatesystem used
CERA-2 Data Model Blocks
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 13
The CERA2 data model …allows for data search according to discipline, keyword, variable,
project, author, geographical region and time interval and for data retrieval.
allows for specification of data processing (aggregation and selection) without attaching the primary data.
is flexible with respect to local adaptations and storage of different types of geo-referenced data.
is open for cooperation and interchange with other database systems.
Data Model Functions
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 14
Level 1 - Interface:Metadata entries(XML, ASCII)
Level 2 – Interf.:Separate filescontaining BLOBtable data
Experiment Description
Pointer toUnix-Files
Dataset 1Description
Dataset nDescription
BLOB DataTable
BLOB DataTable
Data Structure in CERA DB
M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 15
Creation of application-orienteddata storage must beautomatic because of large archive rates !!!