Large Scale Analytical Data Management
Post on 14-Jun-2015
617 Views
Preview:
DESCRIPTION
Transcript
Database Research Data Mgmt Systems Research• SIGMOD, TODS, PVLDB, ICDE, VLDBJ
– major industry connections (billion$/y)
Expanding Topic set & Societal Impact– Data Stream Processing– Data Mining – Information Extraction, Text Retrieval– RDF and Graph data management– MapReduce + Cloud– Data Privacy
DB Research Highlights (1/4)
Data Storage and Query – efficiency/scalability• Computer architecture vs DBMS architecture
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
DB Research Highlights (1/4)
Data Storage and Query – efficiency/scalability• Computer architecture vs DBMS architecture
– Columnar storage
– Fast Compression Methods– Differential Storage Techniques (Positional Delta
Trees)– Vectorized Execution
• http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
– Robust Query Execution (“micro adaptivity”)– Just-In-Time (JIT) Compilation– Cooperative Scans – sharing scarce I/O bandwidth
http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
DB Research Highlights (2/4)
Commodity Cluster Computing - Cloud• Various MonetDB Cluster Projects
– Shared-nothing data storage, query optimization• Hadoop VectorWise (VU MSc projects)
– cluster scalability &failover– Tightly integrated Hadoop/YARN/HDFS
• CWI scilens cluster– Amdahl number >1 large I/O resources– Other uses:webcraw analysis, 500 billion triple BI BSBM
benchmark
DB Research Highlights (3/4)
Adaptive Indexing• DBA expertise extremely scarce• Science workloads hard to predict & variableDatabase Cracking:“every query is an advise how to store the
data”continuous self-steering data
reorganization
+ Approximate Query Execution on Samples+ Recycling – exploit overlap in workloads+ Fingerprint Indexing – exploit local
correlations
DB Research Highlights (4/4)
Support for non-tabular data• Text (retrieval)• Scientific
– Data vaults: directly query FITS, GeoTIFF,BEM,MSEED,..
– SciQL: Arrays as 1st class database objects– MonetDB.R: using columns as arrays (and vice
versa)• Semantic Data – RDF
– “automatically discovering schemas in LOD data”• Bridge gap between RDF and relational
• Graph Data Management– Benchmark development
Application Areas
– Business Intelligence• Marketing/Sales, Fraud Detection, Churn (spin-offs)• Social network analysis (LDBC)
– Security• Digital Forensics (NFI - XIRAF)• ...
– Science• Astronomy (LOFAR transient search) • Meterology (Earthquake Analysis - KNMI)
– Linked Data• Open government (LOD2)
Areas of Activity
Data
Understand and decide
Analyze and model
Store and process
Reasoning
Knowledge representati
on
MultimediaRetrieval
Modeling and
simulation
Machine Learning
Information Retrieval
Decision Theory
BusinessAnalytics
VisualAnalytics
DistributedProcessing
Large Scale Databases
SoftwareEng.
System / Network
Eng.
Data Science Education
enormous demand for (“big”) data scientists• Possibilities/limitations of wide array of techniques
– Information extraction, cleaning– Ranking, retrieval– Data Mining, and its applications– DB principles (Q-opt, query processing algorithms, storage techniques)
• Understand key performance factors– Latency vs bandwidth– Networks, computer architecture– algorithm optimization techniques
• Practical skills– Modern Software engineering methods– Rapid prototyping languages– Solving problems usin Hadoop clusters
proposal: “Extreme Data Management” MSc course
Opportunities: CWI
• Database Architecture Group– research, application, data science experience– MonetDB, Vectorwise technologies– Scilens: data-intensive large compute cluster
• CWI motivators– Dual Appointments– Data Science MSc education
• Attracting top students into MSc projects / PhD– DSRC co-positioning in future research funding
Conclusion
• Database research present in Amsterdam– research, application, valorisation
• Data Science Education!– Proposal: Extreme data Management course
• ..DSRC and the CWI..
top related