Paul Grooten, Matjaž Jug, Robbert Renssen Next Generation Data Management Architecture in Statistical Organisation
Paul Grooten, Matjaž Jug, Robbert Renssen
Next Generation Data Management
Architecture in Statistical Organisation
CBS - Key Characteristics
Autonomous Public Body with a Legal Entity (“ZBO”)
Bonaire
180 mEur 2000+
The Hague Heerlen
Official Statistics Economic - Social - Census
National and Regional
Policy and Opinion Support “Machine”
INTELLIGENCE
INFORMATION
DATA
POLICY OPINION
Great Ambitions…
4
Great Ambitions…
5
Great Ambitions…
6
…but also Challenges
7
Vision of Enterprise Data Lake
8
Data Lake
Microdata Stat. data
Artikelen/Visualisaties
Hergebruiken / Combineren
Slimme / flexibele processen
Afnemers / Onderzoekers
Zelfstandig gebruiken
Afnemers
Publiceren
Afnemers
Ophalen
Berichtgevers Respondenten
Streaming data
Registraties
Exploreren
Key Capabilities
Ability to:
Discover, access and understand
Load, store, model, retrieve
Transform, harmonize, integrate
Access, derive, catalogue
Use (prepare, visualise, analyse…)
Manage as an asset
Secure
9
Capability Groups
10
Consumer Layer (CL)
Data Source Layer (BL)
Data Transformation Layer (DTL)
Data Provisioning Layer (DPL)
Da
ta G
ove
rna
nce
Me
tad
ata
M
an
ag
em
en
t
Se
curi
ty &
A
uth
ori
zati
on
Data Preparation
Data Visualization
Data Analyses Selfservice Reporting
Search & Explore
Dashboarding/ Scorecarding
Messaging
Data Access
Data Hub Data
Aggregation Derive Views Data Catalog
Data Harmonization
Data Transformation
Data Enrichment
Data Validation
Data Storage
Data Load Data Access Data Deletion Data
Extraction
Data Profiling Data
Cleansing Data Quality Management
Classification Management
Variables Management
Data Set Descriptions
Data Set Relations
Backup & Restore
Change Management
Enterprise Architecture
System Management
Configuration Management
Documentation
Management
Usage Monitoring
Authenti-cation
Authorization
User Management
Logging
Auditing
Encryption
Meta Data Catalos
Datameer | Versie 0.4
Data Mining
Model Data Source
Model Data Source
Ingest data
Data Architecture Layers
Vraag
Antwoord
(Leg
acy) D
atasou
rces
Data Source Layer (BL)
CSV SQL DB
Web Srv
ETL tooling
XLS
App
CBDS
Vraag
Consumer Layer (CL)
Web Page S2S
Tooling P V A
P V A = Data Prep = Data Visualization = Data Analytics
Security
Data V
irtualizatio
n
Data Transformation Layer (DTL)
Data Provisioning Layer (DPL)
Building Block 1
Building Block 2
Building Block 3
Building Block 4
Web- Service C
OData Web- Service B
Web- Service A
Security
User Que.
Da
ta G
ove
rna
nce
Tech Meta
Me
tad
ata
Ma
na
ge
me
nt
Import Conceptual Meta
Conn. String
Existing New
Key Building Blocks
12
Metadata Model
Semantic Technology
Data Virtualisation
Big Data Platform
Self-Service BI /
workflow orchestration
PoC: from internal Data Lake
13
Consumer Layer (CL)
Data Provisioning & Transformation Layers (DPL & DTL)
Me
tad
ata
Ma
na
ge
me
nt
Data Source Layer (BL)
RinPN GBAGSL GBAGLND
101210 M NL
GBAPersoon2012V1
Import Conceptual Meta XML
Data Prep
DS
C GbaPersoon
2012V1 EH
B
BE
Harvesting proces tech meta
Meta Data
Reposi-tory
Export Conceptual Meta
OG-ID Desc
354 XXX
OG Table
Data Visual
= New technology for CBS CIO office | Versie 0.7
Security
…towards DaaS Architecture
14
CIO office | Version 1.3
Security
CL
Dataso
urces
DSL
Data V
irtualizatio
n
DTL
DPL
Da
ta G
ove
rna
nce
Existing New
User Que.
Me
tad
ata
Ma
na
ge
me
nt
Tech Meta
UDC=Urban Data Center | CL=Consumer Layer | DPL=Data Provisioning Layer DTL=Data Transformation Layer | DSL=Data Source Layer
P V A = Data Prep = Data Visualization = Data Analytics
Security
EHB
Zone CBS Zone UM
Building Block 3
Building Block 4
Web- Service B
Secured VPN
P V A
Zone UDC1
Building Block 5
Building Block 6
Web- Service C
Secured VPN
P V A
Zone UDC2
Building Block 7
Building Block 8
Web- Service D
Secured VPN
P V A
Building Block 1
Building Block 2
Web- Service A
CBDS DSC
P V A
Data Lake project – work in progress
Status Topic Description
Finished 4-layers Data Architecture
Possibility to decouple Data Source Layer from Consumer Layer and create virtual datasets / virtual views. Web Service interface implemented for business register EHB project demonstrated that architecture delivers benefits
Finished Metadata Model Develop Model that describes statistical data in formal and exact way. In theory it is possible to map any statistical dataset to model represented as a graph and use meta to find data (including ranking)
Finished PoC Data Virtual Successfully connected Denodo to Documentum Database (DSC) / improved query possibility & performance boost
In Progress
PoC Metadata Implement metadata model in PoolParty semantic web platform, harvest technical & conceptual metadata and provide URL to DV platform
In progress
Connect Data Sources
Expand number of Data Sources to improve usability of test platform. Perform stress tests
Scope defined
PoC Multi-Zone DV
Use Data Lake as a research platform for distributed data. Implement secure infrastructure
Planned Data Governance and Security
Define Data Governance for managing and securing virtual datasets
15
What do we want to achieve with the Data Lake Vision?
16
€ M { "
stimulate Cost data- access
Statistical Risk
Growth Re-use
Time to Market
reduce
Foundation for New Business Models