Top Banner
Opening data within organisations #csvconf 2014 - Berlin - @stevenbeeckman
37

csv,conf 2014 - Open data within organizations

Aug 11, 2014

Download

Data & Analytics

Steven Beeckman

This talk describes how we are trying to open (sometimes sensitive) data within our organization.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: csv,conf 2014 - Open data within organizations

Opening data within organisations

#csvconf 2014 - Berlin - @stevenbeeckman

Page 2: csv,conf 2014 - Open data within organizations

hi

Page 3: csv,conf 2014 - Open data within organizations

I’m @stevenbeeckman - a digital dj!mixcloud.com/gehorschade.kollektiv

Page 4: csv,conf 2014 - Open data within organizations

Conductor for StartupBus Europe!

www.startupbus.com

Page 5: csv,conf 2014 - Open data within organizations

Vienna

Poland

Estonia

GermanyUK

France

SpainItaly

Greece

Pre-apply now at startupbus.com

Follow @TheStartupBus

Page 6: csv,conf 2014 - Open data within organizations

Who here knows what devops is about?

Page 7: csv,conf 2014 - Open data within organizations

developers building apps vs operations running apps in production

Page 8: csv,conf 2014 - Open data within organizations

There is

a bigger picture

Page 9: csv,conf 2014 - Open data within organizations

there are a bit more than 2 silo’s

Page 10: csv,conf 2014 - Open data within organizations

Defence 101

Units on the battleground

Units in training

Majors, Colonels and Generals in the staff

Page 11: csv,conf 2014 - Open data within organizations

Defence 101 (bis)

An army needs a very strong HR and logistics machine

Belgian government budget cuts usually cut in its defence budget first

Need for integrated management

Page 12: csv,conf 2014 - Open data within organizations

calculating the cost of a training exercise took

4 people 4 weeks

!to go bug

!5 application owners

!for data hidden in

relational databases Excel sheets

Business Objects reports Access databases

(not so) shared drives

Page 13: csv,conf 2014 - Open data within organizations

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

Page 14: csv,conf 2014 - Open data within organizations

Requirements

1. Centralize data

2. But protect sensitive data (HR, medical privacy, …)

3. Make the data available offline

4. Nodes should be able to regain current state after loss of communication for 5 days

Page 15: csv,conf 2014 - Open data within organizations

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age 2009

First XML based prototypes

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

Page 16: csv,conf 2014 - Open data within organizations

XML-based prototypes

• Able to extract maximum 40 tables from the logistics application in one night

• Slow

• Problems with identical rows

Page 17: csv,conf 2014 - Open data within organizations

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age 2009

First XML based prototypes

New team & new approach

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

Page 18: csv,conf 2014 - Open data within organizations

New team

Hand-over to Dept AD&M (“the pro’s”)

Page 19: csv,conf 2014 - Open data within organizations

New approach

Systems engineering: holistic view on the problem

Take into account the protection of sensitive data

Make it more stable than the prototype

Explicitly not real-time

Check out NASA’s course: http://www.saylor.org/sse101/

Page 20: csv,conf 2014 - Open data within organizations

Conceptually

• lots of data sources with data owners

• 1 central data “warehouse”

• lots of nodes downloading the data they have access rights to

Page 21: csv,conf 2014 - Open data within organizations

HR app

Financial app

Logistics app

Planning app

Excel

Ops unit

data warehouse

another app

Page 22: csv,conf 2014 - Open data within organizations

Inside the data warehouse

Extraction Engine (EE)

File Server

Access Control

Page 23: csv,conf 2014 - Open data within organizations

Extraction Engine (EE)

Based on open-source software:

Linux

MySQL

Talend (Eclipse based ETL workflow tool)

Page 24: csv,conf 2014 - Open data within organizations

What does the EE do every night?

• Detect the meta data (store it in XML format)

• Take a full dump of each data source in csv format

• Calculate delta (deleted rows and inserted rows, in csv format)

• Create two zip files:

• One full copy

• One delta for this day

Page 25: csv,conf 2014 - Open data within organizations

File server

• Stores the zip files available for the nodes

• Full copy only for the current day (but we have a history for a month)

• Delta zip files for 14 days

Page 26: csv,conf 2014 - Open data within organizations

Access control

• Data providers determine themselves whether their data is

• “public” within the organisation

• “restricted” to a set of nodes

Page 27: csv,conf 2014 - Open data within organizations

The nodes

Custom XAMPP package for local development of reporting or JBoss for bigger nodes with validated reports

Custom loader contacting Access Control and filling the MySQL database

Custom “Local Reporting Framework” (XML + XSLT)

Page 28: csv,conf 2014 - Open data within organizations

Current status

Page 29: csv,conf 2014 - Open data within organizations

some logistics guy deployed in Afghanistan

I can’t access the shared drive, I wish I had my data locally!

Stone Age 2009

First XML based prototypes

New team & new approach

I’m tired of these Excel files and Access databases saying

something contradictory.

Gimme the damn truth!

2014

Growth

4090

1000

Page 30: csv,conf 2014 - Open data within organizations

@SpaceCatPics

Page 31: csv,conf 2014 - Open data within organizations

"A LARGE SYSTEM IS ONE WHERE YOU DO NOT KNOW THAT SOME OF ITS COMPONENTS EVEN EXIST."

Page 32: csv,conf 2014 - Open data within organizations

Some statistics

• 400 users (nodes)

• > 1 billion rows processed each night

• ~ 75 gigabytes of data processed each night

• making the EE work requires > 2000 tables

Page 33: csv,conf 2014 - Open data within organizations

0

5

9

14

18

FTP LDAP Microsoft SQL Server MySQL Oracle PostgreSQL Sharepoint

32 source databases

Page 34: csv,conf 2014 - Open data within organizations

big data schema

Page 35: csv,conf 2014 - Open data within organizations

What used to take my team 4 weeks now takes us one click on a

button!

A major responsible for military training & exercises

Page 36: csv,conf 2014 - Open data within organizations

Questions?@stevenbeeckman #csvconf

Hackers, hipsters & hustlers should pre-apply at

www.startupbus.com

Page 37: csv,conf 2014 - Open data within organizations

Image credits

http://www.photographersgallery.com/photo.asp?id=2411Diagonal full of silos

http://www.pragmaticdevops.com/2014/04/management/hacking-management/devops-as-a-team-or-a-responsibility/

Two silos