Data Science and Big Data - Fachhochschule Salzburg...Big Data Definition 20.11.2014 6 “Big Data” is a term encompassing the use of techniques to capture, process, analyse and
Post on 08-Jul-2020
0 Views
Preview:
Transcript
Data Science and Big Data: Research landscape and impact on the
mobility domain
Martin Köhler
Dynamic Transportation Systems
Mobility Department
Austrian Institute of Technology
Salzburg Data Science Symposium – 20.11.2014
Data-intensive science
Enormous data archives are at hand
Various data sources
Often available in real-time
Investigating huge data volumes
and driving research and industry
Science is moving increasingly
from hypothesis-driven to data-
driven discoveries
Correlation vs. Causality
Science is changing
Thousand years ago
Science was empirical
describing natural phenomena
3
Last few hundred years
Theoretical branch using
generalizations
Last few decades
A computational branch
simulating complex phenomena
Today
Data-intensive science,
synthesizing theory, experiment
and computation with statistics
► new way of thinking required! Data - Intensive Science: The Fourth
Paradigm, Alex Szalay
Dept of Physics and Astronomy
The Johns Hopkins University
e.g. Ptolemy’s universe of
concentric spheres
e.g. Newtonian/Einsteinian gravity
e.g. Cosmic structure formation
e.g. Matter/energy content of the universe
More data versus rocket science
“In this paper, we evaluate the
performance of different
learning methods on a
prototypical natural language
disambiguation task,
confusion set disambiguation,
when trained on orders of
magnitude more labeled data
than has previously been
used..”
„Some simple math given a
mountain of data can get
you 80% of the way.“
James Shanahan, Berkeley
4 20.11.2014 Scaling to Very Very Large Corpora for Natural Language Disambiguation,
ACL '01 Proceedings of the 39th Annual Meeting on Association for Computational Linguistics
Banko & Brill, 2001
5 20.11.2014 August E. Evrard, PhD. Cyberscience: Computational Science and the Rise of the
Fourth Paradigm , 2010
Big Data Definition
6 20.11.2014
“Big Data” is a term encompassing the use of techniques to capture, process,
analyse and visualize potentially large datasets in a reasonable timeframe
not accessible to standard IT technologies. By extension, the platform, tools
and software used for this purpose are collectively called “Big Data
technologies”. NESSI White Paper, December 2012
6
Four characteristics:
•Volume: In the last years the amount of generated data increased enormously
•Velocity: Analysing more data in shorter time frames
•Variety: Huge diversity of data formats (Arbitrary–> Relational > Freitext)
•Value: Extracting value (knowledge)
Hardware and software technologies for manageing and
Analyzing huge amounts of data
Or simply said
IF DATA IS PART OF THE PROBLEM
Big Data Dimensions
Legal dimension
Social dimension
Economic dimension
Technological dimension
Application dimension
Copyright
Privacy
User behaviour
collaboration
Social implikations
Business models
Benchmarking
Pricing
Scalable data processing
Signal processing
Statistics
Linguistics
HCI/Visualization
Electronic archiving
Decision support
Industry solutions
20/11/2014 7
Big Data Technology Stack
Hadoop
Ecosystem
Big Data
Platforms
Data
Ingestion
And
Processing
Efficiency
Trust
Workload
Governance
Tools
Platform
Programming
Parallel
Big Data
Analytics
Data
Science
Transform
question to
algorithm
Machine
Learning
Analysis
Integration
Query
Performance
Transform
Warehousing
Big Data
Utilization
Domain
Expertise
Asking the
right
question
Reporting &
Dashboards
Alerting &
Recommendat
ions
Business
Intelligence
Text Analysis
and Search
20/11/2014 8
Data
Centers
Big Data
Management
Scalable Data
Storage
IaaS
Cloud
Virtualization
Network
Compute
Storage
DBMS
NoSQL
M
an
ag
em
en
t
Se
cu
rity
P
riva
cy
Go
ve
rna
nc
e
Da
ta
Va
lue
Big Data Management
9
Technologies for efficient management of large amounts of data • Storage and management of data
• Provisioning and management of the infrastructure
Cloud Ressourcen Interne Datenzentren
Storage
Big Data Plattforms
10
Technologies for massively parallel execution of analytics on top of huge data amounts
• Provisioning of parallel and scalable execution systems
• Real-time computation of sensor data
Massive parallel
programming
Programming models
for data-intensive
applications
(e.g. MapReduce)
High-Level query
languages
Scripting languages
and abstract
representations of low-
level data-intensive
query languages
Streaming
Real-time processing of
(sensor-) data which has
to be reduced for storage
Ad-Hoc queries
Real-time access on
large data amounts
(Queryoptimization –
SQL vs. MapReduce)
Google Pregel Apache Drill
Big Data Analytics
11
Technologies for gaining information from large data amounts on the basis of analytical approaches
• Recognize new models
• Pattern matching
• Pattern recognition
Big Data Utilization
12
Technologies for extracting knowledge and gaining value • Strengthen the market position
• Simple utilization of huge data amounts
Business
Intelligence
Data-driven
provisioning of efficient
idicators
(reporting, key
performance indicator,
audit, …)
Knowledge
Management
Management and
representation of
knowledge
(Ontologies,
LinkedData,
knowledge
management systems)
Decision Support
support the decision
process; includes data
management,
modelling, innovative
and interactive user
interface
Visualization
Interactive visualization
of complex information
and networks with
multiple abstractions
(Visual Analytics)
Traditional versus Data-intensive Approach
– 13 –
HADOOP
Iterate over structure
Transform and analyze
Hadoop Approach• Apply schema on read
• Support range of access patterns to
data stored in HDFS: polymorphic
access
Batch Interactive Real-time
Right Engine, Right Job
In-memory
Traditional Approach• Apply schema on write
• Heavily dependent on IT
Determine list of questions
Design solution
Collect structured data
Ask questions from list
Detect additional questions
Single Query Engine
SQL
Technical and Scientific Challenges
Visual Analytics
Combine the strengths of human and
electronic data processing
Big Data Analytics
Techniques making use of complete
data set, instead of sampling
Real time analytics, stream
processing
Expect real-time or near real-time
responses from the systems
Content Validation
Validating the vast amount of information
in content networks, Trust
14 20/11/2014
Distributed Storage (IaaS, NoSQL)
Datacenter
Parallel Stream Processing MapReduce Extensions
Use Cases and Enterprise Services
Scientific Data Life Sciences Business Reporting
Datacenter
Datacenter
Future Trends in Big Data Key aspects in European research (Horizon 2020)
15 20/11/2014
Big Data
Current state
Natural Language Processing
Multi-Lingual
systems
Real-time
cross-stream
processing
European
data portals
Data
Availa
bility
Scalable
analytics
Data Science Application Domains
Earth
Potentials for Smart Cities - Urban Computing
“Urban computing connects urban sensing, data management, data
analytics, and service providing into a recurrent process for an unobtrusive
and continuous improvement of people’s lives, city operation systems, and
the environment.”
Zheng, Y., et. Al. Urban Computing: Concepts, Methodologies, and Applications,
2014.
Why bother?
Air pollution
Congestions
Noise pollution
Accidents
17
Smart
City
Urban Computing for Urban Planning
Urban Computing for Transportation
Systems
Urban Computing for
the Environment
Urban Computing for Urban Energy Consumption
Urban Computing for
Social Applications
Urban Computing for
Economy
Urban Computing for Public Safety and Security
Data-driven Analytics enabling Smart Mobility Solutions
Real-time integration and analytics of heterogeneous data sources
Massively parallel execution of generic data analytic workflows
Application-specific visualizations
18
Data-driven Analytics – Mobility Applications
19
Crowd Dynamics
Events, Airports, Stations
Dynamic Route Planning
Transport Logistics
Data Acquisition
Floating Car Data, Mobile Phone Data,…
Multi-modal Traffic Flow Modeling
Multi-modal Transport Networks
• Multi-modal data collection
• Data analysis
• Optimization
• Multi-modal traffic simulation and
prediction
Data-driven Analytics – inferring land use
Infer current, actual land use data
from mobile phone and WIFI data
Temporal activity patterns
Spatial clusters
Correlations
Cooperation AIT and MIT
“Inferring land use from mobile phone activity”, Jameson Toole, Michael Ulm, Dietmar Bauer, Marta Gonzalez
Best paper Award” at the the UrbComp 2012
FLEET – Real-time traffic measurement and information system
21
Real time traffic information based on GPS reports of probe vehicles
Accurate short and medium term travel time prediction
Hot spot identification and estimation of traffic queue lengths
Big Data in Logistics
22 date
M. Koehler, University of Vienna; SKG 2012, Beijing, China
EU Project VPH-Share
• FP7 Integrated Project within Virtual Physiological Human Initiative
• Duration: March 2011 – February 2015
• Cost: 14.5 M€; Funding: 10.7 M€; 20 Partners
• Coordinator: The University of Sheffield, United Kingdom
• Goal Contribute to the VPH vision of a systematic framework for understanding physiological processes in the human body in terms of anatomical structure and biophysical mechanisms across multiple length and time scales.
M. Koehler, University of Vienna; SKG 2012, Beijing, China
Cloud Platform (Public / Private)
Select
Workflow
Patient Data
Workflow Inputs
Workflow Outputs
Infer
missing
items
Run
simulation
Decision Support
Patient Centred Computational Workflows
Retrieve
Existing
Data
Return
Results &
Support
Users
VPH
Outreach P
ati
en
t A
va
tar
Applic
ation
Info
str
uc
ture
HPC Infrastructure (DEISA / PRACE)
Pe
rso
na
lis
ed
Mo
de
l
Knowledge Discovery
Data Inference
Compute
Services
Storage
Services
Knowledge
Management
Data Services:
Patient/Population
euHeart
@neurIST
VPH OP
ViroLab
Partners:
CYFRONET, PL
Sheffield Teaching
Hospitals, UK
ATOS Origin, ES
Kings College
London, UK
Universitat
Pompeu
Fabra, ES
Empirica, DE
SCS SRL, IT
NHS IC, UK
INRIA, FR
IOR, IT
Open Univ., UK
Philips Elec., NL
TU Eindhoven, NL
Univ. Auckland, NZ
Uv Amsterdam, NL
UCL, UK
Univ. Vienna, AT
AATRM, ES
FCRB, ES
Project No: 269978
Coordinator: University of
Sheffield, UK
EU Project TRIDEC
25 © IDC Visit us at IDC.com and follow us on Twitter: @IDC
Visit the project: http://bigdataaustria.wordpress.com
Code of practice for big data projects Support and orientation for the impementation of big data projects
26
Process model Maturity model
Reference architecture
Open Data – a key driver for data science
27 20.11.2014
European Data Innovator Award 2014 goes to
Johann Mittheisz, former CIO of the City of Vienna
& the Open Government Team of Vienna
Global market
IDC expects a growth of the
global market from 9,8 Billion
USD in 2012 to 32,4 Billion USD
in 2017
Yearly growth rate: 27%
Austrian market 2013:
~ 23 Mio Euro
Data Scientists
29
„We will soon have a huge skills shortage for data-related jobs.“
Neelie Kroes (ICT 2013, Nov.7, Vilnius)
„Data Scientist: The Sexiest Job of the 21st Century“ http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1
Data scientists
30
Steps towards a driven-driven economy
„Data is a commodity – competence is the key “
31
Ad
de
d V
alu
e
Mar
ket
Lead
ers
hip
Loca
tio
n a
ttra
ctiv
enes
s
Enh
ance
co
mp
ete
nce
s
Visibility
Objectives
Competence
Enable data access
Legislation
Provide infrastructure
Current status
Focus, create and provide competences
Secure competences for the long-term
Establish holistic institution
Establish (international) legal certainty
Establish general framework for data markets
Incentives for Open Data
Enhance funding for SMEs
Steps
Conclusion
Emerging research field utilizing big data for various application
domains
data-intensive computing
Machine learning
Data-intensive science and big data have a huge potential to
drive the evolvement of novel applications
by integrating diverging large-scale data sources
analyzing data sources in real-time
Visualizing results meaningfully
Data-driven analytics is a key enabler for providing
more information to stakeholders in shorter time
Supporting better decisions
33
AIT Austrian Institute of Technology your ingenious partner
Martin Köhler
Mobility Department
Dynamic Transportation Systems
AIT Austrian Institute of Technology GmbH
Giefinggasse 2 | 1210 Vienna | Austria
T +43(0) 50550-6054 | M +43(0) 664 815 79 60 | F +43(0) 50550-6439
martin.koehler@ait.ac.at | http://www.ait.ac.at
top related