Big Data in Research: Research Analytics Industry Solution Stuart Long CTO - Oracle Systems Asia Pacific and Japan
Big Data in Research:
Research Analytics Industry Solution
Stuart Long
CTO - Oracle Systems Asia Pacific and Japan
Master Data
Reference Data
Metadata
Transaction Data
Analytical Data
Unstructured Data
Big Data
Data Realms
Information Sharing & Delivery
Business Intelligence &
Data Warehousing
Data Integration
Content Management
Master Data Management
Enterprise Data Model
Data Governance,
Quality, & Lifecycle
Data Security
Data Technology
Management
Data security
Information sharing & delivery
Business intelligence
& data warehousing
Data integration
Content management
Master data management
Enterprise data model
Data governance,
quality & lifecycle
mgmt
Data technology management
Master data
Transaction data
Reference data
Analytical data
Metadata
Unstructured data
Big Data
Data Realms Data
Realms
Oracle Enterprise Architecture
Framework
Oracle
Information Architecture Framework
Information Architecture Capability Model
The Information Architecture Spectrum
Data Realms Structure Volume Security Storage & Retrieval Modeling
Processing/Integration Consumption
Master data Transactions Analytical data Metadata
Structured Medium - High
Database, app, & user access
RDBMS / SQL Pre-defined relational or dimensional modeling
ETL/ELT, CDC, Replication, Message
BI & Statistical Tools, Operational Applications
Reference data Structured and Semi-Structured
Low-Medium
Platform security
XML / xQuery Flexible & Extensible
ETL/ELT, Message
System-based data consumption
Documents and Content
Unstructured High
File system based
File System / Search
Free Form OS-level file movement
Content Mgmt
Big Data - Weblogs - Sensors - Social Media
Structured, Semi-Structured, Unstructured
High File system & database
Distributed FS / noSQL
Flexible (Key Value)
Hadoop, MapReduce, ETL/ELT, Message
BI & Statistical Tools
Evaluating Economic and Architecture Tradeoffs
Big Data
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
Tota
l Arc
hive
in
Terr
aByt
es (T
B)
1986 1989 1993 1995 1998 2000 2003 2005 2007 2015 2020
Year
Evolution of ESA's EO Data Archives between 1986-2007
and future estimates (up to 2020)
Future Data Estimates
LANDSAT 2-4 MSS (75-Dec 93)
AQUA Modis (April 03-today)
ENVISAT LR (March 02-today)
ENVISAT HR (March 02-today)
TERRA Modis (June 01-today)
QUICK SCATT (01-today) /PROBA (May 02-today)
LANDSAT 7 ETM (April 99-Dec 03)
SEA STAR SeaWifs (Apr 98-today)
ERS 2 HR (May 95-today)
ERS 2 LBR (May 95-today)
JERS SAR/OPS VNIR (92-Sep 98)
ERS 1 HR (Jul 91-Mar 00)
ERS 1 LBR (Jul 91-Mar 00)
SPOT 1-4 HRV (87-today)
MOS 1, 1b MESSR (87-Oct 93)
NOAA 9-17 AVHRR (86-today)
LANDSAT 5 TM (April 84-today)
NIMBUS 7 (Nov 78-May 86), SEASAT (Jun-Oct 78)
The LOFAR
Radio-Interferometre is producing
1.6TB/sec 138PB/day, setting
new frontiers for radio-astronomy
The volume of
earth-observation
data from European Space
Agency’s satellites passed
3PB in 2007 and the projection
for 2020 is seven-fold
In genomics:
• Cost of sequencing is dropping
by 50% every 5 months
• “… analysis, not sequencing, will
be the main expense hurdle”
(Cambridge University, UK)
Courtesy of BERIS
The volume of worldwide
climate data is expanding
rapidly, creating challenges for
both physical archiving and
sharing, for ease of access of
relevant information in a
multidisciplinary environment
J T Overpeck et al. Science 2011;331:700-702
In high energy physics, the data
recorded by each of the big
experiments at the Large Hadron
Collider will be enough to fill
around 100,000 DVDs every year!
4
The Challenges of Big Data
Volume Very large
quantities
of data
Velocity Extremely
fast streams
of data
Variety Wide range
of data type
characteristics
Value High potential
value
if harnessed correctly
5
Intel® VT For Connectivity Intel® VT-c
Intel® VT For Directed I/O Intel® VT-d
Intel® VT-x
Intel Xeon® 5500 Series:
First Platform with End-to-End HW Virtualization
Processor Chipset Network
Intel® Virtualization Technology
Holistic platform centric approach for virtualization
usages
Low Latency Smart Cache SQL
Oracle First Platform with Data Embedded Instructions
Data Processing Unit DPU Data Aware Storage Data Defined Network
Oracle® Enabling Technology
Optimised for Data Processing and Database
Economies of Real Time Analytics Waiting for DATA • Today’s Research applications are
increasingly held back by slow storage
• When requesting data, the server spends most of its time waiting for storage
• Application performance remains sluggish regardless of the Server CPU horsepower
• The traditional remedy of adding more DRAM or “short-stroking” HDDs is both expensive and inefficient
0
20000000
40000000
60000000
80000000
100000000
120000000
Acquire Organize Analyze Visualize
Oracle
Big Data Appliance Oracle Exadata Oracle Exalytics
Infiniband
Big Data inside the Research Lifecycle Oracle’s Engineered Systems Solution
13
The Research Industry Solutions
The Research Enterprise
Research
Analytics
Research Data
Management
Research
Administration
& Control
Our goal: To support researchers, their communities and their organizations
to do better Research by providing
cost-effective, reliable and open solutions
3
Oracle Research Analytics
A platform that enables Researchers to:
• Work collaboratively on extremely large data sets providing
performance and innovative ways to exploit into data
• Build workflows that best support science and the operations of
complex Research
• Run applications and best adapt them to different scientific loads and
challenges
9
Challenges to address
• Exponential growth in data and the ability to access critical
information
• Enterprise infrastructure ability to quickly
accommodate new data sources
• Evolve from data analysis to predictive science
• Ability to translate raw data into information and
knowledge
• Managing resources across workloads and platforms
7
• Process high-volume, low-density information
• Support flexible data structures
• In-database deep analytics
• Perform analysis on big data
• Parallel execution for efficient processing
• Deep, rich set of analytics for extracting maximum business value
Oracle Differentiators
Research
Ecosystem
Research
Infrastructure
Research Data
Management Research
Administration
Research
Mission
11
Research Analytics Flow
Visualization Sharing Discovery Organization
12
Key Capabilities
• Open standards-based environment
• Minimize development time and effort
• Ensures appropriate levels of access
• Lower cost of research
• Facilitate manipulation of extremely large data sets
• Maximize analytic performance and achieve faster results
• Access to the latest investigative methods & tools
• Enables new science
• Ability to work on extremely large data sets allowing researchers new ways to exploit data
• Ensure trust and security
• Interoperable access to distributed repositories of data
• Facilitate innovative approach to discovery and results
• Support deep rich set of analytics
• Minimize development time/effort
• Reduce time-to-discovery
• Lower cost of research
• Enables new science
Visualization Sharing Discovery
Key Benefits
• High velocity loading and organization of information
• Ability to optimize workloads and system operations
• Ingest a wide range of data types
• Data integration
• Map reduction
• Statistical tools
• Analyze data across a wide variety of data characteristics using deep analytics
• Represent analyze finding
• Transform big data into something easy to analyze
• Load data quickly
• Ensures appropriate levels of access
• Enables cross-disciplinary science & discovery
Organization
Oracle Research Analytics: overview
14
People. Process. Portfolio.
Oracle’s Integrated
Big Data
Solution Stack
Oracle Integrated Solution for Big Data
ACQUIRE
Oracle NoSQL Database
HDFS
ORGANIZE
Hadoop
Oracle Big Data
Connectors
DECIDE
Analytic
Applications
ANALYZE
In-D
ata
base
An
aly
tics
Data
Warehouse
Interactive
Discovery Enterprise Applications
Oracle
Exalytics
InfiniBand
Oracle’s Big Data solution
Oracle
Big Data
Appliance Oracle
Exadata
InfiniBand
Acquire Organize & Discover Analyze Decide
Endeca Information Discovery
Cloudera
Hadoop
Oracle
NoSQL
Open-Source
R
Big Data
Connectors
Oracle Data
Integrator
Oracle
Business
Intelligence
Oracle
Advanced
Analytics
Oracle
Database
Oracle
Spatial and
Graph
• Pre-configured and optimized for Big Data processing
– 18 Servers, 864GB RAM, 648TB Storage/Rack; easy rack expansion
– NoSQL, Cloudera Hadoop, Oracle R
– Oracle Loader, Oracle Data Integrator, HDFS Connector for integration
• Integrates into your existing architecture
– Streams data into Exadata @15 TB/hour
Oracle Big Data Appliance
Oracle Big Data Appliance Engineered Systems for Big Data
Big Data
Appliance
Fastest Data Warehouse & OLTP:
– 10X-20X fast load and query times
– 10X storage savings, 80% less power, and a lot less space
Optimized for In-Database Analytics – Model functions execute in storage
Optimized for Network Throughput – Network connections In from Big Data
Capture and Out to In-Memory Analytics
1/5th to1/8th cost of other alternatives
Oracle Exadata
Oracle Exadata Engineered Systems for Systems of Record
Exadata
Data Mining
Statistics
Text Mining
Predictive
Analytics
• Comprehensive Predictive Analytic
platform built inside Database
– Data mining, text mining
– Statistical analysis (based on R)
– Built for data analysts / scientists
• Scalable and parallel: analyzes
huge volumes of data
• Tightly integrated with SQL,
enabling broad usage
• Works inside Exadata and
Big Data Appliance
Oracle Advanced Analytics Advanced In-Database Predictive Analytics
• Spans Relational, Multi-Dimensional, and Unstructured analysis,
combined with Financial & Operational Planning – In-Memory Optimized Hardware
– In-Memory Oracle BI, TimesTen, Essbase, and Endeca
– Several In-Memory Software Innovations
• Tightly integrated with Exadata
Exalytics
In-Memory
Machine
Oracle Exalytics In-Memory Engineered System for Analytics
• Hybrid in-memory search /
analytic engine
– Combines un-structured/structured
and internal/external data (big data)
– Enables search, navigation, and
discovery of data and correlations
• Highly interactive UI for
discovery/exploration
– Social Media Analytics
– Customer 360 Analysis
– Competitive Intelligence
Unified
Indexing
Data
Mashup
Text
Analysis
Unified
Search
Faceted
Navigation
Interactive
Exploration
Information Discovery
Exalytics
In-Memory
Machine
Oracle Information Discovery In-Memory Un-Structured & Semi-Structured Analysis
People. Process. Portfolio.
Customer Success
in
Big Data
Architecture
Customer Success: Erasmus Medical
Centre
Thanks to an Exadata-based solution, Erasmus Medical Centre achieved:
• For a 11 minute query, Exadata could improve it to 1 second, which is a major
advantage for researchers to have immediate results
• Smart Scan and Flash Card : give performance in analyzing data.
• Hybrid Columnar Compression : gives performance in the ability to manipulate
Tb of data (compression from 133 Gb to 11 Gb), with increased performance.
• Adding Oracle Database 11g features like partitioning gives more performance in
manipulating, quantifying data obtained through the study of various genomes
Challenges
Results
• Complex data processing and analysis.
• Ability to
• load huge data information in minimum time
• store these data and their genomic DNA research results on storage disk
• have an efficient system able to give them query performance
16
Customer Success: Oregon State University’s COAS COAS: College of Oceanic and Atmospheric Sciences
• With Oracle, COAS has an easy to manage, integrated system that delivers the
flexibility and scalability necessary to address the exponential data increases
associated with its leading-edge research, as well as quickly adjust to ever-changing
data availability requirements.
• As a result of extending its infrastructure with Oracle, COAS has improved data
movement and performance by approximately 3 to 4 times, reduced system
administration and management time, and unified research silos to gain a holistic
view of integrated data sets.
• Additionally, COAS can now manage its unusually large input/output (I/O) loads,
enabling the computation, storage, analysis and visualization of massive data flows.
Challenges
Results
•To expand its infrastructure to support its leading edge scientific research on the
ocean and atmosphere’s influence on the Earth’s climate
•To meet the data intensive demands of its scientific research and foster an
environment that will address current and future workflows
17
Customer Success: Indiana University
• Enable Research and effective data analysis in different fields
• Provide and run a robust, secure and cost-effective Research environment
protecting data and ensuring that researchers have access to state-of-the-art
technology.
• For additional insight into research data, it provides researchers with access to
Oracle Data Mining, Oracle Spatial and Oracle OLAP to deliver its Database-as-a-
Service to researchers both within Indiana University and at other universities
around the country.
Challenges
Results
• To provide researchers with a first-class database environment that is secure,
reliable and easy to use
• To gain rapidly and effectively insight into the data by building and managing
research-oriented, data-intensive applications.
• To provide tools, templates and plug-ins they need to easily leverage research
data to enhance their findings and increase productivity.
18