Ibm pure data system for analytics n200x

© 2013 IBM Corporation

IBM® PureData™ System for Analytics

N200x Technical Overview

Adriano Di Massimo

PureData for Analytics Europe IOT

© 2013 IBM Corporation2

Increasing

Variety of datarequires new techniques

Increasing

Velocity of datarequires higher performance

Increasing

Volume of datarequires growing capacity

35 ZB

by 2020

Big Data Challenges for Both Transactions and

Analytics are Increasing Demands on Data Systems

Millions oftransactions per second

Telco subscriber activity logging

Mobile CloudSocial Big DataCommerce

2020

50x

2010

Analytics

Billions ofdevices & sensors

Smart Meters, RFIDs, GPS)


Strategic Big Data: the future Model of Datawarehouse

Source: Top Ten Technology Trends for 2013 – Gartner Symposium Barcelona Nov 2012


IBM PureData System for Analytics (PDA)

� Purpose-built analytics engine

� Integrated database, server and storage

� Standard interfaces

� Low total cost of ownership

Speed: 10-100x faster than traditional systems

Simplicity: Minimal administration and tuning

Scalability: Peta-scale user data capacity

Smart: High-performance advanced analytics

Transforms the User Experience


Announcing a New Model!

PureData for Analytics now has TWO models� N1001 – economical, high performance and scalability

� N200x – highest performance appliance to-date

PureData for Analytics continues to provide:� Fastest Time to Value on the market today

� Optimized Big Data analytics performance

� Simple administration for fast and agile deployment

� Accelerate analytic performance using large library of analytic

functions

The new N200x model addresses these key challenges

� Increased performance

� Better density

� Data center efficiency

PureData System for Analytics N200x


Benefits of the IBM PureData System for Analytics The Fastest Performance of Netezza Technology to Date!

6

1 Based on a comparison of the IBM PureData System for Analytics N2001 to the IBM PureData System for Analytics N1001. The performance speed refers to the query times on both macro-analytic and mixed

workload tests as conducted in IBM engineering lab benchmarks. The N2001 query times were an average of 3x faster than those of the N1001. Individual results may vary.

2 128 GB/sec scan rate assuming an average of 4x compression across the system. Individual results may vary.

3 Capacity of IBM PureData System for Analytics N2001 compared to previous generation IBM PureData System for Analytics N1001.

4-Each N2001 rack contains 34 hot spare drives and 240 active drives for a ratio of 1 spare per 7 drives. Each N1001 rack contains 4 hot spare drives and 92 active drives for a ratio of 1 spare per 23 drives. The N2001

has 3.3x more spares per active drive. Frequency of disk related service calls expected to decrease by 70% assuming the same drive failure rates.

Accelerated Performanceof Analytic Queries


Increased Efficiency

of your Data Center


of your Data Center

Simplicity and

Ease of Administration

Simplicity and


� 3X faster performance1

for Big Data analytics

� 128 GB/sec effective scan rate per rack2

to tackle Big Data faster

� Improved system management and resilienceto spend less time managing and more time

delivering value

� 70% FEWER service calls with more spare drives and faster disk regeneration4

� 50% greater data capacity per rack3

helps optimize data center efficiency

� More capacity and less power per rackthan both Oracle and Teradata



7









Increase Efficiency

of your Data Center

Increase Efficiency

of your Data Center

Simplicity and


Simplicity and







delivering value






The PureData System for Analytics AMPP Architecture

PureData System for Analytics Appliance

FPGA

Memory

CPU

FPGA

Memory

CPU

FPGA

Memory

CPU

S-Blades

Network

Fabric

Field Programmable Gate Array =

a blank canvas until it’s configured

Advanced

Analytics

Advanced

Analytics

LoadersLoaders

ETLETL

BIBI

Applications

Disk

Enclosures

“Lite”

Host(IBM xSeries,

Red Hat Linux)


The PureData System for Analytics AMPP Architecture

PureData System for Analytics Appliance

FPGA

Memory

CPU

FPGA

Memory

CPU

FPGA

Memory

CPU

S-Blades

Network

Fabric

Field Programmable Gate Array =

a blank canvas until it’s configured

Advanced

Analytics

Advanced

Analytics

LoadersLoaders

ETLETL

BIBI

Applications

Disk

Enclosures

“Lite”

Host(IBM xSeries,

Red Hat Linux)

• AMPP Architecture

- Combine the benefits of both technologies:

SMP simplicity and MPP performance


Select State, Age, Gender, count(*) From MultiBillionRowCustomerTable Where BirthDate <

‘‘‘‘01/01/1960’’’’ And State in (’’’’FL’’’’, ’’’’GA’’’’, ‘‘‘‘SC’’’’, ‘‘‘‘NC’’’’) Group by State, Age, Gender Order by

State, Age, Gender

S-Blade Data Stream Processing

FPGA Core CPU Core

Decompress ProjectRestrict

Visibility

SQL &

Advanced Analytics

From MultiBillionRowCustomerTableWhere BirthDate <‘‘‘‘01/01/1960’’’’Group by State, Age, Gender

Select State, Age, Gender, count(*)

And State in (‘‘‘‘FL’’’’, ‘‘‘‘GA’’’’, ‘‘‘‘SC’’’’, ‘‘‘‘NC’’’’) Order by State, Age, Gender

From Select Where Group by

Stream via

Zone Map

From

10


Select State, Age, Gender, count(*) From MultiBillionRowCustomerTable Where BirthDate <

‘‘‘‘01/01/1960’’’’ And State in (’’’’FL’’’’, ’’’’GA’’’’, ‘‘‘‘SC’’’’, ‘‘‘‘NC’’’’) Group by State, Age, Gender Order by

State, Age, Gender

S-Blade Data Stream Processing

FPGA Core CPU Core


Visibility

SQL &

Advanced Analytics

From MultiBillionRowCustomerTableWhere BirthDate <‘‘‘‘01/01/1960’’’’Group by State, Age, Gender

Select State, Age, Gender, count(*)

And State in (‘‘‘‘FL’’’’, ‘‘‘‘GA’’’’, ‘‘‘‘SC’’’’, ‘‘‘‘NC’’’’) Order by State, Age, Gender


Stream via

Zone Map

From

• Transparent I/O performance optimization

- Use of FPGA (streaming approach) guarantees

the highest and stable scan rate

11


CPU

Request

General Purpose

Storage

Request

Transactional System used for BI

Data Warehouse WorkloadFewer requests, lots of data manipulation

12


Results

Transactional System used for BI

Request

General Purpose

Storage

CPU

Data Warehouse WorkloadTransaction systems are inefficient for data shuffling

13


Results

PureData for Analytics System

Intelligent StorageCPU

Request

Asymmetric Massively Parallel Processing

Data Warehouse BladesDesigned for Tera-scale Business Intelligence

14


Results

Netezza Performance Server™ System

Intelligent StorageCPU

Request

1% of network

traffic

2% of CPU

requirements

Asymmetric Massively Parallel Processing

Data Warehouse BladesHighly efficient data movement

15


N200x: What’s new

16

FPGA Core CPU Core


Visibility

SQL &

Advanced Analytics


120MB/sec

500MB/sec 800 MB/sec +

480

MB/sec

N1001N200x

65 MB/sec

130 MB/sec

130 MB/sec

325 MB/sec

(2.5 drives / core)

1000 MB/sec 1000 MB/sec +

1300 MB/sec

PureData System for Analytics


How We Did it, Conceptually

17

Balanced Performance

FPGA Core CPU Core

500

MB/sec

800

MB/sec +

1 drive @

120 MB/sec

More Drives with

Faster Scan Rates

Leading to

Faster Performance

Faster FPGA Cores,

Driving Higher Performance

2.5 drives @

130 MB/sec

each

1000

MB/sec

1000

MB/sec +

CPU Core

• Analyze

FPGA Core

• Decompress

• Project

• Filter


PureData System for Analytics N1001

18

S-Blades

Disks

Mem

ory CPU

FPGA

8 8 6 6 6 6 6

� 14 Blades per full rack

� Each S-Blade

� 8 CPU Cores

� 8 FPGA Engines

� Sized to handle 8 disks or 960 MB/sec

� 92 Active Data Slices deliver 11 GB/sec raw disk throughput

8

8

Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGA


PureData System for Analytics N200x

19

S-Blades

Disks40 40 32 32 32 32 32

� 7 Blades per full rack

� Each S-Blade

� 16 CPU Cores

� 16 FPGA Engines

� sized to handle 40 disks or 5.2 GB/sec

� 240 Active Data Slices deliver 31.2 GB/sec raw disk throughput

3x More Disk

Throughput

Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGA Mem

ory CPU

FPGAMem

ory CPU

FPGA

16

16


Netezza Platform Software v7.1Highlights�Scheduler rules for WLM

�Short query prioritization

�Snippet Result Cache

�Faster Bulk Fetching with ODBC

�Password aging and expiry

� nzPortal enhancements

�Cryptographic Standards (s800-131a)

�Support for Replication v1.5

�Support for INZA 3.0

Resiliency�Faster rebalance for failed disks

�Disk validation support

� Large scale disk replacement

�Call Home v1.0

�Enhanced System Health Checks v2.2

� ILMT support for Growth on Demand

Platform & OS�Client Kit support for AIX 7.1

�RHEL 6.4 certification

SQL Enhancements�Multiple Schema (3-part naming)

�Orphan column query

�NOT IN / EXIST improvements

�CASE WHEN improvements

�Support 24 hour datetime

�CESU-8 support

Transaction Enhancement�Truncate table in TXN

� Improved view validation

�Temp table enhancements

�Deprecate Web Admin

ETL�ODBC loader support for INTERVAL

Netezza Performance Portal�Cryptographics standards (s800-131a)

�Scheduler rules

�History type AUDIT

�Restrict nzPortal users

�Groom dialogs

20


Directed Data Processing

21

� Distribute Restrict Optimization

– Use distribution key to target scans

Transaction history distributed on customer ID

Hosts


Directed Data Processing

22

� Distribute Restrict Optimization

– Use distribution key to target scans

Hosts

select ) from tx_hist where custid in (1, 2)

custid = 1

custid = 1

custid = 1

custid = 1

custid = 2

custid = 2

custid = 2

custid = 2

custid = 3

custid = 3

custid = 3

custid = 3

select ) from tx_hist where custid = 3


Page Granular Zone Maps

23

October

November

Other

3 MB

where col = October

Total 12 MB(4 x 3 MB)


Page Granular Zone Maps

24

� 24X finer granularity

October

November

Other

Total 12 MB(4 x 3 MB)

Total 1 MB(8 x 128KB)

3 MB128 KB

where col = October


Snippet Result Cache

Observation

• BI/Web page generated reports create queries with limited variation

• Repeated tables, columns, restrictions

Keep intermediate results

• From simple table scans

• Using existing storage

Internal Benchmarking Results

• Up to 2.5X faster for tactical queries

25


Snippet Result Cache

SQL Query

• Preserves intermediate tables generated by snippets for use in

subsequent queries

• Queries do NOT have to be identical to benefit

Snippet

Snippet

Snippet

Snippet

Snippet

Snippet

Snippet

Snippet

SQL Query

Snippet

Snippet

Snippet

Snippet

Snippet

Snippet

26


ODBC Bulk Fetch Enhancements

Delivers a more competitive select performance!

‒ Eliminates expensive conversion routines when the

client and database share the same data type

‒ Nearly 4X faster for select data types!

Sample improvements:

Data Type Today NPS 7.1 Times Faster % Gain

Char(ns) 175.704 45.009 3.90 74%

Int1 101.38 54.86 1.85 46%

Int8 76.421 24.198 3.16 68%

Boolean (bit) 195.27 133.3441 1.46 31%

Double 75.684 31.271 2.42 58%

27



30







Accelerate Performanceof Analytic Queries



of your Data Center


of your Data Center

Simplicity and


Simplicity and







delivering value







32









Increase Efficiency

of your Data Center

Increase Efficiency

of your Data Center

Simplicity and


Simplicity and







delivering value






Spend Less Time Managing and More Time Innovating

33

� No dbspace/tablespace sizing and configuration

� No redo/physical/Logical log sizing and configuration

� No page/block sizing and configuration for tables

� No extent sizing and configuration for tables

� No Temp space allocation and monitoring

� No RAID level decisions for dbspaces

� No logical volume creations of files

� No integration of OS kernel recommendations

� No maintenance of OS recommended patch levels

� No JAD sessions to configure host/network/storage

Data Experts, not

Database Experts

� Easy Administration Portal

� No software installation

� No indexes and tuning

� No storage administration


IBM Netezza Performance Portal 2.0Consolidating WebAdmin and Portal for Simple Admin

34

� Simple web user interface– Part of the PureData System for Analytics

� New functional and usability

enhancements– Administrative Functions

• Hardware view & alerts

• Database objects administration

• User & Group management

• View active sessions

• Workload Management

• View Events

• Table skew/storage search

• Capacity Planning

– Monitor enhancements

• Usability improvements – allow to resize

monitors and mark not-monitored periods

– Customer requested improvements

• Show locks

• Monitor System Resources

• Perform System Administration

• Understand & Predict Capacity


Netezza Performance Portal 2.1

• Support for Scheduler rules

• Ability to restrict users from adding Hosts

• New panel for Resource Allocation Performance History

• Ability to view history of BAR operations

• Support for EXPLAIN command with Query History enabled

• Client field filters for Query History view

• History type AUDIT added to Query History

• IBM HTTP server replaces Apache server


Scheduler Rules for WLM

1. Replaces the Gatekeeper Scheduler

2. Ability to limit, prioritize, and abort queries

through simple rules

3. Ability to match on group, plan type, priority,

estimate, user, db, table, client info & tags

4. Great for large scale environments running in

high concurrency

5. Helps to tune out query contention resulting

from high use of disk and memory

Gatekeeper

GRASQB

36


Scheduler Rule Examples

� Modifying scheduler rules:– IF USER IS sam THEN INCREASE PRIORITY

– IF TYPE IS LOAD THEN SET PRIORITY LOW

– IF TAG IS eom THEN EXECUTE AS RESOURCEGROUP group42

– IF ESTIMATE >= 5 ESTIMATE < 12 THEN INCREASE PRIORITY

– IF CLIENT_APPLICATION_NAME IS Cognos THEN ABORT

– IF CLIENT_ACCOUNTING_STRING IN (‘weekly_report’, ‘daily_report’)

THEN SET PRIORITY HIGH

� Limiting scheduler rules:– IF TAG IS cube THEN LIMIT 1

– IF TAG IS cube USER IS sam THEN LIMIT 2

– IF TYPE IS GENERATE STATISTICS THEN LIMIT 1


Real time link between your appliance and IBM

• Automatic problem reporting• Ongoing Inventory tracking• Operational status and health for proactive support

Improves support efficiency, effectiveness and the client experience

• Reduces your Total Cost of Ownership (TCO)• Reduces duration of most common support calls• Raises our awareness of your issues sooner• Makes support more proactive without requiring you to do more• Helps to improve product and support quality over time

Call Home Service


How it Works• Targeted NZEVENTs automatically run nzOpenPmr, collect data and email

IBM

• New email identifies you, appliance (identity, location and status) and fault data

• Attached diagnostics include:

+ sysmgr and eventmgr logs

+ SMART logs for disks

+ cluster logs for Host issues

+ crash stacks for core dumps (avg. size: 15 Kbytes)

• Automation opens PMR, posts diagnostic data and replies w/ PMR

Configuration and Enablement• Requires recent NPS fixpack and functional SMTP routing

• Additional configuration in callHome.txt+ IBM Customer (ICN)

+ Machine Type, Model and S/N

• Identity your Support contact and email alias

• nzOpenPmr configuration creates new event table entry

SAMPLE callHome.txt

# /nz/data/config/callHome.txt

# Installation-specific attributes.

customer.company = Your Business

customer.address1 = Appliance Install Address

customer.address2 = Installed City, State, Zip

customer.ICN = 1234567

contact1.name = Joe SysAdmin

contact1.phone = 1.617.555.1212

contact1.email = [email protected]

contact1.cell = 1-508-555-9876

contact1.events = ALL

contact2.name = D.B. Admin

contact2.phone = +1.508.555.1212

contact2.email = [email protected]

contact2.cell = +1.508.555.2121

system.description = Test System

system.location = Rm 122 Aisle F Slot 2

system.model = N2001-005

system.MTM = 3565 / DD0

system.serial = NZ3xxxx

system.CC = 2 char Country Code (ISO)

Call Home Service – How it Works


• Less than 5 minutes to rebalance a failed Blade

– Unmount and remount disk rather than reboot the blade

• Rebalance occurs under normal “pause” Blade

– Avoids losing any process work (Loads or queries)

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

S-B

lad

es

.

.

.

Faster Rebalance for failed Drives


Summary of competitive advantages

41

� Transparent I/O performance optimization– Use of FPGA (streaming approach) guarantees the highest and stable scan rate,

without any need of expensive performance improvement features like:• automatic dynamic storage differentiated by data access behaviour («virtual storage»)

• «in-memory» solution or

• «columnar» storage

� Specific RDMS– Optimized software by removing all unnecessary and expensive typical OLTP

RDBMS features like:• Log/journaling management

• Lock management

• Referential integrity feature management

� AMPP Architecture– Combine the benefits of both technologies: SMP simplicity and MPP performance

– Symmetric «Shared Nothing» Architecture has limitations:• Frequent «bottlenecks» due to the mix of heterogenuous processes on the same physical

resources

• Risk of unbalanced use of clustered resources due to bad access configuration


Summary of competitive advantages

� Workload Management– World-class workload manager functionalities

– Maximize resource usage without complex workload management settings

� Availability and Resiliency– No need of «fallback-like» / table mirroring functionalities

• Disk availability is guaranteed by Raid1

• Zero-downtime in case of node failure is guaranteed by built-in spare S-blades

– Efficient Incremental backup avoiding complex techniques like partitioning archive

� Simplicity– Zero-tuning

• «Zone-map»: automatic anti-index approach to avoid scanning of unnecessary data for

users query

• Automatic update of data demographic statistics

• Automatic partitioning

• Ad-hoc query enabling technology

– Near-zero administration

– Data model agnostic

42


Inside the Q


• 8 Disk Enclosures• 96 1TB SAS Drives (4 hot spares)• RAID 1 Mirroring

• 14 PureData for Analytics S-Blades™• 2 Intel Quad-Core 2+ GHz CPUs• 4 Dual-Engine 125 MHz FPGAs• 24 GB DDR2 RAM• Linux 64-bit Kernel

• 2 Hosts (Active-Passive):• 2 Quad-Core Intel 2.6 GHz CPUs• 7x146 GB SAS Drives• Red Hat Linux 5 64-bit

• User Data Capacity: 128 TB**• Data Scan Speed: 145 TB/hr**• Load Speed (per system): 5+ TB/hr

• Power Requirements: 7.6 kW• Cooling Requirements: 7.8 kW

**: 4X compression assumed

Scales from ¼ Rack to 10 Racks

32 TB to 1.2 PB of User Data

PureData System for Analytics Hardware Overview: Model N1001

44


PureData System for Analytics Hardware Overview: Model N200x

� User Data Capacity: 192 TB*� Data Scan Speed: 450 TB/hr*� Load Speed (per system): 5+ TB/hr

� Power Requirements: 7.5 kW� Cooling Requirements: 27,000 BTU/hr

* Assuming 4X compression

2 Hosts (Active-Passive)� 2 6-Core Intel 3.46 GHz CPUs� 7x300 GB SAS Drives� Red Hat Linux 6 64-bit

7 PureData for Analytics S-Blades™� 2 Intel 8 Core 2+ GHz CPUs� 2 8-Engine Xilinx Virtex-6 FPGAs� 128 GB RAM + 8 GB slice buffer� Linux 64-bit Kernel

12 Disk Enclosures� 288 600 GB SAS2 Drives

� 240 for User Data� 14 for S-Blades� 34 Spare

� RAID 1 Mirroring

Scales from ½ Rack to 4 Racks

45


PureData System for Analytics Models

46

PureData System for

Analytics N1001

PureData System for

Analytics N200x

Blade Type HS22 HX5

CPU Cores / Blade 2 x 4 Core Intel CPUs 2 x 8 Core Intel CPUs

# Disks 96 x 3.5” / 1 TB SAS

(92 Active)

288 x 2.5” / 600GB SAS2

(240 Active)

Raw Capacity 96 TB 172.8 TB

Total Disk Bandwidth ~11 GB/s ~32 GB/s

S-Blades per Rack (cores) 14 (112) 7 (112)

S-Blade Memory 24 GB 128 GB

Rack Configurations ¼, ½, 1, 1 ½, 2 – 10 ¼, ½, 1, 2, 4 (6 and 8 rack

configs to follow)

FPGA Cores / Blade 8 (2 x 4 Engine Xilinx FPGA) 16 ( 2 x 8 Engine Xilinx Virtex

6 FPGA)

User Data / Rack * 128 TB 192 TB

* Assuming 4x Compression


New Offerings for the Entry-Level Market

47

� PureData System for Analytics ‘Lite’ (Q4’13)

– Entry-Level Striper Configuration (N2002-002)

– 32 TB usable capacity

– 50% better performance than a TwinFin-3 (N1001-

002)

– Improved resiliency over TwinFin-3 with more spare

drives

� IBM Netezza Platform Development Software

– Virtualized Image supporting VMWare vSphere 5.1

– Documented reference architecture and best

practices

– Install Licensing

– 16+ TB usable capacity (compressed)

– Development and Test Only


IBM Netezza Platform Development Software

� Full function NPS 7.x software for

DEV and TEST only� In a fully virtualized offering

� Fully supported, simple to setup,

running in minutes� Just like an appliance

� Licensed per virtual server

� System Limits� 16 CPU cores

� 64GB RAM

� 4TB raw space (~16TB w/compression)

Host SPU SPU


IBM Announces Growth on Demand for PureData System for Analytics

Program BasicsProgram Basics

Instant UpgradeInstant Upgrade

Simple DeploymentSimple Deployment

� New Offering called “Growth on Demand”

� Purchase a larger system, license 50% of the capacity and performance

� Grow in easy steps

� Additional capacity enabled by licensing and software configuration

� Capacity can be added, but not reduced with this program

� Provision one system

� Expand through licensing

� Zero impact on data center operations

49


Growth on Demand Single Rack Example

Existing part (seven such parts, one for each model)

New part : min 50% entitled capacity (both storage and performance), one for each existing part

New part : adding 12.5% extra capacity (both storage and performance), one for each PDA model size

50% capacity

100% capacity

Fu

ll R

ack

‘No

rma

l’

Fu

ll R

ack

‘Min

imu

m ca

pa

city’

Add-on

Add-on

Add-on

Add-on

‘Extr

a

ca

pa

city’p

art

s

50


IBM DB2 Analytics AcceleratorNow even faster with N200x

� The PureData System for

Analytics N200x is also the

next generation DB2 Analytics

Accelerator

� Providing the same

improvements to our DB2 for

zOS customers


Big Data Meets Deep Analytics

52

Analytics without constraint


IBM Netezza Analytics Ecosystem

PureData for Analytics AMPP Platform

Software

Development

Kit

Software

Development

Kit

3rd Party

In-Database

Analytics

3rd Party

In-Database

Analytics

Netezza

In-Database

Analytics

Netezza

In-Database

Analytics

User-Defined

Extensions

(UDF,UDA,

UDTF,UDAP)

Transformations

Mathematical

Geospatial

[Esri / nzSpatial]

Predictive

Statistics

Time Series

Data Mining

Fuzzy Logix

SAS

Zementis

IBM SPSS

Language

Support

(Map/Reduce,

Java, Python,

Lua, Perl,

C, C++,

Fortran,

PMML) Mathworks

Revolution

Analytics

BI Tools

Visualization Tools

53


Integrated by DesignIBM Netezza Analytics Version 2.0

54

Netezza In-Database Analytics 2.0

� Transformations

� Mathematical

� Geospatial

� Predictive

� Statistics

� Time Series

� Data Mining

� No data movement

� Analyze deep and wide data

� High performance, parallel computation


� Basic Math*

� Permutation and

Combination*

� Greatest Common

Divisor and Least

Common Multiple*

� Conversion of Values*

� Exponential and

Logarithm*

� Gamma and Beta

Functions

� Matrix Algebra+

� Area Under Curve*

� Interpolation Methods*

Transformations MathematicalTime Series

� Linear Regression+

� Logistic Regression+

� Classification

� Bayesian

� Sampling

� Model Testing

� Geospatial Data Type

� Geometric Functions

� Geometric Analysis

Predictive Geospatial* Fuzzy Logix

DB Lytix

capabilities

+ Netezza

Analytics and

Fuzzy Logix

DB Lytix

capabilities

� Data Profiling / Descriptive Statistics+

� General Diagnostics

� Statistics+

� Sampling

� Data prep

Pre-Built In-Database Analytics

� Descriptive Statistics+

� Distance Measures*

� Hypothesis Testing*

� Chi-Square & Contingency Tables*

� Univariate & Multivariate Distributions+

� Monte Carlo Simulation*

� Autoregressive+

� Forecasting*

� Association Rules+

� Clustering+

� Feature Extraction+

� Discriminant Analysis*

Data Mining

Statistics


What’’’’s New in N200x: Summary

� 50% Greater Storage Capacity per rack

� 3x scan rate vs N1001 series

� Improved Resiliency and Fault Tolerance– More spare drives per cabinet

– Faster drive regeneration

– Online Firmware upgrades

� NPS 7.0– Distribute Restrict Optimization

– Page Granular Zone Maps


Catch the

Striper “Wave”Why Upgrade to the

IBM PureData System for Analytics N2000 Series Appliance


Why Upgrade Your TwinFin System?

PureData System for Analytics N2002 provides:

� The latest hardware– 3x faster scan rates1 – 128 GB/sec effective scan rate per rack2

– 6x more memory per Blade server

– Leverage future software enhancements longer

� Increased data center efficiency with 50% greater data

capacity per rack3

� Improved system management & resiliency

� 70% fewer service calls with more spare drives and faster

disk regeneration4

Catch the Striper Wave before TwinFin comes to end of life

1 Based on a comparison of the IBM PureData System for Analytics N200x to the IBM PureData System for Analytics N1001. The performance speed refers to the query times on both macro-

analytic and mixed workload tests as conducted in IBM engineering lab benchmarks. The N200x query times were an average of 3x faster than those of the N1001. Individual results may

vary.

2128 GB/sec scan rate assuming an average of 4x compression across the system. Individual results may vary.

3 Capacity of IBM PureData System for Analytics N200x compared to previous generation IBM PureData System for Analytics N1001.

4 Each N200x rack contains 34 hot spare drives and 240 active drives for a ratio of 1 spare per 7 drives. Each N1001 rack contains 4 hot spare drives and 92 active drives for a ratio of 1

spare per 23 drives. The N200x has 3.3x more spares per active drive. Frequency of disk related service calls expected to decrease by 70% assuming the same drive failure rates.


IBM Netezza’s Market – Leading Evolution

World’s FirstData WarehouseAppliance

World’s First100 TB DataWarehouse Appliance

World’s FirstPetabyte DataWarehouse Appliance

World’s FirstAnalytic Data Warehouse Appliance

NPS®

8000 Series

TwinFin™ with i-Class™

Advanced Analytics

NPS®

10000 Series

TwinFin™

World’s fastest and “greenest” analytical

platform

2003 2006 2009 2010 2011 2013

PureData™

System for AnalyticsN2002


Striper Leverages the Latest Hardware

3x faster scan rate

Drives per core have gone from

1 drive @ 120 MB/sec to

2.5 drives @ 130 MB/sec

FPGA cores have gone from

500 MB/sec to 1000 MB/sec

CPU cores have gone from

800 MB/sec to 1000+ MB/sec

6x more memory per Blade

(better leveraged by NPS 7.x)

50% greater data capacity per rack


Striper vs. TwinFinHardware Comparison

PureData System for

Analytics N1001 (TwinFin)

PureData System for Analytics

N2002 (Striper)

Blade Type HS22 HX5

CPU Cores / Blade 2 x 4 Core Intel CPUs 2 x 8 Core Intel CPUs

# Disks 96 x 3.5” / 1 TB SAS

(92 Active)

288 x 2.5” / 600GB SAS2

(240 Active)

Raw Capacity 96 TB 172.8 TB

Total Disk Bandwidth ~11 GB/s ~32 GB/s

S-Blades per Rack (cores) 14 (112) 7 (112)

S-Blade Memory 24 GB 128 GB

Rack Configurations ¼, ½, 1, 1 ½, 2 – 10 entry level, ½, 1, 2, 4

FPGA Cores / Blade 8 (2 x 4 Engine Xilinx FPGA) 16 ( 2 x 8 Engine Xilinx Virtex-6 FPGA)

User Data / Rack * 128 TB 192 TB

* Assuming 4x Compression


PureData System for Analytics N2002 HW Overview

� User Data Capacity: 192 TB2

� Data Scan Speed: 478 TB/hr*� Load Speed (per system): 5+ TB/hr

� Power Requirements: 7.5 kW� Cooling Requirements: 27,000 BTU/hr

1 Clients interested in a smaller entry point should refer to the N2002-002 model2 Assuming 4X compression

Scales from ½ Rack to 4

Racks 1

2 Hosts (Active-Passive)� 2 Intel 2.7 GHz Sandy Bridge CPUs� 7x300 GB SAS Drives� Red Hat Linux 6 64-bit

7 PureData for Analytics S-Blades™� 2 Intel 8 Core 2+ GHz CPUs� 2 8-Engine Xilinx Virtex-6 FPGAs� 128 GB RAM + 8 GB slice buffer� Linux 64-bit Kernel

12 Disk Enclosures� 288 600 GB SAS2 Drives

• 240 for User Data• 14 for S-Blades• 34 Spare

� RAID 1 Mirroring


Striper Wave Offer

� Best discounting on the purchase of Striper ever!– Must return TwinFin machine(s)

� Leave the migration to us!* (estimated migration 1-2 weeks based on data and network)

– Review Migration Planning Questionnaire– Develop Migration Plan– Support development of test strategy– Prepare Environment & Install tools for Data & Code Migration– Migrate Data & Code to new appliance*– Removal and secure disposal of TwinFin machine(s)

� Most favorable financing available – Pick your Plan**– Defer Payments for 90 days or more; or– 0% financing with No Upfront Cost; or– Lowest FMV Leasing Rates Available.

* Beyond 100 hours of service, IBM can provide additional fee-based migration services via IBM’’’’s Lab Service Team for test execution support, complex environment considerations, handling for large data volumes, etc.

** With approved credit


Appliance Migration Service

Benefits� Reduce migration risks with proven

guidance and expertise

� Leverage best practices & tools to

accelerate migration activities

� Accelerate your ROI of new appliance

Deliverables� Migration Plan

� Migrated data/code in new Appliance*

Features� Up to 100 hours of Migration Services from

IBM for one environment (20 Client Technical Professionals/80 Lab Services)

– Project Management– Review Migration Planning

Questionnaire– Develop Migration Plan– Support development of Test Strategy– Prepare Environment & Install tools for

Data & Code Migration– Migrate Data & Code to new appliance*

� Beyond 100 hours of service, IBM can provide additional fee-based migration services via IBM’s Lab Service Team for test execution support, complex environment considerations, handling for large data volumes, etc.

Quickly migrate your old Netezza Appliance to the latest PureData System for Analytics Appliance!

* IBM will provide ETL/ Netezza connectivity, however 100 hours does not include manipulation of ETL code or enablement of newer ETL features

*100 hours does not include test execution

* Large data volumes/low capacity network may require additional fee-based Services time to complete migration

* Estimated migration 1-2 weeks based on data and network, per environment


TwinFin to Striper Summary

� Better Longevity

– TwinFin has been in the field since 2009

– IBM PureData System for Analytics N2000 series appliances

have been out since February 1, 2013 – now is the time to

make the switch

– The new system is fully supported and allows you to take full

advantage of many new enhancements

� Faster scan rates

� Better resiliency

� Greater density for data center efficiency

� Appealing Financials

– Most favorable discount on Striper possible

– Financing options from IGF

– Bundled migration services


IBM Netezza Replication Services v1.5

Asynchronous, Homogeneous Replication for

PureData System for Analytics (formerly Netezza)

Simplifying Data Replication for Disaster Recovery and Scale


What’s This Replication Thing?

� IBM Netezza Replication Services keeps a collection of databases

identical across multiple Netezza appliances. Our solution focuses

on replication for Disaster Recovery.

Disaster recovery: a replication use case in which failure of hardware

or software in its operational environment causes no permanent

loss of data or functionality.

Data


Two Common Approaches When NOT Using Replication

� Two Common Options: Dual Feed ETL and Backup Shipping

Primary

DR Site

ETL

WAN

WAN

Full Backup

+ Incrementals

Full Restore

+ Incrementals

Dual Feed ETL

Backup Shipping


Two Common Approaches When NOT Using Replication

Dual ETL Feed Backup and Restore

Benefits

Drawbacks

Benefits

Drawbacks

� Data can arrive at both systems at

roughly the same time.

� Easier to “flip” DR site to be primary

site in the event of a failure.

� Some processes (such as

sequences) may result in different

values.

� In the event of an ETL error, bad data

can be propagated to the DR site.

� Additional overhead for customer

� Only changed data is moved across

the network.

� Backups can later be stored as part

of backup strategy.

� Offers more control over timing of DR

loads, not tied to ETL process.

� Occasional full backups recommended to

ensure consistency, especially if backup

files are later used for backup storage.

� Can result in very large data transfers,

especially during initial full backups.

� Incremental backups do have some

impact on system performance.


Replication Requirements Targeted with Our Solution

� Disaster Recovery solution for PureData Systems for Analytics– Protect business critical data

– Meet regulatory requirements

� Scalable infrastructure that supports:– Growing user populations

– Distributed access to BI and DW applications

– Geographically dispersed user populations

– Higher levels of concurrent access for BI and DW apps

– Reduced application connection and access latencies (“put the data closer”)

70


Replication Solution Overview

� Homogeneous (PDA / Netezza only)

� Asynchronous, “warm stand-by” ( there is latency to the DR box)– Synchronous commit for the source PTS

– Asynchronous transfer to the subordinate PTS, Subordinate Appliance(s)

� Hybrid Replication: SQL Statement & By Value • (Intelligence of solution decides which mode to use)

– SQL statement-level replication (preferred, default)

– Replication By-Value (when necessary)


• IBM PureData System for Analytics N200x (Striper)

• IBM PureData System for Analytics N1001 (TwinFin)

• IBM PureData System for Analytics N1000 (TwinFin)

• IBM Netezza 100 (Skimmer)

• IBM Netezza High Capacity Appliance C1000

• NEC InfoFrame DWH Appliance

Supported Appliances

72

You can upgrade to IBM Netezza release 7.1.0.x from any 6.0.x or 6.1.x release, or from an earlier release of 7.1.0.x to a later 7.1.0.x release.


IBM Netezza Replication Services - Architecture


Description of “by SQL” Replication Method

� Preferred method of replication for our solution– Master node accepts SQL Data Manipulation Language (DML) and Data

Definition Language (DDL) that update the replicated databases.

– SQL statements captured to a replication log

– Logs copied across the network to multiple Netezza nodes

– Subordinates replay the SQL

– Fewer performance implications to customer workloads (near zero impact)• Small amount of information to log/transfer

� The SQL statement that made the change

• External table files logged that are referenced by DML operations � Byte for byte identical to original imported data

• Incoming load rates for up to three simultaneous parallel loads


Description of “by Value” Replication Method

� Alternative method of replicating changes– Used when DML or DDL SQL statements are detected to potentially produce different results on the subordinate.

– Replays the rows which changed (and DDL to ensure appropriate table structure)

� Steps – On the master

• Detect non deterministic SQL DML operations.

• Mark the entire transaction as required to be replicated by the rows that changed and the DDL statements issued against

replicated databases.

• During commit processing of the transaction on the master, the set of rows which changed (inserted, updated or deleted) for

each of the tables affected by DML are captured to the replication log.

– On the subordinate• DDL statements against replicated databases are replayed

• For each modified table, the new rows are inserted, and old rows deleted.

� Requirement to log the underlying row changes to tables – Performance impacted by waiting for rows to log to disk on source system.

– Performance = length of time required for a transaction to complete will generally be longer than the time when

replication is disabled.

� This method may be optimal for some workloads compared to “by SQL”– Session variable available to force the selection of this method when logging transactions

• SET REPLICATE_ALWAYS_BY_VALUE=ON;

� nzreplshowsql command will output more details


IBM Netezza Replication Services - Roles

Subordinate:

Role in a replication set in which execution of UPDATE transactions against non-

temporary tables or sequences in a replicated database are prohibited. Temporary

table UPDATEs and persistent table SELECTs are fully supported.

Master:

Appliance that is the single source of changes to

replicated databases and to global data. The other

appliances in the replication set are subordinates.

The role of master can be changed from one appliance

to another by an administrator, typically

in response to failures and planned outages, or to

“follow the sun” across time zones.

One master and many subordinates are permitted in a replication set. A subordinate

replication host can perform query transactions for load balancing, including creating and

updating temporary tables.

Subordinate appliances can have databases outside of replication scope and they have no

write restrictions.


The Persistent Transport System (PTS)

� External server collocated with every node in replication cluster

� A PTS has three major purposes:

– Move data and files (synchronize transaction logs) from one node to another.

– Send control messages from one node to another.

– Act as a persistent store for recovery from failures.

� PTS H/W Specs:

– 4 cores, 16GB RAM, 5TB+ of disk space, 250MB/s disk write rate for logs

– Redhat Linux 5.7+

� Can Be a Virtual Machine (VM)

The New *flexible* PTS!

(Valid option as of February 2014.)

Note: we encourage customers to have a test environment, so please consider the need for

not only appliances but appropriate PTS in your test environment.


Performance Benefits of a Replicated Environment

� Across the replicated cluster, the advantages of asynchronous

replication: Because applications do not have to wait for transactions

on the master to be transported and applied on target systems,

asynchronous solutions can be deployed over long distances with

(a) negligible impact on application performance, and (b) minimal

network bandwidth consumption.

� On the master system, improve performance by offloading BI reporting to one or more replication target systems.

� On target systems, reduce network and database connection latencies by storing data closer to users and client applications.

� Across the replicated cluster, optimal use of network bandwidth,a direct consequence of the "by-SQL" approach to replicating load file and SQL statement when possible. This contrasts with other databases which log and transmit index and data structure changes.


� Replication PTS HA: The ability to add a second host into

the PTS HW to ensure if there is an issue with the host.

(Note: this requires appropriate hardware and the RedHat

Availability Add-On.)

� Replication Relaxed Serializability: Replication is

compatible with the NPS feature relaxed serializability.

� Replication Master Continue on PTS Error: The ability to

allow the source appliance to continue to change data

even though a replication error occurred and it can not log

to its PTS.

Reduced Restrictions: The removal of restrictions in the SQL

allowed on replicated databases.

(Sequences, Non deterministic SQL, DML which

selects from non-replicated data, Stored

procedures which manipulate timestamps, TEMP

tables now work identically when replication is

enabled vs. disabled)

Increased Resiliency, and Compatibility with Customer WorkloadsIBM Netezza Replication Services v1.5


NPS v7.1 is a Prereq for Replication v1.5

80

Highlights�Scheduler rules for WLM

�Short query prioritization

�Snippet Result Cache

�Faster Bulk Fetching with ODBC

�Password aging and expiry

�nzPortal enhancements

�Cryptographic Standards (s800-131a)

�Support for Replication v1.5

�Support for INZA 3.0

Resiliency�Faster rebalance for failed disks

�Disk validation support

�Large scale disk replacement

�Call Home v1.0

�Enhanced System Health Checks v2.2

�ILMT support for Growth on Demand

Platform & OS�Client Kit support for AIX 7.1

�RHEL 6.4 certification

SQL Enhancements�Multiple Schema (3-part naming)

�Orphan column query

�NOT IN / EXIST improvements

�CASE WHEN improvements

�Support 24 hour datetime

�CESU-8 support

Transaction Enhancement�Truncate table in TXN

�Improved view validation

�Temp table enhancements

�Deprecate Web Admin

ETL�ODBC loader support for INTERVAL

Netezza Performance Portal�Cryptographics standards (s800-131a)

�Scheduler rules

�History type AUDIT

�Restrict nzPortal users

�Groom dialogs


New Features in NPS 7.1 / Replication 1.5

� WHAT IS IT– A system parameter (replContinueOnLogError) in the replc.cfg file.

� HOW IT WORKS– False (default): If a PTS error occurs while capturing the transaction log, the master aborts any active

transaction.

– True: Enables the master to continue processing transactions, regardless of the logging error, but

replication stops so that loads can continue. The master node enters a "continue on error" state, where

write workloads continue even though they are not recorded in the replication log. Because the

transaction log is then invalid due to missing data, you must re-synchronize all nodes after resolving

the PTS issues.

� HOW TO RECOVER– To recover from the replication suspension that results from the "master continue on error" feature,

you must follow the backup and restore procedure. First, run the nzreplanalyze command to generate

a directive file for synchronization and progress the master node from "continue on error" to a

suspended state. Then, use nzreplbackup to create backup and activate master node. Finally, use

nzreplrestore to restore the replication data to the subordinate(s).

� *No other database has this configuration setting!

Master Continue on Error


New Features in NPS 7.1 / Replication 1.5

� As of NPS 7.1 and Replication version 1.5, customers can utilize the "relaxed

serializability" setting in NPS on replication databases!– This functionality utilizes an invisibility list. The invisibility list on the master is replicated

for use on the subordinate.

– There are no constraints around using this setting on the master or subordinate in

replication environments.

– To be clear, the serial execution on the subordinate did not change from the prior

replication release but now it has the invisibility list to "see" the appropriate state of the

database.

– Its worth noting that the appliances behave the same way with relaxed serializability

regardless of replication being turned on or off.

� NPS Configuration Notes (A best practice is to use it at a session level.)– It can be set system wide (globally). This requires a stop and start of the appliance.

– It can be set with a session variable.

Relaxed Serializability Support

NOTE: customers need to know what is occurring to turn serializability to false. Therefore, it is

a best practice to utilize it in session scope (as opposed to globally).

NPS Feature will be documented as of NPS 7.1 for the first time


Replication Reduced Restrictions

� Reduced restrictions – Key software development project since January 2013

� Things that now work fine with replication– SEQUENCES

– Non deterministic SQL (ie. LIMIT 5, Random(), Window functions)

– DML which selects from non-replicated data (system tables, databases)

– Stored procedures which manipulate timestamps

– Session scope temporary tables and variables

- TEMP tables now work identically when replication is enabled vs disabled

– Transactions larger than 300KB of SQL statements now supported

– UDF, UDTF and UDA


Features

� This QuickStart includes the following activities:

� Install the 10 Gb NIC cards in the Netezza appliances, establish and validate connectivity with replication hardware and Netezza appliance.

� Install and configure a basic Netezza Replication Software Solution from one Netezza source to one target.

� Provide information sharing on how to best use and leverage the Netezza Replication Solution.

� Conduct a planning workshop to document disaster and recovery scenarios based on the requirements.

� The scope is limited to one Netezza source and one target. Additional nodes can be supported and quoted separately.

� The site survey / pre-engagement checklist is reviewed and completed by the client before any IBM resources come on-site.

Deliverables

� Installation Report

� Disaster and Recovery Scenarios Document

Ensure your solution is implemented efficiently with low risk

Benefits

� Get a basic replication solution installed and configured quickly realizing your solution ROI faster

� Leverage IBM deep product expertise to define optimum disaster recovery solutions to satisfy your requirements

� Obtain a replication solution foundation to protect one of your most important assets, your data!

Backed by world-class industry and product experts in deploying

Information Management Software

Duration

� 4 weeks

PureData System for Analytics Replication QuickStart Offering


Announcementhttp://www-01.ibm.com/common/ssi/cgi-

bin/ssialias?infotype=AN&subtype=CA&htmlfid=897/ENUS214-055&appname=USN

Fix Centralhttp://www-

933.ibm.com/support/fixcentral/swg/selectFixes?product=ibm/Information+Management/Netez

za+NPS+Software+and+Clients&release=NPS_7.1.0&platform=All&function=all

Knowledge Centerhttp://www-01.ibm.com/support/knowledgecenter/

Replication Serviceshttps://w3-connections.ibm.com/communities/community/NetezzaReplication

Netezza Developer Network download site:https://www14.software.ibm.com/webapp/iwm/web/reg/pick.do?source=swg-im-ibmndn&lang=en_US

ContactsDoug Dailey, Netezza Product Manager (NPS), [email protected]

Chris Gerlt, Netezza Product Manager (Replication), [email protected]

Questions about NPS 7.1 & Replication 1.5


© International Business Machines Corporation 2014

International Business Machines Corporation New Orchard Road Armonk, NY 10504

IBM, the IBM logo, PureSystems, PureFlex, PureApplication, PureData and ibm.com are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide.

A current list of IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtml

All rights reserved.

Ibm pure data system for analytics n200x

Data & Analytics

red hat linux

improved system management

drive failure rates

3x faster performance1

performance speed refers

faster disk regeneration4

n2001 query times

ibm puredata system