35614083 Teradata Architecture

Accenture Confidential

Accenture and Teradata

Teradata Architecture

Key Database Features

The Teradata Database

Explained, Illustrated and Demystified

Accenture Confidential 2

Teradata Architecture:

Table of Contents

Introduction

Platform Architecture: MPP and SMP

Teradata Architecture: MPP and SMP

Key Differentiators


Introduction



Key Differentiators


Introduction:

Purpose and Intended Audience

• The purpose of this deck is to familiarize Accenture practitioners with

the Teradata Relational Database System (RDBMS)

• We tend to focus on Teradata’s unique architecture or features,

occasionally contrasting them with Oracle, a much more familiar

reference for most readers

• The reader need not be totally technical to benefit by reading this: we

have attempted to provide high-level overview material vs. deep (and

sometimes boring) details

• Finally, illustrations are provided to add clarity to certain concepts

where it makes sense


Introduction:

Unique Teradata Attributes

• Teradata is unique among commercial RDBMSs in a number of ways.

(If that weren’t true, there probably would be no need for this deck)

• Key Teradata differentiators:

– It is implemented on Massively Parallel Processing (MPP) hardware

architecture – and always has been

– It was implemented on proprietary hardware with portions of the database

imbedded in the hardware/firmware (although this is no longer the case)

– The database software is unconditionally parallel

– It is linearly scalable, with hundreds of reference sites exceeding a

terabyte (1,000 gigabytes) in size

– It virtually “owns” the Very Large Database (VLDB) market space – and

has for twelve years

• Some of the above points are discussed in the “Key Differentiators”

section


Introduction



Key Differentiators


Teradata Platform Architecture:

Uni-processors, SMPs and MPPs

• Computers –can be broadly categorized into one of three hardware

architectures:

– Uni-processor

• The desktop PC is the example

• Generally applied to client, not server, applications

• Not further discussed further in this paper

– Symmetric Multi Processing (SMP)

• A single computing system with multiple processing units, often

microprocessors

– Massively Parallel Processing (MPP)

• A collection of computing systems – usually SMPs – that are interconnected

and that collaborate to solve a common task(s)

• While there are significant differences between these architectures, the

application programming model is essentially unchanged among them:

the platform software deals with the hardware differences



A Closer Look at SMP Hardware

• Typical SMP hardware architectures have:

– Two to eight, up to as many as 64 processors

• Smaller SMPs often have Compaq or Intel motherboards and run MS Windows

• Larger SMPs are typically RISC machines running UNIX

– Sun, H-P and IBM dominate this space

– All of the processors run from a common, shared memory and they all

access that memory via a common, shared memory bus

– All of the processors share the I/O slots, channels and associated

peripherals devices, notable disk storage subsystems

• SMP examples:

– Low-end:

• Compaq ProLiant DL series (2-4 CPUs, desk side)

• NCR’s Model 4455 or similar (1-4 CPUs, desk side)

– Midrange: HP’s NetServer 6000 series (4-6 CPUs, rack mount)

– High-end: Sun’s Enterprise 10000 (16-64 CPUs, free standing)

– Nearly all IBM and compatible mainframes



SMP Hardware

Memory

Memory Bus

SMP hardware: 4 CPUs (in blue) with shared memory and I/O subsystems

Peripheral Devices

I/O Bus



SMP Hardware Scalability

• Scalability options for SMPs include:

– Larger memories

– Faster CPUs

– More CPUs

– More I/O (slots and busses)

– More peripherals (usually disk arrays)

• Scalability limitations for SMPs:

– Every shared hardware subsystem is a potential bottleneck for an SMP

– The most common limiter to SMP scalability is the memory subsystem

• Each CPU must access the single memory via a common bus

• As the number of CPUs increases, there is added contention for memory

accesses – CPUs begin waiting on the memory subsystem

• Eventually, a point of diminishing returns is reached, where the added expense

of additional CPUs fails to provide a commensurate increase in performance



An Introduction to MPP Hardware

• Massively Parallel Processing (MPP) hardware systems consist of from

two to perhaps hundreds of SMP systems called “nodes”

– Just like a stand alone SMP, each node has its own memory and I/O

subsystems as well as its own copy of the operating system and

application(s)

– The nodes are interconnected via a dedicated, very high-speed, often

proprietary interconnect network

– Most MPP systems run under UNIX, though a few MPP Teradata

installations run under Windows 2000

• MPP examples:

– IBM’s pSeries (formerly RS6000) and IBM’s “Deep Blue,” their chess-

playing machine that defeated Grand Master Gary Kasparov in May, 1997

– The NCR 5250 or 5255 (among other NCR servers), which has never

played chess and probably never will


Platform Architecture:

A 2-node MPP System

Interconnect Network

MPP hardware showing 2 nodes, their disk arrays and the interconnect

MPPNode 0

DiskArray 0

MPPNode 1

DiskArray 1



More on MPP Hardware

• MPP hardware architectures are often called “shared nothing” or

“loosely-coupled” systems, since the nodes – the basic MPP building

blocks – share no computing hardware

• The network that interconnects the nodes enables them to

communicate and cooperate to solve a problem

– Exactly how the interconnect is used depends entirely on the application(s)

running in the system

• So, why bother with the complexity of MPP hardware?

– One word: Scalability: The ability to add processing nodes without “hitting

the wall” before reaching a desired level of performance



Teradata’s MPP Hardware

• For Teradata’s MPP hardware, each node is:

– Made by Solectron to NCR’s design and specifications

– Powered by a 4-CPU Intel Xeon board

– Connected to all the other nodes via NCR’s BYNET interconnect

– Connected via SCSI to its own disk array(s)

– Optionally connected to the disk array(s) of another node in the complex

for fault tolerance purposes (more on this later)



Teradata’s MPP Hardware

BYNET

NCR MPP hardware showing 4 nodes, their disk arrays and the BYNET interconnect

BYNET Interconnect

Node 0Disk Array 0

Node 1Disk Array 1

Node 2Disk Array 2

Node 3Disk Array 3



Teradata’s BYNETtm Interconnect

• NCR’s node interconnect subsystem is called the BYNET

• The BYNET is fully scalable:

– When you add a node, you add bandwidth with it, so that the total

bandwidth available scales as the MPP complex grows

– Early Teradata machines did not have a scalable interconnect (YNET)

• The network architecture is “Folded Banyan”

– All nodes are directly connected to all other nodes

• There are always two BYNETs for redundancy purposes

• The BYNET hardware is an ordinary PCI card designed by NCR

• The BYNET is fast:

– 120 megabytes per second per node per BYNET in each direction

• It’s patented by and proprietary to NCR



BYNET Node-to-Node Connections

• Ever node has a dedicated bi-directional channel to every other node

• This architecture is duplicated – there are really 2 channels (one shown)

Point-to-Point Messaging Broadcast Messaging



Teradata “Cliques”

• A Teradata clique provides high availability, and is a configuration option

• A clique is a group of nodes – 4 are shown below – that can access a common chunk of disk array storage

• Cliques eliminate any single point of failure

BYNET Interconnect

Four nodes

Shared SCSI

Sharable disk



Why have Cliques?

• Cliques add high availability via automatic failure detection and

software re-configuration in the event of a hardware failure(s)

InterconnectBYNET Interconnect



MPP Hardware Illustration

• Below is a medium size Teradata MPP system:

– 16 nodes, each with their own busses, memory and back plane

– 8 cliques of 2 nodes

– 8 disk arrays, one for each clique

– 2 BYNETs, because there are always two BYNETs

– Total BYNET bandwidth is (2 x 2 x 16 x 120) = 7.68 GB/sec!

...



MPP Operating System Software

• Operating System software

– For MPP Teradata, the choices are the same as for SMP:

• NCR’s version of UNIX: MP-RAS, or

• Windows 2000

– For both OS options:

• The BYNET device driver is an ordinary (UNIX or Windows) one

• Teradata doesn’t use the native file system for performance reasons; all the

Teradata database structures are managed by Teradata within raw disk


Introduction



Key Differentiators



Software “Units of Parallelism”

• Teradata software components are known as “Virtual Processors” or

VPROCs

– VPROCs are software threads or processes

• There are two kinds of VPROCs:

– Access Module Processors (AMPs)

• An AMP reads, writes and manipulates all database rows in the partition that

the AMP “owns”

– Parsing Engines (PEs)

• PE parse SQL statements, reducing them to their component executable steps

• The number of VPROCs is configurable

• VPROCs are in every Teradata node

• VPROCs can migrate around the complex, as in the case of a failed

node

• VPROCS provide parallelism within a node



MPP Platform with AMPs and PEs

• Four-node MPP system showing Virtual Processors – AMPs and PEs –

in each node

BYNETBYNET Interconnect

Node 0“w” AMPs

“w” partitions

Node 1“x” AMPs

“x” partitions

Node 2“y” AMPs

“y” partitions

Node 3“z” AMPs

“z” partitions

VPROCS

AMP & PE

VPROCS

AMP & PE

VPROCS

AMP & PE

VPROCS

AMP & PE



Data Partitioning Explained

• Data is automatically distributed to all AMPs – and thus to all disks –

via a proprietary hashing algorithm

– No partitioning or re-partitioning ever required

• File system architecture is fundamentally different

– Rows stored in blocks

– Space allocation is entirely dynamic

• Absolutely minimal DBA effort required

– No reorgs, repartitioning, space management, index rebuilds

– Minimal monitoring required



Data Partitioning Illustrated

• The rows of each table are automatically and unconditionally

distributed to all AMPs (and all available disk storage)

– This enables Teradata’s automatic and unconditional parallelism

SYSTEM TABLES

CUSTOMER

ORDERS

LINEITEM

PART

AMP1 Disk AMP2 Disk AMP3 Disk AMP4 Disk

SUPPLIERS



Data Partitioning Explained

• Let’s take a simple case:

– A four-node, eight AMP Teradata MPP system

– A single database table of 100,000 rows

• The system will configure itself with two AMPs in each node

• Then, via hashing the Unique Primary Index, it will distribute all rows to

all AMPs – giving each AMP about 12,500 rows – and each node

25,000 rows

• This is the ideal “flat” distribution across all the system, and will occur if

the primary key is essentially random – like SSN

• In all processing, each node has to deal with only 1/4 of the total

database

– The name of the game is simple: “Divide and conquer”



SMP Hardware

• In an SMP architecture, Teradata looks much the same as an ordinary

database such as Oracle:

– A single SMP processor does it all

– A single software image can access the entire database


RDBMS Architecture:

Teradata on SMP Hardware

• On SMP hardware architecture, Teradata runs on:

– Windows 2000

– Intel microprocessors

– The above combination is often called “Wintel”

– Almost all Wintel boxes use Compaq or Intel processor boards, typically

populated with Pentium III or Pentium 4 CPUs

– NCR’s SMP machines on either Windows or UNIX (MP-RAS)

• The latter configuration – SMP/UNIX is often used as a low-cost test platform

for a production MPP system under MP-RAS

• Examples of Teradata SMP platforms:

– IBM

– HP

– Compaq

– Dell

– NCR (but only rarely, probably due to cost or client standards)



SMP Hardware and Disk Array

SMP hardware showing 4 CPUs and disk array

SMP Box

Disk Array

SCSI Interconnect (dual paths shown)


Introduction

Historical Perspective



Key Differentiators


Key Differentiators

• Ubiquitous, persistent parallelism

• Unrelenting partitioning

• A really, really mature query optimizer

• The above yield the ability to handle very complex queries, large

complex databases and lots of concurrent users doing lots of

different stuff

• Truly linear scalability

• Mainframe connection via direct FIPS-60 channel connect

– ESCON or “Bus and Tag” media


Scalable, Parallel, High Availability

MPP Hardware

• A group of 1-4 nodes with connections to each other’s storage --- keeps applications running when node(s) fail

• All critical components have redundant backups• Nodes have (optional) LAN/WAN/Mainframe connectivity

SMP Processing Nodes

Point-to-Point

SCSI or

FibreChannel

Interconnect

LSI Logic or EMC2

Disk Arrays

MPP Interconnect BYNET

DA Controllers(w/Cache)




CPU CPU

CPU CPU

CPU CPU

CPU CPU

Memory Memory

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Memory Memory

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Memory Memory

CPU CPU

CPU CPU

CPU CPU

CPU CPU

Memory Memory Data Cache Data Cache Data Cache Data Cache

Server Management


• Basis of Teradata parallelism and scalability– Divide the work evenly among many processing units

– No single point of vulnerability or chokepoint for any operation

Shared Nothing Software

Architecture


• Automatic, Always On

• Rows are distributed evenly by hash partitioning– Define the row, we’ll do the rest

– Regardless of queries or demographics

• Shared nothing software

Teradata Data Distribution

VAMP1 VAMP2 VAMP3 VAMP4 ………………………………………………………VAMPn

Table A Table B Table C

Primary Index

Teradata Parallel Hash Function

P

DM

P

DM

P

DM

P

DM

P

DM

P

DM

P

DM

P

DM

P

DM


Key Data Warehousing Capabilities

• Technology

– Fully automatic space management

– Automatic data distribution

– Always-On, Automatic, Integral, Multi-Level Parallelism

– Continually Improved Cost Based Optimizer

– Full ANSI SQL functionality, complex query optimization


VPROCs

AMP & PE

7

14

41

87

3

94

16

21

53

VPROCs

AMP & PE

2

33

61

54

73

75

1

18

23

Hash Distribution

• Data automatically distributed to AMPs via hashing• Even distribution results in scalable performance

• Hash map defined and maintained by the system

– 2**32 hash codes, 64K buckets distributed to AMPs

• Prime Index (PI) column(s) are hashed

• Hash is always the same - for the same values

• No partitioning or repartitioning required


• Delivers linear scalability

– Maximizes utilization of SMP resources

– To any size configuration

– Allows flexible configurations

– Incremental upgrades

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

AmpsVPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

VPROCs

Amps

Shared Nothing Software


A Shared Nothing Database Architecture Enables Expansion with Balance

• Amount of parallelism grows at the same rate as the system expands

• Each parallel unit does an equal amount of work

Hardware Scalability

Work Accomplished

Software Scalability

= Unit of Parallelism

Unit of

Hardware

Power

Unit of

Hardware

Power

Unit of

Hardware

Power

Unit of

Hardware

Power


Optimizer - Parallelization

• Cost based optimizer

– Parallel aware

• Rewrites built-in and cost based

• Parallelism is automatic

• Parallelism is unconditional

• Each query step fully parallelized

Accenture Confidential

• A single database buffer used by all UoPs• A single logical data store accessed by all

UoPs• Scalability limited due to control

bottlenecks and scalability of single SMP platform

Buffers, Locks, Control Blocks

Data

- Unit of Parallelism

Shared EverythingDatabase Architecture

Shared NothingDatabase Architecture

• Each UoP is assigned a data portion• Query Controller ships functions to UoPs

that own the data• Locks, buffers, etc., not shared• Highly scalable data volumes

Data

Partition

Data

Partition

Data

Partition

Data

Partition

Shared Everything vs. Nothing


Q/AThank You

35614083 Teradata Architecture

Documents

teradata database

mpp teradata installations

smpteradata architecture

smp scalability

mpp systems

smp hardwaresmp hardware

mpp examples

memory accesses cpus