Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Data-centric computing with Netezza Architecture

DISC reading groupSeptember 24, 2007

High Level Points

• Supercomputer use model today:– Compile, submit, wait– Does a poor job of taking advantage of

human insight available in interactive models

• Large datasets can be interactively processed using Netezza

What is Netezza?

• Essentially: A big, fast SQL database

What is Netezza?

• Frontend provides SQL interface• Backend is a large rack of specialized blades

Custom Backend Blades

• Commodity CPU, NIC, disk• Custom FPGA replaces disk interface

– Can do basic filtering in hardware, i.e., stream processing before data hits main memory

Division of Data

• Database distributed across multiple (100+) SPUs

• Each SPU controls, manages its slice of DB

• No info on data management, replciation, etc.

Division of Labor

• SPU FPGA handles basic filtering tasks• SPU CPU handles record level processing:

filtering, parsing, projecting, logging, etc.• SPU CPU handles most operations on

intermediate results: sorts, joins, aggregates• Frontend CPU handles remaining operation

>>> Processing close to disk

What can this be used for?

• Paper gives 3 examples:– Citation graph processing– Search for particular structure in electrical

netlist– Word meaning disambiguation through search

of ontology

Citation graph example

• Look through large, sparse graph (16 million nodes, 388 million edges)

• Find both strong (direct edge) and weak couplings (e.g., two papers cite the same work)

• Essentially same code for workstation and Netezza – no need to expose parallel architecture

• Workstation DNF; 80-100x speedup on smaller tests

IC netlist example• Flattened netlist of 3.5 million transistors, 10

million wires• Search for AND structure

IC example results

• Combinatorial explosion makes directly joining all possibilities for each element impossible

• Can constrain better using fanouts of signals internal to the circuit

• Individual SQL queries for finding possible matches for the individual transistors took under 10 seconds

• Found all uses of the AND macro, as well as many other (1300+) identical structures generated through other means

Ontology example

• Expand out all possible interpretations of a phrase

• Ontology specifies lexical elements, IS-A relations, concepts, and constraints on concepts

• Goal is to search the space, expand concepts to find all matches to given phrase

Ontology results

• Partially unfolded ontology– Greatly expands database size, but reduces

iterations / recursions

• Recoded ontology triples as integers

• 5.58 sec. vs. 262 sec.

• can pipeline multiple queries

Issues

• Works if you can reduce your problem to SQL queries• All of the problems were based on graph expansion /

exploration – how about other domains?• Issues of database partitioning? How does arbitrary

slicing across 108 blades affect performance / scalability, esp. for non-sparse problems?

• Strawman comparison to workstation class machine: how does a traditional DB server / storage cluster compare?

Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Documents

netezza slide

disk slide

search of ontology slide

structure slide

phrase slide

fast sql database slide

ontology example

ontology results