Top Banner
Data-centric computing with Netezza Architecture DISC reading group September 24, 2007
14

Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Data-centric computing with Netezza Architecture

DISC reading groupSeptember 24, 2007

Page 2: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

High Level Points

• Supercomputer use model today:– Compile, submit, wait– Does a poor job of taking advantage of

human insight available in interactive models

• Large datasets can be interactively processed using Netezza

Page 3: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

What is Netezza?

• Essentially: A big, fast SQL database

Page 4: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

What is Netezza?

• Frontend provides SQL interface• Backend is a large rack of specialized blades

Page 5: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Custom Backend Blades

• Commodity CPU, NIC, disk• Custom FPGA replaces disk interface

– Can do basic filtering in hardware, i.e., stream processing before data hits main memory

Page 6: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Division of Data

• Database distributed across multiple (100+) SPUs

• Each SPU controls, manages its slice of DB

• No info on data management, replciation, etc.

Page 7: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Division of Labor

• SPU FPGA handles basic filtering tasks• SPU CPU handles record level processing:

filtering, parsing, projecting, logging, etc.• SPU CPU handles most operations on

intermediate results: sorts, joins, aggregates• Frontend CPU handles remaining operation

>>> Processing close to disk

Page 8: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

What can this be used for?

• Paper gives 3 examples:– Citation graph processing– Search for particular structure in electrical

netlist– Word meaning disambiguation through search

of ontology

Page 9: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Citation graph example

• Look through large, sparse graph (16 million nodes, 388 million edges)

• Find both strong (direct edge) and weak couplings (e.g., two papers cite the same work)

• Essentially same code for workstation and Netezza – no need to expose parallel architecture

• Workstation DNF; 80-100x speedup on smaller tests

Page 10: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

IC netlist example• Flattened netlist of 3.5 million transistors, 10

million wires• Search for AND structure

Page 11: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

IC example results

• Combinatorial explosion makes directly joining all possibilities for each element impossible

• Can constrain better using fanouts of signals internal to the circuit

• Individual SQL queries for finding possible matches for the individual transistors took under 10 seconds

• Found all uses of the AND macro, as well as many other (1300+) identical structures generated through other means

Page 12: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Ontology example

• Expand out all possible interpretations of a phrase

• Ontology specifies lexical elements, IS-A relations, concepts, and constraints on concepts

• Goal is to search the space, expand concepts to find all matches to given phrase

Page 13: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Ontology results

• Partially unfolded ontology– Greatly expands database size, but reduces

iterations / recursions

• Recoded ontology triples as integers

• 5.58 sec. vs. 262 sec.

• can pipeline multiple queries

Page 14: Data-centric computing with Netezza Architecture DISC reading group September 24, 2007.

Issues

• Works if you can reduce your problem to SQL queries• All of the problems were based on graph expansion /

exploration – how about other domains?• Issues of database partitioning? How does arbitrary

slicing across 108 blades affect performance / scalability, esp. for non-sparse problems?

• Strawman comparison to workstation class machine: how does a traditional DB server / storage cluster compare?