PLINQ: A Query Language for Data Parallel Programming Joe Duffy, Microsoft Declarative Aspects of Multicore Programming (DAMP) Workshop – POPL’07 © 2007,

PLINQ: A Query Language for Data Parallel Programming

Joe Duffy, MicrosoftDeclarative Aspects of Multicore Programming (DAMP) Workshop – POPL’07

© 2007, Microsoft, Corp. All rights reserved.

Research Context (1) New Microsoft technology, Language Integrated Query

(LINQ) Primary goals:

Data-source agnostic, type-safe query language Simplify expression of complex, multi-step operations over sets of

data Why?

Many programs contain textual (untyped) SQL, XPath, XQuery, … Programs must deal with increasingly larger quantities of data

Hardware: memory and hard disk capacities continue to grow GBTBPB…

Industry software: rich media, interactive visualizations, AI, NLP Databases are ubiquitous, but aren’t always the solution

Features SQL-like relational algebra language syntax and libraries Supports queries over in-memory collections, XML, and RDBMS’s

In Microsoft’s “developer division” Entity responsible for Visual Studio, Visual C#, Basic, and C++ Releasing as part of Visual Studio 2007

New Microsoft technology, Language Integrated Query (LINQ) Primary goals:

Data-source agnostic, type-safe query language Simplify expression of complex, multi-step operations over sets of

data Why?

Many programs contain textual (untyped) SQL, XPath, XQuery, … Programs must deal with increasingly larger quantities of data

Hardware: memory and hard disk capacities continue to grow GBTBPB…

Industry software: rich media, interactive visualizations, AI, NLP Databases are ubiquitous, but aren’t always the solution

Features SQL-like relational algebra language syntax and libraries Supports queries over in-memory collections, XML, and RDBMS’s

In Microsoft’s “developer division” Entity responsible for Visual Studio, Visual C#, Basic, and C++ Releasing as part of Visual Studio 2007


2

Research Context (2) This talk describes extensions to LINQ to

accomplish parallel queries execution, i.e. Parallel LINQ (PLINQ) Goals:

Apply data parallelism to LINQ query execution Preserve LINQ programming model, little to no req’d interface

changes Deal efficiently with composition and nesting of query

operators Audience: developers on Microsoft’s .NET platform, C#,

VB, and VC++ Architectures: those running Windows – mostly MIMD (w/

SIMD/vector extensions), multi-core machines in the range of 2…64 processors, typically AMD or Intel

Mostly an application of other techniques: RDBMS parallel query execution (Volcano, SQL Server, Oracle), NESL, GpH,, and others

This talk describes extensions to LINQ to accomplish parallel queries execution, i.e. Parallel LINQ (PLINQ) Goals:

Apply data parallelism to LINQ query execution Preserve LINQ programming model, little to no req’d interface

changes Deal efficiently with composition and nesting of query

operators Audience: developers on Microsoft’s .NET platform, C#,

VB, and VC++ Architectures: those running Windows – mostly MIMD (w/

SIMD/vector extensions), multi-core machines in the range of 2…64 processors, typically AMD or Intel

Mostly an application of other techniques: RDBMS parallel query execution (Volcano, SQL Server, Oracle), NESL, GpH,, and others


3

Syntax: A Query Language

var q = from x1 in y

join x2 in z on x2.fA equals x1.fA

where p(x2.fB)

orderby x1.fC

select new { x1.fA, x2.fB, x1.fC };

int r = q.Sum(a => a.fB*a.fC);


4

Queries == Trees of Operators A query is comprised of a tree of operators

Most operators operate on a stream t of type T* and produce a (lazily, on-demand generated) stream u of type U*, i.e. T* U*

var q = from x in A where (x % seed) == 0 select x/0.33f; Many operators are unary: forming a stream, but others are binary, i.e.

a tree

Some operators “terminate” the stream by reducing to a non-stream, i.e. T*U

float s = q.Sum();

As with a program AST, these trees can be analyzed, rewritten

Declarative, data-intensive, and bulk transformation nature means execution technique == implementation detail

This is why we can safely introduce parallelism

A query is comprised of a tree of operators Most operators operate on a stream t of type T* and produce a (lazily,

on-demand generated) stream u of type U*, i.e. T* U* var q = from x in A where (x % seed) == 0 select x/0.33f;

Many operators are unary: forming a stream, but others are binary, i.e. a tree

Some operators “terminate” the stream by reducing to a non-stream, i.e. T*U

float s = q.Sum();

As with a program AST, these trees can be analyzed, rewritten

Declarative, data-intensive, and bulk transformation nature means execution technique == implementation detail

This is why we can safely introduce parallelism© 2007, Microsoft, Corp. All rights

reserved.5

Where

Select

Where

Join

…

Declaring Queries Queries are expressed with one of two mechanisms

“Query comprehensions” – syntax extensions to Visual C#, VB to build a query

Calling query APIs directly The former is transformed into the latter by compilers, e.g.

var q = from x in Y where p(x) orderby x.f1 select x.f2;

becomes…

var q = Enumerable.Select(Enumerable.OrderBy( Enumerable.Where(Y, x => p(x)), x => x.f1), x.f2);

Comprehensions allow query declaration “left-to-right” instead of “inside out” But supports only a subset of query operators right now My hope is that one day this restriction is gone

To obtain results from a query, execution must be forced, e.g.

foreach (T e in q) a(e);or T[] results = q.ToArray(); etc…

Queries are expressed with one of two mechanisms “Query comprehensions” – syntax extensions to Visual C#, VB to build a

query Calling query APIs directly

The former is transformed into the latter by compilers, e.g.

var q = from x in Y where p(x) orderby x.f1 select x.f2;

becomes…

var q = Enumerable.Select(Enumerable.OrderBy( Enumerable.Where(Y, x => p(x)), x => x.f1), x.f2);

Comprehensions allow query declaration “left-to-right” instead of “inside out” But supports only a subset of query operators right now My hope is that one day this restriction is gone

To obtain results from a query, execution must be forced, e.g.

foreach (T e in q) a(e);or T[] results = q.ToArray(); etc…


6

Input to an operator is any data sequence expressible in the CLR’s type system Input to one query operator is often the output of a child

operator Leaves: Arrays, vectors, sets, trees, infinite streams of data

Non-linear data types are flattened for execution, but presented to the programmer in original form

This works because all sequences unified by a common .NET interface, IEnumerable<T> (standard enumerator, e.g. MoveNext, Current), i.e. T*

Query evaluation is mostly lazy; we can get the “first” U from o without forcing complete calculation of the input var q = from x in infiniteStream where p(x); Much like many streaming/vector processing systems Some exceptions:

a Sort needs to evaluate its whole subtree before producing one item a Join evaluates one of its subtrees fully And so on…

Input to an operator is any data sequence expressible in the CLR’s type system Input to one query operator is often the output of a child

operator Leaves: Arrays, vectors, sets, trees, infinite streams of data

Non-linear data types are flattened for execution, but presented to the programmer in original form

This works because all sequences unified by a common .NET interface, IEnumerable<T> (standard enumerator, e.g. MoveNext, Current), i.e. T*

Query evaluation is mostly lazy; we can get the “first” U from o without forcing complete calculation of the input var q = from x in infiniteStream where p(x); Much like many streaming/vector processing systems Some exceptions:

a Sort needs to evaluate its whole subtree before producing one item a Join evaluates one of its subtrees fully And so on…

Query Inputs and Outputs


7

C# Query Comprehension Syntax


8

expr ::= … | query-exprquery-expr ::= from-clause query-bodyfrom-clause ::= ‘from’ itemNameExpr ‘in’ srcExprquery-body ::= join-clause*

(from-clause join-clause* | let-clause | where-clause)*orderby-clause?(select-clause | groupby-clause)query-continuation

join-clause ::=‘join’ itemNameExpr ‘in’ srcExpr ‘on’ keyExpr1 ‘equals’ keyExpr2(‘into’ itemNameExpr)?

let-clause ::= ‘let’ itemNameExpr ‘=’ selExprwhere-clause ::= ‘where’ predExprorderby-clause ::= ‘orderby’ (keyExpr (‘ascending’ |

‘descending’)?)*select-clause ::= ‘select’ selExprgroupby-clause ::= ‘group’ selExpr ‘by’ keyExprquery-continuation ::= ‘into’ itemNameExpr ‘query-body’

expr ::= … | query-exprquery-expr ::= from-clause query-bodyfrom-clause ::= ‘from’ itemNameExpr ‘in’ srcExprquery-body ::= join-clause*

(from-clause join-clause* | let-clause | where-clause)*orderby-clause?(select-clause | groupby-clause)query-continuation

join-clause ::=‘join’ itemNameExpr ‘in’ srcExpr ‘on’ keyExpr1 ‘equals’ keyExpr2(‘into’ itemNameExpr)?

let-clause ::= ‘let’ itemNameExpr ‘=’ selExprwhere-clause ::= ‘where’ predExprorderby-clause ::= ‘orderby’ (keyExpr (‘ascending’ |

‘descending’)?)*select-clause ::= ‘select’ selExprgroupby-clause ::= ‘group’ selExpr ‘by’ keyExprquery-continuation ::= ‘into’ itemNameExpr ‘query-body’

Common Query Operators Binding operators, used to express operations on abstract

elements Bind: from x in A – bind variable x to a single element e in the data

source A, one at a time, so that x may be referenced in the query text Cross product bind: from x in A from y in B – create the relational

cross-product, A × B, binding x and y to members of the resulting pairs (x, y)

Let bind: let x = e – bind variable x to the result of evaluating expression e

General operators, to perform relational operations Selection: where p – for each element e of type T, yield only those for

which the selection predicate, p(e), of form T bool evaluates to true Sort: orderby k (ascending | descending)? – order the elements of

type T ascending or descending based on keys generated with the key-selection function k, of form T K

Map: select p – transform each element e from type T to U via the projection function, p(e), of form T U

Equi-join: join y in B on k1 equals k2 (into z)? – for each pair of elements (x, y) in the cross-product of the “left” input A and the “right” input B, for which k1(x) == k2(y), bind the result to y (or z if specified—“group join”)

Grouping: group p by k – yield groupings of data, of type (K, T*) for which k(e), of the form T K, is equal for all e in the group

Binding operators, used to express operations on abstract elements Bind: from x in A – bind variable x to a single element e in the data

source A, one at a time, so that x may be referenced in the query text Cross product bind: from x in A from y in B – create the relational

cross-product, A × B, binding x and y to members of the resulting pairs (x, y)

Let bind: let x = e – bind variable x to the result of evaluating expression e

General operators, to perform relational operations Selection: where p – for each element e of type T, yield only those for

which the selection predicate, p(e), of form T bool evaluates to true Sort: orderby k (ascending | descending)? – order the elements of

type T ascending or descending based on keys generated with the key-selection function k, of form T K

Map: select p – transform each element e from type T to U via the projection function, p(e), of form T U

Equi-join: join y in B on k1 equals k2 (into z)? – for each pair of elements (x, y) in the cross-product of the “left” input A and the “right” input B, for which k1(x) == k2(y), bind the result to y (or z if specified—“group join”)

Grouping: group p by k – yield groupings of data, of type (K, T*) for which k(e), of the form T K, is equal for all e in the group


9

Some Example Queries


10

Word counts:string doc = …;

var counts = from w in doc.Split(' ') group w by w;

Weighted average:float[] D = …, W = …;

float avg = D.ZipWith(W, (x,y) => x*y).Sum() / W.Sum();

“Select customers whose billing address is in Washington in the United States, or whose cumulative order total is >= $25 USD; order them by total $ descending, group them by state, and project just their name and total”:

Set<Customer> custs = …;

Set<Order> ords = …;

Set<Address> addrs = …;

var q = from c in custs join o in ords on o.CustomerID equals c.ID into co join a in addrs on a.AddressID equals o.BillingID let ordTotal = co.Sum(o => o.TotalCost) where (a.State == "WA" && a.Country == "United States") || ordTotal >= $25.00 orderby ordTotal descending group new {c.LastName,c.FirstName,ordTotal} by a.State;

Word counts:string doc = …;

var counts = from w in doc.Split(' ') group w by w;

Weighted average:float[] D = …, W = …;

float avg = D.ZipWith(W, (x,y) => x*y).Sum() / W.Sum();

“Select customers whose billing address is in Washington in the United States, or whose cumulative order total is >= $25 USD; order them by total $ descending, group them by state, and project just their name and total”:

Set<Customer> custs = …;

Set<Order> ords = …;

Set<Address> addrs = …;

var q = from c in custs join o in ords on o.CustomerID equals c.ID into co join a in addrs on a.AddressID equals o.BillingID let ordTotal = co.Sum(o => o.TotalCost) where (a.State == "WA" && a.Country == "United States") || ordTotal >= $25.00 orderby ordTotal descending group new {c.LastName,c.FirstName,ordTotal} by a.State;

Additional Query Operators


11

Some have no syntactic representation and must be accessed w/ library calls: ForAll(A, a): invoke side effecting operation a(x) for each

element x in A Concat(A, B): linearly concatenate the data inputs A and B Zip(A, B): combine two inputs A and B into pairs by overlaying

data Reverse(A): reverse the ordering of elements in vector A Range(x, y): generate a stream representing the range [x, y) Set operators: Distinct(A), Union(A, B), Intersect(A, B) Reductions (a.k.a. aggregations, folds): Aggregate(A, binOp), Count(A), Sum(A), Min(A), Max(A), Average(A), EqualAll(A, B), Any(A, p), All(A, p), Contains(A, e)

Some have no syntactic representation and must be accessed w/ library calls: ForAll(A, a): invoke side effecting operation a(x) for each

element x in A Concat(A, B): linearly concatenate the data inputs A and B Zip(A, B): combine two inputs A and B into pairs by overlaying

data Reverse(A): reverse the ordering of elements in vector A Range(x, y): generate a stream representing the range [x, y) Set operators: Distinct(A), Union(A, B), Intersect(A, B) Reductions (a.k.a. aggregations, folds): Aggregate(A, binOp), Count(A), Sum(A), Min(A), Max(A), Average(A), EqualAll(A, B), Any(A, p), All(A, p), Contains(A, e)

Runtime: Parallel Execution


12

Operator Parallelism Intra-operator, i.e. partitioning:

Input to a single operator is “split” into p pieces and run in parallel Adjacent and nested operators can enjoy fusion Good temporal locality of data – each datum “belongs” to a partition

Inter-operator, i.e. pipelining Operators run concurrently with respect to one another Can avoid “data skew”, i.e. imbalanced partitions, as can occur w/

partitioning Typically incurs more synchronization overhead and yields

considerably worse locality than intra-operator parallelism, so is less attractive

Partitioning is preferred unless there is no other choice For example, sometimes the programmer wants a single-CPU view,

e.g.:foreach (x in q) a(x)

Consumption action a for might be written to assume no parallelism Bad if a(x) costs more than the element production latency

Otherwise, parallel tasks just eat up memory, eventually stopping when the bounded buffer fills

But a(x) can be parallel too

Intra-operator, i.e. partitioning: Input to a single operator is “split” into p pieces and run in parallel Adjacent and nested operators can enjoy fusion Good temporal locality of data – each datum “belongs” to a partition

Inter-operator, i.e. pipelining Operators run concurrently with respect to one another Can avoid “data skew”, i.e. imbalanced partitions, as can occur w/

partitioning Typically incurs more synchronization overhead and yields

considerably worse locality than intra-operator parallelism, so is less attractive

Partitioning is preferred unless there is no other choice For example, sometimes the programmer wants a single-CPU view,

e.g.:foreach (x in q) a(x)

Consumption action a for might be written to assume no parallelism Bad if a(x) costs more than the element production latency

Otherwise, parallel tasks just eat up memory, eventually stopping when the bounded buffer fills

But a(x) can be parallel too


13

q = from x in A where p(x) select x3; Intra-operator:

Inter-operator:

Both composed:

q = from x in A where p(x) select x3; Intra-operator:

Inter-operator:

Both composed:

… Thread 4 …

… Thread 3 …

… Thread 2 …

… Thread 1 …

Parallelism Illustrations


14

where p(x) select x3

Awhere p(x) select x3

… Thread 2 …… Thread 1 …

A where p(x) select x3

… Thread 2 …

… Thread 1 …

where p(x) select x3

Awhere p(x) select x3

Deciding Parallel Execution Strategy Tree analysis informs decision making:

Where to introduce parallelism? And what kind? (partition vs. pipeline) Based on intrinsic query properties and operator costs

Data sizes, selectivity (for filter f, what % satisfies the predicate?)

Intelligent “guesses”, code analysis, adaptive feedback over time

But not just parallelism, higher level optimizations too, e.g. Common sub-expression elimination, e.g.

from x in X where p(f(x)) select f(x); Reordering operations to:

Decrease cost of query execution, e.g. put a filter before the sort, even if the user wrote it the other way around

Achieve better operator fusion, reducing synchronization cost

Tree analysis informs decision making: Where to introduce parallelism? And what kind? (partition vs. pipeline) Based on intrinsic query properties and operator costs

Data sizes, selectivity (for filter f, what % satisfies the predicate?)

Intelligent “guesses”, code analysis, adaptive feedback over time

But not just parallelism, higher level optimizations too, e.g. Common sub-expression elimination, e.g.

from x in X where p(f(x)) select f(x); Reordering operations to:

Decrease cost of query execution, e.g. put a filter before the sort, even if the user wrote it the other way around

Achieve better operator fusion, reducing synchronization cost


15

Partitioning Techniques Partitioning can be data-source sensitive

If a nested query, can fuse existing partitions If an array, calculate strides and contiguous ranges (+spatial locality) If a (possibly infinite) stream, lazily hand out chunks

Partitioning can be operator sensitive E.g. equi-joins employ a hashtable to turn an O(nm) “nested join” into

O(n+m) Build hash table out of one data source; then probe it for matches Only works if all data elements in data source A with key k are in the same

partition as those elements in data source B also with key k We can use “hash partitioning” to accomplish this: for p partitions,

calculate k for each element e in A and in B, and then assign to partition based on key, e.g. k.GetHashCode() % p

Output of sort: we can fuse, but restrict ordering, ordinal and key based

Existing partitions might be repartitioned Can’t “push down” key partitioning information to leaves: types

changed during stream data flow, e.g. select operator Nesting: join processing output of another join operator Or just to combat partition skew

Partitioning can be data-source sensitive If a nested query, can fuse existing partitions If an array, calculate strides and contiguous ranges (+spatial locality) If a (possibly infinite) stream, lazily hand out chunks

Partitioning can be operator sensitive E.g. equi-joins employ a hashtable to turn an O(nm) “nested join” into

O(n+m) Build hash table out of one data source; then probe it for matches Only works if all data elements in data source A with key k are in the same

partition as those elements in data source B also with key k We can use “hash partitioning” to accomplish this: for p partitions,

calculate k for each element e in A and in B, and then assign to partition based on key, e.g. k.GetHashCode() % p

Output of sort: we can fuse, but restrict ordering, ordinal and key based

Existing partitions might be repartitioned Can’t “push down” key partitioning information to leaves: types

changed during stream data flow, e.g. select operator Nesting: join processing output of another join operator Or just to combat partition skew


16

Example: Query Nesting and Fusion


17

Nesting queries inside of others is common We can fuse partitions

var q1 = from x in A select x*2; var q2 = q1.Sum();

Nesting queries inside of others is common We can fuse partitions

var q1 = from x in A select x*2; var q2 = q1.Sum();

sele

ct x

*2

sele

ct x

*2

+ +

+

sele

ct x

*2

sele

ct

x*2

+ +

+

I. Select (alone)

2. Sum (alone)

3. Select + Sum

Execution of Work


18

Windows’ finest granularity of work is a thread Each partition has at most one thread assigned to it, assigned via

a gang scheduling//dynamic work stealing-like (a la Cilk) algorithm

Tension between creating “just the right number of threads” (static+dynamic adaptivity) versus over partitioning work: would change some things, but maybe for the better

Hard to predict things like IO and blocking Developer still has shared memory, can make horrible

mistakes, e.g.:

int s_x = 0;var q = from x in A where x == s_x++;

Analysis can sometimes catch this, but often not (dynamic function invocation, e.g. where x == side_effecting_func(…))

C#, and generally the CLR’s, type system doesn’t support the notion of purity (though some research systems, e.g. Spec#, provide hope)

Where is transactional memory when you need it?

Windows’ finest granularity of work is a thread Each partition has at most one thread assigned to it, assigned via

a gang scheduling//dynamic work stealing-like (a la Cilk) algorithm

Tension between creating “just the right number of threads” (static+dynamic adaptivity) versus over partitioning work: would change some things, but maybe for the better

Hard to predict things like IO and blocking Developer still has shared memory, can make horrible

mistakes, e.g.:

int s_x = 0;var q = from x in A where x == s_x++;

Analysis can sometimes catch this, but often not (dynamic function invocation, e.g. where x == side_effecting_func(…))

C#, and generally the CLR’s, type system doesn’t support the notion of purity (though some research systems, e.g. Spec#, provide hope)

Where is transactional memory when you need it?

Some Conclusions & Observations


19

Results have been encouraging: about what you’d expect given prior related research Good performance, few changes required to the serial programming model Given the upcoming public release of LINQ in VS, we hope reach will be good Not a silver bullet – just one tool in a developer’s belt

Hard to “catch up” to huge parallelism constants on Windows, particularly given small data inputs and/or inexpensive operators

var q = Range(0,100).Sum(); // add up #s [0,100) Also easy to run into memory bottlenecks, possible opportunities for

architecture-aware optimizations (we already try to maximize spatial+temporal locality)

Costs are hard to get right Too much dynamism in the platform to arrive at a correct # Even if we did, hard to create heuristics that scale well across platforms Too much decomposition, too little, unexpected IO (paging, …), synchronization

But in the end: do costs really matter? Or is it better to represent concurrency using a fixed granule and let another

scheduling mechanism apply policy (work stealing)? Many queries are candidates for SIMD/vector architectures

Targeting other instruction sets (SSEx, GPU) could be profitable

Results have been encouraging: about what you’d expect given prior related research Good performance, few changes required to the serial programming model Given the upcoming public release of LINQ in VS, we hope reach will be good Not a silver bullet – just one tool in a developer’s belt

Hard to “catch up” to huge parallelism constants on Windows, particularly given small data inputs and/or inexpensive operators

var q = Range(0,100).Sum(); // add up #s [0,100) Also easy to run into memory bottlenecks, possible opportunities for

architecture-aware optimizations (we already try to maximize spatial+temporal locality)

Costs are hard to get right Too much dynamism in the platform to arrive at a correct # Even if we did, hard to create heuristics that scale well across platforms Too much decomposition, too little, unexpected IO (paging, …), synchronization

But in the end: do costs really matter? Or is it better to represent concurrency using a fixed granule and let another

scheduling mechanism apply policy (work stealing)? Many queries are candidates for SIMD/vector architectures

Targeting other instruction sets (SSEx, GPU) could be profitable

The End


20

No paper yet – tentative plans for ’07 Public release dates TBD; for more

information, watch: http://www.bluebytesoftware.com/blog/

Thanks for coming …

No paper yet – tentative plans for ’07 Public release dates TBD; for more

information, watch: http://www.bluebytesoftware.com/blog/

Thanks for coming …

PLINQ: A Query Language for Data Parallel Programming Joe Duffy, Microsoft Declarative Aspects of Multicore Programming (DAMP) Workshop – POPL’07 © 2007,

Documents