PLINQ: A Query Language for Data Parallel Programming Joe Duffy, Microsoft Declarative Aspects of Multicore Programming (DAMP) Workshop – POPL’07 © 2007, Microsoft, Corp. All rights reserved.
PLINQ: A Query Language for Data Parallel Programming
Joe Duffy, MicrosoftDeclarative Aspects of Multicore Programming (DAMP) Workshop – POPL’07
© 2007, Microsoft, Corp. All rights reserved.
Research Context (1) New Microsoft technology, Language Integrated Query
(LINQ) Primary goals:
Data-source agnostic, type-safe query language Simplify expression of complex, multi-step operations over sets of
data Why?
Many programs contain textual (untyped) SQL, XPath, XQuery, … Programs must deal with increasingly larger quantities of data
Hardware: memory and hard disk capacities continue to grow GBTBPB…
Industry software: rich media, interactive visualizations, AI, NLP Databases are ubiquitous, but aren’t always the solution
Features SQL-like relational algebra language syntax and libraries Supports queries over in-memory collections, XML, and RDBMS’s
In Microsoft’s “developer division” Entity responsible for Visual Studio, Visual C#, Basic, and C++ Releasing as part of Visual Studio 2007
New Microsoft technology, Language Integrated Query (LINQ) Primary goals:
Data-source agnostic, type-safe query language Simplify expression of complex, multi-step operations over sets of
data Why?
Many programs contain textual (untyped) SQL, XPath, XQuery, … Programs must deal with increasingly larger quantities of data
Hardware: memory and hard disk capacities continue to grow GBTBPB…
Industry software: rich media, interactive visualizations, AI, NLP Databases are ubiquitous, but aren’t always the solution
Features SQL-like relational algebra language syntax and libraries Supports queries over in-memory collections, XML, and RDBMS’s
In Microsoft’s “developer division” Entity responsible for Visual Studio, Visual C#, Basic, and C++ Releasing as part of Visual Studio 2007
© 2007, Microsoft, Corp. All rights reserved.
2
Research Context (2) This talk describes extensions to LINQ to
accomplish parallel queries execution, i.e. Parallel LINQ (PLINQ) Goals:
Apply data parallelism to LINQ query execution Preserve LINQ programming model, little to no req’d interface
changes Deal efficiently with composition and nesting of query
operators Audience: developers on Microsoft’s .NET platform, C#,
VB, and VC++ Architectures: those running Windows – mostly MIMD (w/
SIMD/vector extensions), multi-core machines in the range of 2…64 processors, typically AMD or Intel
Mostly an application of other techniques: RDBMS parallel query execution (Volcano, SQL Server, Oracle), NESL, GpH,, and others
This talk describes extensions to LINQ to accomplish parallel queries execution, i.e. Parallel LINQ (PLINQ) Goals:
Apply data parallelism to LINQ query execution Preserve LINQ programming model, little to no req’d interface
changes Deal efficiently with composition and nesting of query
operators Audience: developers on Microsoft’s .NET platform, C#,
VB, and VC++ Architectures: those running Windows – mostly MIMD (w/
SIMD/vector extensions), multi-core machines in the range of 2…64 processors, typically AMD or Intel
Mostly an application of other techniques: RDBMS parallel query execution (Volcano, SQL Server, Oracle), NESL, GpH,, and others
© 2007, Microsoft, Corp. All rights reserved.
3
Syntax: A Query Language
var q = from x1 in y
join x2 in z on x2.fA equals x1.fA
where p(x2.fB)
orderby x1.fC
select new { x1.fA, x2.fB, x1.fC };
int r = q.Sum(a => a.fB*a.fC);
© 2007, Microsoft, Corp. All rights reserved.
4
Queries == Trees of Operators A query is comprised of a tree of operators
Most operators operate on a stream t of type T* and produce a (lazily, on-demand generated) stream u of type U*, i.e. T* U*
var q = from x in A where (x % seed) == 0 select x/0.33f; Many operators are unary: forming a stream, but others are binary, i.e.
a tree
Some operators “terminate” the stream by reducing to a non-stream, i.e. T*U
float s = q.Sum();
As with a program AST, these trees can be analyzed, rewritten
Declarative, data-intensive, and bulk transformation nature means execution technique == implementation detail
This is why we can safely introduce parallelism
A query is comprised of a tree of operators Most operators operate on a stream t of type T* and produce a (lazily,
on-demand generated) stream u of type U*, i.e. T* U* var q = from x in A where (x % seed) == 0 select x/0.33f;
Many operators are unary: forming a stream, but others are binary, i.e. a tree
Some operators “terminate” the stream by reducing to a non-stream, i.e. T*U
float s = q.Sum();
As with a program AST, these trees can be analyzed, rewritten
Declarative, data-intensive, and bulk transformation nature means execution technique == implementation detail
This is why we can safely introduce parallelism© 2007, Microsoft, Corp. All rights
reserved.5
Where
Select
Where
Join
…
Declaring Queries Queries are expressed with one of two mechanisms
“Query comprehensions” – syntax extensions to Visual C#, VB to build a query
Calling query APIs directly The former is transformed into the latter by compilers, e.g.
var q = from x in Y where p(x) orderby x.f1 select x.f2;
becomes…
var q = Enumerable.Select(Enumerable.OrderBy( Enumerable.Where(Y, x => p(x)), x => x.f1), x.f2);
Comprehensions allow query declaration “left-to-right” instead of “inside out” But supports only a subset of query operators right now My hope is that one day this restriction is gone
To obtain results from a query, execution must be forced, e.g.
foreach (T e in q) a(e);or T[] results = q.ToArray(); etc…
Queries are expressed with one of two mechanisms “Query comprehensions” – syntax extensions to Visual C#, VB to build a
query Calling query APIs directly
The former is transformed into the latter by compilers, e.g.
var q = from x in Y where p(x) orderby x.f1 select x.f2;
becomes…
var q = Enumerable.Select(Enumerable.OrderBy( Enumerable.Where(Y, x => p(x)), x => x.f1), x.f2);
Comprehensions allow query declaration “left-to-right” instead of “inside out” But supports only a subset of query operators right now My hope is that one day this restriction is gone
To obtain results from a query, execution must be forced, e.g.
foreach (T e in q) a(e);or T[] results = q.ToArray(); etc…
© 2007, Microsoft, Corp. All rights reserved.
6
Input to an operator is any data sequence expressible in the CLR’s type system Input to one query operator is often the output of a child
operator Leaves: Arrays, vectors, sets, trees, infinite streams of data
Non-linear data types are flattened for execution, but presented to the programmer in original form
This works because all sequences unified by a common .NET interface, IEnumerable<T> (standard enumerator, e.g. MoveNext, Current), i.e. T*
Query evaluation is mostly lazy; we can get the “first” U from o without forcing complete calculation of the input var q = from x in infiniteStream where p(x); Much like many streaming/vector processing systems Some exceptions:
a Sort needs to evaluate its whole subtree before producing one item a Join evaluates one of its subtrees fully And so on…
Input to an operator is any data sequence expressible in the CLR’s type system Input to one query operator is often the output of a child
operator Leaves: Arrays, vectors, sets, trees, infinite streams of data
Non-linear data types are flattened for execution, but presented to the programmer in original form
This works because all sequences unified by a common .NET interface, IEnumerable<T> (standard enumerator, e.g. MoveNext, Current), i.e. T*
Query evaluation is mostly lazy; we can get the “first” U from o without forcing complete calculation of the input var q = from x in infiniteStream where p(x); Much like many streaming/vector processing systems Some exceptions:
a Sort needs to evaluate its whole subtree before producing one item a Join evaluates one of its subtrees fully And so on…
Query Inputs and Outputs
© 2007, Microsoft, Corp. All rights reserved.
7
C# Query Comprehension Syntax
© 2007, Microsoft, Corp. All rights reserved.
8
expr ::= … | query-exprquery-expr ::= from-clause query-bodyfrom-clause ::= ‘from’ itemNameExpr ‘in’ srcExprquery-body ::= join-clause*
(from-clause join-clause* | let-clause | where-clause)*orderby-clause?(select-clause | groupby-clause)query-continuation
join-clause ::=‘join’ itemNameExpr ‘in’ srcExpr ‘on’ keyExpr1 ‘equals’ keyExpr2(‘into’ itemNameExpr)?
let-clause ::= ‘let’ itemNameExpr ‘=’ selExprwhere-clause ::= ‘where’ predExprorderby-clause ::= ‘orderby’ (keyExpr (‘ascending’ |
‘descending’)?)*select-clause ::= ‘select’ selExprgroupby-clause ::= ‘group’ selExpr ‘by’ keyExprquery-continuation ::= ‘into’ itemNameExpr ‘query-body’
expr ::= … | query-exprquery-expr ::= from-clause query-bodyfrom-clause ::= ‘from’ itemNameExpr ‘in’ srcExprquery-body ::= join-clause*
(from-clause join-clause* | let-clause | where-clause)*orderby-clause?(select-clause | groupby-clause)query-continuation
join-clause ::=‘join’ itemNameExpr ‘in’ srcExpr ‘on’ keyExpr1 ‘equals’ keyExpr2(‘into’ itemNameExpr)?
let-clause ::= ‘let’ itemNameExpr ‘=’ selExprwhere-clause ::= ‘where’ predExprorderby-clause ::= ‘orderby’ (keyExpr (‘ascending’ |
‘descending’)?)*select-clause ::= ‘select’ selExprgroupby-clause ::= ‘group’ selExpr ‘by’ keyExprquery-continuation ::= ‘into’ itemNameExpr ‘query-body’
Common Query Operators Binding operators, used to express operations on abstract
elements Bind: from x in A – bind variable x to a single element e in the data
source A, one at a time, so that x may be referenced in the query text Cross product bind: from x in A from y in B – create the relational
cross-product, A × B, binding x and y to members of the resulting pairs (x, y)
Let bind: let x = e – bind variable x to the result of evaluating expression e
General operators, to perform relational operations Selection: where p – for each element e of type T, yield only those for
which the selection predicate, p(e), of form T bool evaluates to true Sort: orderby k (ascending | descending)? – order the elements of
type T ascending or descending based on keys generated with the key-selection function k, of form T K
Map: select p – transform each element e from type T to U via the projection function, p(e), of form T U
Equi-join: join y in B on k1 equals k2 (into z)? – for each pair of elements (x, y) in the cross-product of the “left” input A and the “right” input B, for which k1(x) == k2(y), bind the result to y (or z if specified—“group join”)
Grouping: group p by k – yield groupings of data, of type (K, T*) for which k(e), of the form T K, is equal for all e in the group
Binding operators, used to express operations on abstract elements Bind: from x in A – bind variable x to a single element e in the data
source A, one at a time, so that x may be referenced in the query text Cross product bind: from x in A from y in B – create the relational
cross-product, A × B, binding x and y to members of the resulting pairs (x, y)
Let bind: let x = e – bind variable x to the result of evaluating expression e
General operators, to perform relational operations Selection: where p – for each element e of type T, yield only those for
which the selection predicate, p(e), of form T bool evaluates to true Sort: orderby k (ascending | descending)? – order the elements of
type T ascending or descending based on keys generated with the key-selection function k, of form T K
Map: select p – transform each element e from type T to U via the projection function, p(e), of form T U
Equi-join: join y in B on k1 equals k2 (into z)? – for each pair of elements (x, y) in the cross-product of the “left” input A and the “right” input B, for which k1(x) == k2(y), bind the result to y (or z if specified—“group join”)
Grouping: group p by k – yield groupings of data, of type (K, T*) for which k(e), of the form T K, is equal for all e in the group
© 2007, Microsoft, Corp. All rights reserved.
9
Some Example Queries
© 2007, Microsoft, Corp. All rights reserved.
10
Word counts:string doc = …;
var counts = from w in doc.Split(' ') group w by w;
Weighted average:float[] D = …, W = …;
float avg = D.ZipWith(W, (x,y) => x*y).Sum() / W.Sum();
“Select customers whose billing address is in Washington in the United States, or whose cumulative order total is >= $25 USD; order them by total $ descending, group them by state, and project just their name and total”:
Set<Customer> custs = …;
Set<Order> ords = …;
Set<Address> addrs = …;
var q = from c in custs join o in ords on o.CustomerID equals c.ID into co join a in addrs on a.AddressID equals o.BillingID let ordTotal = co.Sum(o => o.TotalCost) where (a.State == "WA" && a.Country == "United States") || ordTotal >= $25.00 orderby ordTotal descending group new {c.LastName,c.FirstName,ordTotal} by a.State;
Word counts:string doc = …;
var counts = from w in doc.Split(' ') group w by w;
Weighted average:float[] D = …, W = …;
float avg = D.ZipWith(W, (x,y) => x*y).Sum() / W.Sum();
“Select customers whose billing address is in Washington in the United States, or whose cumulative order total is >= $25 USD; order them by total $ descending, group them by state, and project just their name and total”:
Set<Customer> custs = …;
Set<Order> ords = …;
Set<Address> addrs = …;
var q = from c in custs join o in ords on o.CustomerID equals c.ID into co join a in addrs on a.AddressID equals o.BillingID let ordTotal = co.Sum(o => o.TotalCost) where (a.State == "WA" && a.Country == "United States") || ordTotal >= $25.00 orderby ordTotal descending group new {c.LastName,c.FirstName,ordTotal} by a.State;
Additional Query Operators
© 2007, Microsoft, Corp. All rights reserved.
11
Some have no syntactic representation and must be accessed w/ library calls: ForAll(A, a): invoke side effecting operation a(x) for each
element x in A Concat(A, B): linearly concatenate the data inputs A and B Zip(A, B): combine two inputs A and B into pairs by overlaying
data Reverse(A): reverse the ordering of elements in vector A Range(x, y): generate a stream representing the range [x, y) Set operators: Distinct(A), Union(A, B), Intersect(A, B) Reductions (a.k.a. aggregations, folds): Aggregate(A, binOp), Count(A), Sum(A), Min(A), Max(A), Average(A), EqualAll(A, B), Any(A, p), All(A, p), Contains(A, e)
Some have no syntactic representation and must be accessed w/ library calls: ForAll(A, a): invoke side effecting operation a(x) for each
element x in A Concat(A, B): linearly concatenate the data inputs A and B Zip(A, B): combine two inputs A and B into pairs by overlaying
data Reverse(A): reverse the ordering of elements in vector A Range(x, y): generate a stream representing the range [x, y) Set operators: Distinct(A), Union(A, B), Intersect(A, B) Reductions (a.k.a. aggregations, folds): Aggregate(A, binOp), Count(A), Sum(A), Min(A), Max(A), Average(A), EqualAll(A, B), Any(A, p), All(A, p), Contains(A, e)
Runtime: Parallel Execution
© 2007, Microsoft, Corp. All rights reserved.
12
Operator Parallelism Intra-operator, i.e. partitioning:
Input to a single operator is “split” into p pieces and run in parallel Adjacent and nested operators can enjoy fusion Good temporal locality of data – each datum “belongs” to a partition
Inter-operator, i.e. pipelining Operators run concurrently with respect to one another Can avoid “data skew”, i.e. imbalanced partitions, as can occur w/
partitioning Typically incurs more synchronization overhead and yields
considerably worse locality than intra-operator parallelism, so is less attractive
Partitioning is preferred unless there is no other choice For example, sometimes the programmer wants a single-CPU view,
e.g.:foreach (x in q) a(x)
Consumption action a for might be written to assume no parallelism Bad if a(x) costs more than the element production latency
Otherwise, parallel tasks just eat up memory, eventually stopping when the bounded buffer fills
But a(x) can be parallel too
Intra-operator, i.e. partitioning: Input to a single operator is “split” into p pieces and run in parallel Adjacent and nested operators can enjoy fusion Good temporal locality of data – each datum “belongs” to a partition
Inter-operator, i.e. pipelining Operators run concurrently with respect to one another Can avoid “data skew”, i.e. imbalanced partitions, as can occur w/
partitioning Typically incurs more synchronization overhead and yields
considerably worse locality than intra-operator parallelism, so is less attractive
Partitioning is preferred unless there is no other choice For example, sometimes the programmer wants a single-CPU view,
e.g.:foreach (x in q) a(x)
Consumption action a for might be written to assume no parallelism Bad if a(x) costs more than the element production latency
Otherwise, parallel tasks just eat up memory, eventually stopping when the bounded buffer fills
But a(x) can be parallel too
© 2007, Microsoft, Corp. All rights reserved.
13
q = from x in A where p(x) select x3; Intra-operator:
Inter-operator:
Both composed:
q = from x in A where p(x) select x3; Intra-operator:
Inter-operator:
Both composed:
… Thread 4 …
… Thread 3 …
… Thread 2 …
… Thread 1 …
Parallelism Illustrations
© 2007, Microsoft, Corp. All rights reserved.
14
where p(x) select x3
Awhere p(x) select x3
… Thread 2 …… Thread 1 …
A where p(x) select x3
… Thread 2 …
… Thread 1 …
where p(x) select x3
Awhere p(x) select x3
Deciding Parallel Execution Strategy Tree analysis informs decision making:
Where to introduce parallelism? And what kind? (partition vs. pipeline) Based on intrinsic query properties and operator costs
Data sizes, selectivity (for filter f, what % satisfies the predicate?)
Intelligent “guesses”, code analysis, adaptive feedback over time
But not just parallelism, higher level optimizations too, e.g. Common sub-expression elimination, e.g.
from x in X where p(f(x)) select f(x); Reordering operations to:
Decrease cost of query execution, e.g. put a filter before the sort, even if the user wrote it the other way around
Achieve better operator fusion, reducing synchronization cost
Tree analysis informs decision making: Where to introduce parallelism? And what kind? (partition vs. pipeline) Based on intrinsic query properties and operator costs
Data sizes, selectivity (for filter f, what % satisfies the predicate?)
Intelligent “guesses”, code analysis, adaptive feedback over time
But not just parallelism, higher level optimizations too, e.g. Common sub-expression elimination, e.g.
from x in X where p(f(x)) select f(x); Reordering operations to:
Decrease cost of query execution, e.g. put a filter before the sort, even if the user wrote it the other way around
Achieve better operator fusion, reducing synchronization cost
© 2007, Microsoft, Corp. All rights reserved.
15
Partitioning Techniques Partitioning can be data-source sensitive
If a nested query, can fuse existing partitions If an array, calculate strides and contiguous ranges (+spatial locality) If a (possibly infinite) stream, lazily hand out chunks
Partitioning can be operator sensitive E.g. equi-joins employ a hashtable to turn an O(nm) “nested join” into
O(n+m) Build hash table out of one data source; then probe it for matches Only works if all data elements in data source A with key k are in the same
partition as those elements in data source B also with key k We can use “hash partitioning” to accomplish this: for p partitions,
calculate k for each element e in A and in B, and then assign to partition based on key, e.g. k.GetHashCode() % p
Output of sort: we can fuse, but restrict ordering, ordinal and key based
Existing partitions might be repartitioned Can’t “push down” key partitioning information to leaves: types
changed during stream data flow, e.g. select operator Nesting: join processing output of another join operator Or just to combat partition skew
Partitioning can be data-source sensitive If a nested query, can fuse existing partitions If an array, calculate strides and contiguous ranges (+spatial locality) If a (possibly infinite) stream, lazily hand out chunks
Partitioning can be operator sensitive E.g. equi-joins employ a hashtable to turn an O(nm) “nested join” into
O(n+m) Build hash table out of one data source; then probe it for matches Only works if all data elements in data source A with key k are in the same
partition as those elements in data source B also with key k We can use “hash partitioning” to accomplish this: for p partitions,
calculate k for each element e in A and in B, and then assign to partition based on key, e.g. k.GetHashCode() % p
Output of sort: we can fuse, but restrict ordering, ordinal and key based
Existing partitions might be repartitioned Can’t “push down” key partitioning information to leaves: types
changed during stream data flow, e.g. select operator Nesting: join processing output of another join operator Or just to combat partition skew
© 2007, Microsoft, Corp. All rights reserved.
16
Example: Query Nesting and Fusion
© 2007, Microsoft, Corp. All rights reserved.
17
Nesting queries inside of others is common We can fuse partitions
var q1 = from x in A select x*2; var q2 = q1.Sum();
Nesting queries inside of others is common We can fuse partitions
var q1 = from x in A select x*2; var q2 = q1.Sum();
sele
ct x
*2
sele
ct x
*2
+ +
+
sele
ct x
*2
sele
ct
x*2
+ +
+
I. Select (alone)
2. Sum (alone)
3. Select + Sum
Execution of Work
© 2007, Microsoft, Corp. All rights reserved.
18
Windows’ finest granularity of work is a thread Each partition has at most one thread assigned to it, assigned via
a gang scheduling//dynamic work stealing-like (a la Cilk) algorithm
Tension between creating “just the right number of threads” (static+dynamic adaptivity) versus over partitioning work: would change some things, but maybe for the better
Hard to predict things like IO and blocking Developer still has shared memory, can make horrible
mistakes, e.g.:
int s_x = 0;var q = from x in A where x == s_x++;
Analysis can sometimes catch this, but often not (dynamic function invocation, e.g. where x == side_effecting_func(…))
C#, and generally the CLR’s, type system doesn’t support the notion of purity (though some research systems, e.g. Spec#, provide hope)
Where is transactional memory when you need it?
Windows’ finest granularity of work is a thread Each partition has at most one thread assigned to it, assigned via
a gang scheduling//dynamic work stealing-like (a la Cilk) algorithm
Tension between creating “just the right number of threads” (static+dynamic adaptivity) versus over partitioning work: would change some things, but maybe for the better
Hard to predict things like IO and blocking Developer still has shared memory, can make horrible
mistakes, e.g.:
int s_x = 0;var q = from x in A where x == s_x++;
Analysis can sometimes catch this, but often not (dynamic function invocation, e.g. where x == side_effecting_func(…))
C#, and generally the CLR’s, type system doesn’t support the notion of purity (though some research systems, e.g. Spec#, provide hope)
Where is transactional memory when you need it?
Some Conclusions & Observations
© 2007, Microsoft, Corp. All rights reserved.
19
Results have been encouraging: about what you’d expect given prior related research Good performance, few changes required to the serial programming model Given the upcoming public release of LINQ in VS, we hope reach will be good Not a silver bullet – just one tool in a developer’s belt
Hard to “catch up” to huge parallelism constants on Windows, particularly given small data inputs and/or inexpensive operators
var q = Range(0,100).Sum(); // add up #s [0,100) Also easy to run into memory bottlenecks, possible opportunities for
architecture-aware optimizations (we already try to maximize spatial+temporal locality)
Costs are hard to get right Too much dynamism in the platform to arrive at a correct # Even if we did, hard to create heuristics that scale well across platforms Too much decomposition, too little, unexpected IO (paging, …), synchronization
But in the end: do costs really matter? Or is it better to represent concurrency using a fixed granule and let another
scheduling mechanism apply policy (work stealing)? Many queries are candidates for SIMD/vector architectures
Targeting other instruction sets (SSEx, GPU) could be profitable
Results have been encouraging: about what you’d expect given prior related research Good performance, few changes required to the serial programming model Given the upcoming public release of LINQ in VS, we hope reach will be good Not a silver bullet – just one tool in a developer’s belt
Hard to “catch up” to huge parallelism constants on Windows, particularly given small data inputs and/or inexpensive operators
var q = Range(0,100).Sum(); // add up #s [0,100) Also easy to run into memory bottlenecks, possible opportunities for
architecture-aware optimizations (we already try to maximize spatial+temporal locality)
Costs are hard to get right Too much dynamism in the platform to arrive at a correct # Even if we did, hard to create heuristics that scale well across platforms Too much decomposition, too little, unexpected IO (paging, …), synchronization
But in the end: do costs really matter? Or is it better to represent concurrency using a fixed granule and let another
scheduling mechanism apply policy (work stealing)? Many queries are candidates for SIMD/vector architectures
Targeting other instruction sets (SSEx, GPU) could be profitable
The End
© 2007, Microsoft, Corp. All rights reserved.
20
No paper yet – tentative plans for ’07 Public release dates TBD; for more
information, watch: http://www.bluebytesoftware.com/blog/
Thanks for coming …
No paper yet – tentative plans for ’07 Public release dates TBD; for more
information, watch: http://www.bluebytesoftware.com/blog/
Thanks for coming …