On the Factor Refinement Principle and its Implementation ...users.cecs.anu.edu.au/~mohsin/downloads/slides-msc-thesis-ali-20… · On the Factor Re nement Principle and its Implementation

On the Factor Refinement Principle and itsImplementation on Multicore Architectures

Masters Public Lecture

Presented By: Md. Mohsin AliSupervisor: Dr. Marc Moreno Maza

Dept. of Computer Science (ORCCA Lab)The University of Western Ontario, London, ON, Canada

December 15, 2011

1 / 65

Factor Refinement

Serial AlgorithmsApproach based on the naive refinementApproach based on augment refinementApproach based on subproduct trees

Motivation

Implementation Challenges on Multicore Architectures

Contribution

Proposed Parallel AlgorithmsA d-n-c illustrationParallel algorithms based on the naive refinementParallel approach based on the augment refinement

Parallel Algorithm Based on Subproduct Tree

Conclusion

2 / 65

Factor Refinement I

Factor Refinement

3 / 65

Factor Refinement I

DefinitionI Let D be a UFD and m1,m2, . . . ,mr be elements of D.

I Let m be the product of m1,m2, . . . ,mr .

I We say that elements n1, n2, . . . , ns of D form a GCD-free basiswhenever gcd(ni , nj) = 1 for all 1 ≤ i < j ≤ s.

I Let e1, e2, . . . , es be positive integers.

I We say that the pairs (n1, e1), (n2, e2), . . . , (ns , es) form arefinement of m1,m2, . . . ,mr if the following two conditions hold:

(i) n1, n2, . . . , ns is a GCD-free basis,

(ii) for every 1 ≤ i ≤ r there exists non-negative integers f1, . . . , fs such

that we have∏

1≤j≤s nfjj = mi ,

(ii)∏

1≤i≤s neii = m.

When this holds, we also say that (n1, e1), (n2, e2), . . . , (ns , es) is a coprimefactorization of m.

4 / 65

Factor Refinement II

Example

Let m1 = 30,m2 = 42 and their product m = 1260. Then

(i) 51, 62, 71 is a refinement of 30 and 42,

(ii) 5, 6, 7 is a GCD-free basis of 30 and 42,

(iii) 51, 62, 71 is a coprime factorization of 1260.

5 / 65

Factor Refinement IIIApplications

I Simplifying systems of polynomial equations and inequations,(i) {

ab 6= 0bc 6= 0ca 6= 0

=⇒{

a 6= 0b 6= 0c 6= 0,

(ii) Below, {A,B,C ,D,E ,F ,G} can be seen as a GCD-freebasis of {S1, S2,S3}:

S1

S2

S3

S1

S2

S3

A

B

CD

E

F

G

=⇒ ,

I consolidation of independent factorizations,I etc.

6 / 65

Serial Algorithms:

Approach based on the naive refinement andquadratic arithmetic

7 / 65

Approach based on the naive refinement I

Idea from Bach, Driscoll, and Shallit in 1990 [BDS90].

I Given a partial factorization of an integer m, say m = m1m2,we compute d = gcd(m1,m2) and write

m = (m1/d)(d2)(m2/d).

I This process is continued until all the factors are pairwisecoprime.

I This is also used for the general case of more than two inputs,say m = m1m2 . . .m`.

Algebraic complexity

If m = m1m2 . . .m`, then this algorithm takes O(size(m)3) bit op-erations, where

size(m) =

{1 if m = 0

1 + blog2 |m|c if m > 0.

8 / 65

Serial Algorithms:

Approach based on the augment refinementand quadratic arithmetic

9 / 65

Approach based on augment refinement and quadraticarithmetic I

Again from Bach, Driscoll, and Shallit in 1990 [BDS90].

I Basic idea same as before but organizing the computationsmore precisely leading to an improved complexity [BDS90]

I The trick is to keep tracks of the pairs (nj , nk) in an orderedpair list such that only elements adjacent in the list can havea nontrivial GCD.


If m = m1m2 . . .m`, then this algorithm takes O(size(m)2) bit op-erations, where

size(m) =

{1 if m = 0

1 + blog2 |m|c if m > 0.

10 / 65

Serial Algorithms:

Approach based on subproduct trees

11 / 65

Approach based on subproduct trees I

Idea of Asymptotically Fast Algorithm for GCD-free Basis fromDahan, Moreno Maza, Schost, and Xie in 2005 [DMS+05].

I Divide the input into sub-problems until a base case isreached,

I Conquer the sub-problems from leaves to the root applyingfast arithmetic based on subproduct trees (described later).


The total number of field operations of this algorithm isO(M(d) log3

2 d), whereI d is the sum of the degrees of the input polynomials,I M(d) is a multiplication time of two univariate polynomials of

degree less than d ,

12 / 65

Motivation I

Parallel Computation of the Minimal Elements of a Poset

I by Leiserson, Li, Moreno Maza, and Xie in 2010 [ELLMX10].

I This is a multithreaded (fork-join parallelism) approach whichis divide-and-conquer, free of data races, inspired byparallel-merge-sort.

I Its Cilk++ shows nearly linear speed-up on 32-core processorsfor sufficiently large input data set.

This work led us to the design and implementation of parallelfactor refinement algorithms.

13 / 65

Implementation Challenges on MulticoreArchitectures

14 / 65

Multithreaded Parallelism on Multicore Architectures I

Multicore architectures

I A multi-core processor is a single computing component withtwo or more independent and tighly coupled processors, calledcores, sharing memory.

I They also share the same bus and memory controller; thusmemory bandwidth may limit performances.

I In order to maintain memory consistency, synchorization isneeded between cores, which may also limit performances.

15 / 65

Multithreaded Parallelism on Multicore Architectures II

Fork-join parallelism

I This model represents the execution of a multithreadedprogram as a set of nonblocking threads denoted by thevertices of a dag where the dag edges indicate dependenciesbetween instructions.

I Assuming unit cost of execution for all threads, the number ofvertices of the dag is the work (= running time on a singlecore).

I The maximum length of a path from the root to a leaf is thespan (= running time on ∞ processors).

I The paralleisim is the ratio work to span (= average amountof work along the span).

16 / 65

The Ideal-cache Model Ifathena,cel,prokop,[email protected]

= ( ) �( + = )

( + ( = )( + )) ( )� �

( +( + + )= + =p

)

( ; )

Qcachemisses

organized byoptimal replacement

strategy

MainMemory

Cache

Z=L Cache lines

Linesof length L

CPU

Wwork

>

= ( ) ;

( )

( ; )

Figure 1: The ideal-cache model.

17 / 65

The Ideal-cache Model II

I The processor can only refer to words that reside in the cachememory, which is a small and fast access memory, containingZ words organized in cache lines of L words each.

I If the referenced line of a word is not in cache, thecorresponding line needs to be brought from the mainmemory. This is a cache miss. If the cache is full, a cache linemust be evicted.

I Cache complexity analyzes algorithms in terms of cachemisses.

18 / 65

From Cilk to Cilk++ I

The language

I Cilk (resp. Cilk++) is an extension of C (resp. C++)implementating the fork-join parallelism with two keywordsspawn and sync.

I A Cilk (resp. Cilk++) program has the same semantics asits C (resp. C++) ellision.

Performance of the work-stealing scheduler

In theory, Cilk (resp. Cilk++)s scheduler executes anyCilk++computation in a nearly optimal time on p processors, pro-vided that

I for almost all parallel steps, there are at least p units of workwhich can be run concurrently,

I each processor is either working or stealing work,

I each thread executes in unit time.

19 / 65

Parallelization overheads I

Overheads and burden

I In practice, the observed speedup factor may be less(sometimes much less) than the theoretical parallelism.

I Many factors explain that fact: simplification assumptions ofthe fork-join parallelism model, architecture limitation, costsof executing the parallel constructs, overheads of scheduling.

Parallelism vs. burdened parallelism

I Cilkview is a perforance analyzer which caclulates the work,the span, the parallelism of a given Cilk++ program run.

I Cilkview also estimates the running time Tp on p processorsas Tp = T1/p + 1.7 burden span, where burden span is15000 instructions times the number of spawn along the span!

20 / 65

Contribution I

I Parallel algorithm based on the naive refinement principle[NOT GOOD for data locality and thus for parallelism onmulticore architectures].

I Parallel algorithm based on the augment refinement principle[GOOD for data locality and parallelism].

I Parallel algorithm based on subproduct tree [MORECHALLENGING for implementation on multicorearchitectures].

Principle

All are based on algorithms which are divide-and-conquer (d-n-c),multithreaded, free of data races.

21 / 65

Proposed Parallel Algorithms

A d-n-c illustration

22 / 65

A d-n-c illustration I

2, 6, 7, 10, 15, 21, 22, 26

2, 6, 7, 10 15, 21, 22, 26

2, 6 7, 10 15, 21 22, 26

2 6 7 10 15 21 22 26

21 61 71 101 151 211 221 261

31, 22 71, 101 51, 71, 32 111, 131, 22

31, 71, 51, 23 51, 71, 32, 111, 131, 22

111, 131, 33, 72, 52, 25

Input

Output

Expand

Merge

Done in parallel

Done in parallel

Figure 2: Example of algorithm execution.

23 / 65


Parallel algorithms based on the naiverefinement

24 / 65

Parallel algorithms based on the naive refinement IExpanding input and calling Merge

Algorithm 1: ParallelFactorRefinement(A)

Input: Array of square-free positive integers A = m1,m2, . . . ,mk .Output: Two arrays of positive integers

N = n1, n2, . . . , ns ,E = e1, e2, . . . , es such that(n1, e1), (n2, e2), . . . , (ns , es) is a refinement of m1,m2, · · · ,mk .Moreover, n1, n2, . . . , ns are square-free.

if k = 1 then1:

return [ A ], [ 1 ] ;2:

else3:

Divide array A into two parts called A1 and A2;4:

f1 ← spawn ParallelFactorRefinement(A1) ;5:

f2 ← spawn ParallelFactorRefinement(A2) ;6:

sync;7:

return MergeRefinement(f1, f2);8:

25 / 65

Parallel algorithms based on the naive refinement IICalling all pairs of GCD and separating factors with those GCDs.

Algorithm 2: MergeRefinement(A,E ,B,F)Input: Arrays of square-free positive integers A = l1, . . . , lk , E = e1, . . . , ek ,

B = m1, . . . ,mr and F = f1, . . . , fr , where A,E (resp. B,F ) is regarded as afactor refinement (l1, e1), . . . , (lk , ek ) (resp. (m1, f1), . . . , (mr , fr )).

Output: A factor refinement of the concatenation of A,E and B,F .Allocate G and H of (k × r) row-major arrays, A′ an k-array and B′ an r -array;1:parallel for (i , j) ∈ {1, . . . , k} × {1, . . . , r} do2:

Hi,j ← 0;

G ← GcdOfAllPairs(A,B) ;3:parallel for i = 1 to k do4:

pi ←∏

1≤j≤r Gi,j ;5:A′i ← li/pi ;6:for j = 1 to r do7:

if Gi,j 6= 1 then8:Hi,j ← Hi,j + ei ;9:

For each column, do similar work, from Line 4 to 9 ;10:Write the non-one entries of A′ (for each row), B′ (for each column), and G to C ;11:Write the corresponding exponents from E , F , and H to D;12:return C ,D ;13:

26 / 65

Parallel algorithms based on the naive refinement III

Algorithm 3: GcdOfAllPairsInner(A,B,G)Input: 1-D arrays of positive integers A = l1, l2, . . . , lk , B = m1,m2, . . . ,mr and a 2-D array

G = [Gi,j | 1 ≤ i ≤ k, 1 ≤ j ≤ r ].Output: All possible pairs of gcd(li ,mj ) for 1 ≤ i ≤ k, 1 ≤ j ≤ r with gcd(li ,mj ) stored in Gi,j .comment C is a global variable equal to a base-case threshold, say 16. Assume C ≥ 2.

if k ≤ C and r ≤ C then1:for (i, j) ∈ {1, . . . , k} × {1, . . . , r} do2:

Gi,j ← gcd(Ai ,Bj ); [Place of INEFFICIENCY for cache complexity]3:

else if k > C and r ≤ C then4:Divide A into two halves A1 = l1, . . . , lk/2,A2 = lk/2+1, . . . , lk ;5:GcdOfAllPairsInner(A1,B,G);6:GcdOfAllPairsInner(A2,B,G);7:

else if k ≤ C and r > C then8:Divide B into two halves B1 = m1, . . . ,mr/2,B2 = mr/2+1, . . . ,mr ;9:GcdOfAllPairsInner(A,B1,G);10:GcdOfAllPairsInner(A,B2,G);11:

else12:Divide A and B arrays into two halves13:A1 = l1, . . . , lk/2,A2 = lk/2+1, . . . , lk ; B1 = m1, . . . ,mr/2,B2 = mr/2+1, . . . ,mr ;

spawn GcdOfAllPairsInner(A1,B1,G);14:spawn GcdOfAllPairsInner(A2,B2,G);15:spawn GcdOfAllPairsInner(A1,B2,G);16:spawn GcdOfAllPairsInner(A2,B1,G);17:

27 / 65

Parallel algorithms based on the naive refinement IV

Proposition: work, span and parallelism

On input data of size n for Algorithm 1, under the condition thateach arithmetic operation (integer division and integer GCD com-putation) has a unit cost,

I Work = O(n2),I Span = O(n), andI Parallelism = O(n). GOOD! :-)

28 / 65

Parallel algorithms based on the naive refinement V

Proposition: cache complexity

Under the ideal cache of Z words, with L words per cache-line (modelof Frigo, Leiserson, Prokop, and Ramachandra, 1999), with C smallenough, for any input of size n, the number of cache misses ofAlgorithm 3 is

Q(n) ∈ O(n2/L + n2/Z + n2/Z 2), BAD! :-(

under the assumption thatI there exists a positive integer w such that each integer

coefficient of each input array A or B in Algorithms 1, 2 isstored in w consecutive machine words, and

I each input array A or B is packed, that is, its successiveelements occupy consecutive memory slots.

29 / 65

Experimental results I

I The speedup estimates are obtained by Cilkview feature ofCilk++ (Frigo, Leiserson and Randall, 1998) executing in anIntel(R) Xeon(R) (64 bit) Machine, with CPU (E7340) Speed2.40GHz, 128.0 GB of RAM, in a 16 Cores sharcnet cluster.

I Timings are obtained by running in an Intel(R) Core(TM) i7(64 bit) Machine, with CPU (870) Speed 2.93GHz, 8.0 GB ofRAM with 8 Cores configured in ORCCA Lab.

I Timing results are obtained with the average value of 5executions.

30 / 65

Experimental results II

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Sp

ee

du

p

Cores

Parallelism = 2395.73432, Ideal SpeedupLower Performance Bound

Measured Speedup

Figure 3: Scalability analysis of the naive refinement based parallel factorrefinement algorithm for 4, 000 dense square-free univariate polynomialsby Cilkview. All of Degree 60 and coefficient operations are takenmodulo a small prime.

31 / 65

Experimental results III

0

10

20

30

40

50

60

100 200 300 400 500 600

Tim

e (

Se

c)

Input size

Algorithm: Cores = 1Algorithm: Cores = 8

Maple

Figure 4: Running time comparisons of the naive refinement basedparallel factor refinement algorithm for dense square-free univariatepolynomials. Degree 60 and coefficient operations are taken modulo asmall prime.

32 / 65


Parallel approach based on the augmentrefinement

33 / 65

Parallel approach based on the augment refinement IExpanding input and calling Merge

Algorithm 4: ParallelFactorRefinementDNC(A)

Input: A sequence of square-free polynomialsA = (m1,m2, . . . ,mk) ∈ K[x ].

Output: A sequence of square-free pairwise coprime polynomialsN = (n1, n2, . . . , ns) ∈ K[x ], and a sequence of positive integersE = (e1, e2, . . . , es) such that (n1, e1), (n2, e2), . . . , (ns , es) is afactor refinement of m1,m2, · · · ,mk .

if k < 2 then1:

return (m1), (1) ;2:

else3:

Divide A into two sub-sequences called A1 and A2;4:

(X1,Y1)← spawn ParallelFactorRefinementDNC(A1) ;5:

(X2,Y2)← spawn ParallelFactorRefinementDNC(A2) ;6:

sync;7:

return MergeRefinementDNC(X1,Y1,X2,Y2);8:

34 / 65

Parallel approach based on the augment refinement IIProceed recursively and calling Merge

Algorithm 5: MergeRefinementDNC(A,E ,B,F )Input: Two sequences of square-free pairwise coprime polynomials A = (a1, . . . , an)

and B = (b1, b2, . . . , br ) ∈ K[x] together with two sequences of positiveintegers E = (e1, e2, . . . , en) and F = (f1, f2, . . . , fr ).

Output: A factor refinement of (a1, e1), . . . , (an, en), (b1, f1), . . . , (br , fr ).

if n ≤ BASESIZE or r ≤ BASESIZE then1:return MergeRefineTwoSeq(A,E ,B,F );2:

else3:Divide A, E , B, and F into two halves called A1, A2, E1, E2, B1, B2, and F1, F2,4:respectively ;(L1,M1,Q1,R1,S1,T1)← spawn MergeRefinementDNC(A1,E1,B1,F1) ;5:(L2,M2,Q2,R2,S2,T2)← spawn MergeRefinementDNC(A2,E2,B2,F2) ;6:sync ;7:

(L3,M3,Q3,R3,S3,T3)← spawn MergeRefinementDNC(L1,M1,S2,T2) ;8:(L4,M4,Q4,R4,S4,T4)← spawn MergeRefinementDNC(L2,M2,S1,T1) ;9:sync ;10:

return11:{L3 + L4,M3 + M4,Q1 + Q2 + Q3 + Q4,R1 + R2 + R3 + R4, S3 + S4,T3 + T4} ;

35 / 65

Parallel approach based on the augment refinement IIIProceed iteratively and calling Merge

Algorithm 6: MergeRefineTwoSeq(A,E ,B,F )Input: Two sequences of square-free pairwise coprime polynomials A = (a1, . . . , an)

and B = (b1, b2, . . . , br ) of K[x] together with two sequences of positiveintegers E = (e1, e2, . . . , en) and F = (f1, f2, . . . , fr ).

Output: A factor refinement of (a1, e1), . . . , (an, en), (b1, f1), . . . , (br , fr ).

L← ∅;1:M ← ∅;2:Q ← ∅;3:R ← ∅;4:S0 ← B;5:T0 ← F ;6:

for i from 1 to n do7:(ì ,mi ,Qi ,Ri , Si ,Ti )← MergeRefinePolySeq(ai , ei ,Si−1,Ti−1) ;8:Q ← Q + Qi ;9:R ← R + Ri ;10:if ì 6= 1 then11:

L← L + (ì ) ;12:M ← M + (mi ) ;13:

return (L,M,Q,R, Sn,Tn);14:

36 / 65

Parallel approach based on the augment refinement IVProceed iteratively and augment refinement

Algorithm 7: MergeRefinePolySeq(a, e,B,F )Input: A square-free polynomial a ∈ K[x], a positive integer e, a sequence of

square-free pairwise coprime polynomials B = (b1, b2, . . . , bn) of K[x] and asequence of positive integers F = (f1, f2, . . . , fn).

Output: A factor refinement of ae , b1f1 , . . . , bn

fn .

`0 ← a;1:m0 ← e;2:Q ← ∅;3:R ← ∅;4:S ← ∅;5:T ← ∅;6:

for i from 1 to n do7:(ì ,mi ,Gi ,Vi , di ,wi )← PolyRefine(ì−1,mi−1, bi , fi ) ;8:Q ← Q + Gi ;9:R ← R + Vi ;10:if di 6= 1 then11:

S ← S + (di ) ;12:T ← T + (wi ) ;13:

return (`n,mn,Q,R, S,T );14:

37 / 65

Parallel approach based on the augment refinement VRefinement of two inputs (Pair-refine)

Algorithm 8: PolyRefine(a, e, b, f )Input: Two square-free univariate polynomials a, b ∈ K[x] for a field K and two

positive integers e, f .Output: A factor refinement of (a, e), (b, f ) .

g ← gcd(a, b);1:a′ ← a quotient g ;2:b′ ← b quotient g ;3:

if g = 1 then4:return (a, e,∅,∅, b, f ) ; // Here ∅ designates the empty sequence5:

else if a = b then6:return (1, 1, (a), (e + f ), 1, 1)7:

else8:(`1, e1,G1,V1, r1, f1)← PolyRefine(a′, e, g , e + f );9:(`2, e2,G2,V2, r2, f2)← PolyRefine(r1, f1, b′, f );10:if `2 6= 1 then11:

G2 ← G2 + (`2) ; // Here + designates sequence concatenation12:V2 ← V2 + (e2);13:

return (`1, e1,G1 + G2,V1 + V2, r2, f2) ;14:

38 / 65

Parallel approach based on the augment refinement VI

Proposition: work, span and parallelism

On input data of size n for Algorithm 4, under the condition thateach arithmetic operation (integer division and integer GCD com-putation) has a unit cost, where C is the threshold BASESIZE,

I Work = O(n2),I Span = O(Cn), andI Parallelism = O(n/C ). NOT BAD! :-)

39 / 65

Parallel approach based on the augment refinement VII

Proposition: cache complexity

Under the ideal cache of Z words, with L words per cache-line (modelof Frigo, Leiserson, Prokop, and Ramachandra, 1999), with C smallenough, for any input of size n, the number of cache misses ofAlgorithm 5 is

Q(n) = O(n2/ZL + n2/Z 2), GOOD! :-)

under the assumption thatI each input or output sequence in Algorithms 8, 7, 6 or 5 is

packed, andI each element of the field K can be stored in one machine

word.

40 / 65

Experimental results I

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Sp

ee

du

p

Cores


Measured Speedup

Figure 5: Scalability analysis of the augment refinement based parallelfactor refinement algorithm for 200, 000 int type inputs by Cilkview.

41 / 65

Experimental results II

5

10

15

20

25

30

35

40

45

50

5000 5200 5400 5600 5800 6000

Tim

e (

Se

c)

Input size


Maple

Figure 6: Running time comparisons of the augment refinement basedparallel factor refinement algorithm for int type inputs.

42 / 65

Experimental results III

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Sp

ee

du

p

Cores


Measured Speedup

Figure 7: Scalability analysis of the augment refinement based parallelfactor refinement algorithm for 4, 000 dense square-free univariatepolynomials by Cilkview. All of degree 150 and coefficient operationsare taken modulo a small prime.

43 / 65

Experimental results IV

0

20

40

60

80

100

120

140

100 200 300 400 500 600

Tim

e (

Se

c)

Input size


Maple

Figure 8: Running time comparisons of the augment refinement basedparallel factor refinement algorithm for dense square-free univariatepolynomials. All of degree 150 and coefficient operations are takenmodulo a small prime.

44 / 65

Experimental results V

0

2

4

6

8

10

12

14

16

0 2 4 6 8 10 12 14 16

Sp

ee

du

p

Cores


Measured Speedup

Figure 9: Scalability analysis of the augment based parallel factorrefinement algorithm for 4, 120 sparse square-free univariate polynomialsby Cilkview when the input is already a GCD-free basis. Degree 150and coefficient operations are taken modulo a small prime.

45 / 65

Experimental results VI

0

5

10

15

20

25

30

35

40

45

50

200 300 400 500 600

Tim

e (

Se

c)

Input size


Maple

Figure 10: Running time comparisons of the augment based parallelfactor refinement algorithm for sparse square-free univariate polynomialswhen input is already a GCD-free basis. Degree 150 and coefficientoperations are taken modulo a small prime.

46 / 65

Parallel Algorithm Based on SubproductTrees

47 / 65

Subproduct Tree I

I Useful construction to devise fast algorithms with univariate polynomials,

I If m0,m1, . . . ,mn−1 be monic, non-constant, polynomials in K[x], then its

subproduct tree is as follows [cost: O(M(d) log2(d)), d =∑n−1

i=0 degree(mi )]

m0.m1 . . .mn−1

m0.m1 . . .mn/2−1 mn/2.mn/2+1 . . .mn−1

m0.m1 m2.m3 mn−2.mn−1

m0 m1 m2 m3 mn−2 mn−1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . .

i = k

i = k − 1

i = 1

i = 0

Figure 11: Construction of a subproduct tree.

48 / 65

Fast multiple remainders I

I Let m0,m2, . . . ,mn−1 be as before and assume their subproduct tree M hasbeen computed.

I To compute f rem m0, f rem m1, . . . , f rem mn−1, one reduces f by the rootM, then by its children (leading to two remainders) and so on.

I deg(f ) < n, this amounts to O(M(d) log2(d)) field operations.

m0.m1 . . .mn−1

m0.m1 . . .mn/2−1 mn/2.mn/2+1 . . .mn−1

m0.m1 m2.m3 mn−2.mn−1

m0 m1 m2 m3 mn−2 mn−1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

. . .

i = k

i = k − 1

i = 1

i = 0

Figure 12: Construction of a subproduct tree.

49 / 65

Fast Multiple GCDs I

I Suppose we have a polynomial f ∈ K[x] and a sequence of square-freepolynomials A = m0,m1, . . . ,mn−1 in K[x] with their subproduct tree.

I We compute multiGcd(f ,A), that is, all gcd(f , ai ) for 0 ≤ i ≤ (n− 1) as follows:

a0.a1 . . . an−1

a0.a1 . . . an/2−1 an/2.an/2+1 . . . an−1

a0.a1 a2.a3 an−2.an−1

a0 a1 a2 a3 an−2 an−1

......

......

. . .

mk,0 = f mod (a0.a1 . . . an−1)

mk−1,0 = mk,0 mod (a0.a1 . . . an/2−1) mk−1,1 = mk,0 mod (an/2.an/2+1 . . . an−1)

m1,0 = m2,0 mod (a0.a1)

m0,0 = m1,0 mod a0return gcd(m0,0, a0)

m0,1 = m1,0 mod a1return gcd(m0,1, a1)

Figure 13: Multiple GCDs computation.

50 / 65

Fast All Possible Pairs of GCDs I

I Suppose we have two sequences of square-free polynomialsA = a0, a1, . . . , an−1 and B = b0, b1, . . . , bs−1 in K[x ].

I We compute parallelPairsOfGcd(A,B), that is,gcd(a0, b0), . . . , gcd(a1, bs−1), . . .,gcd(an−1, b0), . . . , gcd(an−1, bs−1)

I This is done using the subproduct tree of b0, b1, . . . , bs−1 andcalling multiGcd(f ,A) at ech node.

51 / 65

Merging GCD-free Bases I

I Once we have a routine parallelPairsOfGcd(A, B), we easilyderive a routine for merging two GCD-free bases of square freepolynomials.

I Let us call it parallelMergeGcdFreeBases(f1, f2).

52 / 65

Parallel GCD-free Bases IExpanding input and calling Merge

Algorithm 9: parallelGcdFreeBasis(A)

Input: Sequence of square-free polynomials A = a1, a2, . . . , ae .Output: A GCD-free basis of A.Build a subproduct tree where the root is labeled by the sequence1:

of polynomials A;for every node N ∈ Sub′, bottom-up, processing all nodes of the2:

same level concurrently doif N is not a leaf then3:

f1 ← spawn leftChild(N);4:

f2 ← spawn rightChild(N);5:

sync;6:

Label N by parallelMergeGcdFreeBases(f1, f2);7:

return the label of RootOf(Sub′);8:

53 / 65

Parallel GCD-free Bases IIProposition: work, span and parallelism

On input data of size n for Algorithm 9,

I Work = O(M(d) log32 d), GOOD thanks to fast arithmetic

I Span = O( dn log2

2(n)), and

I Parallelism = O(nlog4

2 d

log22(n)

), NOT GOOD, because almost similar to

QUADRATIC and with higher parallelization overheads

where

(i) d =∑i=n

i=1deg(inputi ),

(iii) FFT-based parallel univariate multiplication withM(d) ∈ O(d log2 d),

(iv) computing univariate polynomial GCD in degree d with a work ofO(M(d) log2 d) and a span of O(d), and

(v) subproduct tree of n arbitrary polynomials needs work ofO(M(d) log2 n) and span of O(log2

2 d log2 n).

54 / 65

Parallel GCD-free Bases III

Key Observations:

I Better work than that of quadratic.

I Almost similar parallelism that of quadratic.

I But, implementation on multicore is more challenging,because

(i) parallelization of polynomial multiplication in multicore is hardproblem reported by Chowdhury, Moreno Maza, Pan, andSchost in 2011 [CMPS11] and Moreno Maza and Xie in2011 [MX11].

(ii) They are efficient from degree 100,000 to 1,000,000 which hasa negative impact on the span of subproduct tree.

(iii) Subproduct tree construction has alsoI a negative impact on memory consumption andI no cache-friendly algorithms in multicore is known.

55 / 65

Factor Refinement I

Conclusion

56 / 65

Conclusion I

I We have proposed and analyzed parallel algorithms for factorrefinement (and GCD-free basis) computation.

I We have observed that cache complexity has major effects onperformances.

I Algorithm 5 with cache complexity O(n2/ZL) is more efficientthan Algorithm 3 with O(n2/L), as we verified experimentally.

I Asymptotically fast polynomial arithmetic does not always payoff on multicore architectures.

I Work in progress:I Handling multi-precision (= big) integers efficiently.I Supporting the multivariate case in the context of polynomial

system solving.

57 / 65

Acknowledgement I

I would like to acknowledge Dr. Yuzhen Xie, Dr. Rong Xiao, Dr.Changbo Chen and other ORCCA Lab members for their great

help throughout this research work.

58 / 65

Thanks to All!

59 / 65

Work, span and parallelism: naive based principle ILet W1(n), W2(n), W3(n) (resp. S1(n), S2(n), S3(n)) the work (resp. span) of Algorithms 1, 2and 3 on input data of order n.Thus we have:

W1(n) = 2W1(n/2) + W2(n) and S1(n) = S1(n/2) + S2(n). (1)

W2(n) = W3(n) + O(n2) and S2(n) = S3(n) + O(n). (2)

W3(n) = 4W3(n/2) + Θ(1) and S3(n) = S3(n/2) + Θ(1). (3)

From Equation 3, we have:

W3(n) = Θ(n2) and S3(n) = Θ(log2(n)).

From Equation 2, we obtain

W2(n) = Θ(n2) and S2(n) = Θ(n).

Finally, from Equation 1, we deduce

W1(n) = Θ(n2) and S1(n) = Θ(n).

60 / 65

Cache complexity: naive based principle ILet Q(n) is the number of cache misses in Algorithm 3. Under the ideal cache model, there exists a constant αsuch that we have,

Q(n) ≤{n2/L + n for n < αZ

4Q(n/2) + Θ(1) otherwise.

The solution of the above recurrence is as follows:

Q(n) ≤ 4Q(n/2) + Θ(1)

≤ 4[4Q(n/4) + Θ(1)] + Θ(1)

.

.

.

≤ 4kQ(n/2k ) +

k−1∑j=0

4jΘ(1)

where 2k = n/αZ in base case. Under this base case,

Q(n) ≤ (n/αZ)2[(αZ)2/L + αZ ] + Θ((n/αZ)2)

≤ Θ(n2/L + n2

/Z + n2/Z2)

61 / 65

Work, span and parallelism: augment based principle ILet us denote by W4(n), W5(n), W6(n), W7(n), W8(n) (resp. S4(n), S5(n), S6(n), S7(n), S8(n))the work (resp. span) of Algorithms 4, 5, 6, 7 and 8 on input data of order n.Thus we have:

W4(n) ≤ 2W4(n/2) + W5(n) and S4(n) ≤ S4(n/2) + S5(n). (4)

Algorithm 5 also proceeds in a divide-and-conquer manner, dividing the input data into two partsand performing two recursive calls.

W5(n) ≤{W6(n) for n < C

4W5(n/2) + Θ(1) otherwise,(5)

and

S5(n) ≤{S6(n) for n < C

2S5(n/2) + Θ(1) otherwise.(6)

It is observed that W6(n), W7(n), W8(n) fit within Θ(n2), Θ(n), Θ(1), respectively. Moreover,

since Algorithms 6, 7 and 8 are serial, we also have S6(n), S7(n), S8(n) within Θ(n2), Θ(n), Θ(1),respectively.Let k = dlog2(n/C)e. Then we have

W5(n) ≤ O(4kC 2) = O(n2) and S5(n) ≤ O(2kC 2) = O(Cn).

Therefore, from Relation (4), we deduce

W4(n) ∈ O(n2) and S4(n) ∈ O(Cn).

62 / 65

Cache complexity: augment based principle I

The recursive structure of Algorithm 5 presents that there exists a positive constant α such that Q(n) satisfies thefollowing relation:

Q(n) ≤{O(n/L + 1) for n < αZ

4Q(n/2) + Θ(1) otherwise,

provided C < αZ .The above recurrence leads to the following inequality for all n ≥ 2:

Q(n) ≤ 4Q(n/2) + Θ(1)

≤ 4[4Q(n/4) + Θ(1)] + Θ(1)

.

.

.

≤ 4kQ(n/2k ) +

k−1∑j=0

4jΘ(1)

where k = dlog2(n/αZ)e. Since n/2k ≤ αZ , we deduce:

Q(n) ≤ (n/αZ)2(αZ/L + 1) + Θ((n/αZ)2)

= O(n2/ZL + n2

/Z2).

.

63 / 65

References I

Eric Bach, James Driscoll, and Jeffrey Shallit.Factor Refinement.In Symposium on Discrete Algorithms, pages 201–211, 1990.

Muhammad F. I. Chowdhury, Marc Moreno Maza, Wei Pan,

and Eric Schost.Complexity and Performance Results for non FFT-basedUnivariate Polynomial Multiplication, 2011.

X. Dahan, M. Moreno Maza, E. Schost, W. Wu, and Y. Xie.On the Complexity of the D5 Principle.SIGSAM Bull., 39:97–98, September 2005.

64 / 65

References II

Charles E. Leiserson, Liyun Li, M. Moreno Maza, and YuzhenXie.Parallel Computation of the Minimal Elements of a Poset.In 4th International Workshop on Parallel and SymbolicComputation, pages 53–62, 2010.

Marc Moreno Maza and Yuzhen Xie.Balanced Dense Polynomial Multiplication on Multi-cores.Int. J. Found. Comput. Sci., 22(5):1035–1055, 2011.

65 / 65

On the Factor Refinement Principle and its Implementation ...users.cecs.anu.edu.au/~mohsin/downloads/slides-msc-thesis-ali-20… · On the Factor Re nement Principle and its Implementation

Documents