CSE 548: (Design and) Analysis of Algorithms - Divide-and ...seclab.cs.sunysb.edu/sekar/cse548/ln/divide1.pdf · time of O(k +l). Thus merge s are linear, and the overall time taken

Warmup Sorting Selection Closest pair Multiplication FFT

CSE 548: (Design and) Analysis of AlgorithmsDivide-and-conquer Algorithms

R. Sekar

1 / 81

Warmup Sorting Selection Closest pair Multiplication FFT Overview Search H-Tree Exponentiation

Divide-and-Conquer: A versatile strategy

Steps

Break a problem into

subproblems that are

smaller instances of the

same problem

Recursively solve these

subproblems

Combine these answers

to obtain the solution to

the problem

Benefits

Conceptual simplification

Speed up:

rapidly (exponentially) reduce problem

space

exploit commonalities in subproblem

solutions

Parallelism: Divide-and-conquer algorithms

are amenable to parallelization

Locality: Their depth-first nature increases

locality, extremely important for today’s

processors.

2 / 81


Topics

1. Warmup

Overview

Search

H-Tree

Exponentiation

2. Sorting

Mergesort

Recurrences

Fibonacci

Numbers

Quicksort

Lower Bound

Radix sort

3. Selection

Select k-th min

Priority Queues

4. Closest pair

5. Multiplication

Matrix

Multiplication

Integer

multiplication

6. FFT

Fourier

Transform

DFT

FFT Algorithm

Fast

multiplication

3 / 81


Binary Search

Problem: Find a key k in an ordered collection

Examples: Sorted array A[n]: Compare k with A[n/2], then

recursively search in A[0 · · · (n/2− 1)] (if k < A[n/2]) or

A[n/2 · · · n] (otherwise)

Binary search tree T : Compare k with root(T ), based on the

result, recursively search left or right subtree of root.

B-Tree: Hybrid of the above two. Root stores an array M of m

keys, and has m + 1 children. Use binary search on M to

identify which child can contain k, recursively search that

subtree.

4 / 81


H-tree: Planar embedding of full binary tree

Key properties

fractal geometry —divide-and-conquer structure

n nodes in O(n) area

all root-to-leaf paths equal inlength

Applications

compact embedding of binarytrees in VLSI

hardware clock distribution

5 / 81


Divide-and-conquer Construction of H-tree

MkHtree(l, b, r, t, n)

horizLine(b+t2 , l + l+r4 , r −

l+r4 )

vertLine(l + l+r4 , b + b+t

4 , t −b+t4 )

vertLine(r − l+r4 , b + b+t

4 , t −b+t4 )

if n ≤ 4 return

MkHtree(l, b+t2 ,l+r2 , t,

n4 )

MkHtree( l+r2 ,b+t2 , r, t,

n4 )

MkHtree(l, b, l+r2 ,b+t2 ,

n4 )

MkHtree( l+r2 , b, r,b+t2 ,

n4 )

6 / 81


Divide-and-conquer Construction of H-tree

Questions

How compact is the embedding

Ratio of minimum distance

between nodes and the average

area per node

What is the root-to-leaf path

length?

Can we do better?

Finally, how can we show that

the algorithm is correct?

MkHtree(l, b, r, t, n)

horizLine(b+t2 , l + l+r4 , r −

l+r4 )

vertLine(l + l+r4 , b + b+t

4 , t −b+t4 )

vertLine(r − l+r4 , b + b+t

4 , t −b+t4 )

if n ≤ 4 return

MkHtree(l, b+t2 ,l+r2 , t,

n4 )

MkHtree( l+r2 ,b+t2 , r, t,

n4 )

MkHtree(l, b, l+r2 ,b+t2 ,

n4 )

MkHtree( l+r2 , b, r,b+t2 ,

n4 )

7 / 81


Exponentiation

How many multiplications are required to compute xn?

Can we use a divide-and-conquer approach to make it faster?

ExpBySquaring(n, x)

if n > 1

y = ExpBySquaring(bn/2c, x2)if odd(n) y = x ∗ yreturn y

else return x

8 / 81

Warmup Sorting Selection Closest pair Multiplication FFT Mergesort Recurrences Fibonacci Numbers Quicksort Lower Bound Radix sort

Merge Sort

Binary searchThe ultimate divide-and-conquer algorithm is, of course, binary search: to find a key k in alarge file containing keys z[0, 1, . . . , n− 1] in sorted order, we first compare k with z[n/2], anddepending on the result we recurse either on the first half of the file, z[0, . . . , n/2 − 1], or onthe second half, z[n/2, . . . , n− 1]. The recurrence now is T (n) = T (n/2)+ O(1), which is thecase a = 1, b = 2, d = 0. Plugging into our master theorem we get the familiar solution: arunning time of just O(log n).

2.3 MergesortThe problem of sorting a list of numbers lends itself immediately to a divide-and-conquerstrategy: split the list into two halves, recursively sort each half, and then merge the twosorted sublists.

function mergesort(a[1 . . . n])Input: An array of numbers a[1 . . . n]Output: A sorted version of this array

if n > 1:return merge(mergesort(a[1 . . .n/2]), mergesort(a[n/2+ 1 . . . n]))

else:return a

The correctness of this algorithm is self-evident, as long as a correct merge subroutine isspecified. If we are given two sorted arrays x[1 . . . k] and y[1 . . . l], how do we efficiently mergethem into a single sorted array z[1 . . . k + l]? Well, the very first element of z is either x[1] ory[1], whichever is smaller. The rest of z[·] can then be constructed recursively.

function merge(x[1 . . . k], y[1 . . . l])if k = 0: return y[1 . . . l]if l = 0: return x[1 . . . k]if x[1] ≤ y[1]:return x[1] merge(x[2 . . . k], y[1 . . . l])

else:return y[1] merge(x[1 . . . k], y[2 . . . l])

Here denotes concatenation. This merge procedure does a constant amount of work perrecursive call (provided the required array space is allocated in advance), for a total runningtime of O(k + l). Thus merge’s are linear, and the overall time taken by mergesort is

T (n) = 2T (n/2) + O(n),

or O(n log n).

Looking back at the mergesort algorithm, we see that all the real work is done in merg-ing, which doesn’t start until the recursion gets down to singleton arrays. The singletons are

56

9 / 81


Merge Sort (Continued)

Binary searchThe ultimate divide-and-conquer algorithm is, of course, binary search: to find a key k in alarge file containing keys z[0, 1, . . . , n− 1] in sorted order, we first compare k with z[n/2], anddepending on the result we recurse either on the first half of the file, z[0, . . . , n/2 − 1], or onthe second half, z[n/2, . . . , n− 1]. The recurrence now is T (n) = T (n/2)+ O(1), which is thecase a = 1, b = 2, d = 0. Plugging into our master theorem we get the familiar solution: arunning time of just O(log n).

2.3 MergesortThe problem of sorting a list of numbers lends itself immediately to a divide-and-conquerstrategy: split the list into two halves, recursively sort each half, and then merge the twosorted sublists.

function mergesort(a[1 . . . n])Input: An array of numbers a[1 . . . n]Output: A sorted version of this array

if n > 1:return merge(mergesort(a[1 . . .n/2]), mergesort(a[n/2+ 1 . . . n]))

else:return a

The correctness of this algorithm is self-evident, as long as a correct merge subroutine isspecified. If we are given two sorted arrays x[1 . . . k] and y[1 . . . l], how do we efficiently mergethem into a single sorted array z[1 . . . k + l]? Well, the very first element of z is either x[1] ory[1], whichever is smaller. The rest of z[·] can then be constructed recursively.

function merge(x[1 . . . k], y[1 . . . l])if k = 0: return y[1 . . . l]if l = 0: return x[1 . . . k]if x[1] ≤ y[1]:return x[1] merge(x[2 . . . k], y[1 . . . l])

else:return y[1] merge(x[1 . . . k], y[2 . . . l])

Here denotes concatenation. This merge procedure does a constant amount of work perrecursive call (provided the required array space is allocated in advance), for a total runningtime of O(k + l). Thus merge’s are linear, and the overall time taken by mergesort is

T (n) = 2T (n/2) + O(n),

or O(n log n).

Looking back at the mergesort algorithm, we see that all the real work is done in merg-ing, which doesn’t start until the recursion gets down to singleton arrays. The singletons are

56

10 / 81


Merge Sort Illustration

Figure 2.4 The sequence of merge operations in mergesort.

2 3 10 1 6 7 135

102 53 137 1 6

2 5 3 7 13 1 610

Input: 10 2 3 1135 7 6

1 6 10 1332 5 7 .

merged in pairs, to yield arrays with two elements. Then pairs of these 2-tuples are merged,producing 4-tuples, and so on. Figure 2.4 shows an example.

This viewpoint also suggests how mergesortmight be made iterative. At any given mo-ment, there is a set of “active” arrays—initially, the singletons—which are merged in pairs togive the next batch of active arrays. These arrays can be organized in a queue, and processedby repeatedly removing two arrays from the front of the queue, merging them, and puttingthe result at the end of the queue.

In the following pseudocode, the primitive operation inject adds an element to the endof the queue while eject removes and returns the element at the front of the queue.

function iterative-mergesort(a[1 . . . n])Input: elements a1, a2, . . . , an to be sorted

Q = [ ] (empty queue)for i = 1 to n:

inject(Q, [ai])while |Q| > 1:

inject(Q,merge(eject(Q),eject(Q)))return eject(Q)

57

11 / 81


Merge sort time complexity

mergesort(A) makes two recursive invocations of itself, each with

an array half the size of A

merge(A, B) takes time that is linear in |A|+ |B|

Thus, the runtime is given by the recurrence

T (n) = 2T(n2

)+ n

In divide-and-conquer algorithms, we often encounter recurrences

of the form

T (n) = aT(nb

)+ O(nd)

Can we solve them once for all?12 / 81


Master Theorem

If T (n) = aT(nb

)+O(nd) for constants a > 0, b > 1, and d ≥ 0, then

T (n) =

O(nd), if d > logb a

O(nd log n) if d = logb a

O(nlogb a) if d < logb a

13 / 81


Proof of Master TheoremFigure 2.3 Each problem of size n is divided into a subproblems of size n/b.

Size 1

Size n/b2

Size n/b

Size n

Depthlogb n

Width alogb n = nlogb a

Branching factor a

subproblems, each of size n/bk (Figure 2.3). The total work done at this level is

ak × O n

bk

d= O(nd) ×

a

bd

k.

As k goes from 0 (the root) to logb n (the leaves), these numbers form a geometric series withratio a/bd. Finding the sum of such a series in big-O notation is easy (Exercise 0.2), and comesdown to three cases.

1. The ratio is less than 1.Then the series is decreasing, and its sum is just given by its first term, O(nd).

2. The ratio is greater than 1.The series is increasing and its sum is given by its last term, O(nlogb a):

nd a

bd

logb n= nd

alogb n

(blogb n)d

= alogb n = a(loga n)(logb a) = nlogb a.

3. The ratio is exactly 1.In this case all O(log n) terms of the series are equal to O(nd).

These cases translate directly into the three contingencies in the theorem statement.

55

Can be proved by induction, or by summing up the series where

each term represents the work done at one level of this tree.14 / 81


What if Master Theorem can’t be appplied?

Guess and check (prove by induction)

expand recursion for a few steps to make a guess

in principle, can be applied to any recurrence

Akra-Bazzi method (not covered in class)

recurrences can be much more complex than that of Master theorem

15 / 81


More on time complexity: Fibonacci Numbers

function fib1(int n)

if n = 0 return 0;

if n = 1 return 1;

return fib1(n− 1) + fib1(n− 2)

Is this algorithm correct? Yes: follows the defintion of Fibonacci

What is its runtime?

T (n) = T (n− 1) + T (n− 2) + 3, with T (k) ≤ 2 for k < 2Solution is an exponential function . . .Prove this by induction!

Can we do better?

16 / 81


Structure of calls to fib1Figure 0.1 The proliferation of recursive calls in fib1.

Fn−3

Fn−1

Fn−4

Fn−2

Fn−4

Fn−6Fn−5Fn−4

Fn−2 Fn−3

Fn−3 Fn−4 Fn−5Fn−5

Fn

As with fib1, the correctness of this algorithm is self-evident because it directly uses thedefinition of Fn. How long does it take? The inner loop consists of a single computer step andis executed n − 1 times. Therefore the number of computer steps used by fib2 is linear in n.From exponential we are down to polynomial, a huge breakthrough in running time. It is nowperfectly reasonable to compute F200 or even F200,000.1As we will see repeatedly throughout this book, the right algorithm makes all the differ-

ence.

More careful analysis

In our discussion so far, we have been counting the number of basic computer steps executedby each algorithm and thinking of these basic steps as taking a constant amount of time.This is a very useful simplification. After all, a processor’s instruction set has a variety ofbasic primitives—branching, storing to memory, comparing numbers, simple arithmetic, andso on—and rather than distinguishing between these elementary operations, it is far moreconvenient to lump them together into one category.But looking back at our treatment of Fibonacci algorithms, we have been too liberal with

what we consider a basic step. It is reasonable to treat addition as a single computer step ifsmall numbers are being added, 32-bit numbers say. But the nth Fibonacci number is about0.694n bits long, and this can far exceed 32 as n grows. Arithmetic operations on arbitrarilylarge numbers cannot possibly be performed in a single, constant-time step. We need to auditour earlier running time estimates and make them more honest.We will see in Chapter 1 that the addition of two n-bit numbers takes time roughly propor-

tional to n; this is not too hard to understand if you think back to the grade-school procedure1To better appreciate the importance of this dichotomy between exponential and polynomial algorithms, the

reader may want to peek ahead to the story of Sissa and Moore, in Chapter 8.

14

Complete binary tree of depth n, contains 2n calls to fib1

But there are only n distinct Fibonacci numbers being computed!

Each Fibonacci number computed an exponential number of times!

17 / 81


Improved Algorithm for Fibonacci

function fib2(n)

int f [max(2, n + 1)];

f [0] = 0; f [1] = 1;

for (i = 2; i ≤ n; i++)

f [i] = f [i − 1] + f [i − 2];

return f [n]

Linear-time algorithm!

But wait! We are operating on very large numbersnth Fibonacci number requires approx. 0.694n bitsProve this by induction!

Operation on k-bit numbers require k operations

i.e., Computing Fn requires 0.694n log n operations18 / 81


Quicksort

qs(A, l, h) /*sorts A[l . . . h]*/

if l >= h return;

(h1, l2) =

partition(A, l, h);

qs(A, l, h1);

qs(A, l2, h)

partition(A, l, h)

k = selectPivot(A, l, h); p = A[k];

swap(A, h, k);

i = l − 1; j = h;

while true do

do i++ while A[i] < p;

do j−− while A[j] > p;

if i ≥ j break;

swap(A, i, j);

swap(A, i, h)

return (j, i + 1)

19 / 81


Analysis of Runtime of qs

General case: Given by the recurrence T (n) = n + T (n1) + T (n2)

where n1 and n2 are the sizes of the two sub-arrays after partition.

Best case: n1 = n2 = n/2. By master theorem, T (n) = O(n log n)

Worst case: n1 = 1, n2 = n− 1. By master theorem, T (n) = O(n2)

A fixed choice of pivot index, say, h, leads to worst-case behavior incommon cases, e.g., input is sorted.

Lucky/unlucky split: Alternate between best- and worst-case splits.

T (n) = n + T (1)+ T(n-1) + n (worst case split)

= n + 1+ (n-1) + 2T((n-1)/2) = 2n + 2T ((n− 1)/2)

which has an O(n log n) solution.

Three-fourths split:

T (n) = n + T (0.25n) + T (0.75n) ≤ n + 2T (0.75n) = O(n log n)

20 / 81


Average case analysis of qs

Define input distribution: All permutations equally likely

Simplifying assumption: all elements are distinct. (Nonessential assumption)

Set up the recurrence: When all permutations are qually likely, the selected

pivot has an equal chance of ending up at the i th position in the

sorted order, for all 1 ≤ i ≤ n. Thus, we have the following recurrence

for the average case:

T (n) = n +1

n

n−1∑i=1

(T (i) + T (n− i))

Solve recurrence: Cannot apply the master theorem, but since it seems that

we get an O(n log n) bound even in seemingly bad cases, we can try

to establish a cn log n bound via induction.

21 / 81


Establishing average case of qs

Establish base case. (Trivial.)

Induction step involves summation of the form∑n−1

i=1 i log i .

Attempt 1: bound log i above by log n. (Induction fails.)Attempt 2: split the sum into two parts:

n/2∑i=1

i log i +n−1∑

i=n/2+1

i log i

and apply the approx. to each half. (Succeeds with c ≥ 4.)Attempt 3: replace the summation with the upper bound

n∫x=1

x log x =x2

2

(log x − 1

2

)∣∣∣∣nx=1

(Succeeds with the constraint c ≥ 2.)

22 / 81


Randomized Quicksort

Picks a pivot at random

What is its complexity?

For randomized algorithms, we talk about expected complexity, which

is an average over all possible values of the random variable.

If pivot index is picked uniformly at random over the interval [l, h],

then:

every array element is equally likely to be selected as the pivot

every partition is equally likely

thus, expected complexity of randomized quicksort is given by the

same recurrence as the average case of qs.

23 / 81


Lower bounds for comparison-based sortingSorting algorithms can be depicted as trees: each leaf identifies the input

permutation that yields a sorted order.

An n log n lower bound for sortingSorting algorithms can be depicted as trees. The one in the following figure sorts an array ofthree elements, a1, a2, a3. It starts by comparing a1 to a2 and, if the first is larger, comparesit with a3; otherwise it compares a2 and a3. And so on. Eventually we end up at a leaf, andthis leaf is labeled with the true order of the three elements as a permutation of 1, 2, 3. Forexample, if a2 < a1 < a3, we get the leaf labeled “2 1 3.”

3 2 1

Yes

a2 < a3?

a1 < a2?

a1 < a3?

a2 < a3? a1 < a3?

2 3 1

2 1 3

3 1 2 1 3 2

1 2 3

No

The depth of the tree—the number of comparisons on the longest path from root to leaf,in this case 3—is exactly the worst-case time complexity of the algorithm.This way of looking at sorting algorithms is useful because it allows one to argue that

mergesort is optimal, in the sense that Ω(n log n) comparisons are necessary for sorting nelements.Here is the argument: Consider any such tree that sorts an array of n elements. Each of

its leaves is labeled by a permutation of 1, 2, . . . , n. In fact, every permutation must appearas the label of a leaf. The reason is simple: if a particular permutation is missing, whathappens if we feed the algorithm an input ordered according to this same permutation? Andsince there are n! permutations of n elements, it follows that the tree has at least n! leaves.We are almost done: This is a binary tree, and we argued that it has at least n! leaves.

Recall now that a binary tree of depth d has at most 2d leaves (proof: an easy induction ond). So, the depth of our tree—and the complexity of our algorithm—must be at least log(n!).And it is well known that log(n!) ≥ c · n log n for some c > 0. There are many ways to see

this. The easiest is to notice that n! ≥ (n/2)(n/2) because n! = 1 · 2 · · · · · n contains at leastn/2 factors larger than n/2; and to then take logs of both sides. Another is to recall Stirling’sformula

n! ≈

π

2n +

1

3

· nn · e−n.

Either way, we have established that any comparison tree that sorts n elements must make,in the worst case, Ω(n log n) comparisons, and hence mergesort is optimal!

Well, there is some fine print: this neat argument applies only to algorithms that usecomparisons. Is it conceivable that there are alternative sorting strategies, perhaps usingsophisticated numerical manipulations, that work in linear time? The answer is yes, undercertain exceptional circumstances: the canonical such example is when the elements to besorted are integers that lie in a small range (Exercise 2.20).

59

The tree has n! leaves, and hence a height of log n!. By Stirling’s

approximation, n! ≈√2πn

(ne

)n, so, log n! = O(n log n)

No comparison-based sorting algorithm can do better!

24 / 81


Bucket sort

Overview

Divide: Partition input intointervals (buckets), based on keyvalues

Linear scan of input, dropinto appropriate bucket

Recurse: Sort each bucket

Combine: Concatenate bin

contents

Example

Images from Wikipedia commons

25 / 81


Bucket sort (Continued)

Bucket sort generalizes quicksort to multiple partitions

Combination = concatenation

Worst case quadratic bound applies

But performance can be much better if input distribution is uniform.

Exercise: What is the runtime in this case?

Used by letter sorting machines in post oces

26 / 81


Counting Sort

Special case of bucket sort where each bin corresponds to an

interval of size 1.

No need to recurse. Divide = conquered!

Makes sense only if range of key values is small (usually constant)

Thus, counting sort can be done in O(n) time!

Hmm. How did we beat the O(n log n) lower bound?

27 / 81


Radix Sorting

Treat an integer as a sequence of digits

Sort digits using counting sort

LSD sorting: Sort first on least significant digit, and most

significant digit last. After each round of counting sort, results

can be simply concatenated, and given as input to the next

stage.

MSD sorting: Sort first on most significant digit, and least

significant digit last. Unlike LSD sorting, we cannot concatenate

after each stage.

Note: Radix sort does not divide inputs into smaller subsets

If you think of input as multi-dimensional data, then we break

down the problem to each dimension. 28 / 81


Stable sorting algorithms

Stable sorting algorithms: don’tchange order of equal elements.

Merge sort and LSD sort are stable.Quicksort is not stable.

Why is stability important?

Eect of sorting on attribute A andthen B is the same as sorting on〈B, A〉LSD sort won’t work without thisproperty!

Other examples: sorting spreadsheets or tables on web pages

Images from Wikipedia Commons

29 / 81

http://commons.wikimedia.org/wiki/File:Sorting_playing_cards_using_stable_sort.svg#mediaviewer/File:Sorting_playing_cards_using_stable_sort.svg


Sorting strings

Can use LSD or MSD sortingEasy if all strings are of same length.

Requires a bit more care with variable-length strings.

Starting point: use a special terminator character t < a for all valid

characters a.

Easy to devise an O(nl) algorithm, where n is the number ofstrings and l is the maximum size of any string.But such an algorithm is not linear in input size.

Exercise: Devise a linear-time string algorithm.

Given a set S of strings, your algorithm should sort in O(|S|)time, where

|S| =∑s∈S|s|

30 / 81

Warmup Sorting Selection Closest pair Multiplication FFT Select k-th min Priority Queues

Select kth largest element

Obvious approach: Sort, pick kth element — wasteful, O(n log n)

Better approach: Recursive partitioning, search only on one side

qsel(A, l, h, k)

if l = h return A[l];

(h1, l2) = partition(A, l, h);

if k ≤ h1return qsel(A, l, h1, k)

else return qsel(A, l2, h, k)

Complexity

Best case: Splits are even:

T (n) = n+ T (n/2), which has an O(n)

solution.

Skewed 10%/90% T (n) ≤ n + T (0.9n) —

still linear

Worst case: T (n) = n + T (n− 1) —

quadratic!

31 / 81


Worst-case O(n) Selection

Intuition: Spend a bit more time to select a pivot that ensures

reasonably balanced partitions

MoM Algorithm [Blum, Floyd, Pratt, Rivest and Tarjan 1973]

32 / 81


O(n) Selection: MoM Algorithm

Quick select (qsel ) takes no time to pick a pivot, but then spends

O(n) to partition.

Can we spend more time upfront to make a better selection of the

pivot, so that we can avoid highly skewed splits?

Key Idea

Use the selection algorithm itself to choose the pivot.

Divide into sets of 5 elements

Compute median of each set (O(5), i.e., constant time)Use selection recursively on these n/5 elements to pick their mediani.e., choose the median of medians (MoM) as the pivot

Partition using MoM, and recurse to find kth largest element.

33 / 81


O(n) Selection: MoM Algorithm

Theorem: MoM-based split won’t be worse than 30%/70%

Result: Guaranteed linear-time algorithm!

Caveat: The constant factor is non-negligible; use as fall-back if

random selection repeatedly yields unbalanced splits.

34 / 81


Selecting maximum element: Priority Queues

Heap

A tree-based data structure for priorityqueues

Heap property: H is a heap if for everysubtree h of H

∀k ∈ keys(h) root(h) ≥ k

where keys(h) includes all keysappearing within h

Note: No ordering of siblings or cousins

Supports insert , deleteMax and max.Typically implemented using arrays, i.e.,without an explicit tree data structure

Task of maintaining maxis distributed to subsetsof the entire set;alternatively, it can bethought of asmaintaining severalparallel queues with asingle head.

Images from Wikimedia Commons

35 / 81

http://commons.wikimedia.org/wiki/File:Max-Heap.svg#mediaviewer/File:Max-Heap.svg


Binary heap

Array representation: Store heap elements in breadth-first order in

the array. Node i ’s children are at indices 2 ∗ i and 2 ∗ i + 1

Conceptually, we are dealing with a balanced binary tree

Max: Element at the root of the array, extracted in O(1) time

DeleteMax: Overwrite root with last element of heap. Fix heap –

takes O(log n) time, since only the ancestors of the last node need

to be fixed up.

Insert: Append element to the end of array, fix up heap

MkHeap: Fix up the entire heap. Takes O(n) time.

Heapsort: O(n log n) algorithm, MkHeap followed by n calls to

DeleteMax36 / 81


Finding closest pair of points

Problem: Given a set of n points in a d-dimensional space, identify

the two that have the smallest Euclidean distance between them.

Applications: A central problem in graphics, vision, air-trac control,

navigation, molecular modeling, and so on.

Images from Wikipedia Commons

37 / 81

http://commons.wikimedia.org/wiki/File:Closest_pair_of_points.svg#mediaviewer/File:Closest_pair_of_points.svg


Divide-and-conquer closest pair (2D)

Divide: Identify k such that the line x = k divides the points evenly.

(Median computation, takes O(n) time.)

Recursive case: Find closest pair in each half.

Combine:

Can’t just take the min of the closest pairs from two halves.

Need to consider pairs across the divide line — seems that this

will take O(n2) time!

38 / 81


Speeding up search for cross-region pairs

Observation (Key Observation 1)

Let δ1 and δ2 be the minimum distances in each half.

Need only consider points within δ = min(δ1, δ2) from the dividing

line

We expect that only a small number of points will be within such a

narrow strip.

But in the worst case, every point could be within the strip!

39 / 81


Sparsity condition

Consider a point p on the left δ-strip. How many points q1, ..., qr on

the right δ-strip could be within δ from p?

Observation (Key Observation 2)

q1, ..., qr should all be within a rectangular 2δ × δ asshown

r can’t be too large: q1, ..., qr will crowd together,

closer than δ

Theorem: r ≤ 6

We need to consider at most 6n cross-region pairs!

Remains O(n) in higher dimensions as well

40 / 81


Closest pair: Summary

Recurrence: T (n) = 2T (n/2) + Ω(n), since median computation is

already linear-time. Thus, T (n) = Ω(n log n).

To get to O(n log n), need to1. compute the δ-strip in O(n) time

Keep the points in each region sorted in x-dimension

Takes an additional O(n log n) time, no change to overall complexity

2. compute q1, ..., q6 in O(1) time.keep points in each region sorted also in y-dimension

maintain this order while deleting points outside δ strip

in this list, for each p, consider only 12 neighbors — 6 on each side of

divide

41 / 81

Warmup Sorting Selection Closest pair Multiplication FFT Matrix Multiplication Integer multiplication

Matrix Multiplication

The product Z of two n× n matrices X and Y is given by

Zij =n∑k=1

XikYkj — leads to an O(n3) algorithm.

This follows by taking expected values of both sides of the following statement:

Time taken on an array of size n

≤ (time taken on an array of size 3n/4)+ (time to reduce array size to ≤ 3n/4),

and, for the right-hand side, using the familiar property that the expectation of the sum is thesum of the expectations.From this recurrence we conclude that T (n) = O(n): on any input, our algorithm returns

the correct answer after a linear number of steps, on the average.

The Unix sort commandComparing the algorithms for sorting and median-finding we notice that, beyond the com-mon divide-and-conquer philosophy and structure, they are exact opposites. Mergesort splitsthe array in two in the most convenient way (first half, second half), without any regard tothe magnitudes of the elements in each half; but then it works hard to put the sorted sub-arrays together. In contrast, the median algorithm is careful about its splitting (smallernumbers first, then the larger ones), but its work ends with the recursive call.Quicksort is a sorting algorithm that splits the array in exactly the same way as the me-

dian algorithm; and once the subarrays are sorted, by two recursive calls, there is nothingmore to do. Its worst-case performance is Θ(n2), like that of median-finding. But it can beproved (Exercise 2.24) that its average case is O(n log n); furthermore, empirically it outper-forms other sorting algorithms. This has made quicksort a favorite in many applications—for instance, it is the basis of the code by which really enormous files are sorted.

2.5 Matrix multiplicationThe product of two n×nmatrices X and Y is a third n×n matrix Z = XY , with (i, j)th entry

Zij =

n

k=1

XikYkj.

To make it more visual, Zij is the dot product of the ith row of X with the jth column of Y :

X Y Z

i

j

(i, j)× =

In general, XY is not the same as Y X; matrix multiplication is not commutative.The preceding formula implies an O(n3) algorithm for matrix multiplication: there are n2

entries to be computed, and each takes O(n) time. For quite a while, this was widely believed

62

42 / 81


Divide-and-conquer Matrix Multiplication

Divide X and Y into four n/2× n/2 submatrices

X =

[A B

C D

]and Y =

[E F

G H

]Recursively invoke matrix multiplication on these submatrices:

XY =

[A B

C D

][E F

G H

]=

[AE + BG AF + BH

CE + DG CF + DH

]Divided, but did not conquer! T (n) = 8T (n/2) + O(n2), which is still

O(n3)

43 / 81


Strassen’s Matrix Multiplication

Strassen showed that 7 multiplications are enough:

XY =

[P6 + P5 + P4 − P2 P1 + P2

P3 + P4 P1 − P3 + P5 − P7

]where

P1 = A(F − H) P5 = (A + D)(E + H)

P2 = (A + B)H P6 = (B − D)(G + H)

P3 = (C + D)E P7 = (A− C)(E + F )

P4 = D(G − E)Now, the recurrence T (n) = 7T (n/2) + O(n2) has O(nlog2 7 = n2.81)

solution!

Best-to-date complexity is about O(n2.4), but this algorithm is not

very practical.

44 / 81


Karatsuba’s AlgorithmSame high-level strategy as Strassen — but predates Strassen.

Divide: n-digit numbers into halves, each with n/2-digits:

a = a1 a0 = 2n/2a1 + a0

b = b1 b0 = 2n/2b1 + b0

ab = 2na1b1 + 2n/2(a1b0 + b1a0) + a0b0

Key point — Instead of 4 multiplications, we can get by with 3 since:

a1b0 + b1a0 = (a1 + a0)(b1 + b0)− a1b1 − a0b0

Recursively compute a1b1, a0b0 and (a1 + a0)(b1 + b0).

Recurrence T (n) = 3T (n/2) + O(n) has an O(nlog2 3 = n1.59) solution!Note: This trick for using 3 (not 4) multiplications noted by Gauss

(1777-1855) in the context of complex numbers.45 / 81


Toom-Cook Multiplication

Generalize KaratsubaDivide into n > 2 parts

Can be more easily understood when integer multiplication isviewed as a polynomial multiplication.

46 / 81


Integer Multiplication Revisited

An integer represented using digits

an−1 . . . a0over a base d (i.e., 0 ≤ ai < d) is very similar to the polynomial

A(x) =n−1∑i=0

aixi

Specifically, the value of the integer is A(d).

Integer multiplication follows the same steps as polynomialmultiplication:

an−1 . . . a0 × bn−1 . . . b0 = (A(x)× B(x))(d)

47 / 81


Polynomials: Basic Properties

Horner’s rule

An nth degree polynomial∑n

i=0 aixi can be evaluated in O(n) time:

((· · · ((anx + an−1)x + an−2)x + · · ·+ a1)x + a0)

Roots and Interpolation

An nth degree polynomial A(x) has exactly n roots r1, ..., rn. In general,ri ’s are complex and need not be distinct.It can be represented as a product of sums using these roots:

A(x) =n∑i=1

aixi =

n∏i=0

(xi − ri)

Alternatively, A(x) can be specified uniquely by specifying n + 1 points(xi , yi) on it, i.e., A(xi) = yi .

48 / 81


Operations on Polynomials

Representation Add Mult

Coecients O(n) O(n2)

Roots ? O(n)

Points O(n) O(n)

Note: Point representation is the best for computation! But usually,

only the coecients are given

Solution: Convert to point form by evaluating A(x) at selected

points.

But conversion defeats the purpose: requires O(n) evaluations, each

taking O(n) time, thus we are back to O(n2) total time.

Toom (and FFT) Idea: Choose evaluation points judiciously to speed

up evaluation 49 / 81


Matrix representation of Polynomial Evaluation

Given a polynomial

A(x) =n−1∑t=0

atxn

choose m points x0, . . . , xm for its evaluation.

Evaluation can be expressed using matrix multiplication:

p0p1p2...

pm

=

1 x0 x20 · · · xn−10

1 x1 x21 · · · xn−11

1 x2 x22 · · · xn−12...

......

......

1 xm x2m · · · xn−1m

a0a1...

an−1

50 / 81


Multiplication using Point Representation

Let A(x) and B(x) be polynomials representing two numbers

Evaluate both polynomials at chosen points x0, ...xm

P = XA Q = XB

where P,X, A,Q and B denote matrices as in last page

Compute point-wise productr0...

rm

=

p0 ∗ q0

...

pm ∗ qm

Compute polynomial C corresponding to R

R = XC ⇒ C = X−1R

To avoid overflow, m should be degree(A) + degree(B) + 1 for R51 / 81


Improving complexity ...

Key problem: Complexity of computing X and its inverse X−1

Toom strategy:Use low-degree polynomials e.g., Toom-2 = Karatsuba uses degree 1.represents an n-bit number as a 2-digit number over a large base d = 2n/2

Fix evaluation points for a given degree polynomial so that X and X−1

can be precomputedFor Toom-2, x0 = 0, x1 = 1, x2 =∞. (Define A(∞) = an−1.)

Choose points so that expensive multiplications can be avoided while

computing P = XA,Q = XB and C = X−1R

Toom-N on n-digit numbers needs 2N − 1 multiplications on n/N

digit numbers:

T (n) = (2N − 1)T (n/N) + O(n)

which, by Master theorem, has a solution O(nlogN(2N−1)) solution52 / 81


Karatsuba revisited as Toom-2

Given evaluation points x0 = 0, x1 = 1, x2 =∞,

X =

1 0 0

1 1 1

0 1 0

XA =

1 0 0

1 1 1

0 1 0

a0a10

=

a0a0 + a1a1

Similarly

XB =

b0b0 + b1b1

Point-wise multiplication yields:

R =

a0b0(a0 + a1)(b0 + b1)

a1b1

and so on ...

53 / 81


Limitations of Toom

In principle, complexity can be reduced to n1+ε for arbitrarily small

positive ε by increasing N

In reality, the algorithm itself depends on the choice of N .

Specifically, constant factors involved increase rapidly with N .

As a practical matter, N = 4 or 5 is where we stop.

Question: Can we go farther?

54 / 81


FFT and Schonhage-Strassen

Key idea: evaluate polynomial on the complex plane

Choose powers of Nth complex root of unity as the points for

evaluation

Enables sharing of operations in computing XA so that it can be

done in O(N logN) time, rather than O(N2) time needed for the

naive matrix-multiplication based approach

55 / 81


FFT to the Rescue!

Matrix form of DFT and interpretation as polynomial evaluation:

s0s1...sj...

sN−1

=

1 1 1 · · · 11 ω ω2 · · · ωN−1

......

......

...1 ωj ω2j · · · ωj(N−1)

......

......

...1 ωN−1 ω2(N−1) · · · ω(N−1)(N−1)

a0a1...aj...

aN−1

Voila! FFT computes A(x) at N points (xi = ωi ) in O(N logN) time!

O(N logN) integer multiplicationConvert to point representation using FFT O(N logN)Multiply on point representation O(N)Convert back to coecients using FFT−1 O(N logN)

56 / 81


FFT to the Rescue!

s0s1...sj...

sn−1

=

1 1 1 · · · 11 ω ω2 · · · ωn−1

......

......

...1 ωj ω2j · · · ωj(n−1)

......

......

...1 ωn−1 ω2(n−1) · · · ω(n−1)(n−1)

a0a1...aj...

an−1

FFT can be thought of as a clever way to choose points:

Evaluations at many distinct points “collapse” together

This is why we are left with 2T (n/2) work after division, instead of4T (n/2) for a naive choice of points.

57 / 81


FFT-based multiplication: More careful analysis ...

Computations on complex or real numbers can lose precision.

For integer operations, we should work in some other ring — usually,

we choose a ring based on modulo arithmetic.Ex: in mod 33 arithmetic, 2 is the 10th root of 1, i.e., 210 ≡ 1 mod 33

More generally, 2 is the nth root of unity modulo (2n/2 + 1)

Point-wise additions and multiplications are not O(1).

We are adding up to n numbers (“digits”) — we need Ω(log n) bits

So, total cost increases by at least log n, i.e., O(n log2 n).

[Schonhage-Strassen ’71] developed O(n log n log log n) algorithm:

recursively apply their technique for “inner” operations.

58 / 81


Integer Multiplication Summary

Algorithms implemented in libraries for arbitrary precision

arithmetic, with applications in public key cryptography, computer

algebra systems, etc.

GNU MP is a popular library, uses various algorithms based on

input size: naive, Karatsuba, Toom-3, Toom-4, or

Schonhage-Strassen (at about 50K digits).

Karatsuba is Toom-2. Toom-N is based on

Evaluating a polynomial at 2N points,

performing point-wise multiplication, and

interpolating to get back the polynomial, while

minimizing the operations needed for interpolation

59 / 81

Warmup Sorting Selection Closest pair Multiplication FFT Fourier Transform DFT FFT Algorithm Fast multiplication

Fast Fourier Transformation

One of the most widely used algorithms — yet most people are

unaware of its use!

Solving dierential equations: Applied to many computational

problems in engineering, e.g., heat transfer

Audio: MP3, digital audio processors, music/speech synthesizers,

speech recognition, ...

Image and video: JPEG, MPEG, vision, ...

Communication: modulation, filtering, radars, software-defined

radios, H.264, ...

Medical diagnostics: MRI, PET, ultrasound, ...

Quantum computing: See text Ch. 10

Other: Optics, data compression, seismology, ...

60 / 81


Fourier Series

Theorem (Fourier Theorem)

Any (suciently smooth) function with a period

T can be expressed as a sum of series of

sinusoids with periods T/n for integral n.

a(t) =∞∑n=0

(dn sin(2πnt/T ) + en cos(2πnt/T ))

Images from Wikimedia Commons. 61 / 81

http://commons.wikimedia.org/wiki/File:Fourier_Series.svg#mediaviewer/File:Fourier_Series.svg


Fourier Series

Using the identity

eix = cos x + i sin x

Fourier series becomes

a(t) =∞∑

n=−∞cne

2πint/T

It can be shown

cn =

T∫0

a(t)e−2πintdt

For real a(t), cn = c∗−n.

Example: Touch tone button 1

Signal. [touch tone button 1]

Time domain.

Frequency domain.

38

π ⋅

π ⋅

Reference: Cleve Moler, Numerical Computing with MATLAB

0.5

Images from Cleve Moler, Numerical computing with MATLAB62 / 81


Fourier Transform

What if a is not periodic?

May be we can start with the Fourier series definition for cn

cn =

∫ T

0

a(t)e−2πintdt

and let T →∞?

Frequencies are not discrete any more, as the

“fundamental frequency” f = 1/T → 0

Instead of discrete coecients cn, we will have a

continuous function — call it s(f ).

s(f ) =

∫ ∞∞

a(t)e−2πiftdt

F(a) denotes a’s Fourier

transform

F is almost self-inverting:

F(F(a(t))) = a(−t)

63 / 81


How do Fourier Series/Transform help?

Dierential equations: Turn non-integrable functions into a sum of

easily integrable ones.

Some problems easier to solve in frequency domain:

Filtering: filter out noise, tuning, ...

Compression: eliminate high frequency components, ...

Convolution: Convolution in time domain becomes (simpler)

multiplication in frequency domain.

Definition (Convolution)

(a ∗ b)(t) =

∞∫−∞

a(t − x)b(x)dxTheorem (Convolution)F(a ∗ b)(t) = F(a(t))F(b(t))

64 / 81


Discrete Fourier Transform

Real-world signals are typically sampled

DFT is a formulation of FT applicable to such samples

Nyquist rate: A signal with highest frequency n/2 can be losslessly

reconstructed from n samples.

DFT of time domain samples a0, . . . , an−1 yields frequency domain

samples s0, . . . , sn−1:

sf =n−1∑t=0

ate−2πift/n cf. s(f ) =

∫ ∞∞

a(t)e−2πiftdt

Note: DFT formulation can be derived from FT by treating the

sampling process as a multiplication by a sequence of impulse

functions separated by the sampling interval65 / 81


Background: Complex Plane, Polar Coordinates

Figure 2.6 The complex roots of unity are ideal for our divide-and-conquer scheme.

θReal

Imaginary

a

b

r

The complex planez = a + bi is plotted at position (a, b).

Polar coordinates: rewrite as z = r(cos θ + i sin θ) = reiθ,denoted (r, θ).• length r =

√a2 + b2.

• angle θ ∈ [0, 2π): cos θ = a/r, sin θ = b/r.• θ can always be reduced modulo 2π.

Examples: Number −1 i 5 + 5i

Polar coords (1, π) (1, π/2) (5√

2, π/4)

(r1r2, θ1 + θ2)

(r1, θ1)

(r2, θ2)

Multiplying is easy in polar coordinates

Multiply the lengths and add the angles:(r1, θ1) × (r2, θ2) = (r1r2, θ1 + θ2).

For any z = (r, θ),• −z = (r, θ + π) since −1 = (1, π).• If z is on the unit circle (i.e., r = 1), then zn = (1, nθ).

Angle 2πn

4πn

2πn + π

The nth complex roots of unitySolutions to the equation zn = 1.

By the multiplication rule: solutions are z = (1, θ), for θ amultiple of 2π/n (shown here for n = 16).

For even n:• These numbers are plus-minus paired: −(1, θ) = (1, θ+π).• Their squares are the (n/2)nd roots of unity, shown herewith boxes around them.

Divide-and-conquer step

EvaluateAe(x), Ao(x)at (n/2)ndroots

Stillpaired

Divide andconquer

Paired

Evaluate A(x)at nth rootsof unity

(n is a power of 2)

70

66 / 81


Polar Coordinates and Multiplication


θReal

Imaginary

a

b

r



√a2 + b2.




2, π/4)

(r1r2, θ1 + θ2)

(r1, θ1)

(r2, θ2)




Angle 2πn

4πn

2πn + π






Stillpaired

Divide andconquer

Paired


(n is a power of 2)

70

67 / 81


Roots of unity on Complex Plane


θReal

Imaginary

a

b

r



√a2 + b2.




2, π/4)

(r1r2, θ1 + θ2)

(r1, θ1)

(r2, θ2)




Angle 2πn

4πn

2πn + π






Stillpaired

Divide andconquer

Paired


(n is a power of 2)

70

68 / 81


Matrix representation of DFT

Given time domain samples at for t = 0, 1, . . . , n− 1,

Compute frequency domain samples sf for f = 0, 1, . . . , n− 1

sf =n−1∑t=0

ate−2πift/n =

n−1∑t=0

at(e−2πi/n

)ft=

n−1∑t=0

atωft

where ω = e−2πi/n is the nth complex root of unity

s0s1s2...

sj...

sn−1

=

1 1 1 · · · 1

1 ω ω2 · · · ωn−1

1 ω2 ω4 · · · ω2(n−1)

......

......

...

1 ωj ω2j · · · ωj(n−1)

......

......

...

1 ωn−1 ω2(n−1) · · · ω(n−1)(n−1)

a0a1a2...

aj...

an−1

69 / 81


Matrix representation of DFT

Note that e−2πi/n is represents the nth root of 1, denoted ω.

s0s1s2...

sj...

sn−1

=

1 1 1 · · · 1

1 ω ω2 · · · ωn−1

1 ω2 ω4 · · · ω2(n−1)

......

......

...

1 ωj ω2j · · · ωj(n−1)

......

......

...

1 ωn−1 ω2(n−1) · · · ω(n−1)(n−1)

a0a1a2...

aj...

an−1

Possible interpretations of these matrix equations:

Simultaneous equations that can be solved

Change of basis (rotate coordinate system)

Evaluation of polynomial∑n−1

k=0 akxk at x = ωj , 0 ≤ j < n.

70 / 81


Speeding up FFT Computation

Matrix multiplication formulation has an obvious

divide-and-conquer implementation

Mn−→A 0...(n−1) =

[M 11n/2 M 12

n/2

M 21n/2 M 22

n/2

][ −→A 0···bn/2c−→

A dn/2e···(n−1)

]But this algorithm still takes O(n2) time

... but wait! — there are only O(n) distinct elements in the square

matrix Mn.

O(n) repetitions of each element in Mn, so there is significant

scope for sharing operations on submatrices!

71 / 81


Observations about M(ω)

The orthogonality property can be summarized in the single equation

MM∗ = nI,

since (MM∗)ij is the inner product of the ith and jth columns of M (do you see why?). Thisimmediately implies M−1 = (1/n)M ∗: we have an inversion formula! But is it the same for-mula we earlier claimed? Let’s see—the (j, k)th entry of M ∗ is the complex conjugate of thecorresponding entry ofM , in other words ω−jk. WhereuponM ∗ = Mn(ω−1), and we’re done.

And now we can finally step back and view the whole affair geometrically. The task weneed to perform, polynomial multiplication, is a lot easier in the Fourier basis than in thestandard basis. Therefore, we first rotate vectors into the Fourier basis (evaluation), thenperform the task (multiplication), and finally rotate back (interpolation). The initial vectorsare coefficient representations, while their rotated counterparts are value representations. Toefficiently switch between these, back and forth, is the province of the FFT.

2.6.4 A closer look at the fast Fourier transformNow that our efficient scheme for polynomial multiplication is fully realized, let’s hone inmore closely on the core subroutine that makes it all possible, the fast Fourier transform.

The definitive FFT algorithm

The FFT takes as input a vector a = (a0, . . . , an−1) and a complex number ω whose powers1,ω,ω2, . . . ,ωn−1 are the complex nth roots of unity. It multiplies vector a by the n × n matrixMn(ω), which has (j, k)th entry (starting row- and column-count at zero) ω jk. The potentialfor using divide-and-conquer in this matrix-vector multiplication becomes apparent whenM ’scolumns are segregated into evens and odds:

=

aMn(ω)

an−1

a0

a1

a2

a3

a4

...

ωjk

k

j =

a2

a1

a3

an−1

...

a0

...an−2

2k + 1Column

2k

Even

ω2jk ωj · ω2jk

columnsOddcolumns

j

Row ja2

a1

a3

an−1

...

a0

...an−2

ω2jk

ω2jk

ωj · ω2jk

2k + 1Column

j + n/2

2k

−ωj · ω2jk

In the second step, we have simplified entries in the bottom half of the matrix using ωn/2 = −1and ωn = 1. Notice that the top left n/2 × n/2 submatrix is Mn/2(ω

2), as is the one on thebottom left. And the top and bottom right submatrices are almost the same as Mn/2(ω

2),but with their jth rows multiplied through by ωj and −ωj, respectively. Therefore the finalproduct is the vector

74

Two successive colums dier by a factor ωj in the j th row

Rows that are n/2 rows apart dier by a factor of ωkn/2 in the kth

column

Note that ωn/2 = −1, so they dier by a factor of −1 on odd columns,

and are identical on even columns.

72 / 81


DFT Matrix Multiplication, Rearranged ...

The orthogonality property can be summarized in the single equation

MM∗ = nI,

since (MM∗)ij is the inner product of the ith and jth columns of M (do you see why?). Thisimmediately implies M−1 = (1/n)M ∗: we have an inversion formula! But is it the same for-mula we earlier claimed? Let’s see—the (j, k)th entry of M ∗ is the complex conjugate of thecorresponding entry ofM , in other words ω−jk. WhereuponM ∗ = Mn(ω−1), and we’re done.

And now we can finally step back and view the whole affair geometrically. The task weneed to perform, polynomial multiplication, is a lot easier in the Fourier basis than in thestandard basis. Therefore, we first rotate vectors into the Fourier basis (evaluation), thenperform the task (multiplication), and finally rotate back (interpolation). The initial vectorsare coefficient representations, while their rotated counterparts are value representations. Toefficiently switch between these, back and forth, is the province of the FFT.

2.6.4 A closer look at the fast Fourier transformNow that our efficient scheme for polynomial multiplication is fully realized, let’s hone inmore closely on the core subroutine that makes it all possible, the fast Fourier transform.

The definitive FFT algorithm

The FFT takes as input a vector a = (a0, . . . , an−1) and a complex number ω whose powers1,ω,ω2, . . . ,ωn−1 are the complex nth roots of unity. It multiplies vector a by the n × n matrixMn(ω), which has (j, k)th entry (starting row- and column-count at zero) ω jk. The potentialfor using divide-and-conquer in this matrix-vector multiplication becomes apparent whenM ’scolumns are segregated into evens and odds:

=

aMn(ω)

an−1

a0

a1

a2

a3

a4

...

ωjk

k

j =

a2

a1

a3

an−1

...

a0

...an−2

2k + 1Column

2k

Even

ω2jk ωj · ω2jk

columnsOddcolumns

j

Row ja2

a1

a3

an−1

...

a0

...an−2

ω2jk

ω2jk

ωj · ω2jk

2k + 1Column

j + n/2

2k

−ωj · ω2jk

In the second step, we have simplified entries in the bottom half of the matrix using ωn/2 = −1and ωn = 1. Notice that the top left n/2 × n/2 submatrix is Mn/2(ω

2), as is the one on thebottom left. And the top and bottom right submatrices are almost the same as Mn/2(ω

2),but with their jth rows multiplied through by ωj and −ωj, respectively. Therefore the finalproduct is the vector

74

Figure 2.9 The fast Fourier transformfunction FFT(a,ω)Input: An array a = (a0, a1, . . . , an−1), for n a power of 2

A primitive nth root of unity, ωOutput: Mn(ω) a

if ω = 1: return a(s0, s1, . . . , sn/2−1) = FFT((a0, a2, . . . , an−2),ω

2)

(s0, s1, . . . , s

n/2−1) = FFT((a1, a3, . . . , an−1),ω

2)

for j = 0 to n/2 − 1:rj = sj + ωjsjrj+n/2 = sj − ωjsj

return (r0, r1, . . . , rn−1)

a0

a2...an−2

a0

a2...an−2

Mn/2

Mn/2

a1

a3...an−1

a1

a3...an−1

Mn/2

Mn/2

+ ωj

− ωjj + n/2

Row j

In short, the product of Mn(ω) with vector (a0, . . . , an−1), a size-n problem, can be expressedin terms of two size-n/2 problems: the product of Mn/2(ω

2) with (a0, a2, . . . , an−2) and with(a1, a3, . . . , an−1). This divide-and-conquer strategy leads to the definitive FFT algorithm ofFigure 2.9, whose running time is T (n) = 2T (n/2) + O(n) = O(n log n).

The fast Fourier transform unraveled

Throughout all our discussions so far, the fast Fourier transform has remained tightly co-cooned within a divide-and-conquer formalism. To fully expose its structure, we now unravelthe recursion.The divide-and-conquer step of the FFT can be drawn as a very simple circuit. Here is how

a problem of size n is reduced to two subproblems of size n/2 (for clarity, one pair of outputs(j, j + n/2) is singled out):

75

Only two subproblems of size n/2:

Multiply Mn/2 by−→A odd and

−→A even:

T(n) = 2T(n/2) + O(n), with an

O(n log n) solution.But wait! Mn has O(n2) size, how can

we operate on it in O(n) time?

73 / 81


FFT AlgorithmFigure 2.9 The fast Fourier transformfunction FFT(a,ω)Input: An array a = (a0, a1, . . . , an−1), for n a power of 2



2)

(s0, s1, . . . , s

n/2−1) = FFT((a1, a3, . . . , an−1),ω

2)


return (r0, r1, . . . , rn−1)

a0

a2...an−2

a0

a2...an−2

Mn/2

Mn/2

a1

a3...an−1

a1

a3...an−1

Mn/2

Mn/2

+ ωj

− ωjj + n/2

Row j






75

Figure 2.9 The fast Fourier transformfunction FFT(a,ω)Input: An array a = (a0, a1, . . . , an−1), for n a power of 2



2)

(s0, s1, . . . , s

n/2−1) = FFT((a1, a3, . . . , an−1),ω

2)


return (r0, r1, . . . , rn−1)

a0

a2...an−2

a0

a2...an−2

Mn/2

Mn/2

a1

a3...an−1

a1

a3...an−1

Mn/2

Mn/2

+ ωj

− ωjj + n/2

Row j






75

74 / 81


Convolution in the Discrete World

(−→An ∗−→Bm)t =

m−1∑x=0

at−xbx cf. (a∗b)(t) =

∞∫−∞

a(t−x)b(x)dx

Linear convolution: at−x = 0 if x > t

Circular convolution: at−x = at−x+n if x > n. (Equivalent to treating

A as a periodic function.)

Zero-extended convolution: First extend A and B to have m + n− 1

samples by letting Ay = 0 for m ≤ y < m + n and Bz = 0 for

n ≤ z < m + n.

With zero-extension, the definitions of linear and ciucular

conventions match, and hence become eqivalent. Hence, we will deal

only with zero-extended convolution.

Theorem (Discrete Convolution F(−→An ∗−→Bm) = F(

−→An)F(

−→Bm))

75 / 81


Why this fascination with convolution?

Computationally, convolution is a loop to add products

The convolution theorem says we can replace this O(n) loop by a

single operation on the DFT. That is fascinating!

Wait a minute! What about the cost of computing F first?

If we use FFT, then we the computation of F and its inversion will

still obe O(n log n), not quadratic.

Can we use FFT as a building block to speed up algorithms

for other problems?

Integer multiplication looks like a convolution, and usually

takes O(n2). Can we make it O(n log n)?

76 / 81


FFT to the Rescue!

Matrix form of DFT and interpretation as polynomial evaluation:

s0s1...sj...

sn−1

=

1 1 1 · · · 11 ω ω2 · · · ωn−1

......

......

...1 ωj ω2j · · · ωj(n−1)

......

......

...1 ωn−1 ω2(n−1) · · · ω(n−1)(n−1)

a0a1...aj...

an−1

Voila! FFT computes A(x) at n points (xi = ωi ) in O(n log n) time!

O(n log n) integer multiplicationConvert to point representation using FFT O(n log n)Multiply on point representation O(n)Convert back to coecients using FFT−1 O(n log n)

77 / 81


FFT to the Rescue!

s0s1...sj...

sn−1

=

1 1 1 · · · 11 ω ω2 · · · ωn−1

......

......

...1 ωj ω2j · · · ωj(n−1)

......

......

...1 ωn−1 ω2(n−1) · · · ω(n−1)(n−1)

a0a1...aj...

an−1

FFT can be thought of as a clever way to choose points:

Evaluations at many distinct points “collapse” together

This is why we are left with 2T (n/2) work after division, instead of4T (n/2) for a naive choice of points.

78 / 81


FFT-based Multiplication: Summary

FFT works with 2k points — Increases work by up to 2x.

Product of two nth-degree polynomial has degree 2nWe need to work with 2n points, i.e., 4x increase in time.

Requires inverting to coecient representation after multiplication:−→Sn = Mn(ω)

−→An

M−1n (ω)−→S n = M−1n (ω)Mn(ω)

−→An =

−→An

It is easy to show that M−1n (ω) = Mn(−ω)/n, and hence:−→An ∗−→Bn = FFT (FFT (

−→A2n, ω) · FFT (

−→B2n, ω), ω−1)/n

We are back to the convolution theorem!

79 / 81


More careful analysis ...

Computations on complex or real numbers can lose precision.

For integer operations, we should work in some other ring — usually,

we choose a ring based on modulo arithmetic.Ex: in mod 33 arithmetic, 2 is the 10th root of 1, i.e., 210 ≡ 1 mod 33

More generally, 2 is the nth root of unity modulo (2n/2 + 1)

Point-wise additions and multiplications are not O(1).

We are adding up to n numbers (“digits”) — we need Ω(log n) bits

So, total cost increases by at least log n, i.e., O(n log2 n).

[Schonhage-Strassen ’71] developed O(n log n log log n) algorithm:

recursively apply their technique for “inner” operations.

80 / 81


Integer Multiplication Summary

Algorithms implemented in libraries for arbitrary precision

arithmetic, with applications in public key cryptography, computer

algebra systems, etc.

GNU MP is a popular library, uses various algorithms based on

input size: naive, Karatsuba, Toom-3, Toom-4, or

Schonhage-Strassen (at about 50K digits).

Karatsuba is Toom-2. Toom-N is based on

Evaluating a polynomial at 2N points,

performing point-wise multiplication, and

interpolating to get back the polynomial, while

minimizing the operations needed for interpolation

81 / 81

CSE 548: (Design and) Analysis of Algorithms - Divide-and ...seclab.cs.sunysb.edu/sekar/cse548/ln/divide1.pdf · time of O(k +l). Thus merge s are linear, and the overall time taken

Documents