CS 124 Course Notes 1 Spring 2002 An algorithm is a recipe or a well-defined procedure for performing a calculation, or in general, for transforming some input into a desired output. Perhaps the most familiar algorithms are those those for adding and multiplying integers. Here is a multiplication algorithm that is different from the standard algorithm you learned in school: write the multiplier and multiplicand side by side. Repeat the following operations - divide the first number by 2 (throw out any fractions) and multiply the second by 2, until the first number is 1. This results in two columns of numbers. Now cross out all rows in which the first entry is even, and add all entries of the second column that haven’t been crossed out. The result is the product of the two numbers. 75 29 37 58 18 116 9 232 4 464 2 928 1 1856 2175 29 x 1001011 29 58 232 1856 2175 Figure 1.1: A different multiplication algorithm. 1-1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CS 124 Course Notes 1 Spring 2002
An algorithm is a recipe or a well-defined procedure for performing a calculation, or in general, for transforming
some input into a desired output. Perhaps the most familiar algorithms are those those for adding and multiplying
integers. Here is a multiplication algorithm that is different from the standard algorithm you learned in school: write
the multiplier and multiplicand side by side. Repeat the following operations - divide the first number by 2 (throw
out any fractions) and multiply the second by 2, until the first number is 1. This results in two columns of numbers.
Now cross out all rows in which the first entry is even, and add all entries of the second column that haven’t been
crossed out. The result is the product of the two numbers.
75 2937 5818 116 9 232 4 464 2 928 1 1856 2175
29x 1001011
2958
23218562175
Figure 1.1: A different multiplication algorithm.
1-1
1-2
In this course we will ask a number of basic questions about algorithms:
• Does it halt?
The answer for the algorithm given above is clearly yes, provided we are multiplying positive integers. The
reason is that for any integer greater than 1, when we divide it by 2 and throw out the fractional part, we always
get a smaller integer which is greater than or equal to 1. Hence our first number is eventually reduced to 1 and
the process halts.
• Is it correct?
To see that the algorithm correctly computes the product of the integers, observe that if we write a 0 for each
crossed out row, and 1 for each row that is not crossed out, then reading from bottom to top just gives us
the first number in binary. Therefore, the algorithm is just doing standard multiplication, with the multiplier
written in binary.
• Is it fast?
It turns out that the above algorithm is about as fast as the standard algorithm you learned in school. Later in
the course, we will study a faster algorithm for multiplying integers.
• How much memory does it use?
The memory used by this algorithm is also about the same as that of standard algorithm.
1-3
The history of algorithms for simple arithmetic is quite fascinating. Although we take these algorithms for
granted, their widespread use is surprisingly recent. The key to good algorithms for arithmetic was the positional
number system (such as the decimal system). Roman numerals (I, II, III, IV, V, VI, etc) are just the wrong data
structure for performing arithmetic efficiently. The positional number system was first invented by the Mayan
Indians in Central America about 2000 years ago. They used a base 20 system, and it is unknown whether they had
invented algorithms for performing arithmetic, since the Spanish conquerors destroyed most of the Mayan books on
science and astronomy.
The decimal system that we use today was invented in India in roughly 600 AD. This positional number system,
together with algorithms for performing arithmetic, were transmitted to Persia around 750 AD, when several impor-
tant Indian works were translated into Arabic. Around this time the Persian mathematician Al-Khwarizmi wrote his
Arabic textbook on the subject. The word “algorithm” comes from Al-Khwarizmi’s name. Al-Khwarizmi’s work
was translated into Latin around 1200 AD, and the positional number system was propagated throughout Europe
from 1200 to 1600 AD.
The decimal point was not invented until the 10th century AD, by a Syrian mathematician al-Uqlidisi from
Damascus. His work was soon forgotten, and five centuries passed before decimal fractions were re-invented by the
Persian mathematician al-Kashi.
With the invention of computers in this century, the field of algorithms has seen explosive growth. There are a
number of major successes in this field:
• Parsing algorithms - these form the basis of the field of programming languages
• Fast Fourier transform - the field of digital signal processing is built upon this algorithm.
• Linear programming - this algorithm is extensively used in resource scheduling.
• Sorting algorithms - until recently, sorting used up the bulk of computer cycles.
• String matching algorithms - these are extensively used in computational biology.
• Number theoretic algorithms - these algorithms make it possible to implement cryptosystems such as the RSA
public key cryptosystem.
• Compression algorithms - these algorithms allow us to transmit data more efficiently over, for example, phone
lines.
1-4
• Geometric algorithms - displaying images quickly on a screen often makes use of sophisticated algorithmic
techniques.
In designing an algorithm, it is often easier and more productive to think of a computer in abstract terms. Of
course, we must carefully choose at what level of abstraction to think. For example, we could think of computer
operations in terms of a high level computer language such as C or Java, or in terms of an assembly language. We
could dip further down, and think of the computer at the level AND and NOT gates.
For most algorithm design we undertake in this course, it is generally convenient to work at a fairly high level.
We will usually abstract away even the details of the high level programming language, and write our algorithms in
”pseudo-code”, without worrying about implementation details. (Unless, of course, we are dealing with a program-
ming assignment!) Sometimes we have to be careful that we do not abstract away essential features of the problem.
To illustrate this, let us consider a simple but enlightening example.
1-5
1.1 Computing the nth Fibonacci number
Remember the famous sequence of numbers invented in the 15th century by the Italian mathematician Leonardo
Fibonacci? The sequence is represented as F0,F1,F2 . . ., where F0 = 0, F1 = 1, and for all n ≥ 2, Fn is defined as
Fn−1 +Fn−2. The first few Fibonacci numbers are 0,1,1,2,3,5,8,13,21,34,55, . . . The value of F30 is greater than a
million! It is easy to see that the Fibonacci numbers grow exponentially. As an exercise, try to show that Fn ≥ 2n/2
for sufficiently large n by a simple induction.
Here is a simple program to compute Fibonacci numbers that slavishly follows the definition.
function F(n: integer): integer
if n = 0 then return 0
else if n = 1 then return 1
else return F(n−1)+F(n−2)
The program is obviously correct. However, it is woefully slow. As it is a recursive algorithm, we can naturally
express its running time on input n with a recurrence equation. In fact, we will simply count the number of addition
operations the program uses, which we denote by T (n). To develop a recurrence equation, we express T (n) in terms
of smaller values of T . We shall see several such recurrence relations in this class.
It is clear that T (0) = 0 and T (1) = 0. Otherwise, for n ≥ 2, we have
T (n) = T (n−1)+T(n−2)+1,
because to computer F(n) we compute F(n− 1) and F(n− 2) and do one other addition besides. This is (almost)
the Fibonacci equation! Hence we can see that the number of addition operations is growing very large; it is at least
2n/2 for n ≥ 4.
1-6
Can we do better? This is the question we shall always ask of our algorithms. The trouble with the naive
algorithm the wasteful recursion: the function F is called with the same argument over and over again, exponentially
many times (try to see how many times F(1) is called in the computation of F(5)). A simple trick for improving
performance is to avoid repeated calculations. In this case, this can be easily done by avoiding recursion and just
calculating successive values:
function F(n: integer): integer array A[0 . . .n] of integer
A[0] = 0; A[1] = 1
for i = 2 to n do:
A[i] = A[i−1]+A[i−2]
return A[n]
This algorithm is of course correct. Now, however, we only do n−1 additions.
1-7
It seems that we have come so far, from exponential to polynomially many operations, that we can stop here.
But in the back of our heads, we should be wondering an we do even better? Surprisingly, we can. We rewrite our
equations in matrix notation. Then
F1
F2
=
0 1
1 1
·
F0
F1
.
Similarly,
F2
F3
=
0 1
1 1
·
F1
F2
=
0 1
1 1
2
·
F0
F1
,
and in general, Similarly,
Fn
Fn+1
=
0 1
1 1
n
·
F0
F1
.
So, in order to compute Fn, it suffices to raise this 2 by 2 matrix to the nth power. Each matrix multiplication
takes 12 arithmetic operations, so the question boils down to the following: how many multiplications does it take
to raise a base (matrix, number, anything) to the nth power? The answer is O(logn). To see why, consider the case
where n > 1 is a power of 2. To raise X to the nth power, we compute X n/2 and then square it. Hence the number of
multiplications T (n) satisfies
T (n) = T (n/2)+1,
from which we find T (n) = logn. As an exercise, consider what you have to do when n is not a power of 2.
(Hint: consider the connection with the multiplication algorithm of the first section; there too we repeatedly halved
a number...)
So we have reduced the computation time exponentially again, from n− 1 arithmetic operations to O(log n),
a great achievement. Well, not really. We got a little too abstract in our model. In our accounting of the time
requirements for all three methods, we have made a grave and common error: we have been too liberal about what
constitutes an elementary step. In general, we often assume that each arithmetic step takes unit time, because the
numbers involved will be typically small enough that we can reasonably expect them to fit within a computer’s
word. Remember, the number n is only log n bits in length. But in the present case, we are doing arithmetic on huge
numbers, with about n bits, where n is pretty large. When dealing with such huge numbers, if exact computation
is required we have to use sophisticated long integer packages. Such algorithms take O(n) time to add two n-bit
numbers. Hence the complexity of the first two methods was larger than we actually thought: not really O(Fn) and
O(n), but instead O(nFn) and O(n2), respectively. The second algorithm is still exponentially faster. What is worse,
the third algorithm involves multiplications of O(n)-bit integers. Let M(n) be the time required to multiply two n-bit
numbers. Then the running time of the third algorithm is in fact O(M(n)).
1-8
The comparison between the running times of the second and third algorithms boils down to a most important
and ancient issue: can we multiply two n-bit integers faster than Ω(n2) ? This would be faster than the method we
learn in elementary school or the clever halving method explained in the opening of these notes.
As a final consideration, we might consider the mathematicians’ solution to computing the Fibonacci numbers.
A mathematician would quickly determine that
Fn =1√
5
[(
1+√
52
)n
−
(
1−√
52
)n]
.
Using this, how many operations does it take to compute Fn? Note that this calculation would require floating point
arithmetic. Whether in practice that would lead to a faster or slower algorithm than one using just integer arithmetic
might depend on the computer system on which you run the algorithm.
CS 124 Lecture 2
In order to discuss algorithms effectively, we need to start with a basic set of tools. Here, we explain these tools
and provide a few examples. Rather than spend time honing our use of these tools, we will learn how to use them by
applying them in our studies of actual algorithms.
Induction
The standard form of the induction principle is the following:
If a statement P(n) holds for n = 1, and if for every n ≥ 1 P(n) implies P(n+1), then P holds for all n.
Let us see an example of this:
Claim 2.1 Let S(n) = ∑ni=1 i. Then S(n) = n(n+1)
2 .
Proof: The proof is by induction.
Base Case: We show the statement is true for n = 1. As S(1) = 1 = 1(2)2 , the statement holds.
Induction Hypothesis: We assume S(n) = n(n+1)2 .
Reduction Step: We show S(n+1) = (n+1)(n+2)2 . Note that S(n+1) = S(n)+n+1. Hence
S(n+1) = S(n)+n+1
=n(n+1)
2+n+1
= (n+1)(n
2+1
)
=(n+1)(n+2)
2.
2-1
2-2
The proof style is somewhat pedantic, but instructional and easy to read. We break things down to the base case
– showing that the statement holds when n = 1; the induction hypothesis – the statement that P(n) is true; and the
reduction step – showing that P(n) implies P(n+1).
Induction is one of the most fundamental proof techniques. The idea behind induction is simple: take a large
problem (P(n + 1)), and somehow reduce its proof to a proof of a smaller problems (such as P(n); P(n) is smaller
in the sense that n < n+1). If every problem can thereby be broken down to a small number of instances (we keep
reducing down to P(1)), these can be checked easily. We will see this idea of reduction, whereby we reduce solving
a problem to a solving an easier problem, over and over again throughout the course.
As one might imagine, there are other forms of induction besides the specific standard form we gave above.
Here’s a different form of induction, called strong induction:
If a statement P(n) holds for n = 1, and if for every n ≥ 1 the truth of P(i) for all i ≤ n implies P(n+1), then P holds
for all n.
Exercise: show that every number has a unique prime factorization using strong induction.
2-3
O Notation
When measuring, for example, the number of steps an algorithm takes in the worst case, our result will generally
be some function T (n) of the input size, n. One might imagine that this function may have some complex form, such
as T (n) = 4n2 − 3n log n + n2/3 + log3 n− 4. In very rare cases, one might wish to have such an exact form for the
running time, but in general, we are more interested in the rate of growth of T (n) rather than its exact form.
The O notation was developed with this in mind. With the O notation, only the fastest growing term is important,
and constant factors may be ignored. More formally:
Definition 2.2 We say for non-negative functions f (n) and g(n) that f (n) is O(g(n)) if there exist positive constants
c and N such that for all n ≥ N,
f (n) ≤ cg(n).
2-4
Let us try some examples. We claim that 2n3 +4n2 is O(n3). It suffices to show that 2n3 +4n2 ≤ 6n3 for n ≥ 1,
by definition. But this is clearly true as 4n3 ≥ 4n2 for n ≥ 1. (Exercise: show that 2n3 +4n2 is O(n4).)
We claim 10log2 n is O(lnn). This follows from the fact that 10 log2 n ≤ (10log2 e) ln n.
If T (n) is as above, then T (n) is O(n2). This is a bit harder to prove, because of all the extraneous terms. It is,
however, easy to see; 4n2 is clearly the fastest growing term, and we can remove the constant with O notation. Note,
though, that T (n) is O(n3) as well! The O notation is not tight, but more like a ≤ comparison.
2-5
Similarly, there is notation for ≥ and = comparisons.
Definition 2.3 We say for non-negative functions f (n) and g(n) that f (n) is is Ω(g(n)) if there exist positive con-
stants c and N such that for all n ≥ N,
f (n) ≥ cg(n).
We say that f (n) is Θ(g(n)) if both f (n) is O(g(n)) and f (n) is Ω(g(n)).
The O notation has several useful properties that are easy to prove.
Lemma 2.4 If f1(n) is O(g1(n)) and f2(n) is O(g2(n)) then f1(n)+ f2(n) is O(g1(n)+g2(n)).
Proof: There exist positive constants c1,c2,N1, and N2 such that f1(n) ≤ c1g1(n) for n ≥ N1 and f2(n) ≤ c2g2(n) for
n ≥ N2. Hence f1(n)+ f2(n) ≤ maxc1,c2(g1(n)+g2(n)) for n ≥ maxN1,N2.
Exercise: Prove similar lemmata for f1(n) f2(n). Prove the lemmata when O is replaced by Ω or Θ.
2-6
Finally, there is a bit for notation corresponding to <<, when one function is (in some sense) much less than
another.
Definition 2.5 We say for non-negative functions f (n) and g(n) that f (n) is is o(g(n)) if
limn→∞
f (n)
g(n)= 0.
Also, f (n) is ω(g(n)) if g(n) is o( f (n)).
We emphasize that the O notation is a tool to help us analyze algorithms. It does not always accurately tell us
how fast an algorithm will run in practice. For example, constant factors make a huge difference in practice (imagine
increasing your bank account by a factor of 10), and they are ignored in the O notation. Like any other tool, the O
notation is only useful if used properly and wisely. Use it as a guide, not as the last word, to judging an algorithm.
2-7
Recurrence Relations
A recurrence relation defines a function using an expression that includes the function itself. For example, the
Fibonacci numbers are defined by:
F(n) = F(n−1)+F(n−2), F(1) = F(2) = 1.
This function is well-defined, since we can compute a unique value of F(n) for every positive integer n.
Note that recurrence relations are similar in spirit to the idea of induction. The relations defines a function value
F(n) in terms of the function values at smaller arguments (in this case, n− 1 and n− 2), effectively reducing the
problem of computing F(n) to that of computing F at smaller values. Base cases (the values of F(1) and F(2)) need
to be provided.
Finding exact solutions for recurrence relations is not an extremely difficult process; however, we will not
focus on solution methods for them here. Often a natural thing to do is to try to guess a solution, and then prove it
by induction. Alternatively, one can use a symbolic computation program (such as Maple or Mathematica); these
programs can often generate solutions.
We will occasionally use recurrence relations to describe the running times of algorithms. For our purposes, we
often do not need to have an exact solution for the running time, but merely an idea of its asymptotic rate of growth.
For example, the relation
T (n) = 2T (n/2)+2n, T (1) = 1
has the exact solution (for n a power of 2) of T (n) = 2n log2 n+n. (Exercise: Prove this by induction.) But for our
purposes, it is generally enough to know that the solution is Θ(n log n).
2-8
The following theorem is extremely useful for such recurrence relations:
Theorem 2.6 The solution to the recurrence relation T (n) = aT (n/b)+ cnk , where a ≥ 1 and b ≥ 2 are integers
and c and k are positive constants satisfies:
T (n) is
O(
nlogb a)
if a > bk
O(
nk log n)
if a = bk
O(
nk)
if a < bk.
2-9
Data Structures
We shall regard integers, real numbers, and bits, as well as more complicated objects such as lists and sets, as
primitive data structures. Recall that a list is just an ordered sequence of arbitrary elements.
List q := [x1,x2, . . . ,xn].
x1 is called the head of the list.
xn is called the tail of the list.
n = |q| is the size of the list.
We denote by the concatenation operation. Thus q r is the list that results from concatenating the list q with
the list r.
The operations on lists that are especially important for our purposes are:
head(q) return(x1)
push(q,x) q := [x]q
pop(q) q := [x2, . . . ,xn], return(x1)
inject(q,x) q := q [x]
eject(q) q := [x1,x2, . . . ,xn−1], return(xn)
size(q) return(n)
The head, pop, and eject operations are not defined for empty lists. Appropriate return values (either an error,
or an empty symbol) can be designed depending on the implementation.
A stack is a list that supports operations head, push, pop.
A queue is a list that supports operations head, inject and pop.
A deque supports all these operations.
Note that we can implement lists either by arrays or using pointers as the usual linked lists. Arrays are often
faster in practice, but they are often more complicated to program (especially if there is no implicit limit on the
number of items). In either case, each of the above operations can be implemented in a constant number of steps.
2-10
Application: Mergesort
For the rest of the lecture, we will review the procedure mergesort. The input is a list of n numbers, and the
output is a list of the given numbers sorted in increasing order. The main data structure used by the algorithm will be
a queue. We will assume that each queue operation takes 1 step, and that each comparison (is x > y?) takes 1 step.
We will show that mergesort takes O(n logn) steps to sort a sequence of n numbers.
The procedure mergesort relies on a function merge which takes as input two sorted (in increasing order) lists
of numbers and outputs a single sorted list containing all the given numbers (with repetition).
2-11
function merge (s, t)list s, tif s = [ ] then return t
else if t = [ ] then return selse if s(1) ≤ t(1) then u:= pop(s)
else u:= pop(t)return push(u, merge(s, t))
end merge
function mergesort (s)list s, qq = [ ]for x ∈ s
inject(q, [x])rofwhile size(q) ≥ 2
u := pop(q)v := pop(q)inject(q, merge(u,v))
endif q = [ ] return [ ]
else return q(1)end mergesort
2-12
The correctness of the function merge follows from the following fact: the smallest number in the input is either
s(1) or t(1), and must be the first number in the output list. The rest of the output list is just the list obtained by
merging s and t after deleting that smallest number.
The number of steps for each invocation of function merge is O(1) steps. Since each recursive invocation of
merge removes an element from either s or t, it follows that function merge halts in O(|s|+ |t|) steps.
Question: Can you design an iterative (rather than recursive) version of merge? How much time does is take?
Which version would be faster in practice– the recursive or the iterative?
v := pop(q)previsit(v)explored(v) := 1for (v,w) ∈ E
if explored(w) = 0 theninject(q,w)dist(w) = dist(v)+1
firof
end whileend BFS
BFS runs, of course, in linear time O(|E|), under the assumption that |E| ≥ |V |. The reason is that BFS visits
each edge exactly once, and does a constant amount of work per edge.
4-1
4-2
S0 1 2
1
2 3
22
Figure 4.1: BFS of a directed graph
Although BFS does not have the same subtle properties of DFS, it does provide useful information. BFS visits
vertices in order of increasing distance from s. In fact, our BFS algorithm above labels each vertex with the distance
from s, or the number of edges in the shortest path from s to the vertex. For example, applied to the graph in
Figure 4.1, this algorithm labels the vertices (by the array dist) as shown.
Why are we sure that the array dist is the shortest-path distance from s? A simple induction proof suffices. It
is certainly true if the distance is zero (this happens only at s). And, if it is true for dist(v) = d, then it can be easily
shown to be true for values of dist equal to d +1 —any vertex that receives this value has an edge from a vertex with
dist d, and from no vertex with lower value of dist. Notice that vertices not reachable from s will not be visited or
labeled.
4-3
Single-Source Shortest Paths —Nonnegative Lengths
What if each edge (v,w) of our graph has a length, a positive integer denoted length(v,w), and we wish to find
the shortest paths from s to all vertices reachable from it? (What if we are interested only in the shortest path from s
to a specific node t? As it turns out, all algorithms known for this problem have to compute the shortest path from s
to all vertices reachable from it.) BFS offers a possible solution. We can subdivide each edge (u,v) into length(u,v)
edges, by inserting length(u,v)− 1 “dummy” nodes, and then apply DFS to the new graph. This algorithm solves
the shortest-path problem in time O(∑(u,v)∈E length(u,v)). Unfortunately, this can be very large —lengths could be
in the thousands or millions. So we need to find a better way.
The problem is that this BFS-based algorithm will spend most of its time visiting “dummy” vertices; only
occasionally will it do something truly interesting, like visit a vertex of the original graph. What we would like to
do is run this algorithm, but only do work for the “interesting” steps.
4-4
To do this, We need to generalize BFS. Instead of using a queue, we will instead use a heap or priority queue
of vertices. A heap is an data structure that keeps a set of objects, where each object has an associated value. The
operations a heap H implements include the following:
deletemin(H) return the object with the smallest value
insert(x,y,H) insert a new object x/value y pair in the structure
change(x,y,H) if y is smaller than x’s current value,
change the value of object x to y
We will not distinguish between insert and change, since for our purposes, they are essentially equivalent;
changing the value of a vertex will be like re-inserting it. (In all heap implementations we assume that we have an
array of pointers that gives, for each vertex, its position in the heap, if any. This allows us to always have at most
one copy of each vertex in the heap. Furthermore, it makes changes and inserts essentially equivalent operations.)
Each entry in the heap will stand for a projected future “interesting event” of our extended BFS. Each entry will
correspond to a vertex, and its value will be the current projected time at which we will reach the vertex. Another
way to think of this is to imagine that, each time we reach a new vertex, we can send an explorer down each adjacent
edge, and this explorer moves at a rate of 1 unit distance per second. With our heap, we will keep track of when each
vertex is due to be reached for the first time by some explorer. Note that the projected time until we reach a vertex
can decrease, because the new explorers that arise when we reach a newly explored vertex could reach a vertex first
(see node b in Figure 4.2). But one thing is certain: the most imminent future scheduled arrival of an explorer must
happen, because there is no other explorer who can reach any vertex faster. The heap conveniently delivers this most
imminent event to us.
4-5
As in all shortest path algorithms we shall see, we maintain two arrays indexed by V . The first array, dist[v],
will eventually contain the true distance of v from s. The other array, prev[v], will contain the last node before v in
the shortest path from s to v. Our algorithm maintains a useful invariant property: at all times dist[v] will contain a
conservative over-estimate of the true shortest distance of v from s. Of course dist[s] is initialized to its true value 0,
and all other dist’s are initialized to ∞, which is a remarkably conservative overestimate. The algorithm is known as
Djikstra’s algorithm, named after the inventor.
Algorithm Djikstra (G = (V,E, length); s ∈V )v,w: verticesdist: array[V ] of integerprev: array[V ] of verticesH: priority heap of VH := s : 0for v ∈V do
dist[v] := ∞, prev[v] :=nilrofdist[s] := 0while H 6= /0
This algorithm solves the general single-source shortest path problem in O(|V | · |E|) time.
4-11
Negative Cycles
In fact, there is a further problem that negative edges can cause. Suppose the length of edge (b,a) in Figure 2
were changed to −5. The the graph would have a negative cycle (from a to b and back). On such graphs, it does not
make sense to even ask the shortest path question. What is the shortest path from s to c in the modified graph? The
one that goes directly from s to a to c (cost: 3), or the one that goes from s to a to b to a to c (cost: 1), or the one that
takes the cycle twice (cost: -1)? And so on.
The shortest path problem is ill-posed in graphs with negative cycles. It makes no sense and deserves no
answer. Our algorithm in the previous section works only in the absence of negative cycles. (Where did we assume
no negative cycles in our correctness argument? Answer: When we asserted that a shortest path from s to a exists!)
But it would be useful if our algorithm were able to detect whether there is a negative cycle in the graph, and thus to
report reliably on the meaningfulness of the shortest path answers it provides.
This is easily done. After the |V | − 1 rounds of updates of all edges, do a last update. If any changes occur
during this last round of updates, there is a negative cycle. This must be true, because if there were no negative
cycles, |V |−1 rounds of updates would have been sufficient to find the shortest paths.
4-12
Shortest Paths on DAG’s
There are two subclasses of weighted graphs that automatically exclude the possibility of negative cycles:
graphs with non-negative weights and DAG’s. We have already seen that there is a fast algorithm when the weights
are non-negative. Here we will give a linear algorithm for single-source shortest paths in DAG’s.
Our algorithm is based on the same principle as our algorithm for negative weights. We are trying to find a
sequence of updates, such that all shortest paths are its subsequences. But in a DAG we know that all shortest paths
from s must go in the topological order of the DAG. All we have to do then is first topologically sort the DAG using
a DFS, and then visit all edges coming out of nodes in the topological order. This algorithm solves the general
single-source shortest path problem for DAG’s in O(m) time.
CS124 Lecture 5 Spring 2002
Minimum Spanning Trees
A tree is an undirected graph which is connected and acyclic. It is easy to show that if graph G(V,E) that
satisfies any two of the following properties also satisfies the third, and is therefore a tree:
• G(V,E) is connected
• G(V,E) is acyclic
• |E| = |V |−1
(Exercise: Show that any two of the above properties implies the third (use induction).)
A spanning tree in an undirected graph G(V,E) is a subset of edges T ⊆ E that are acyclic and connect all the
vertices in V . It follows from the above conditions that a spanning tree must consist of exactly n− 1 edges. Now
suppose that each edge has a weight associated with it: w : E → Z. Say that the weight of a tree T is the sum of the
weights of its edges; w(T ) = ∑e∈T w(e). The minimum spanning tree in a weighted graph G(V,E) is one which has
the smallest weight among all spanning trees in G(V,E).
As an example of why one might want to find a minimum spanning tree, consider someone who has to install
the wiring to network together a large computer system. The requirement is that all machines be able to reach each
other via some sequence of intermediate connections. By representing each machine as a vertex and the cost of
wiring two machines together by a weighted edge, the problem of finding the minimum cost wiring scheme reduces
to the minimum spanning tree problem.
In general, the number of spanning trees in G(V,E) grows exponentially in the number of vertices in G(V,E).
(Exercise: Try to determine the number of different spanning trees for a complete graph on n vertices.) Therefore
it is infeasible to search through all possible spanning trees to find the lightest one. Luckily it is not necessary
to examine all possible spanning trees; minimum spanning trees satisfy a very important property which makes it
possible to efficiently zoom in on the answer.
5-1
Lecture 5 5-2
We shall construct the minimum spanning tree by successively selecting edges to include in the tree. We will
guarantee after the inclusion of each new edge that the selected edges, X , form a subset of some minimum spanning
tree, T . How can we guarantee this if we don’t yet know any minimum spanning tree in the graph? The following
property provides this guarantee:
Cut property: Let X ⊆ T where T is a MST in G(V,E). Let S ⊂V such that no edge in X crosses between S
and V −S; i.e. no edge in X has one endpoint in S and one endpoint in V −S. Among edges crossing between S and
V −S, let e be an edge of minimum weight. Then X ∪e ⊆ T ′ where T ′ is a MST in G(V,E).
The cut property says that we can construct our tree greedily. Our greedy algorithms can simply take the
minimum weight edge across two regions not yet connected. Eventually, if we keep acting in this greedy manner,
we will arrive at the point where we have a minimum spanning tree. Although the idea of acting greedily at each
point may seem quite intuitive, it is very unusual for such a strategy to actually lead to an optimal solution, as we
will see when we examine other problems!
Proof: Suppose e /∈ T . Adding e into T creates a unique cycle. We will remove a single edge e ′ from this
unique cycle, thus getting T ′ = T ∪e−e′. It is easy to see that T ′ must be a tree — it is connected and has
n−1 edges. Furthermore, as we shall show below, it is always possible to select an edge e ′ in the cycle such that it
crosses between S and V −S. Now, since e is a minimum weight edge crossing between S and V −S, w(e ′) ≥ w(e).
Therefore w(T ′) = w(T )+ w(e)−w(e′) ≤ w(T ). However since T is a MST, it follows that T ′ is also a MST and
w(e) = w(e′). Furthermore, since X has no edge crossing between S and V − S, it follows that X ⊆ T ′ and thus
X ∪e ⊆ T ′.
How do we know that there is an edge e′ 6= e in the unique cycle created by adding e into T , such that e′ crosses
between S and V −S? This is easy to see, because as we trace the cycle, e crosses between S and V −S, and we must
cross back along some other edge to return to the starting point.
Lecture 5 5-3
In light of this, the basic outline of our minimum spanning tree algorithms is going to be the following:
X := .Repeat until |X | = n−1.
Pick a set S ⊆V such that no edge in X crosses between S and V −S.Let e be a lightest edge in G(V,E) that crosses between S and V −S.X := X ∪e.
The difference between minimum spanning tree algorithms lies in how we pick the set S at each step.
Lecture 5 5-4
Prim’s algorithm:
In the case of Prim’s algorithm, X consists of a single tree, and the set S is the set of vertices of that tree. One
way to think of the algorithm is that it grows a single tree, adding a new vertex at each step, until it has the minimum
spanning tree. In order to find the lightest edge crossing between S and V − S, Prim’s algorithm maintains a heap
containing all those vertices in V − S which are adjacent to some vertex in S. The priority of a vertex v, according
to which the heap is ordered, is the weight of its lightest edge to a vertex in S. This is reminiscent of Dijkstra’s
algorithm (where distance was used for the heap instead of the edge weight). As in Dijkstra’s algorithm, each vertex
v will also have a parent pointer prev(v) which is the other endpoint of the lightest edge from v to a vertex in S. The
pseudocode for Prim’s algorithm is almost identical to that for Dijkstra’s algorithm:
Procedure Prim(G(V,E), s)v,w: verticesdist: array[V ] of integerprev: array[V ] of verticesS: set of vertices, initially emptyH: priority heap of VH := s : 0for v ∈V do
dist[v] := ∞, prev[v] :=nilrofdist[s] := 0while H 6= /0
v := deletemin(h)S := S∪vfor (v,w) ∈ E and w ∈V −S do
if dist[w] > length(v,w)dist[w] := length(v,w), prev[w] := v, insert(w,dist[w],H)
firof
end while end Prim
Note that each vertex is “inserted” on the heap at most once; other insert operations simply change the value on
the heap. The vertices that are removed from the heap form the set S for the cut property. The set X of edges chosen
to be included in the MST are given by the parent pointers of the vertices in the set S. Since the smallest key in the
heap at any time gives the lightest edge crossing between S and V −S, Prim’s algorithm follows the generic outline
for a MST algorithm presented above, and therefore its correctness follows from the cut property.
The running time of Prim’s algorithm is clearly the same as Dijkstra’s algorithm, since the only change is how
we prioritize nodes in the heap. Thus, if we use d-heaps, the running time of Prim’s algorithm is O(m logm/n n).
Lecture 5 5-5
Kruskal’s algorithm:
Kruskal’s algorithm uses a different strategy from Prim’s algorithm. Instead of growing a single tree, Kruskal’s
algorithm attempts to put the lightest edge possible in the tree at each step. Kruskal’s algorithm starts with the edges
sorted in increasing order by weight. Initially X = , and each vertex in the graph regarded as a trivial tree (with
no edges). Each edge in the sorted list is examined in order, and if its endpoints are in the same tree, then the edge is
discarded; otherwise it is included in X and this causes the two trees containing the endpoints of this edge to merge
into a single tree. Note that, by this process, we are implicitly choosing a set S ⊆ V with no edge in X crossing
between S and V −S, so this fits in our basic outline of a minimum spanning tree algorithm.
To implement Kruskal’s algorithm, given a forest of trees, we must decide given two vertices whether they
belong to the same tree. For the purposes of this test, each tree in the forest can be represented by a set consisting of
the vertices in that tree. We also need to be able to update our data structure to reflect the merging of two trees into a
single tree. Thus our data structure will maintain a collection of disjoint sets (disjoint since each vertex is in exactly
one tree), and support the following three operations:
• MAKESET(x): Create a new x containing only the element x.
• FIND(x): Given an element x, which set does it belong to?
• UNION(x,y): replace the set containing x and the set containing y by their union.
The pseudocode for Kruskal’s algorithm follows:
Function Kruskal(graph G(V,E))set XX = E:= sort E by weightfor u ∈V
MAKESET(u)roffor (u,v) ∈ E (in increasing order) do
if FIND(u) 6= FIND(v) doX = X ∪(u,v)UNION(u,v)
rofreturn(X )
end Kruskal
Lecture 5 5-6
The correctness of Kruskal’s algorithm follows from the following argument: Kruskal’s algorithm adds an edge
e into X only if it connects two trees; let S be the set of vertices in one of these two trees. Then e must be the first
edge in the sorted edge list that has one endpoint in S and the other endpoint in V − S, and is therefore the lightest
edge that crosses between S and V −S. Thus the cut property of MST implies the correctness of the algorithm.
The running time of the algorithm, assuming the edges are given in sorted order, is dominated by the set
operations: UNION and FIND. There are n−1 UNION operations (one corresponding to each edge in the spanning
tree), and 2m FIND operations (2 for each edge). Thus the total time of Kruskal’s algorithm is O(m×FIND+n×
UNION). We will soon show that this is O(m log∗ n). Note that, if the edges are not initially given in sorted order,
then to sort them in the obvious way takes O(m logm) time, and this would be the dominant part of the running time
of the algorithm.
Lecture 5 5-7
Exchange Property
Actually spanning trees satisfy an even stronger property than the cut property — the exchange property. The
exchange property is quite remarkable since it implies that we can “walk” from any spanning tree T to a minimum
spanning tree T by a sequence of exchange moves — each such move consists of throwing an edge out of the current
tree that is not in T , and adding a new edge into the current tree that is in T . Moreover, each successive tree in the
“walk” is guaranteed to weigh no more than its predecessor.
Exchange property: Let T and T ′ be spanning trees in G(V,E). Given any e′ ∈ T ′−T , there exists an edge
e ∈ T −T ′ such that (T −e)∪e′ is also a spanning tree.
The proof is quite similar to that of the cut property. Adding e′ into T results in a unique cycle. There must be
some edge in this cycle that is not in T ′ (since otherwise T ′ must have a cycle). Call this edge e. Then deleting e
restores a spanning tree, since connectivity is not affected, and the number of edges is restored to n−1.
To see how one may use this exchange property to “walk” from any spanning tree to a MST: let T be any
spanning tree and let T be a MST in G(V,E). Let e′ be the lightest edge that is not in both trees. Perform an
exchange using this edge. Since the exchange was done with the lightest such edge, the new tree must be lighter than
the old one. Since T is already a MST, it follows that the exchange must have been performed upon T and results in
a lighter spanning tree which has more edges in common with T (if there are several edges of the same weight, then
the new tree might not be lighter, but it still has more edges in common with T ).
Lecture 5 5-8
1 5
3 5 2
4 12 5 7
3 6
Figure 5.1: An example of Prim’s algorithm and Kruskal’s algorithm. Which is which?
CS124 Lecture 6 Spring 2002
Disjoint set (Union-Find)
For Kruskal’s algorithm for the minimum spanning tree problem, we found that we needed a data structure for
maintaining a collection of disjoint sets. That is, we need a data structure that can handle the following operations:
• MAKESET(x) - create a new set containing the single element x
• UNION(x,y) - replace two sets containing x and y by their union.
• FIND(x) - return the name of the set containing the element x
Naturally, this data structure is useful in other situations, so we shall consider its implementation in some detail.
Within our data structure, each set is represented by a tree, so that each element points to a parent in the tree.
The root of each tree will point to itself. In fact, we shall use the root of the tree as the name of the set itself; hence
the name of each set is given by a canonical element, namely the root of the associated tree.
It is convenient to add a fourth operation LINK(x,y) to the above, where we require for LINK that x and y are
two roots. LINK changes the parent pointer of one of the roots, say x, and makes it point to y. It returns the root
of the now composite tree y. With this addition, we have UNION(x,y) = LINK(FIND(x),FIND(y)), so the main
problem is to arrange our data structure so that FIND operations are very efficient.
6-1
Lecture 6 6-2
Notice that the time to do a FIND operation on an element corresponds to its depth in the tree. Hence our goal is
to keep the trees short. Two well-known heuristics for keeping trees short in this setting are UNION BY RANK and
PATH COMPRESSION. We start with the UNION BY RANK heuristic. The idea of UNION BY RANK is to ensure
that when we combine two trees, we try to keep the overall depth of the resulting tree small. This is implemented as
follows: the rank of an element x is initialized to 0 by MAKESET. An element’s rank is only updated by the LINK
operation. If x and y have the same rank r, then invoking LINK(x,y) causes the parent pointer of x to be updated to
point to y, and the rank of y is then updated to r + 1. On the other hand, if x and y have different rank, then when
invoking LINK(x,y) the parent point of the element with smaller rank is updated to point to the element with larger
rank. The idea is that the rank of the root is associated with the depth of the tree, so this process keeps the depth
small. (Exercise: Try some examples by hand with and without using the UNION BY RANK heuristic.)
The idea of PATH COMPRESSION is that, once we perform a FIND on some element, we should adjust its
parent pointer so that it points directly to the root; that way, if we ever do another FIND on it, we start out much
closer to the root. Note that, until we do a FIND on an element, it might not be worth the effort to update its parent
pointer, since we may never access it at all. Once we access an item, however, we must walk through every pointer
to the root, so modifying the pointers only changes the cost of this walk by a constant factor.
Lecture 6 6-3
procedure MAKESET(x)p(x) := xrank(x) := 0
end
function FIND(x)if x 6= p(x) then
p(x) := FIND(p(x))return(p(x))
end
function LINK(x,y)if rank(x) > rank(y) then x ↔ yif rank(x) = rank(y) then rank(y) := rank(y)+1p(x) := yreturn(y)
end
procedure UNION(x,y)LINK(FIND(x),FIND(y))
end
Lecture 6 6-4
In our analysis, we show that any sequence of m UNION and FIND operations on n elements take at most
O((m + n) log∗ n) steps, where log∗ n is the number of times you must iterate the log2 function on n before getting
a number less than or equal to 1. (So log∗ 4 = 2, log∗ 16 = 3, log∗ 65536 = 4.) We should note that this is not the
tightest analysis possible; however, this analysis is already somewhat complex!
Note that we are going to do an amortized analysis here. That is, we are going to consider the cost of the
algorithm over a sequence of steps, instead of considering the cost of a single operation. In fact a single UNION or
FIND operation could require O(logn) operations. (Exercise: Prove this!) Only by considering an entire sequence
of operations at once can obtain the above bound. Our argument will require some interesting accounting to total the
cost of a sequence of steps.
Lecture 6 6-5
We first make a few observations about rank.
• if v 6= p(v) then rank(p(v)) > rank(v)
• whenever p(v) is updated, rank(p(v)) increases
• the number of elements with rank k is at most n2k
• the number of elements with rank at least k is at most n2k−1
The first two assertions are immediate from the description of the algorithm. The third assertion follows from
the fact that the rank of an element v changes only if LINK(v,w) is executed, rank(v) = rank(w), and v remains
the root of the combined tree; in this case v’s rank is incremented by 1. A simple induction then yields that when
rank(v) is incremented to k, the resulting tree has at least 2k elements. The last assertion then follows from the third
assertion, as ∑∞j=k
n2 j = n
2k−1 .
Exercise: Show that the maximum rank an item can have is logn.
Lecture 6 6-6
As soon as an element becomes a non-root, its rank is fixed. Let us divide the (non-root) elements into groups
according to their ranks. Group i contains all elements whose rank r satisfies log∗ r = i. For example, elements in
group 3 have ranks in the range (4,16], and the range of ranks associated with group i is (2 i−1,22i−1). For convenience
we shall write this more simply by saying group (k,2k] to mean the group with these ranks.
It is easy to establish the following assertions about these groups:
• The number of distinct groups is at most log∗ n. (Use the fact that the maximum rank is log n.)
• The number of elements in the group (k,2k] is at most n2k .
Let us assign 2k tokens to each element in group (k,2k]. The total number of tokens assigned to all elements
from that group is then at most 2k n2k = n, and the total number of groups is at most log∗ n, so the total number of
tokens given out is n log∗ n. We use these tokens to account for the work done by FIND operations.
Recall that the number of steps for a FIND operation is proportional to the number of pointers that the FIND
operation must follow up the tree. We separate the pointers into two groups, depending on the groups of u and
p(u) = v, as follows:
• Type 1: a pointer is of Type 1 if u and v belong to different groups, or v is the root.
• Type 2: a pointer is of Type 2 if u and v belong to the same group.
We account for the two Types of pointers in two different ways. Type 1 links are “charged” directly to the FIND
operation; Type 2 links are “charged” to u, who “pays” for the operation using one of the tokens. Let us consider
these charges more carefully.
Lecture 6 6-7
The number of Type 1 links each FIND operation goes through is at most log∗ n, since there are only log∗ n
groups, and the group number increases as we move up the tree.
What about Type 2 links? We charge these links directly back to u, who is supposed to pay for them with a
token. Does u have enough tokens? The point here is that each time a FIND operation goes through an element u,
its parent pointer is changed to the current root of the tree (by PATH COMPRESSION), so the rank of its parent
increases by at least 1. If u is in the group (k,2k], then the rank of u’s parent can increase fewer than 2k times before
it moves to a higher group. Therefore the 2k tokens we assign to u are sufficient to pay for all FIND operations that
go through u to a parent in the same group.
Lecture 6 6-8
We now count the total number of steps for m UNION and FIND operations. Clearly LINK requires just O(1)
steps, and since a UNION operation is just a LINK and 2 FIND operations, it suffices to bound the time for at most
2m FIND OPERATIONS. Each FIND operation is charged at most log∗ n for a total of O(m log∗ n). The total number
of tokens used at most n log∗ n, and each token pays for a constant number of steps. Therefore the total number of
steps is O((m+n) log∗ n).
Let us give a more equation-oriented explanation. The total time spent over the course of m UNION and FIND
operations is just
∑all FIND ops
(# links passed through).
We split this sum up into two parts:
∑all FIND ops
(# links in same group) + ∑all FIND ops
(# links in different groups).
(Technically, the case where a link goes to the root should be handled explicitly; however, this is just O(m) links in
total, so we don’t need to worry!) The second term is clearly O(m log∗ n). The first term can be upper bounded by:
∑all elements u
(# ranks in the group of u),
because each element u can be charged only once for each rank in its group. (Note here that this is because the links
to the root count in the second sum!) This last sum is bounded above by
∑all groups
(# items in group) · (# ranks in group) ≤log∗ n
∑k=1
n2k 2k ≤ n log∗ n.
This completes the proof.
Lecture 6 6-9
x y y
x
a
b
c
d
UNION(x,y)
FIND(d)
a
b c d
Figure 6.1: Examples of UNION BY RANK and PATH COMPRESSION.
CS124 Lecture 7
In today’s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we
will see, sometimes it works, and sometimes even when it doesn’t, it can provide a useful result.
Horn Formulae
A simple application of the greedy paradigm solves an important special case of the SAT problem. We have
already seen that 2SAT can be solved in linear time. Now consider SAT instances where in each clause, there is at
most one positive literal. Such formulae are called Horn formulae; for example, this is an instance:
In the above, I(i 6= j) represents the value 1 if i 6= j and 0 if i = j. We obtain the above expression by considering the
possible edit operations available. Suppose our last operation is a Delete, so that we deleted the ith character of A to
transform A[1 . . . i] to B[1 . . . j]. Then we must have transformed A[1 . . . i−1] to B[1 . . . j], and hence the edit distance
would be D(i−1, j)+1, or the cost of the transformation from A[1 . . . i−1] to B[1 . . . j] plus one for the cost of the
final Delete. Similarly, if the last operation is an Insert, the cost would be D(i, j−1)+1.
The other possibility is that the last operation is a Replace of the ith character of A with the jth character of B,
or a Match between these two characters. If there is a Match, then the two characters must be the same, and the cost
is D(i−1, j−1). If there is a Replace, then the two characters should be different, and the cost is D(i−1, j−1)+1.
We combine these two cases in our formula, using D(i−1, j−1)+ I(i 6= j).
Our recurrence takes the minimum of all these possibilities, expressing the fact that we want the best possible
choice for the final operation!
Lecture 9 9-6
It is worth noticing that our recursive description does not work when i or j is 0. However, these cases are
trivial. We have
D(i,0) = i,
since the only way to transform the first i characters of A into nothing is to delete them all. Similarly,
D(0, j) = j.
Again, it is helpful to think of the computation of the D(i, j) as filling up a two-dimensional array. Here, we
begin with the first column and first row filled. We can then fill up the rest of the array in various ways: row by row,
column by column, or diagonal by diagonal!
Besides computing the distance, we may want to compute the actual transformation. To do this, when we fill
the array, we may also picture filling the array with pointers. For example, if the minimal distance for D(i, j) was
obtained by a final Delete operation, then the cell (i, j) in the table should have a pointer to (i− 1, j). Note that a
cell can have multiple pointers, if the minimum distance could have been achieved in multiple ways. Now any path
back from (n,m) to (0,0) corresponds to a sequence of operations that yields the minimum distance D(n,m), so the
transformation can be found by following pointers.
The total computation time and space required for this algorithm is O(nm).
Lecture 9 9-7
9.3 All pairs shortest paths
Let G be a graph with positive edge weights. We want to calculate the shortest paths between every pair of nodes.
One way to do this is to run Dijkstra’s algorithm several times, once for each node. Here we develop a different
dynamic programming solution.
Our subproblems will be shortest paths using only nodes 1 . . . k as intermediate nodes. Of course when k equals
the number of nodes in the graph, n, we will have solved the original problem.
We let the matrix Dk[i. j] represent the length of the shortest path between i and j using intermediate nodes 1 . . .k.
Initially, we set a matrix D0 with the direct distances between nodes, given by di j . Then Dk is easily computed from
the subproblems Dk−1 as follows:
Dk[i, j] = min(Dk−1[i, j],Dk−1[i,k]+Dk−1[k, j]).
The idea is the shortest path using intermediate nodes 1 . . .k either completely avoids node k, in which case it
has the same length as Dk−1[i, j]; or it goes through k, in which case we can glue together the shortest paths found
from i to k and k to j using only intermediate nodes 1 . . .k−1 to find it.
It might seem that we need at least two matrices to code this, but in fact it can all be done in one loop. (Exercise:
think about it!)
D = (di j), distance array, with weights from all i to all j
for k = 1 to n do
for i = 1 to n do
for j = 1 to n do
D[i, j] = min(D[i, j],D[i.k]+D[k, j])
Note that again we can keep an auxiliary array to recall the actual paths. We simply keep track of the last
intermediate node found on the path from i to j. We reconstruct the path by succesively reconstructing intermediate
nodes, until we reach the ends.
Lecture 9 9-8
9.4 Traveling salesman problem
Suppose that you are given n cities and the distances di j between them. The traveling salesman problem (TSP) is to
find the shortest tour that takes you from your home city to all the other cities and back again. As there are (n−1)!
possible paths, this can clearly be done in O(n!) time by trying all possible paths. Of course this is not very efficient.
Since the TSP is NP-complete, we cannot really hope to find a polynomial time algorithm. But dynamic
programming gives us a much better algorithm than trying all the paths.
The key is to define the appropriate subproblem. Suppose that we label our home city by the symbol 1, and
other cities are labeled 2, . . . ,n. In this case, we use the following: for a subset S of vertices including 1 and at least
one other city, let C(S, j) be the shortest path that start at 1, visits all other nodes in S, and ends at j. Note that our
subproblems here look slightly different: instead of finding tours, we are simply finding paths. The important point
is that the shortest path from i to j through all the vertices in S consists of some shortest path from i to a vertex x,
where x ∈ S− j, and the additional edge from x to j.
for all j do C(i, j, j) := d1 j
for s = 3 to n do % s is the size of the subset
for all subsets S of 1, . . . ,n of size s containing 1 do
for all j ∈ S, j 6= 1 do
C(S, j) := mini6= j,i∈S[C(S− j, i)+di j ]
opt := min j 6=i C(1, . . . ,n, j)+d j1
The idea is to build up paths one node at a time, not worrying (at least temporarily) where they will end up.
Once we have paths that go through all the vertices, it is easy to check the tours, since they consists of a shortest path
through all the vertices plus an additional edge. The algorithm takes time O(n22n), as there are O(n2n) entries in the
table (one for each pair of set and city), and each takes O(n) time to fill. Of course we can add in structures so that
we can actually find the tour as well. Exercise: Consider how memory-efficient you can make this algorithm.
CS124 Lecture 10 Spring 1999
10.1 The Birthday Paradox
How many people do there need to be in a room before with probability greater than 1/2 some two of them have thesame birthday? (Assume birthdays are distributed uniformly at random.)
Surprisingly, only 23. This is easily determined as follows: the probability the first two people have differentbirthdays is (1−1/365). The probability that the third person in the room then has a birthday different from the firsttwo, given the first two people have different birthdays, is (1−2/365), and so on. So the probability that all of thefirst k people have different birthdays is the product of these terms, or
(1−1
365) · (1−
2365
) · (1−3
365) . . . · (1−
k−1365
).
Determining the right value of k is now a simple exercise.
10-1
Lecture 10 10-2
10.2 Balls into Bins
Mathematically, the birthday paradox is an example of a more general mathematical question, often formulated interms of balls and bins. Some number of balls n are thrown into some number of bins m. What does the distributionof balls and bins look like?
The birthday paradox is focused on the first time a ball lands in a bin with another ball. One might also ask howmany of the bins are empty, how many balls are in the most full bin, and other sorts of questions.
Let us consider the question of how many bins are empty. Look at the first bin. For it to be empty, it has tobe missed by all n balls. Since each ball hits the first bin with probability 1/m, the probability the first bin remainsempty is
(1−1m
)n ≈ e−n/m.
Since the same argument holds for all bins, on average a fraction e−n/m of the bins will remain empty.
Exercise: Howmany bins have 1 ball? 2?
Lecture 10 10-3
10.3 Hash functions
A hash function is a deterministic mapping from one set into another that appears random. For example, mappingpeople into their birthdays can be thought of as a hash function.
In general, a hash function is a mapping f : 0, . . . ,n−1 → 0, . . . ,m−1. Generally n >> m; for example,the number of people in the world in much bigger than the number of possible birthdays. There is a great deal oftheory behind designing hash functions that “appear random.” We will not go into that theory here, and insteadassume that the hash functions we have available are in fact completely random. In other words, we assume that foreach i (0 ≤ i ≤ n−1), the probability that f (i) = j is 1/m (for (0 ≤ j ≤ m−1). Notice that this does mean that everytime we look at f (i), we get a different random answer! The value of f (i) is fixed for all time; it is just equally likelyto take on any value in the range.
While such completely random hash functions are unavailable in practice, they generally provide a good roughidea of how hashing schemes perform.
(An aside: in reality, birthdays are not completely random either. Seasonal distributions skew the calculation.How might this affect the birthday paradox?)
Lecture 10 10-4
10.4 Applications: A Password-checker
We now consider a hashing application. Suppose you are adminstering a computer system, and you would like tomake sure that nobody uses a common password. This protects against hackers, who can often determine if someoneis using a common password (such as a name, or a common dictionary word) by gaining access to the encryptedpassword file and using an exhaustive search. When the user attempts to change their password, you would like tocheck their password against a dictionary of common passwords as quickly as possible.
One way to do this would be to use a standard search technique, such as binary search, on the string. Thisapproach has two negative features. First, one must store the entire dictionary, which takes memory. Second, ona large dictionary, this approach might be slow. Instead we present a quick and space-efficient scheme based onhashing. The data structure we consider is commonly called a Bloom filter, after the originator.
Choose a table size m. Create a table consisting of m bits, initially all set to 0. Use a hash function on each ofthe n words in the dictionary, where the range of the hash function is [0,m). If the word hashes to value k, set the kthbit of the table to 1.
When a user attempts to change the password, hash the user’s desired password and check the appropriateentry in the table. If there is a 1 there, reject the password; it could be a common one. Otherwise, accept it. Acommon password from the dictionary is always rejected. Assuming other strings are hashed to a random location,the probability of rejecting a password that should be accepted is 1− e−n/m.
It would seem one would need to choose m to be fairly large in order to make the probability of rejecting apotentially good password small. Space can be used more efficiently by making multiple tables, using a differenthash function to set the bits for each table. To check a proposed password now requires more time, since severalhash functions must be checked. However, as soon as a single 0 entry is found, the password can be accepted. Theprobability of rejecting a password that should be accepted when using h tables, each of size m, is then
(
1− e−n/m)h
.
The total space used is merely hm bits. Notice that the Bloom filter sometimes returns the wrong answer – we mayreject a proposed password, even though it is not a common password. This sort of error is probably acceptable, aslong as it doesn’t happen so frequently as to bother users. Fortunately this error is one-sided; a common password isnever accepted. One must set the parameters m and h appropriately to trade off this error probability against spaceand time requirements.
For example, consider a dictionary of 100,000 common passwords, each of which is on average 7 characterslong. Uncompressed this would be 700,000 bytes. Compression might reduce it substantially, to around 300,000bytes. Of course, then one has the problems of searching efficiently on a compressed list.
Instead, one could keep a 100,000 byte Bloom filter, consisting of 5 tables of 160,000 bits. The probability ofrejecting a reasonable password is just over 2%. The cost for checking a password is at most 5 hashes and 5 lookupsinto the table.
CS 124 Lecture 11
11.1 Applications: Fingerprinting for pattern matching
Suppose we are trying to find a pattern string P in a long document D. How can we do it quickly and efficiently?
Hash the pattern P into say a 16 bit value. Now, run through the file, hashing each set of |P| consecutivecharacters into a 16 bit value. If we ever get a match for a pattern, we can check to see if it corresponds to an actualpattern match. (In this case, we want to double-check and not report any false matches!) Otherwise we can just moveon. We can use more than 16 bits, too; we would like to use enough bits so that we will obtain few false matches.
This scheme is efficient, as long as hashing is efficient. Of course hashing can be a very expensive operation, soin order for this approach to work, we need to be able to hash quickly on average. In fact, a simple hashing techniqueallows us to do so in constant time per operation!
The easiest way to picture the process is to think of the file as a sequence of digits, and the pattern as a number.Then we move a pointer in the file one character at a time, seeing if the next |P| digits gives us a number equal tothe number corresponding to the pattern. Each time we read a character in the file, the number we are looking atchanges is a natural way: the leftmost digit a is removed, and a new rightmost digit b is inserted. Hence, we updatean old number N and obtain a new number N ′ by computing
N ′ = 10 · (N −10|P|−1 ·a)+b.
When dealing with a string, we will be reading characters (bytes) instead of numbers. Also, we will not wantto keep the whole pattern as a number. If the pattern is large, then the corresponding number may be too largeto do effective comparisons! Instead, we hash all numbers down into say 16 bits, by reducing them modulo someappropriate prime p. We then do all the mathematics (multiplication, addition) modulo p, i.e.
N ′ = [10 · (N −10|P|−1 ·a)+b] mod p.
All operations mod p can be made quite efficient, so each new hash value takes only constant time to compute!
This pattern matching technique is often called fingerprinting. The idea is that the hash of the pattern createsan almost unique identifier for the pattern– like a fingerprint. If we ever find two fingerprints that match, we have agood reason to expect that they must come the same pattern. Of course, unlike real fingerprints, our hashing-basedfingerprints do not actually uniquely identify a pattern, so we still need to check for false matches. But since falsematches should be rare, the algorithm is very efficient!
See Figure 11.1 for an example of fingerprinting.
11-1
Lecture 11 11-2
P = 17935p = 251
6386179357342...
P mod p = 114
63861 mod p = 10738617 mod p = 21486179 mod p = 8661793 mod p = 4717935 mod p = 11479357 mod p = 4193573 mod p = 20135734 mod p = 9257342 mod p = 114
Figure 11.1: A fingerprinting example. The pattern P is a 5 digit number. Note successive calculations take constanttime: 38617 mod p = ( (63861 mod p) - (60000 mod p)) · 10 + 7 mod p. Also note that false matches are possible(but unlikely); 57432 = 17935 mod p.
One question remains. How should we choose the prime p? We would like the prime we choose to work well,in that it should have few false matches. The problem is that for every prime, there are certainly some bad patternsand documents. If we choose a prime in advance, then someone can try to set up a document and pattern that willcause a lot of false matches, making our fingerprinting algorithm go very slowly.
A natural approach is to choose the prime p randomly. This way, nobody can set up a bad pattern and documentin advance, since they are not sure what prime we will choose.
Let us make this a bit more rigorous. Let π(x) represent the number of primes that are less than or equal to x. Itwill be helpful to use the following fact:
Fact: xlnx ≤ π(x) ≤ 1.26 x
ln x .
Consider any point in the algorithm, where the pattern and document do not match. If our pattern has length|P|, then at that point we are comparing two numbers that are each less than 10 |P|. In particular, their difference (inabsolute value) is less than 10|P|. What is the probability that a random prime divides this difference? That is, whatis the probability that for the random prime we choose, the two numbers corresponding to the pattern and the current|P| digits in the document are equal modulo p.
First, note that there are at most log2 10|P| distinct primes that divide the difference, since the difference is atmost 10|P| (in absoulte value), and each distinct prime divisor is at least 2. Hence, if we choose our prime randomly
Lecture 11 11-3
from all primes up to Z, the probability we have a false match is at most
log2 10|P|
π(Z).
Now the probability that we have a false match anywhere is at most |D| times the probability that we have a falsematch in any single location, by the union bound. Hence the probability that we have a false match anywhere is atmost
|D| log2 10|P|
π(Z).
Exercise: How big should we make Z in order to make the probability of a false match anywhere in thealgorithm less than 1/100?
Lecture 11 11-4
How could we improve the probability of a false match? One way is to choose from a larger set of primes.Another way is to choose not just one random prime, but several random primes from Z. This is like choosingseveral hash functions in the Bloom filter problem. There is a false match only if there is a false match at everyrandom prime we choose. If we choose k primes (with replacement) from the primes up to Z, the probability of afalse match at a specific point is at most
(
log2 10|P|
π(Z)
)k
.
CS124 Lecture 12
12.1 Near duplicate documents1
Suppose we are designing a major search engine. We would like to avoid answering user queries with multiplecopies of the same page. That is, there may be several pages with exactly the same text. These duplicates occurfor a variety of reasons. Some are mirror sites, some are copies of common pages (such as Unix man pages), someare multiple spam advertisements, etc. Returning just one of the duplicates should be sufficient for the end user;returning all of them will clutter the response page, wasting valuable real estate and frustraing the user. How can wecope with duplicate pages?
Determining exact duplicates has a simple solution, based on hashing. Use the text of each page and an ap-propriate hash function to hash the text into a 64 bit signature. If two documents have the same signature, it isreasonable to assume that they share the same text. (Why? How often is this assumption wrong? Is it a terrible thingif the assumption turns out to be false?) By comparing signatures on the fly, we can avoid returning duplicates.
This solution works extremely well if we want to catch exact duplicates. What if, however, we want to capturethe idea of “near duplicate” documents, or similar documents. For example, consider two mirror sites on the Web.It may be that the documents share the same text, except that the text corresponding to the links on the page aredifferent, with each referring to the correct mirror site. In this case, the two pages will not yield the same signature,although again, we would not want to return both pages to the end user, because they are so similar. As anotherexample, consider two copies of a newspaper article, one with a proper copyright notice added, and one without. Wedo not need to return both pages to the user. Again, hashing the document appears to be of no help. Finally, considerthe case of advertisers who submit slightly modified versions of their ads over and over again, trying to get more orbetter spots on the response pages sent back to users. We want to stop their nefarious plans!
We will describe a scheme used to detect similar documents efficiently, using a hashing based scheme. Like theBloom filter solution for password dictionaries, our solution is highly efficient in terms of space and time. The costfor this efficienty is accuracy; our algorithm will sometimes make mistakes, because it uses randomness.
12.2 Set resemblance
We describe a more general problem that will relate to our document similarity problem.
Consider two sets of numbers, A and B. For concreteness, we will assume that A and B are subsets of 64 bitnumbers. We may define the resemblance of A and B as
resemblance(A,B) = R(A,B) =|A∩B||A∪B| .
The resemblance is a real number between 0 and 1. Intuitively, the resemblance accurately captures how closethe two sets are. Sets and documents will be related, as we will see later.
1This lecture is based on the work of Andrei Broder, who developed these ideas, and convinced Altavista to use them! (The second featmay have been even more difficult than the first.)
12-1
Lecture 12 12-2
How quickly can we determine the resemblance of two sets? If the sets are each of size n, the natural approach(compare each element to in A to each element in B) is O(n2). We can do better by sorting the sets. Still, theseapproaches are all rather slow, when we consider that we will have many sets to deal with and hence many pairs ofsets to consider.
Instead we should ocnsider relaxing the problem. Suppose that we do not need an exact calculation of theresemblance R(A,B). A reasonable estimate or approximation of the resemblance will suffice. Also, since we willbe answering a variety of queries over a long period of time, it makes sense to consider algorithms that first doa preprocessing phase, in order to handle the queries much more quickly. That is, we will first do some work,preparing the appropriate data structures and data in a preprocessing phase. The advantage of doing all this work inadvance will be that queries regarding resemblance can then be quickly answered.
Our estimation process will require a black box that does the following: it produces an effective random per-mutation on the set of 64 bit numbers. What do we mean by a random permutation? Let us consider just the case offour bit number, of which there are 16. Suppose we write each number on a card. Generating a random permutationis like shuffling this deck of 16 cards and looking at the order at which the numbers appear after ths shuffling. Forexample, if we find the number 0011 on the first card, then our permutation maps the number 3 to the number 1. Wewrite this as π(3) = 1, where π is a function that represents the permutation.
Suppose we have an efficient implemenation of random permutations, which we think of as a black box proce-dure. That is, when we invoke the black box procedure BB(1,x) on a 64 bit number x, we get out y = π1(x) for somefixed, completely random permutation π1. Similarly, if we invoke the black box BB(2,x), we get out π2(x) for somedifferent random permutation π2. (In fact in practice we cannot achieve this black box, but we can get close enoughthat it is useful to think in these terms for analysis.)
Let us use the notation π1(A) to denote the set of elements obtained by computing BB(1,x) for every x in A.Consider the following procedure: we compute the set π1(A) and π1(B), and record the minimum of each set. Whendoes minπ1(A)= minπ1(B)? This happens only when there is some element x satisfying π1(x) = minπ1(A)=minπ1(B). In other words, the element x that is the minimum element in the set A∪B has to be the intersection ofthe sets A∩B.
If π1 is a random permutation, then every element in A∪B has equal probability of mapping to the minimumelement after the permutation is applies. That is, for all x and y in A∪B,
Pr[π1(x) = minπ1(A∪B)] = Pr[π1(y) = minπ1(A∪B)].
Thus, for the minimum of π1(A) and π1(B) to be the same, the minimum element must lie in π1(A∩B) (see Fig-ure 12.1). Hence
Pr[minπ1(A) = minπ1(B)] =|A∩B||A∪B| .
But this is just the resemblance R(A,B)!
This gives us a way to estimate the resemblance. Instead of taking just one permutation, we take many– say100. For each set A, we preprocess by computing minπj(A) for j = 1 to 100, and store these values. To estimatethe resemblance of two sets A and B, we count how often the minima are the same, and divide by 100. It is like eachpermutation gives us a coin flip, where the probability of a heads (a match) is exactly the resemblance R(A,B) of thetwo sets.
Lecture 12 12-3
A B
BA
Figure 12.1: If the minimum element of π1(A) and π1(B) are the same, the minimum element must lie in π1(A∩B).
Four score and seven years ago, our foundingFour score and seven
score and seven yearsand seven years ago
seven years ago ouryears ago our founding
Figure 12.2: Shingling: the document is broken up into all segments of k consecutive words; each segment leads toa 64 bit hash value.
12.3 Turning Document Similarity into a Set Resemblance Problem
We now return to the original application. How do we turn document similarity into a set resemblance problem? Thekey idea is to hash pieces of the document– say every four consecutive words– into 64 bit numbers. This process hasbeen called shingling, and each set of consecutive words is called a shingle. (See Figure 12.2.) Using hashing, theshingles give rise to the resulting numbers for the set resemblance problem, so that for each document D there is aset SD. There are many possible variations and improvements possible. For example, one could modify the numberof bits in a shingle or the method for shingling. Similarly, one could throw out all shingles that are not 0 mod 16,say, in order to reduce the number of shingles per document.
This approach obscures some important information in the document– such as the order paragraphs appearin, say. However, it seems reasonable to say that if the resulting sets have high resemblance, the documents arereasonably similar.
Once we have the shingles for the document, we associate a document sketch with each document. The sketchof a document SD is a list of say 100 numbers: (minπ1(SD),minπ2(SD),minπ3(SD), . . . ,minπ100(SD)).
Now we choose a threshold– for example, we might say that two documents are the similar if 90 out of the 100entries in the sketch match. Now whenever a user queries the search engine, we check the sketches of the documentswe wish to return. If two sketches share 90 entries, we only send one of them. (Alternatively, we could catch theduplicates on the crawling side– we check all the documents as we crawl the Web, and whenever two sketches sharemore than 90 entries, we assume the associated documents are similar, so that we only need to store one of them!)
Recall that our scheme uses random permutations. So, if we set our sketch threshold to 90 out of 100 entries,
Lecture 12 12-4
this does not guarantee that any pair of documents with high resemblance are caught. Also, some pairs of documentsthat do not have high resemblance may get marked as having high resemblance. How well does this scheme do?
We analyze how well the scheme does with the following argument. For each permutation πi, the probabilitythat two documents A and B have the same value in the ith position of the sketch is just the resemblance of the twodocuments R(A,B) = r. (Here the resemblance R(A,B) of course refers to the resemblance of the sets of numbersobtained by shingling A and B.) Hence, the probability p(r) that at least 90 out of the 100 entries in the sketch matchis
p(r) =100
∑k=90
(
100k
)
rk(1− r)100−k.
What does p(r) look like as a function of r? The graph is shown in Figure 12.3. Notice that p(r) stays verysmall until r approaches 0.9, and then quickly grows towards 1. This is exactly the property we want our scheme tohave– if two documents are not similar, we will rarely mistake them for being similar, and if they are similar, we arelikely to catch them!
For example, even if the resemablance is 0.8, we will only get 90 matches with probability less than 0.006!When the resemblance is only 0.5, the probability of having 90 entries in the sketch match falls to almost 10−18! Ifdocuments are not alike, we will rarely mistake them as being similar.
If documents are alike, we will most likely catch them. If the resemblance is 0.95, the documents will have 90or more entries in common in the sketch with probability greater than .988; if the resemblance is 0.96, the probabilityjumps to over .997.
We are dealing with a very large number of dcouments– most search engines currently index twenty-five to overone hundred million Web pages! So even though the probability of making a mistake is small, it will happen. Theworst that happens, though, is that the search engine fails to index a few pages that it should, and it fails to catch afew duplicates that it should. These problems are not a big deal.
Lecture 12 12-5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Resemblance
Pro
bab
ility
of
90 o
r m
ore
mat
ches
Figure 12.3: Making the threshold for document similarity 90 out of 100 matches in the sketch leads to the followinggraph relating resemblance to the probability two documents are considered similar. Notice the sharp change inbehavior near where the resemablance is 0.90. Essentially, the procedure behaves like a low pass filter.
CS124 Lecture 13
Hopefully the ideas we saw in our hashing problems have convinced you that randomness is a useful tool inthe design and analysis of algorithms. Just to make sure, we will consider several more example of how to userandomness to design algorithms.
13.1 Primality testing
A great deal of modern cryptography is based on the fact that factoring is apparently hard. At least nobody haspublished a fast way to factor yet. (It is rumored the NSA knows how to factor, and is keeping it a secret. Someof you might well have worked or will work for the NSA, at which point you will be required to keep this secret.Shame on you.) Of course, certain numbers are easy to factor– numbers with small prime factors, for example. Sooften, for cryptographic purposes, we may want to generate very large prime numbers and multiply them together.How can we find large prime numbers?
We are fortunate to find that prime numbers are pretty dense. That is, there’s an awful lot of them. Letπ(x) bethe number of primes less than or equal tox. Then
π(x) ≈ xlnx
,
or more exactly,
limx→∞
π(x)x
lnx
= 1.
This means that on average about one out of every lnx numbers is prime, if we are looking for primes about the sizeof x. So if we want to find prime numbers of say 250 digits, we would have to check about ln10250≈ 576 numberson average before finding a prime. (We can search smarter, too, throwing out multiples of 2,3,5, etc. in order tocheck fewer numbers.) Hence, all we need is a good method fortesting if a number is prime. With such a test, wecan generate large primes easily– just keep generating random large numbers, and test them for primality until wefind a suitable prime number.
How can we test if a numbern is prime? The pedantic way is to try dividingn by all smaller numbers.Alternatively, we can try to dividen by all primes up to
√n. Of course, both of these approaches are quite slow;
whenn is about 10250, the value of√
n is still huge! The point is that 10250 has only 250 (or more generallyO(logn))digits, so we’d like the running time of the algorithm to be based on the size 250, not 10250!
How can we quickly test if a number is prime? Let’s start by looking at some ways that work pretty well, buthave a few problems. We will use the following result from number theory:
Theorem 13.1 If p is a prime and 1≤ a < p, then
ap−1 = 1 mod p.
Proof: There are two nice proofs for this fact. One uses a simple induction to prove the equivalent statementthatap = a mod p. This is clearly true whena = 1. Now
(a+1)p =p
∑i=0
(pi
)ap−i.
13-1
Lecture 13 13-2
The coefficient(p
i
)is divisible byp, unlessi = 0 or i = p. Hence
(a+1)p = ap +1 mod p = a+1 modp,
where the last step follows by the induction hypothesis.
An alternative proof uses the following idea. Consider the numbers 1,2, . . . , p−1. Multiply them all bya, sonow we havea,2a, . . . ,(p−1)a. Each of these number is distinct modp, and there arep−1 such numbers, so infact the sequencea,2a, . . . ,(p−1)a is the same as the sequence 1,2, . . . , p−1 when considered modulop, exceptfor the order. Hence
1 ·2 · . . . · (p−1) = a ·2a · . . . · (p−1)a mod p = ap−1 ·1 ·2 · . . . · (p−1) mod p.
Thus we haveap−1 = 1 mod p.
This immediately suggests one way to check if a numbern is prime. Compute 2n−1 modn. If it is not 1, thenn is certainly not prime! Note that we can compute 2n−1 modn quite efficiently, using our previously discussedmethods for exponentiation, which require onlyO(logn) multiplications! Thus this test is efficient.
But so far this test is just one-way; ifn is composite, we may have that 2n−1 = 1 modn, so we cannot assumethatn is prime just because it passes the test. For example, 2340 = 1 mod 341, and 341 is not prime. Such a numberis called a2-psuedoprime, and unfortunately there are infinitely many of them. (Of course, even though there areinfinitely many 2-pseudoprimes, they are not as dense as the primes– that is, there are relatively very few of them.So if we generate a large numbern randomly, and see if 2n−1 = 1 modn, we will most likely be right if we then sayn is prime if it passes this test. In practice, this might be good enough! This is not a good primality test, however, ifan NSA official you know gives you a number to test for primality, and you think they might be trying to fool you.The NSA might be purposely giving you a 2-pseudoprime. They can be tricky that way.)
You might think to try a different base, other than 2. For example, you might choose 3, or a random value ofa. Unfortunately, there are infinitely many 3-pseudoprimes. In fact, there are infinitely many composite numbersnsuch thatan−1 = 1 modn for all a that do not share a factor withn. (That is, for alla such that the greatest commondivisor of a andn is 1.) Suchn are calledCarmichael numbers– the smallest such number is 561. So a test based onthis approach is destined to fail for some numbers.
There is a way around this problem, due to Rabin. Letn−1 = 2tu. Suppose we choose a random basea andcomputean−1 by first computingau and then repeatedly squaring. Along the way, we will check to see for the valuesau,a2u, . . . whether they have the following property:
a2i−1u = ±1 modn,a2iu = 1 modn.
That is, suppose we find anon-trivial square root of 1 modulon. It turns out that only composite numbers havenon-trivial square roots – prime numbers don’t. In fact, if we choosea randomly, andn is composite, for at least 3/4of the values ofa, one of two things will happen: we will either find a non-trivial square root of 1 using this process,or we will find thatan−1 = 1 modn. In either case, we know thatn is composite!
A value of a for which eitheran−1 = 1 modn or the computation ofan−1 yields a non-trivial square root iscalled awitness to the compositeness ofn. We have said that 3/4 of the possible values ofa are witnesses (we willnot prove this here!). So if we pick a single value ofa randomly, andn is composite, we will determine thatn iscomposite with probability at least 3/4. How can we improve the probability of catching whenn is composite?
The simplest way is just to repeat the test several times, each time choosing a value ofa randomly. (Note thatwe do not even have to go to the trouble of making sure we try different values ofa each time; we can choose valueswith replacement!) Each time we try this we have a probability of at least 3/4 of catching thatn is composite, so if
Lecture 13 13-3
we try the testk times, we will return the wrong answer in the case wheren is composite with probability(1/4)k. Fork = 25, the probability of the algorithm itself making an error is thus(1/2)50; the probability that a random cosmicray affected your arithmetic unit is probably higher!
This trick comes up again and again with randomized algorithms. If the probability of catching an error on asingle trial isp, the probability of failing to catch an error aftert trials is(1− p)t , assuming each trial is independent.By makingt sufficiently large, the probability of error can be reduced. Since the probability shrinks exponentiallyin t, few trials can produce a great deal of security in the answer.
CS 124 Lecture 14
14.1 Cryptography Fundamentals
Cryptography is concerned with the following scenario: two people, Alice and Bob, wish to communicate privately
in the presence of an eavesdropper, Eve. In particular, suppose Alice wants to send Bob a messagex. (For conve-
nience, we will always assume our message has been converted into a bit string.) Using cryptography, Alice would
compute a functione(x), the encoding ofx, using some secret key, and transmite(x) to Bob. Bob receivese(x),
and using his own secret key, would compute a functiond(e(x)) = x. The functiond provides the decoding of the
encodinge(x). Eve is presumably unable to recoverx from e(x) because she does not have the key – without the
key, computingx is either impossible or computationally difficult.
14.1.1 One-Time Pad
A classical cryptographic method is theone-time pad. A one-time pad is a random string of bitsr, equal in length to
the messagex, that Alice and Bob share and is secret. By random, here we mean thatr is equally like to be any bit
string of the right length,|r|. Alice computee(x) = x⊕ r; Bob computesd(e(x)) = e(x)⊕ r = x⊕ r⊕ r = x.
The claim is that Eve gets absolutely no information about the message by seeinge(x). More concretely, we
claim
Pr(message isx | e(x)) = Pr(message isx);
that is, knowinge(x) gives no more information to Eve than she already had. This is a nice exercise in condtional
probabilities.
Sincee(x) provides no information, the one-time pad is completely secure. (Notice that this does not rely
on notions of computational difficulty; Eve really obtains no additional information!) There are, however, crucial
drawbacks.
• The keyr has to be as long asx.
• The keyr can only be used once. (To see this, suppose we use the same keyr to encodex andy. The Eve can
computee(x)⊕ e(y) = x⊕ y, which might yield useful information!)
14-1
Lecture 14 14-2
• The keyr has to be exchanged, by some other means. (Private courier?)
14.1.2 DES
TheData Encrytpion Standard, or DES, is a U.S. government sponsored cryptographic method proposed in 1976. It
uses a 56 bit key, again shared by Alice and Bob, and it encodes blocks of 64 bits using a complicated sequence of
bit operations.
Many have suspected that the government engineered the DES standard, so that they could break it easily, but
nobody has shown a simpler method for breaking DES other than trying the 256 possible keys. These days, however,
trying even this large number of keys can be accomplished in just a few days with specialized hardware. Hence DES
is widely considered no longer secure.
14.1.3 RSA
RSA (named after its inventors, Ron Rivest, Adi Shamir, and Len Adleman) was developed around the same time as
DES. RSA is an example ofpublic key cryptography. In public key cryptography, Bob has two keys: a public key,
ke, known to everyone, and a private key,kd , known only to Bob. If Alice (or anyone else) wants to send a messagex
to Bob, she encrypts it ase(x) using the public key; Bob then decrypts it using his private key. For this to be secure,
the private key must be hard to compute from the public key, and similarlye(x) must be hard to compute fromx.
The RSA algorithm depends on some number theory and simple algorithms, which we will consider before
describing RSA. We will then describe how RSA isefficient andsecure.
14.2 Tools for RSA
14.2.1 Primality
For the time being, we will assume that it is possible to generate large prime numbers. In fact, there are simple and
efficientrandomized algorithms for generating large primes, that we will consider later in the course.
Lecture 14 14-3
14.2.2 Euclid’s Greatest Common Divisor Algorithm
Definition: Thegreatest common divisor (or gcd) of integersa,b ≥ 0 is the largest integerd ≥ 0 such thatd|a and
d|b, whered|a denotes thatd dividesa.
Example: gcd(360,84) = 12.
One way of computing the gcd is to factor the two numbers, and find the common prime factors (with the right
multiplicity). Factoring, however, is a problem for which we do not have general efficient algorithms.
The following algorithm, due to Euclid, avoids factoring. Assumea ≥ b ≥ 0.
function Euclid(a,b)
if b = 0 return(a)
return(Euclid(b,a modb))
end Euclid
Euclid’s algorithm relies on the fact that gcd(a,b) = gcd(b,a modb). You should prove this as an exercise.
We need to check that this algorithm is efficient. We will assume that mod operations are efficient (in fact they
can be done inO(log2 a) bit operations). How many mod operations must be performed?
To analyze this, we notice that in the recursive calls of Euclid’s algorithms, the numbers always get smaller.
For the algorithm to be efficient, we’d like to have only aboutO(loga) recursive calls. This will require the numbers
to shrink by a constant factor after a constant number of rounds. In fact, we can show that the larger number shrinks
by a factor of 2 every 2 rounds.
Claim 1: a modb ≤ a/2.
Proof: The claim is trivially true ifb ≤ a/2. If b > a/2, thena modb = a−b ≤ a/2.
Claim 2: On calling Euclid(a,b), after the second recursive call Euclid(a′,b′) hasa′ ≤ a/2.
Proof: For the second recursive call, we will havea′ = a modb.
14.2.3 Extended Euclid’s Algorithm
Euclid’s algorithm can be extended to give not just the greatest common divisord = gcd(a,b), but also two integers
x andy such thatax+ by = d. This will prove useful to us subsequently, as we will explain.
Lecture 14 14-4
Extended-Euclid(a,b)
if b = 0 return(a,1,0)
Computek such thata = bk +(a modb)
(d,x,y) = Extended-Euclid(b,a modb)
return((d,y,x− ky))
end Extended-Euclid
Claim 3: The Extended Euclid’s algorithm returns the correct answer.
Proof: By induction ona + b. It clearly works if b = 0. (Note the understanding that all numbers divide
0!) If b = 0, then we may assume the recursive call provides the correct answer by induction, asa modb < a.
Hence we havex andy such thatbx +(a modb)y = d. But (a modb) = a− bk, and hence by substitution we get
bx+(a−bk)y = d, or ay+ b(x− ky) = d. This shows the algorithm provides the correct output.
Note that the Extended Euclid’s algorithm is clearly efficient, as it requires only a few extra arithmetic opera-
tions per recursive call over Euclid’s algorithm.
The Extended Euclid’s algorithm is useful if we wish to compute the inverse of a number. That is, suppose we
wish to finda−1 modn. The numbera has a multiplicative inverse modulon if and only if the gcd ofa andn is 1.
Moreover, the Extended Euclid’s algorithm gives us that number. Since in this case computing gcd(a,n) givesx,y
such thatax+ ny = 1, we have thatx = a−1 modn.
14.2.4 Exponentiation
Suppose we have to computexy mod z, for integersx,y,z. Multiplying x by itself y times is one possibility, but
it is too slow. A more efficient approach is to repeatedly square fromx, to getx2 mod z, x4 mod z, x8 mod z . . .,
x2logymod z. Now xy can be computed by multiplying together moduloz the powers that correspond to ones in the
binary representation ofy.
14.3 The RSA Protocol
To create a public key, Bob finds two large primes,p andq, of roughly the same size. (Large should be a few hundred
decimal digits. Recently, with a lot of work, 512-bit RSA has been broken; this corresponds ton = pq being 512
Lecture 14 14-5
bits long.) Bob computesn = pq, and also computes a random integere, such that gcd((p−1)(q−1),e) = 1. (An
alternative to choosinge randomly often used in practice is to choosee = 3, in which casep andq cannot equal 1
modulo 3.)
The pair(n,e) is Bob’s public key, which he announces to the world. Bob’s private key isd = e−1 mod (p−1)(q−1), which can be computed by Euclid’s algorithm. More specifically,(p,q,d) is Bob’s private key.
Suppose Alice wants to send a message to Bob. We think of the message as being a numberx from the range
[1,n]. (If the message is too big to be represented by a number this small, it must be broken up into pieces; for
example, the message could be broken into bit strings of lengthlogn.) To encode the message, Alice computes
and sends to Bob
e(x) = xe modn.
Upon receipt, Bob computes
d(e(x)) = (e(x))d modn.
To show that this operation decodes correctly, we must prove:
Claim 4: d(e(x)) = x.
Proof: We use the steps:
e(x)d = xde = x1+k(p−1)(q−1) = x modn.
The first equation recalls the definition ofe(x). The second uses the fact thatd = e−1 mod(p−1)(q−1), and hence
de = 1+k(p−1)(q−1) for some integerk. The last equality is much less trivial. It will help us to have the following
lemma:
Claim 5: (Fermat’s Little Theorem) Ifp is prime, then fora = 0 mod p, we haveap−1 = 1 mod p.
Proof: Look at the numbers 1,2, . . . , p−1. Suppose we multiply them all bya modulop, to geta ·1 mod p,a ·2 mod p, . . . ,a · (p−1) mod p. We claim that the two sets of numbers are the same! This is because every pair of
numbers in the second group is different; this follows since ifa · i = a · j mod p, then by multiplying bya−1, we
must havei = j mod p. But if all the numbers in the second group are different modulop, since none of them are 0,
they must just be 1,2, . . . , p−1. (To get a feel for this, take an example: whenp = 7 anda = 5, multiplyinga by the
numbers1,2,3,4,5,6 yields5,3,1,6,4,2.)
From the above equality of sets of numbers, we conclude
1 ·2 · · · (p−1) = (a ·1) · (a ·2) · · · (a · (p−1)) mod p.
Lecture 14 14-6
Multiplying both sides by 1−1,2−1, . . . ,(p−1)−1 we have
1 = ap−1 mod p.
This proves Claim 5.
We now return to the end of Claim 4, where we must prove
x1+k(p−1)(q−1) = x modn.
We first claim thatx1+k(p−1)(q−1) = x mod p. This is clearly true ifx = 0 mod p. If x = 0 mod p, then by Fermat’s
Little Theorem,x(p−1) = 1 modp, and hencexk(p−1)(q−1) = 1 mod p, from which we havex1+k(p−1)(q−1) = x mod p.
by the same argument we also havex1+k(p−1)(q−1) = x modq. But if a number is equal tox both modulop and
moduloq, it is equal tox modulon = p ·q. Hencex1+k(p−1)(q−1) = x modn, and Claim 4 is proven.
We have shown that the RSA protocol allows for correct encoding and decoding. We also should be convinced
it is efficient, since it requires only operations that we know to be efficient, such as Euclid’s algorithm and modular
exponentiation. One thing we have not yet asked is why the scheme is secure. That is, why can’t the eavesdropper
Eve recover the messagex also?
The answer, unfortunately, is that there is no proof that Eve cannot computex efficiently from e(x). There
is simply a belief that this is a hard problem. It is an unproven assumption that there is no efficient algorithm for
computingx from e(x). There is the real but unlikely possibility that someone out there can read all messages sent
using RSA!
Let us seek some idea of why RSA is believed to be secure. If Eve obtainse(x) = xe modn, what can she do?
She could try all possible values ofx to try to find the correct one; this clearly takes too long. Or she could try to
factorn and computed. Factoring, however, is a widely known and well studied problem, and nobody has come up
with a polynomial time algorithm for the problem. In fact, it is widely believed that no such algorithm exists.
It would be nice if we could make some sort of guarantee. For example, suppose that breaking RSA allowed
us to factorn. Then we could say that RSA is as hard as factoring. Unfortunately, this is not the case either. It
is possible that RSA could be broken without providing a general factoring algorithm, although it seems that any
natural approach for breaking RSA would also provide a way to factorn.
CS124 Lecture 15
15.1 2SAT
We begin by showing yet another possible way to solve the 2SAT problem. Recall that the input to 2SAT is a logicalexpression that is the conjunction (AND) of a set of clauses, where each clause is the disjunction (OR) of two literals.(A literal is either a Boolean variable or the negation of a Boolean variable.) For example, the following expressionis an instance of 2SAT:
A solution to an instance of a 2SAT formula is an assignment of the variables to the values T (true) and F(false) so that all the clauses are satisfied– that is, there is at least one true literal in each clause. For example, theassingment x1 = T,x2 = F,x3 = F,x4 = T satisfies the 2SAT formula above.
Here is a simple randomized solution to the 2SAT problem. Start with some truth assignment, say by setting allthe variables to false. Find some clause that is not yet satisfied. Randomly choose one the variables in that clause,say by flipping a coin, and change its value. Continue this process, until either all clauses are satisfied or you gettired of flipping coins.
In the example above, when we begin with all variables set to F, the clause (x1 ∨ x2) is not satisfied. So wemight randomly choose to set x1 to be T. In this case this would leave the clause (x4 ∨ x1) unsatisfied, so we wouldhave to flip a variable in the clause, and so on.
Why would this algorithm tend to lead to a solution? Let us suppose that there is a solution, call it S. Supposewe keep track of the number of variables in our current assignment A that match S. Call this number k. We wouldlike to get to the point where k = n, the number of variables in the formula, for then A would match the solution S.How does k evolve over time?
At each step, we choose a clause that is unsatisfied. Hence we know that A and S disagree on the value of atleast one of the variables in this clause– if they agreed, the clause would have to be satisfied! If they disagree on both,then clearly changing either one of the values will increase k. If they disagree on the value one of the two variables,then with probability 1/2 we choose that variable and make increase k by 1; with probability 1/2 we choose the othervariable and decrease k by 1.
Hence, in the worse case, k behaves like a random walk– it either goes up or down by 1, randomly. This leavesus with the following question: if we start k at 0, how many steps does it take (on average, or with high probability)for k to stumble all the way up to n, the number of variables?
We can check that the average amount of steps to walk (randomly) from 0 to n is just n2. In fact, the averageamount of time to walk from i to n is n2 − i2. Note that the time average time T (i) to walk from i to n is given by:
T (n) = 0
T (i) =T (i−1)
2+
T (i+ 1)2
+ 1, i ≥ 1
T (0) = T (1)+ 1.
15-1
Lecture 15 15-2
These equations completely determine T (i), and our solution satisfies these equations!
Hence, on average, we will find a solution in at most n2 steps. (We might do better– we might not start with allof our variables wrong, or we might have some moves where we must improve the number of matches!)
We can run our algorithm for say 100n2 steps, and report that no solution was found if none was found. Thisalgorithm might return the wrong answer– there may be a truth assignment, and we have just been unlucky. Butmost of the time it will be right.
CS124 Lecture 16
An introductory example
Suppose that a company that produces three products wishes to decide the level of production of each so as to
maximize profits. Let x1 be the amount of Product 1 produced in a month, x2 that of Product 2, and x3 that of Product
3. Each unit of Product 1 yields a profit of 100, each unit of Product 2 a profit of 600, and each unit of Product 3 a
profit of 1400. There are limitations on x1, x2, and x3 (besides the obvious one, that x1,x2,x3 ≥ 0). First, x1 cannot
be more than 200, and x2 cannot be more than 300, presumably because of supply limitations. Also, the sum of the
three must be, because of labor constraints, at most 400. Finally, it turns out that Products 2 and 3 use the same
piece of equipment, with Product 3 using three times as much, and hence we have another constraint x2 +3x3 ≤ 600.
What are the best levels of production?
We represent the situation by a linear program, as follows:
max100x1 + 600x2 + 1400x3
x1 ≤ 200
x2 ≤ 300
x1 + x2 + x3 ≤ 400
x2 + 3x3 ≤ 600
x1,x2,x3 ≥ 0
The set of all feasible solutions of this linear program (that is, all vectors in 3-d space that satisfy all constraints)
is precisely the polyhedron shown in Figure 16.1.
We wish to maximize the linear function 100x1 +600x2 +1400x3 over all points of this polyhedron. Geometri-
cally, the linear equation 100x1 + 600x2 + 1400x3 = c can be represented by a plane parallel to the one determined
by the equation 100x1 +600x2 +1400x3 = 0. This means that we want to find the plane of this type that touches the
polyhedron and is as far towards the positive orthant as possible. Obviously, the optimum solution will be a vertex
(or the optimum solution will not be unique, but a vertex will do). Of course, two other possibilities with linear
programming are that (a) the optimum solution may be infinity, or (b) that there may be no feasible solution at all.
16-1
Lecture 16 16-2
x1
x2
x3
200
300
200
opt
Figure 16.1: The feasible region
In this case, an optimal solution exists, and moreover we shall show that it is easy to find.
Linear programs
Linear programs, in general, have the following form: there is an objective function that one seeks to optimize,
along with constraints on the variables. The objective function and the constraints are all linear in the variables;
that is, all equations have no powers of the variables, nor are the variables multiplied together. As we shall see,
almost all problems can be represented by linear programs, and for many problems it is an extremely convenient
representation. So once we explain how to solve linear programs, the question then becomes how to reduce other
problems to linear programming (LP).
There are polynomial time algorithms for solving linear programs. In practice, however, such problems are
solved by the simplex method devised by George Dantzig in 1947. The simplex method starts from a vertex (in this
Lecture 16 16-3
case the vertex (0,0,0)) and repeatedly looks for a vertex that is adjacent, and has better objective value. That is, it
is a kind of hill-climbing in the vertices of the polytope. When a vertex is found that has no better neighbor, simplex
stops and declares this vertex to be the optimum. For example, in the figure one of the possible paths followed by
simplex is shown. No known variant of the simplex algorithm has been proven to take polynomial time, and most of
the variations used in practice have been shown to take exponential time on some examples. Fortunately, in practice,
bad cases rarely arise, and the simplex algorithm runs extremely quickly. There are now implementations of simplex
that solve routinely linear programs with many thousands of variables and constraints.
Of course, given a linear program, it is possible either that (a) the optimum solution may be infinity, or (b) that
there may be no feasible solution at all. If this is the case, simplex algorithm will discover it.
Reductions between versions of simplex
A general linear programming problem may involve constraints that are equalities or inequalities in either
direction. Its variables may be nonnegative, or could be unrestricted in sign. And we may be either minimizing
or maximizing a linear function. It turns out that we can easily translate any such version to any other. One
such translation that is particularly useful is from the general form to the one required by simplex: minimization,
nonnegative variables, and equality constraints.
To turn an inequality ∑aixi ≤ b into an equality constraint, we introduce a new variable s (the slack variable for
this inequality), and rewrite this inequality as ∑aixi + s = b,s ≥ 0. Similarly, any inequality ∑aixi ≥ b is rewritten
as ∑aixi − s = b,s ≥ 0; s is now called a surplus variable.
We handle an unrestricted variable x as follows: we introduce two nonnegative variables, x+ and x−, and
replace x by x+ − x− everywhere. The idea is that we let x = x+ − x−, where we may restrict both x+ and x− to be
nonnegative. This way, x can take on any value, but there are only nonnegative variables.
Finally, to turn a maximization problem into a minimization one, we just multiply the objective function by −1.
A production scheduling example
We have the demand estimates for our product for all months of 1997, di : i = 1, . . . ,12, and they are very
uneven, ranging from 440 to 920. We currently have 30 employees, each of which produce 20 units of the product
each month at a salary of 2,000; we have no stock of the product. How can we handle such fluctuations in demand?
Three ways:
Lecture 16 16-4
• overtime —but this is expensive since it costs 80% more than regular production, and has limitations, as
workers can only work 30% overtime.
• hire and fire workers —but hiring costs 320, and firing costs 400.
• store the surplus production —but this costs 8 per item per month
This rather involved problem can be formulated and solved as a linear program. As in all such reductions, the
crucial first step is defining the variables:
• Let w0 be the number of workers we have the ith month —we have w0 = 30.
• Let xi be the production for month i.
• oi is the number of items produced by overtime in month i.
• hi and fi are the number of workers hired/fired in the beginning of month i.
• si is the amount of product stored after the end of month i.
We now must write the constraints:
• xi = 20wi + oi —the amount produced is the one produced by regular production, plus overtime.
• wi = wi−1 + hi − fi,wi ≥ 0 —the changing number of workers.
• si = si−1 +xi−di ≥ 0 —the amount stored in the end of this month is what we started with, plus the production,
minus the demand.
• oi ≤ 6wi —only 30% overtime.
Finally, what is the objective function? It is
min 2000∑wi + 400∑ fi + 320∑hi + 8∑ si + 180∑oi,
where the summations are from i = 1 to 12.
A Communication Network Problem
Lecture 16 16-5
We have a network whose lines have the bandwidth shown in Figure 16.2. We wish to establish three calls: one
between A and B (call 1), one between B and C (call 2), and one between A and C (call 3). We must give each call
at least 2 units of bandwidth, but possibly more. The link from A to B pays 3 per unit of bandwidth, from B to C
pays 2, and from A to C pays 4. Notice that each call can be routed in two ways (the long and the short path), or by a
combination (for example, two units of bandwidth via the short route, and three via the long route). Suppose we are
a shady network administrator, and our goals is to maximize the network’s income (rather than minimize the overall
cost). How do we route these calls to maximize the network’s income?
11
13 6
12
10
8
B
AC
Figure 16.2: A communication network
This is also a linear program. We have variables for each call and each path (long or short); for example x1 is
the short path for call 1, and x′2 the long path for call 2. We demand that (1) no edge bandwidth is exceeded, and (2)
each call gets a bandwidth of 2.
max 3x1 + 3x′1 + 2x2 + 2x′2 + 4x3 + 4x′3
x1 + x′1 + x2 + x′2 ≤ 10
x1 + x′1 + x3 + x′3 ≤ 12
x2 + x′2 + x3 + x′3 ≤ 8
x1 + x′2 + x′3 ≤ 6
Lecture 16 16-6
x′1 + x2 + x′3 ≤ 13
x′1 + x′2 + x3 ≤ 11
x1 + x′1 ≥ 2
x2 + x′2 ≥ 2
x3 + x′3 ≥ 2
x1,x′1 . . . ,x′3 ≥ 0
The solution, obtained via simplex in a few milliseconds, is the following: x1 = 0,x′1 = 7,x2 = x′2 = 1.5,x3 =
.5,x′3 = 4.5.
Question: Suppose that we removed the constraints stating that each call should receive at least two units.
Would the optimum change?
Approximate Separation
An interesting last application: Suppose that we have two sets of points in the plane, the black points (xi,yi) :
i = 1, . . . ,m and the white points (xi,yi) : i = m+1, . . . ,m+n. We wish to separate them by a straight line ax+by = c,
so that for all black points ax + by ≤ c, and for all white points ax + by ≥ c. In general, this would be impossible.
Still, we may want to separate them by a line that minimizes the sum of the “displacement errors” (distance from the
boundary) over all misclassified points. Here is the LP that achieves this:
mine1 +e2 + . . .+ em + em+1 + . . .+ em+n
e1 ≥ ax1 + by1 − ce2 ≥ ax2 + by2 − c
...em ≥ axm + bym − c
em+1 ≥ c−axm+1 −bym+1...
em+n ≥ c−axm+n −bym+n
ei ≥ 0
Network Flows
Suppose that we are given the network in top of Figure 16.3, where the numbers indicate capacities, that is, the
amount of flow that can go through the edge in unit time. We wish to find the maximum amount of flow that can go
through this network, from S to T .
Lecture 16 16-7
5
3
2
1 12
2
3
5
S
A C
B D
T
5
3
2
1 12
2
3
5
S
A C
B D
T
5
3
2
1 12
2
3
5
S
A C
B D
T
5
3
2
1 12
2
3
5
S
A C
B D
T
2
22
42
2
2
2
42
2
2
42
2
minimum cut,capacity 6
Figure 16.3: Max flow
Lecture 16 16-8
This problem can also be reduced to linear programming. We have a nonnegative variable for each edge, rep-
resenting the flow through this edge. These variables are denoted fSA, fSB, . . . We have two kinds of constraints:
capacity constraints such as fSA ≤ 5 (a total of 9 such constraints, one for each edge), and flow conservation con-
straints (one for each node except S and T ), such as fAD + fBD = fDC + fDT (a total of 4 such constraints). We wish
to maximize fSA + fSB, the amount of flow that leaves S, subject to these constraints. It is easy to see that this linear
program is equivalent to the max-flow problem. The simplex method would correctly solve it.
In the case of max-flow, it is very instructive to “simulate” the simplex method, to see what effect its various
iterations would have on the given network. Simplex would start with the all-zero flow, and would try to improve it.
How can it find a small improvement in the flow? Answer: it finds a path from S to T (say, by depth-first search),
and moves flow along this path of total value equal to the minimum capacity of an edge on the path (it can obviously
do no better). This is the first iteration of simplex (see Figure 16.3).
How would simplex continue? It would look for another path from S to T . Since this time we already partially
(or totally) use some of the edges, we should do depth-first search on the edges that have some residual capacity,
above and beyond the flow they already carry. Thus, the edge CT would be ignored, as if it were not there. The
depth-first search would now find the path S−A−D− T , and augment the flow by two more units, as shown in
Figure 16.3.
Next, simplex would again try to find a path from S to T . The path is now S−A−B−D−T (the edges C−T
and A−D are full are are therefore ignored), and we augment the flow as shown in the bottom of Figure 16.3.
Next simplex would again try to find a path. But since edges A−D, C− T , and S−B are full, they must be
ignored, and therefore depth-first search would fail to find a path, after marking the nodes S,A,C as reachable from
S. Simplex then returns the flow shown, of value 6, as maximum.
How can we be sure that it is the maximum? Notice that these reachable nodes define a cut (a set of nodes
containing S but not T ), and the capacity of this cut (the sum of the capacities of the edges going out of this set) is
6, the same as the max-flow value. (It must be the same, since this flow passes through this cut.) The existence of
this cut establishes that the flow is optimum!
There is a complication that we have swept under the rug so far: when we do depth-first search looking for a
path, we use not only the edges that are not completely full, but we must also traverse in the opposite direction all
edges that already have some non-zero flow. This would have the effect of canceling some flow; canceling may be
necessary to achieve optimality, see Figure 16.4. In this figure the only way to augment the current flow is via the
path S−B−A−T , which traverses the edge A−B in the reverse direction (a legal traversal, since A−B is carrying
Lecture 16 16-9
non-zero flow).
1 1
1
1 1
S
A
B
T
Figure 16.4: Flows may have to be canceled
In general, a path from the source to the sink along which we can increase the flow is called an augmenting
path. We can look for an augmenting path by doing for example a depth first search along the residual network,
which we now describe. For an edge (u,v), let c(u,v) be its capacity, and let f (u,v) be the flow across the edge.
Note that we adopt the following convention: if 4 units flow from u to v, then f (u,v) = 4, and f (v,u) = −4. That is,
we interpret the fact that we could reverse the flow across an edge as being equivalent to a “negative flow”. Then the
residual capacity of an edge (u,v) is just
c(u,v)− f (u,v).
The residual network has the same vertices as the original graph; the edges of the residual network consist of all
weighted edges with strictly positive residual capacity. The idea is then if we find a path from the source to the sink
in the residual network, we have an augmenting path to increase the flow in the original network. As an exercise,
you may want to consider the residual network at each step in Figure 16.3.
Suppose we look for a path in the residual network using depth first search. In the case where the capacities
are integers, we will always be able to push an integral amount of flow along an augmenting path. Hence, if the
maximum flow is f ∗, the total time to find the maximum flow is O(E f∗), since we may have to do an O(E) depth
first search up to f∗ times. This is not so great.
Note that we do not have to do a depth-first search to find an augmenting path in the residual network. In fact,
using a breadth-first search each time yields an algorithm that provably runs in O(VE2) time, regardless of whether
or not the capacities are integers. We will not prove this here. There are also other algorithms and approaches to the
Lecture 16 16-10
max-flow problem as well that improve on this running time.
To summarize: the max-flow problem can be easily reduced to linear programming and solved by simplex. But
it is easier to understand what simplex would do by following its iterations directly on the network. It repeatedly
finds a path from S to T along edges that are not yet full (have non-zero residual capacity), and also along any reverse
edges with non-zero flow. If an S−T path is found, we augment the flow along this path, and repeat. When a path
cannot be found, the set of nodes reachable from S defines a cut of capacity equal to the max-flow. Thus, the value
of the maximum flow is always equal to the capacity of the minimum cut. This is the important max-flow min-cut
theorem. One direction (that max-flow≤min-cut) is easy (think about it: any cut is larger than any flow); the other
direction is proved by the algorithm just described.
CS124 Lecture 17
5
3
2
1 12
2
3
5
S
A C
B D
T
5
3
2
1 12
2
3
5
S
A C
B D
T
5
3
2
1 12
2
3
5
S
A C
B D
T
5
3
2
1 12
2
3
5
S
A C
B D
T
2
22
42
2
2
2
42
2
2
42
2
minimum cut,capacity 6
Figure17.1:Max flow
17-1
Lecture17 17-2
Network Flows
Supposethatwearegiventhenetwork in topof Figure17.1,wherethenumbersindicatecapacities,thatis, the
amountof flow thatcango throughtheedgein unit time. Wewish to find themaximumamountof flow thatcango
throughthisnetwork, from Sto T.
This problemcanalsobereducedto linearprogramming.We have a nonnegative variablefor eachedge,rep-
resentingthe flow throughthis edge. ThesevariablesaredenotedfSA fSB We have two kinds of constraints:
capacityconstraintssuchas fSA
5 (a total of 9 suchconstraints,onefor eachedge),andflow conservation con-
straints(onefor eachnodeexceptSandT), suchas fAD fBD fDC fDT (a total of 4 suchconstraints).Wewish
to maximize fSA fSB, theamountof flow thatleavesS, subjectto theseconstraints.It is easyto seethatthis linear
programis equivalentto themax-flow problem.Thesimplex methodwouldcorrectlysolve it.
Lecture17 17-3
In thecaseof max-flow, it is very instructive to “simulate” thesimplex method,to seewhateffect its various
iterationswouldhave on thegivennetwork. Simplex wouldstartwith theall-zeroflow, andwould try to improve it.
How canit find a small improvementin theflow? Answer: it findsa pathfrom S to T (say, by depth-firstsearch),