30  CS 4104 Algorithms Notes
01 July, 2021  119 min read
Contents
 01  Stable Matching
 02  Analysis of Algorithms
 03  Review of Graphs and Priority Queues
 04  Linear Time Algoirthms
 05  Greedy Algorithms
 06  Greedy Graph Algorithms
 07  Applications of Minimum Spanning Trees
 08  Divide and Conquer Algorithms
 09  Dynamic Programming
 10  Network Flow
 11  Network Flow Applications
 12  NP and Computational Tractibility
 13  NPComplete Problems
 14  Coping with NPCompleteness
 15  Solutions
Introduction
Notes from CS 4104 with exercises and derivations from "Algorithm Design" by Kleinberg and Tardos.
Glossary
Term  Definition 

Matching  each man is paired with ≤ 1 woman and vice versa 
Perfect Matching  each man is paired with exactly 1 woman and vice versa 
Rogue Couple  a man and a woman who are not matched, but prefer each other to their 
Stable Matching  a perfect matching without any roogue couples 
polynomial running time  if there exists constants $c,d > 0$ such that $\forall n$, the running time is bounded by $cn^d$ 
Asymptotic Upper Bound  a function $f(n)$ is $O(g(n))$ if there exists constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \leq cg(n)$ 
Asymptotic Lower Bound  a function $f(n)$ is $\Omega(g(n))$ if there exist constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \geq cg(n)$ 
Asymptotic Tight Bound  a function $f(n)$ is $\Theta(g(n))$ if $f(n)$ is $O(g(n))$ and $f(n)$ is $\Omega(g(n))$ 
Euler Path  Only possible if the number of nodes with odd degree is at most 2 
A $v_1  v_k$ path  in an undirected graph $G = (V,E)$ is a sequence of $P$ nodes $v_1, v_2, \mathellipsis, v_{k1}, v_k; v_k \in V$ s.t. every consecutive pair of nodes $v_i; v_{i+1}, 1 \leq i < k$ is connected by an edge $E$ 
simple path  all its nodes are distinct 
cycle  is a path where $k>2$, the first $k1$ nodes are distinct and $v_1 = v_k$ 
connected undirected graph  for every pair of nodes $u,v \in V$, there is a path from $u$ to $v$ in $G$ 
distance $d(u,v)$  the minimum number of edges in any $uv$ path 
connected component of $G$ containing $s$  the set of all nodes $u$ such that there is an $su$ path in $G$ 
Adjaceny Matrix  $n \times n$ boolean matrix, where the entry in row $i$, col $j$ is 1 iff the graph contains the edge $(i, j)$ 
Adjaceny List  array $Adj$, where $Adj[v]$ stores the list of all nodes adjacent to $v$ 
bipartite  A graph $G = (V, E)$ is bipartite if $V$ can be partitioned into two subsets $X,Y$ such that every edge in $E$ has one endpoint in $X$ and one in $Y$ 
Greedy algorithm  makes the best current choice without looking back. Assume the choices made prior were perfect 
inversion  A schedule has an inversion if a job $i$ with a deadline $d(i)$ is scheduled before another job $j$ with an earlier deadline $d(j)$ i.e. $d(j) < d(i), s(i) < s(j)$ 
spanning tree  A subset $T of E$ is a spanning tree of $G$ if $(V, T)$ is a tree 
01  Stable Matching
Problem
 Each man ranks all the women in order of preference
 Each woman ranks all the men in order of preference
 Each person uses all ranks from $1, ... , n$ s.t. there are no ties or incomplete lists
Male preference matrix:
$w_1$  $w_2$  $w_3$  $w_4$  

$m_1$  1  2  3  4 
$m_2$  4  3  1  2 
$m_3$  4  3  1  2 
$m_4$  3  1  4  2 
Female preference matrix:
$m_1$  $m_2$  $m_3$  $m_4$  

$w_1$  1  2  3  4 
$w_2$  4  1  3  2 
$w_3$  4  1  2  3 
$w_4$  3  1  2  4 
 A Matching: each man is paired with ≤ 1 woman and vice versa

A Perfect Matching: each man is paired with exactly 1 woman and vice versa
 "perfect" means onetoone mapping
 A Rogue Couple: a man and a woman who are not matched, but prefer each other to their current partners
 A Stable Matching: A perfect matching without any rogue couples
Solution: GaleShapley Algorithm
$\boxed{ \begin{aligned} &\text{Initially, all men and all women are free} \\ &\text{Set } S \text{ of matched pairs is empty} \\ &\text{While there is at least one free man who has not proposed to every woman} \\ &\quad\text{Choose such a man } m \\ &\qquad m \text{ proposes to his highestranked woman } w \text{ to whom he has not yet proposed} \\ &\quad\text{If } w \text{ is free} \\ &\qquad\text{she becomes engaged to } m \text{ s.t. } S \leftarrow S \cup \{m, w\} \\ &\quad\text{Else if } w \text{ is engaged to } m' \text{ and she prefers } m \text{ to } m' \\ &\qquad\text{she becomes engaged to } m \text{ s.t. } S \leftarrow S \cup \{m, w\} \\ &\qquad m' \text{ becomes free s.t. } S \leftarrow S \backslash \{m', w\} \\ &\quad\text{Otherwise, } m \text{ remains free} \\ &\text{Return set } S \text{ of engaged pairs} \end{aligned} }$Proof of Perfection
 Suppose the set $S$ of pairs returned by GaleShapley algorithm is not perfect
 $S$ is a matching, therefore there must be at least one free man $m$
 $m$ has proposed to all women (since the algorithm terminated)
 Therefore, each woman must be engaged (since she remains engaged after the first proposal to her)
 Therefore, all men must be engaged, contradicting ⚔️ the assumption that $m$ is free
 That matching is perfect since the program terminated QED.
Proof of Stability
 Suppose $S$ is not stable i.e. there are two pairs $(m_1, w_1)$, $(m_2, w_2) \in S$ s.t $m_1$ prefers $w_2 > w_1$ and $w_2$ prefers $m_1 > m_2$
 $m_1$ must have proposed to $w_2$ prior to $w_1$ because at that stage $w_2$ must have rejected $m_1$; otherwise she would have been paired with $w_1$, which would in turn prevent the pairing of $(m_2, w_2)$ in a later iteration
 When $w_2$ rejected $m_1$, she must have been paired with some man $m_3 > m_1$
 Since $m_2$ is paired with $w_2$ at termination, $w_2$ must prefer $m_2$ to $m_3$ or $m_2 = m_3$
 This contradicts ⚔️ our conclusion (from instability) that $w_2$ prefers $m_1 > m_2$ QED.
Observations
 This algorithm computes a matching i.e. each woman gets pair with at most one man and vice versa
 The man's status can alternate between being free and being engaged
 The woman's status remains engaged after the first proposal
 The ranking of a man's partner remains the same or goes down
 The ranking of a woman's partner can never go down
 The number of proposals made after $k$ iterations is the best indicator of progress
 The max number of total proposals (or iterations) that can be made is $n^2$
 Always produces the same matching in which each man is paired with his best, valid partner, but each woman is paired with her worst, valid partner
02  Analysis of Algorithms
Polynomial Time
 Essentially brute forcing

e.g. brute force sorting:
 Given $n$ numbers, permute them so that they appear in increasing order
 Try all $n!$ permutations
 For each permutation, check if it is sorted
 $O(nn!)$
 Desirable scaling property: when the input size doubles, the algorithm should only slow down by some constant $c$
 An algorithm has polynomial running time if there exists constants $c,d > 0$ such that $\forall n$, the running time is bounded by $cn^d$
 an algorithm is said to be efficient if it has a polynomial running time
Upper and Lower Bounds

Asymptotic Upper Bound: a function $f(n)$ is $O(g(n))$ if there exists constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \leq cg(n)$
 e.g. $100n \log_2n$ is $O(n^2)$ for $c = 100, n_0 = 1$

Asymptotic Lower Bound: a function $f(n)$ is $\Omega(g(n))$ if there exist constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \geq cg(n)$
 e.g. $\frac{n}{10}\log_2 n$ is $\Omega(n)$ for $c = 1, n_0 = 1024$
In general, we attempt to find a sufficiently large $n_0$ such that our upper and lower bounds are satisfied.
 Asymptotic Tight Bound: a function $f(n)$ is $\Theta(g(n))$ if $f(n)$ is $O(g(n))$ and $f(n)$ is $\Omega(g(n))$
Transitively:
 if $f = O(g)$ and $g = O(h)$, then $f = O(h)$
 if $f = \Omega(g)$ and $g = \Omega(h)$, then $f = \Omega(h)$
 if $f = \Theta(g)$ and $g = \Theta(h)$, then $f = \Theta(h)$
Additively:
 if $f = O(h)$ and $g = O(h)$, then $f + g = O(h)$
 Similar statements hold for lower and tight bounds
 if $k$ is a constant and there are $k$ functions: $f_i = O(h), 1\leq i \leq k$
 if $f = O(g)$, then $f + g = \Theta(g)$
Examples
$p,q,n > 0$
$f(n)$  $g(n)$  Reason 

$pn^2 + qn + r$  $\Theta(n^2)$  $pn^2 + qn + r \leq (p + q + r)n^2$, and we can ignore low order terms 
$pn^2 + qn + r$  $O(n^3)$  $n^2 \leq n^3, \text{ if } n \geq 1$ 
$\displaystyle\sum_{0 \leq i \leq d} a_i n^i$  $\Theta(n^d)$  if $d > 0$ is an integer constant and $a_d > 0$ 
$O(n^{1.59})$  polynomial time  $n^{1.59}$ is $O(n^2)$ 
$\log_a n$  $O(\log_b n)$  $\log_2(x) = \frac{\log_{10}(x)}{\log_{10}(2)} \forall a,b > 1$ therefore the base is irrelevant 
 $\forall \text{ constant } x > 0, \log n = O(n^x)$ e.g. $\log n = n^{0.00001}$
 $\forall \text{ constant } r > 1, d >0, n^d = O(r^n)$ e.g. $n^3 = O(1.1^n)$
03  Review of Graphs and Priority Queues
Priority Queues: Motivation
Sorting
 Instance: Nonempty list $[x_1, x_2, \mathellipsis, x_n]$ of integers
 Solution: A permutation $[y_1, y_2, \mathellipsis, y_n]$ of $x_1, x_2, \mathellipsis, x_n$ such that $y_i \leq y_{i+1}$ for all $1 \leq i < n$$
Possible algorithm:
 insert each number into some data structure $D$
 Repeatedly find the smallest number in $D$, output it, and remove it
 To get $O(n \log n)$ running time, each "find minimum" step and each remove step must additively take $O(\log n)$
Priority Queues
 Store a set $S$ of elements where each element $v$ has a priority value $key(v)$
 smaller key values = higher priorities

Supported operations:
 find element with smallest key
 remove smallest element
 insert an element
 delete an element
 update the key of an element
 element deletion and key update require knowledge of the position of the element in the priority
 combines benefits of both lists and sorted arrays
 balanced binary tree
 heap order. For every element $v$ at a node $i$, the element $w$ at $i$'s parent satisfies $key(w) \leq key(v)$
 Each node's key is at least as large as its parent's

storing nodes of the heap in an array:
 Node at index $i$ has children at indices $2i$ and $2i+1$, with a parent at index $\lfloor \frac{i}{2} \rfloor$
 index 1 is the root
 if $2i > n$, where $n$ is the current number of elements in the heap, the node at index $i$ is a leaf
Inserting an Element: Heapifyup
 insert a new element at index $n+1$
 Fix heap order using
Heapifyup(H, n+1)
Proof of Running Time Complexity for Heapifyup(i)
 each invocation decreases the second argument by a factor of at least 2
 after $k$ invocations, argument is at most $\frac{i}{2^k}$
 Therefore $\frac{i}{2^k} \geq 1$ which implies that $k \leq \log_2 i$ therefore heapify up is $O(\log i)$
Deleting an Element: Heapifydown
 Suppose $H$ has $n+1$ elements
 Delete element $H[i]$ moving element at $H[n+1]$ to $H[i]$
 If element at $H[i]$ is too small, fix the heap order using
Heapifyup(H, i)
 If element at $H[i]$ is too large, fix heap order using
Heapifydown(H, i)
Proof of Running Time Complexity for Heapifydown(i)
 Each invocation increases is second argument by a factor of at least 2
 after $k$ invocations arguments must be at least $i2^k \leq n$, which implies that $k \leq \log_2 \frac{n}{i}$
 Therefore running time is $O(\log_2 \frac{n}{i})$
Sorting
 Instance: Nonempty list $[x_1, x_2, \mathellipsis, x_n]$ of integers
 Solution: A permutation $[y_1, y_2, \mathellipsis, y_n]$ of $x_1, x_2, \mathellipsis, x_n$ such that $y_i \leq y_{i+1}$ for all $1 \leq i < n$$
Final algorithm:
 Insert each number in a priority queue $H$
 Repeatedly find the smallest number in $H$, output it, and delete it from $H$
Thus, each insertion and deletion take $O(\log n)$ for a total running time of $O(n \log n)$
Graphs: Motivation
 Contact tracing hahaaaaa
Taxonomy of a Graph
 comprised of vertices and (directed / undirected) edges, they can form face.
 Euler Path: Only possible if the number of nodes with odd degree is at most 2

Formally, an Undirected graph $G = (V, E)$ where $V, E$ are sets of vertices and edges with $E \subseteq V \times V$
 Elements of $E$ are unordered pairs
 $Edge(u, v)$ is incident on $u,v$ meaning $u,v$ are neighbors of one another
 Exactly one edge between any pair of nodes
 $G$ contains no self loops, i.e. no edges of the from $(u,u)$

Formally, an directed graph $G = (V, E)$ where $V, E$ are sets of vertices and edges with $E \subseteq V \times V$
 Elements of $E$ are ordered pairs
 $e = (u, v)$ where $u$ is the tail of the edge $e$, $v$ is the head, and we can say that $e$ is directed from $u$ to $v$
 A pair of nodes may be connected by two directed edges $(u,v)$ and $(v,u)$
 $G$ contains no self loops
 A $v_1  v_k$ path in an undirected graph $G = (V,E)$ is a sequence of $P$ nodes $v_1, v_2, \mathellipsis, v_{k1}, v_k; v_k \in V$ s.t. every consecutive pair of nodes $v_i; v_{i+1}, 1 \leq i < k$ is connected by an edge $E$
 a path is simple if all its nodes are distinct
 a cycle is a path where $k>2$, the first $k1$ nodes are distinct and $v_1 = v_k$
 An undirected graph $G$ is connected if, for every pair of nodes $u,v \in V$, there is a path from $u$ to $v$ in $G$
 The distance $d(u,v)$ between nodes $u,v$ is the minimum number of edges in any $uv$ path
 The connected component of $G$ containing $s$ is the set of all nodes $u$ such that there is an $su$ path in $G$
Computing Connected Components
 Rather than computing the connected component of $G$ that contains $s$ and check if $t$ is in that component, we can "explore" $G$ starting from $s$, maintaining a set $R$ of visited nodes
Breadth First Search
 Explore $G$ starting at $s$ and going "outward" in all directions, adding nodes one "layer" at a time
 Layer $L_0$ only contains $s$
 Layer $L_1$ contains $s$ and all its neighbors, etc. etc.

Layer $L_0, L_1, \mathellipsis , L_j, L_{j+1}$ contains all nodes that:
 do not belong to an earlier layer
 are connected by an edge to a node in layer $L_j$
 The shortest path from $s$ to each node contains $j$ edges
 Claim: For each $j \geq 1$, layer $L_j$ consists of all nodes exactly at distance $j$ from $S$
 A nontree edge is an edge of $G$ that does not belong to the BFS tree $T$
Proof
 There is a path from $s$ to $t$ if an only iff $t$ is a member of some layer

Let $v$ be a node in layer $L_{j+1}$ and $$u$ be the "first" node in $L_j$ such that $(u,v)$ is an edge in $G$. Consider the graph $T$ formed by all edges, directed from $u$ to $v$
 Notice that $T$ is a tree because it is connected and the number of edges in $T$ is the number of nodes in all the laters minus one
 $T$ is called the BFS Tree
Inductive Proof of BFS Distance Property
 for every $j \geq 0$, for every node $u \in L_j$, $d(s, u) = j$
 Basis: $k =0$, $d(s,s) = 0$
 Inductive Hypothesis: Assume the claim is true for every node $v \in L_k$, $d(s, v) = k$

Inductive Step: Prove for every node $x \in L_{k+1}$, $d(s, x) = k + 1$
 $d(s,x) = d(s, y) + 1$ if $y$ is a node in $L_k$
 Therefore $d(s,x) = k+1$
Depth First Search
 Explore $G$ as if it were a maze: start from $s$, traverse first edge out (to node $v$), traverse first edge out of $v$, . . . , reach a deadend, backtrack, repeat
Properties of each
 BFS and DFS visit the same set of nodes, but in a different order
Representing Graphs

A Graph $G = (V,E)$ has two input parameters: $V =n, E = m$, with $n1 \leq m \leq \binom{n}{2}$
 Size of graph is defined to be $m + n$
 Strive for algorithms whose running time is linear in graph size i.e. $)(m + n)$
 We assume that $V = {1, 2, \mathellipsis, n}$

Use an Adjaceny Matrix: $n \times n$ boolean matrix, where the entry in row $i$, col $j$ is 1 iff the graph contains the edge $(i, j)$
 the space used is $\Theta(n^2)$, which is optimal in the worst case
 can check if there is an edge between nodes $i$, $j$ in $O(1)$ time
 iterate over all the edges incident on node $i$ in $\Theta(n)$ time

Use an Adjaceny List: array $Adj$, where $Adj[v]$ stores the list of all nodes adjacent to $v$
 an edge $e = (u,v)$ appears twice: $Adj[u],$Adj[v]$
 $n_v$ is the number of neighbors of node $v$
 space used is $O(\displaystyle\sum_{v \in G}n_v) = O(m+n)$
 check if there is an adge between nodes $u, v$ in $O(n_u)$ time
 Iterate over all the edges incident on node $u$ in $\Theta(n_u)$ time
 Inserting an edge takes $O(1)$ time
 Deleting an edge takes $O(n)$ time
Operation/Space  Adj. Matrix  Adj. List 

Is $(i, j)$ an edge?  $O(1)$ time  $O(n_i)$ time 
Iterate over edges incident on node $i$  $O(n)$ time  $O(n_i)$ time 
Space used  $O(n^2)$  $O(m + n)$ 
04  LinearTime Graph Algorithms
Bipartite Graphs

A graph $G = (V, E)$ is bipartite if $V$ can be partitioned into two subsets $X,Y$ such that every edge in $E$ has one endpoint in $X$ and one in $Y$
 $X \times X \cap E = \emptyset$ and $Y \times Y \cap E = \emptyset$
 Color the nodes in $X$ red and the nodes in $Y$ blue. No edge in $E$ connects nodes of the same graph
 no cycle with an odd number of nodes is bipartite, similarly all cycles of even length are bipartite
 Therefore, if a graph is bipartite, then it cannot contain a cycle of odd length
Algorithm for Testing Bipartite
 Assume $G$ is connected, otherwise apply the algorithm to each connected component separately
 Pick an arbitrary node $s$ and color it red.
 Color all it's neighbors blue. Color the uncolored neighbors of these nodes red, and so on till all nodes are colored
 Check if every edge has endpoints of different colours
more formally:
 Run BFS on $G$. Maintain an array for the color
 When we add a node $v$ to a layer $i$, set $\text{color[i]}$ to red if $i$ is even, blue of odd
 At the end of BFS, scan all the edges to check if there is any edge both of whose endpoints received the same color
Running time is linear proportional to the size of the graph: $O(m + n)$ since we do a constant amount of work per node in addition to the time spent by BFS.
Proof of Correctness of a "twocolorable" algorithm
Need to prove that if the algorithm says $G$ is bipartite, then it is actually bipartite AND need to prove that if $G$ is not bipartite, can we determine why?

Let $G$ be a graph and let $L_0, L_1, \mathellipsis, L_k$ be the layers produced by the BFS, starting at node $s$. Then exactly one of the following statements is true:
 No edge of $G$ joins two nodes in the same layer: Bipartite since nodes in the even laters can be colored red and those in odd blue
 There is an edge of $G$ that joins two nodes in the same layer: Not Bipartite
 $ L_i  L_j  = 1, \quad \forall L \in BFS$
05  Greedy Algorithms
 Greedy algorithms: make the best current choice without looking back. Assume the choices made prior were perfect
Example Problem: Interval Scheduling
 At an amusement park, want to compute the largest number of rides you can be on in one day
 Input: start and end time of each ride
 Constraint: cannot be in two places at one time
 Instance: nonempty set $\{(s(i), f(i)), 1 \leq i \leq n\}$ of start and finish times of $n$ jobs
 Solution: The largest subset of mutually compatible jobs
 Two jobs are compatible if they do not overlap
 Problem models the situation where you have a resource, a set of fixed jobs, and you want to schedule as many jobs as possible
 For any input set of jobs, the algorithm must provably compute the largest set of compatible jobs (measured by interval count, not cumulative interval length)
Template for a Greedy Algorithm
 Process jobs in some order. Add next job to the result if it is compatible with the jobs already in the result

key question: in what order should we process job?
 earliest start time: increasing order of start time $s(i)$
 earliest finish time: increasing order of start time $f(i)$
 shortest interval: increasing order of $f(i)  s(i)$
 fewest conflicts: increasing order of the number of conflicting jobs
Earliest Finish Time
 the most optimal general solution
Proof of Optimality
 Claim $A$ is a compatible set of jobs that is the largest possible in any set of mutually compatible jobs

Proof by contradiction that there's no "better" solution at each step of the algorithm
 need to define "better", and what a step is in terms of the progress of the algorithm
 order the output in terms of increasing finish time for a measure of progress
 Finishing time of a job $r$ selected by $A$ must be less than or equal to the finishing time of a job $r$ selected by any other algorithm
 Let $O$ be an optimal set of jobs. We will show that $A = O$
 Let $i_1, i_2, \mathellipsis, i_k$ be the set of jobs in $A$ in order of finishing time
 Let $j_1, j_2, \mathellipsis, j_m$ be the set of jobs in $O$ in order of $m \geq k$
 Claim: for all indices $r \leq k, \quad f(i_r) \leq f(j_r)$

Base case: is it possible for finish time of the first job of our algorithm to be later than the opposing job?
 $f(i_1) > f(j_1)$ is not possible, only $f(i_1) \leq f(j_1)$

Inductive Hypothesis: this is always the case for any generic job index $r$:
 $f(i_r) \leq f(j_r)$ for some $r \leq k$

Inductive Step: show that the same claim holds for the next job:
 $f(i_{r+1}) \leq f(j_{r+1})$ for some $r+1 \leq k$
 $s(j_{r+1}) \geq f(i_r) \geq f(i_r)$
 claim $m = k$: $f(i_{k}) \leq f(j_{k}) \leq s(j_{k+1})$
A complete proof can be found here:
Implementing EFT
 Sort jobs in order of increasing finish time
 Store starting time of jobs in an array $S$
 $k=1$

While $k \leq S$
 output job $k$
 Let the finish time of job $k$ be $f$
 Iterate over $S$ from $k$ onwards to find the first index $i$ s.t. $S[i] \geq f$
 $k = i$
 Must be careful to iterate over $S$ s.t. we never scan same index more than once
 Running time is $O(n \log n)$ since it's dominated by the first sorting step
Scheduling to Minimize Lateness
 Suppose a job $i$ has a length $t(i)$ and a deadline $d(i)$
 want to schedule all $n$ jobs on one resource
 Goal is to assign a starting time $s(i)$ to each job such that each job is delayed as little as possible
 A job $i$ is delayed if $f(i) > d(i)$; the lateness of job is $\max(0, f(i)  d(i))$

the lateness of a schedule is $\max \limits_{1 \leq i \leq n}(\max(0, f(i)  d(i)))$
 the largest of each job's lateness values
Minimizing Lateness
 Instance: Set $\{ (t(i), d(i)), 1 \leq i \leq n \}$ of lengths of deadlines of $n$ jobs
 Solution: Set $\{ s(i), 1 \leq i \leq n \}$ such that $\max \limits_{1 \leq i \leq n}(\max(0, f(i)  d(i)))$ is as small as possible
Template for Greedy Algorithm

Key question: In what order should we schedule the jobs:
 Shortest length: increasing order of length $t(i)$. Ignores deadlines completely, shortest job may have a very late deadline:
$i$ 1 2 $t(i)$ 1 10 $d(i)$ 100 10  Shortest slack time: Increasing order of $d(i)  t(i)$. Bad for long jobs with late deadlines. Job with smallest slack may take a long time:
$i$ 1 2 $t(i)$ 1 10 $d(i)$ 2 10  Earliest Deadline: Increasing order deadline $d(i)$. Correct? Does it make sense to tackle jobs with earliest deadlines first?
Proof of Earliest Deadline Optimality
$\boxed{ \begin{aligned} &\text{Order the jobs in order of increase deadlines } d(i) \\ &\text{Assume for simplicity of notation that } d(1) \leq \mathellipsis d(n) \\ &\text{Initially, } f = 0 \\ &\text{Consider the jobs } i=1, \mathellipsis, n { in this order }\\ &\quad\text{Assign the job } i \text{ to the time interval from } s(i) = f \text{ to } f(i) = f + t_i\\ &\quad f \leftarrow f + t_i \\ &\text{Return the set of scheduled intervals} [s(i), f(i)], \text{ for } i = 1, \mathellipsis, n\\ \end{aligned} }$Inversions

A schedule has an inversion if a job $i$ with a deadline $d(i)$ is scheduled before another job $j$ with an earlier deadline $d(j)$ i.e. $d(j) < d(i), s(i) < s(j)$
 if two jobs have the same deadline, they cannot cause an inversion
 in $n$ jobs, the maximum amount of inversion is $n \choose 2$
 Claim: if a schedule has and inversion, then there must be a pair of jobs $i, j$ such that $j$ is scheduled immediately after $i$ and $d(j) < d(i)$
Proof of Local Inversion
 If we have an inversion between $l, m$ s.t. $s(l) < s(m), d(l) > d(m)$, then we can find some inverted $i,j$ scheduled between $l,m$
 This is because: in a list where each element is greater than the last (in terms of deadline), then the list is sorted.
 The contrapositive of this is: if a list is unsorted, then there are two adjacent elements that are unsorted.
Properties of Schedules
 Claim 1: The algorithm produces a schedule with no inversion and no idle time (i.e. jobs are tightly packed)

Claim 2: All schedules (produced from the same input) with no inversions have the same lateness
 Case 1: All jobs have distinct deadlines. There is a unique schedule with no inversions and no idle time.
 Case 2: Some jobs have the same deadline. Ordering of the jobs does not change the maximum lateness of these jobs.
 Claim 3: There is an optimal schedule with no idle time.

Claim 4: There is an optimal schedule with no inversions and no idle time.
 Start with an optimal schedule $O$ (that may have inversions) and use an exchange argument to convert $O$ into a schedule that satisfies Claim 4 and has lateness not larger than $O$.
 If $O$ has an inversion, let $i, j$ be consecutive inverted jobs in $O$. After swapping $i, j$ we get a schedule $O'$ with one less inversion.
 Claim: The lateness of $O'$ is no larger than the lateness of $O$
 It is sufficient to prove the last item, since after $n \choose 2$ swaps, we obtain a schedule with no inversions whose lateness is no larger than that of $O$
 In $O$, assume each job $r$ is scheduled for the interval $[s(r), f(r)]$ and has lateness $\mathcal l(r)$. For $O'$, let the lateness of $r$ be $l'(r)$
 Claim: $l'(k) = l(k), \quad \forall k \neq i,j$
 Claim: $l'(j) = l(j)$,
 Claim: $l'(j) \leq l(j)$ because $l'(j) = f(j)  d_i \leq f(j)  d_j = l(j)$

N.B. common mistakes with exchange arguments:
 Wrong: start with algorithm's schedule A and argue it cannot be improved by swapping two jobs
 Correct: Start with an arbitrary schedule O (which can be optimal) and argue that O can be converted into a schedule that is essentially the same as A without increasing lateness
 Wrong: Swap two jobs that are not neighboring in $O$. Pitfall is that the completion time of all intervening jobs then changes
 Correct: Show that an inversion exists between two neighboring jobs and swap them
 Claim 5: The greedy algorithm produces an optimal schedule, follows from 1, 2, 4
Summary
 Greedy algorithms make local decisions

Three strategies:
 Greedy algorithm stays ahead  Show that after each step in the greedy algorithm, its solution is at least as good as that produced by any other algorithm
 Structural bound  First, discover a property that must be satisfied by every possible solution. Then show that the (greedy) algorithm produces a solution with this property
 Exchange argument  Transform the optimal solution in steps into the solution by the greedy algorithm without worsening the quality of the optimal solution
06  Greedy Graph Algorithms
The Shortest Path Problem
 $G(V,E)$ is a connected, directed graph. Each edge has a length $l(e) \geq 0$
 length of a path $P$ us the sum of the lengths of the edges in $P$
 Goal: compute the shortest path from a specified start node $s$ to every other node in the graph
 Instace: A directed graph $G(V,E)$, a function $I: E \rightarrow \reals^+$ and a node $s \in V$
 Solution: A set $\{P_u, u \in V\}$ of paths, where $P_u$ is the shortest path in $G$ from $s$ to $u$
Generalizing BFS
 If all edges have the same wight, or distance, BFS would work, processing nodes in order of distance.

What if the graph has integer edge weights, can we make the graph unweighted?
 yes, placing dummy edges and nodes to pad out lengths > 1 at the expense of memory and running time
 Edge weight of $w$ gets $w  1$ nodes
 Size of the graph (and therefore runtime) becomes $m + n + \sum_{e \in E} l(e)$. Pseudopolynomial time depending on input values
Dijkstra's Algorithm
 Famous for pointing out the harm of the
goto
command, in doing so developed the shortest path algorithm  Like BFS: explore nodes in nonincreasing order of distance $s$. Once a node is explored, its distance is fixed
 Unlike BFS: Layers are not uniform, edges are explored by evaluating candidates by edge wight
 For each unexplored node, determine "best" preceding explored node. Record shortest path length only through explored nodes
 Like BFS: Record previous node in path, and build a tree.
Formally:

Maintain a set $S$ of explored nodes
 For each node $u \in S$, compute $d(u)$, which (we will prove, invariant) is the length of the shortest path from $s$ to $u$
 For each node $x \notin S$, maintain a value $d'(x)$, which is the length of the shortest path from $s$ to $x$ using only the nodes in $S$ and $x$ itself
 Greedily add a node $v$ to $S$ that has the smallest value of $d'(v)$
What does $v = \arg \min_{x \in V  S }(d'(x))$ mean?
 Run over all unexplored nodes $x \in V  S$
 examine all $d'$ values for each node
 Return the argument (i.e. the node) that has the smallest value of $d'(x)$
To compute the shorts paths: when adding a node $v$ to $S$, store the predecessor $u$ that minimizes $d'(v)$
Proof of Correctness
 Let $P_u$ be the path computed by the algorithm for an arbitrary node $u$
 Claim: $P_u$ is the shortest path from $s$ to $u$

Prove by induction on the size of $S$
 Base case: $S = 1$. The only node in $S$ is $s$
 Inductive Hypothesis: $S = k$ for some $k \geq 1$. The algorithm has computed $P_u$ for every node $u \in S$. Strong induction.

Inductive step: $S = k+1$ because we add the node $v$ to $S$. Could the be a shorter path $P$ from $s$ to $v$? We must prove this is not the case.
 poll: $P'$ must contain an edge from x to y where x is explore (in S) and y is unexplored (in V  S)
 poll: The node v in P' must be explored (in S)

Locate key nodes on $P'$

Break $P'$ into subpaths from $s$ to $x$, $x$ to $y$, $y$ to $v$
 use $l$ to denote the lengths of the subpaths in $P'$
 $d(x) \leq l(s, x)$
 $d(u) + l(u,v) \leq d(x) + l(x, y)$
 $0 \leq l(y,v)$
 $d(u) + l(u,v) = d(v) \leq l(P') = l(s,x) + l(x,y) + l(y,v)$

Observations about Dijkstra's Algorithm
 As described, it cannot handle negative edge lengths
 Union of shortest paths output by Dijkstra's forms a tree: why?
 Union of the shortest paths from a fixed source $s$ forms a tree; paths not necessarily computed by computed by Dijkstra's
Running time of Dijkstra's
$\boxed{ \begin{aligned} &S = \{s\} \text{ and } d(s) = 0 \\ &\text{while } S \neq V \\ &\quad\text{for every node } x \in V  S \\ &\qquad\text{Set } d'(x) = \min_{(u, x): u \in S} (d(u) + l(u, x)) \\ &\quad\text{Set } v = \arg \min_{x \in V  S }(d'(x)) \\ &\quad\text{Add } v \text{ to } S \text{ and set } d(v) = d'(v)\\ \end{aligned} }$ $V$ has $n$ nodes and $m$ edges, so their are $n1$ iterations on the while loops
 In each iteration, for each node $x \in V  S$, compute $d'(x) = \min_{(u, x): u \in S} (d(u) + l(u, x))$ which is proportional the the number of edges incident on $x$
 Running time per iteration is $O(m)$, since the algorithm processes each edge $(u,x)$ in the graph exactly once (when computing $d'(x)$)
 Therefore, the overall running time is $O(mn)$
Optimizing our Algorithm
 Observation, if we add $v$ to $S$, $d'(x)$ changes only if $(v, x)$ is an edge in $G$
 Idea: For each node $x \in V  S$, store the current value of $d'(x)$. Upon adding a node $v$ to $S$, update $d'()$ only for neighbors of $v$

How do we efficiently compute $v = \arg \min_{x \in V  S }(d'(x))$
 Priority Queue!
 For each node $x \in V  S$, store the pair $(x, d'(x))$ in a priority queue $Q$ with $d'(x)$ as the key
 Determine the next node $v$ to add to $S$ using $ExtractMin$
 After adding $v$ to $S$, for each node $x \in V  S$ such that there is an edge from $v$ to $x$, check if $d'(x)$ should be updated i.e. if there is a shortest path from $s$ to $x$ via $v$
 in line 8, if $x$ is not in $Q$, simply insert it
New Runtime
 $ExtractMin$ happens $n  1$ times
 For every node $v$, the running time of step 5 $O(deg_v)$, the number of outgoing neighbors of $v$: $\sum_{v \in V} O(deg_v) = O(m)$
 $ChangeKey$ is invoked at most once for each edge, $m$ times, and is an $O(\log n)$ operation
 So, the total runtime is $O(m \log n)$
Network Design
 want to connect a set of nodes using a set of edges with certain properties
 Input is usually a graph, and the desired network (output) should use a subset of edges in the graph
 Example: connect all nodes using a cycle of shortest total length. This problem is the NPcomplete traveling salesman problem
Minimum Spanning Tree (MST)
 Given an undirected graph $G(V,E)$ with a cost $c(e) > 0$ associated with each edge $e \in E$.
 Find a subset $T$ of edges such that the graph $(V,T)$ is connected and the cost $\sum_{e \in T} c(e)$ is as small as possible
 Instance: An undirected graph $G(V,E)$ and a function $c: E \rightarrow \R^+$
 Solution: A set $T \sube E$ of edges such that $(V, T)$ is connected and the cost $\sum_{e \in T} c(e)$ is as small as possible
 Claim: if $T$ is a minimumcost solution to this problem, then $(V,T)$ is a tree
 A subset $T of E$ is a spanning tree of $G$ if $(V, T)$ is a tree
Characterizing MSTs

Does the edge of smallest cost belong to an MST? Yes
 Wrong proof: Because Kruskal's algorithm adds it.
 Right proof: tbd

Which edges must belong to an MST?
 What happens when we delete an edge from an MST?
 MST breaks up into subtrees
 Which edge should we add to join them?

Which edges cannot belong to an MST?
 What happens when we add an edge to an MST?
 We obtain a cycle
 Which edge in the cycle can we be sure does not belong to an MST
Greedy Algorithm for the MST Problem

Template: process edges in some order. Add an edge to $T$ if tree property is not violated.
 increasing cost order: Process edges in increasing order of cost. Discard an edge if it creates a cycle – Kruskal's
 Dijkstralike: Start from a node $s$ and grow $T$ outward from $s$: add the node that can attached most cheaply to current tree – Prim's
 Decreasing cost order: Delete edges in order of decreasing cost as long as graph remains connected – ReverseDelete
 each of these works
 Simplifying assumption: all edge costs are distinct
Graph Cuts
 a cut in a graph $G(V,E)$ is a set of edges whose removal disconnects the graph (into two or more connected components)
 Every set $S \sub V$ ($S$ cannot be empty or the entire set $V$) has a corresponding cut: cut($S$) is the set of edges (v,w) such that $v \in S, w \in V  S$
 cut($S$) is a "cut" because deleting the edges in cut($S$) disconnects $S$ from $V  S$

Claim: for every $S \sub V, S \neq \empty$, every MST contains the cheapest edge in cut$(S$)
 will have to proof by contradiction using exchange argument
Proof of Cut Property of MSTs
 Negation of the desired property: There is a set $S \sub V$ and an MST $T$ such that $T$ does not contain the cheapest edge in cut($S$)
 Proof strategy: If $T$ does not contain $e$, show that there is a tree with a smaller cost than $T$ that contains $e$.

Wrong proof:
 Since $T$ is spanning, it must contain some edge e.g. $f$ in cut($S)
 $T  \{f\} \cup \{e\}$ has smaller cost than $T$ but may not be a spanning tree

Correct proof:
 Add $e$ to $T$ forming a cycle
 This cycle must contain an edge $e'$ in cut($S$)
 $T  \{e'\} \cup \{e\}$ has smaller cost than $T$ and is a spanning tree
Prim's Algorithm
 Maintain a tree (S, T), i.e., a set of nodes and a set of edges, which we will show will always be a tree
 Start with an arbitrary node $s \in S$
 Step 3 is the cheapest edge in cut(S) for the currently explored set S

In other words, each step in Prim's algorithm computes and adds the cheapest edge in the current value of cut($S$)
 $\arg \min_{(u,v):u \in S, v \in V  S} c(u,v) \equiv \arg \min_{(u,v) \in cut(S)} c(u,v)$
Optimality of Prim's

Claim: Prim's algorithm outputs an MST
 Prove that every edge inserted satisfies the cut property (true by construction, in each iteration $(u, v)$ is necessarily the cheapest edge in cut($S$) for the current value of $S$)
 Prove that the graph constructed is a spanning tree
 Why are there no cycles in $(V, T)$
 Why is $(V,T)$ a spanning tree (edges in $T$ connect all nodes in $V$)  Because $G$ is connected, if there were an unconnected node $v$ at the end, then Prim's would not have terminated
Final version of Prim's Algorithm
$\boxed{ \begin{aligned} &Insert(Q, s, 0, \empty) \\ &\text{while } S \neq V \\ &\quad(v, a(v), u) = ExtractMin(Q) \\ &\quad\text{Add node } v \text{ to } S \text{ and edge } (u,v) \text{ to } T \\ &\quad\text{for every node } x \in V  S \text{ s.t. } (v, x) \text{ is ans edge in } G \\ &\qquad\text{ if } c(v,x) < a(x) \text{then }\\ &\qquad\quad a(x) = c(v,x) \\ &\qquad\quad ChangeKey(Q,x,a(x),v) \\ \end{aligned} }$ $Q$ is a priority queue
 Each element in $Q$ is a triple, the node, its attachment cost, and its predecessor in the MST
 In step 8, if $x$ is not already in $Q$, simply insert (x, a(x), v) into $Q$
 Total of $n  1$ $ExtractMin$ and $m$ $ChangeKey$ operations, yielding a running time of $O(m \log n)$
 running time of step 5 is proportional to the degree of $x$ which is proportional to the number of edges in the graph $m$
Kruskal's Algorithm
 Start with an empty set $T$ of edges
 Process edges in $E$ in increasing order of cost
 Add the next edge $e$ to $T$ only if adding $e$ does not create a cycle. Discard $e$ otherwise
 Note: at any iteration, $T$ may contain several connected components and each node in $V$ is in some component

Claim: Kruskal's algorithm outputs an MST
 For every edge $e$ added, demonstrate the existence of a set $S \sub V$ (and $V  S$) such that $e$ and $S$ satisfy the cut property, i.e.e, the cheapest edge in $cut(S)$
 If $e = (u,v)$, let $S$ be the set of nodes connected to $u$ in the current graph $T$
 Why is $e$ the cheapest edge in cut($S$)  because we process them in increasing order of cost

Prove that the algorithm computes a spanning tree
 $(V,T)$ contains no cycles by construction
 If $(V,T)$ is not connected, then there exists a subset $S$ of nodes not connected to $VS$. What is the contradiction?
Implementing Kruskal's Algorithm
 start with an empty set $T$ of edges
 Process edges in $E$ in increasing order of cost
 Add the next edge $e$ to $T$ only if adding $e$ does not create a cycle
 Sorting edges takes $O(m \log n)$ time

Key question: "Does adding $e = (u,v)$ to $T$ create a cycle?
 Maintain set of connected components of $T$
 $Find(u)$: return the name of the connected component of $T$ that $u$ belongs to
 $Union(A,B)$: merge connected components A, B
Analysing Kruskal's Algorithm
 How many $Find$ invocations does Kruskal's need? $2m$
 How many $Union$ invocations? $n1$

Two implementations of $UnionFind$:
 Each $Find$ take $O(1)$, $k$ invocations of $Union$ takes $O(k \log k)$ time in total – $O(m + n \log n)$
 Each $Find$ take $O(\log n )$, each invocation of $Union$ takes $O(1)$ – $O(m \log n + n)$
 In general $m < n$, but in either case, the total running time is $O(m \log n)$ since we have to spend $O(m \log n)$ time sorting the edges by increasing cost, regardless of the underlying implementation of the $UnionFind$ data structure
Comments on UnionFind and MST
 useful to maintain connected components of a graph as edges are added to the graph
 Data structure does not support edge deletion efficiently

Current best algorithm for MST runs in $O(m\alpha (m,n))$ time (Chazelle 2000) where $\alpha(m,n)$ is a function of $m,n$ and $O(m)$ randomized time
 Let $\log^* n =$ the number of times you take $\log n$ before you reach 1
 e.g. $\log^*(2^{10}) = 4$, $\log^*(2^{2^{10}}) = 5$
 Holy grail: $O(m)$ deterministic algorithm for MST
Cycle Property
 When can we be sure that an edge cannot be in any MST
 Let $C$ be any cycle in $G$ and let $e = (v,w)$ be the most expensive edge in $C$
 Claim: $e$ does not belong to any MST of G

Proof: exchange argument.
 If a supposed MST $T$ contains $e$, show that there is a tree with smaller cost than $T$ that does not contain $e$
ReverseDelete Algorithm
¯\(ツ)/¯
Observations about MST algorithms
 Any algorithm that constructs a spanning tree by including the Cut and Cycle properties (include edges that satisfy the cut P, delete edges that satisfy the cycle property) will be an MST
07  Applications of Minimum Spanning Trees
Minimum Bottleneck Spanning Tree (MBST)
 MST minimizes the total cost of the spanning network

Consider another network design criterion
 build a network connecting all cities in mountainous region, but ensure the highest elevation is as low as possible
 total road length is not a criterion
 Idea: compute an MST in which the edge with the highest cost is as cheap as possible
 In an undirected graph $G(V,E)$, let $(V,T)$ be a spanning tree. The bottleneck edge in $T$ is the edge with the largest cost in $T$
 Instance: An an undirected graph $G(V,E)$ such that $V,T$ is a spanning tree
 Solution: A Set $T \sube E$ of edges such that $(V,T)$ is a spanning tree and there is no spanning tree in $G$ with a cheaper bottleneck
Is every MST and MBST, or vice versa?
 If a tree is an MBST, it might not be an MST
 If a tree is an MST, it will be an MBST
Proof:
 Let $T$ be the MST, and let $T'$ be a spanning tree with a cheaper bottleneck edge.
 Let $e$ be the bottleneck edge in $T$
 Every edge in $T'$ has lower cost than $e$
 Adding $e$ to $T'$ creates a cycle consisting only of edges, where $e$ is the costliest edge in the cycle
 Since $e$ is the costliest edge in this cycle, by the cycle property, $e$ cannot belong to any MST, which contradicts the fact that $T$ is an MST
Motivation for Clustering
 Given a set of objects and distances between them
 objects can be anything
 Distance function: increase distance corresponds to dissimilarity
 Goal: group objects into clusters, where each cluster is a group of similar objects
Formalizing the Clustering Problem
 Let $U$ be the set of $n$ objects labels $p_1, p_2, \mathellipsis, p_n$
 For every pair $p_i, p_j$, we have a distance $d(p_i, p_j)$
 We require that $d(p_i, p_i) = 0$, $d(p_i, p_j) > 0$, $d(p_i, p_j) = d(p_j, p_i)$,
 Given a positive integer $k$, a kclustering of $U$ is a partition of $U$ into $k$ nonempty subsets or clusters $C_1, C_2, ..., C_k$

that spacing of a clustering is the smallest distance between objects in 2 different subsets:
 $spacing(C_1, C_2, ..., C_k) = \min_{1 \leq i,j \leq k, i\neq j, p \in Ci, q \in C_j} d(p,q)$
 spacing is the minimum of distance between objects in two different subsets
minimum = inifnity
for every cluster C_i
for every cluster C_j ≠ i
for every point p in C_i
for every point q in C_j
if d(p,q) < minimum
minimum = d(p,q)
return minimum

Clustering of Maximum Spacing
 Instance: a Set $U$ of objects, a distance function $d : U \times U \rightarrow \Reals^+$
 Solution: A kclustering of $U$ whose spacing is the largest over all possible kclusterings
 $O(n^2)$ on $n$ clusters and then $O(i \times j)$ on points in disparate cluster
Algorithm for Max Spacing
 intuition: greedily cluster objects in increasing order of distance
 Let $\mathcal C$ be the set of $n$ clusters, with each object in $U$ in its own cluster

Process pairs of objects in increasing order of distance
 Let $(p,q)$ be the next pair with $p \in C_p$ and $q \in C_p$
 If $C_p \neq C_q$ add a new cluster $C_p \cup C_q$ to $\mathcal{C}$, delete $C_p, C_q$ from $\mathcal{C}$
 Stop when there are $k$ cluster in $\mathcal{C}$
 Same as Kruskal's algorithm, but do not add the last $k1$ edges in MST

Given a clustering $\mathcal{C}$, what is spacing($\mathcal{C}$)?
 The length of the next edge that would be added  the cost of the $(k1)$st most expensive edge in the MST. Let this cost be $d^*$
Proof of optimal computing
 Let $\mathcal{C'}$ be any other cluster (with $k$ clusters).

We will prove that spacing($\mathcal{C'}$) $\leq d^*$
 There must be two objects $p_i, p_j \in U$ in the same cluster $C_r \in \mathcal{C}$ but in different clusters in $\mathcal{C'}$: spacing($\mathcal{C'}$) $\leq d(p_i, q_i)$
 Suppose $p_i \in C'_s$ and $p_j \in C'_t$ in $\mathcal{C}$
 All edges in the path $Q$ connecting $p_i \rightarrow p_j$ in the MST have lengths $\leq d^*$
 In particular, there is an object $p \in C'_s$ and an object $p' \notin C'_s$ s.t. $p$ and $p'$ are adjacent in $Q$
 $d(p, p') \leq d^* \implies$