# Introduction

Notes from CS 4104 with exercises and derivations from "Algorithm Design" by Kleinberg and Tardos.

## Glossary

Term Definition
Matching each man is paired with ≤ 1 woman and vice versa
Perfect Matching each man is paired with exactly 1 woman and vice versa
Rogue Couple a man and a woman who are not matched, but prefer each other to their
Stable Matching a perfect matching without any roogue couples
polynomial running time if there exists constants $c,d > 0$ such that $\forall n$, the running time is bounded by $cn^d$
Asymptotic Upper Bound a function $f(n)$ is $O(g(n))$ if there exists constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \leq cg(n)$
Asymptotic Lower Bound a function $f(n)$ is $\Omega(g(n))$ if there exist constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \geq cg(n)$
Asymptotic Tight Bound a function $f(n)$ is $\Theta(g(n))$ if $f(n)$ is $O(g(n))$ and $f(n)$ is $\Omega(g(n))$
Euler Path Only possible if the number of nodes with odd degree is at most 2
A $v_1 - v_k$ path in an undirected graph $G = (V,E)$ is a sequence of $P$ nodes $v_1, v_2, \mathellipsis, v_{k-1}, v_k; v_k \in V$ s.t. every consecutive pair of nodes $v_i; v_{i+1}, 1 \leq i < k$ is connected by an edge $E$
simple path all its nodes are distinct
cycle is a path where $k>2$, the first $k-1$ nodes are distinct and $v_1 = v_k$
connected undirected graph for every pair of nodes $u,v \in V$, there is a path from $u$ to $v$ in $G$
distance $d(u,v)$ the minimum number of edges in any $u-v$ path
connected component of $G$ containing $s$ the set of all nodes $u$ such that there is an $s-u$ path in $G$
Adjaceny Matrix $n \times n$ boolean matrix, where the entry in row $i$, col $j$ is 1 iff the graph contains the edge $(i, j)$
Adjaceny List array $Adj$, where $Adj[v]$ stores the list of all nodes adjacent to $v$
bipartite A graph $G = (V, E)$ is bipartite if $V$ can be partitioned into two subsets $X,Y$ such that every edge in $E$ has one endpoint in $X$ and one in $Y$
Greedy algorithm makes the best current choice without looking back. Assume the choices made prior were perfect
inversion A schedule has an inversion if a job $i$ with a deadline $d(i)$ is scheduled before another job $j$ with an earlier deadline $d(j)$ i.e. $d(j) < d(i), s(i) < s(j)$
spanning tree A subset $T of E$ is a spanning tree of $G$ if $(V, T)$ is a tree

# 01 - Stable Matching

### Problem

• Each man ranks all the women in order of preference
• Each woman ranks all the men in order of preference
• Each person uses all ranks from $1, ... , n$ s.t. there are no ties or incomplete lists

#### Male preference matrix:

$w_1$ $w_2$ $w_3$ $w_4$
$m_1$ 1 2 3 4
$m_2$ 4 3 1 2
$m_3$ 4 3 1 2
$m_4$ 3 1 4 2

#### Female preference matrix:

$m_1$ $m_2$ $m_3$ $m_4$
$w_1$ 1 2 3 4
$w_2$ 4 1 3 2
$w_3$ 4 1 2 3
$w_4$ 3 1 2 4
• A Matching: each man is paired with ≤ 1 woman and vice versa
• A Perfect Matching: each man is paired with exactly 1 woman and vice versa

• "perfect" means one-to-one mapping
• A Rogue Couple: a man and a woman who are not matched, but prefer each other to their current partners
• A Stable Matching: A perfect matching without any rogue couples

### Solution: Gale-Shapley Algorithm

\boxed{ \begin{aligned} &\text{Initially, all men and all women are free} \\ &\text{Set } S \text{ of matched pairs is empty} \\ &\text{While there is at least one free man who has not proposed to every woman} \\ &\quad\text{Choose such a man } m \\ &\qquad m \text{ proposes to his highest-ranked woman } w \text{ to whom he has not yet proposed} \\ &\quad\text{If } w \text{ is free} \\ &\qquad\text{she becomes engaged to } m \text{ s.t. } S \leftarrow S \cup \{m, w\} \\ &\quad\text{Else if } w \text{ is engaged to } m' \text{ and she prefers } m \text{ to } m' \\ &\qquad\text{she becomes engaged to } m \text{ s.t. } S \leftarrow S \cup \{m, w\} \\ &\qquad m' \text{ becomes free s.t. } S \leftarrow S \backslash \{m', w\} \\ &\quad\text{Otherwise, } m \text{ remains free} \\ &\text{Return set } S \text{ of engaged pairs} \end{aligned} }

#### Proof of Perfection

1. Suppose the set $S$ of pairs returned by Gale-Shapley algorithm is not perfect
2. $S$ is a matching, therefore there must be at least one free man $m$
3. $m$ has proposed to all women (since the algorithm terminated)
4. Therefore, each woman must be engaged (since she remains engaged after the first proposal to her)
5. Therefore, all men must be engaged, contradicting ⚔️ the assumption that $m$ is free
6. That matching is perfect since the program terminated QED.

#### Proof of Stability

1. Suppose $S$ is not stable i.e. there are two pairs $(m_1, w_1)$, $(m_2, w_2) \in S$ s.t $m_1$ prefers $w_2 > w_1$ and $w_2$ prefers $m_1 > m_2$
2. $m_1$ must have proposed to $w_2$ prior to $w_1$ because at that stage $w_2$ must have rejected $m_1$; otherwise she would have been paired with $w_1$, which would in turn prevent the pairing of $(m_2, w_2)$ in a later iteration
3. When $w_2$ regected $m_1$, she must have been paired with some man $m_3 > m_1$
4. Since $m_2$ is paired with $w_2$ at termination, $w_2$ must prefer $m_2$ to $m_3$ or $m_2 = m_3$
5. This contradicts ⚔️ our conclusion (from instability) that $w_2$ prefers $m_1 > m_2$ QED.

### Observations

• This algorithm computes a matching i.e. each woman gets pair with at most one man and vice versa
• The man's status can alternate between being free and being engaged
• The woman's status remains engaged after the first proposal
• The ranking of a man's partner remains the same or goes down
• The ranking of a woman's partner can never go down
• The number of proposals made after $k$ iterations is the best indicator of progress
• The max number of total proposals (or iterations) that can be made is $n^2$
• Always produces the same matching in which each man is paired with his best, valid partner, but each woman is paired with her worst, valid partner

# 02 - Analysis of Algorithms

### Polynomial Time

• Essentially brute forcing
• e.g. brute force sorting:

• Given $n$ numbers, permute them so that they appear in increasing order
• Try all $n!$ permutations
• For each permutation, check if it is sorted
• $O(nn!)$
• Desirable scaling property: when the input size doubles, the algorithm should only slow down by some constant $c$
• An algorithm has polynomial running time if there exists constants $c,d > 0$ such that $\forall n$, the running time is bounded by $cn^d$
• an algorithm is said to be efficient if it has a polynomial running time

### Upper and Lower Bounds

• Asymptotic Upper Bound: a function $f(n)$ is $O(g(n))$ if there exists constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \leq cg(n)$

• e.g. $100n \log_2n$ is $O(n^2)$ for $c = 100, n_0 = 1$
• Asymptotic Lower Bound: a function $f(n)$ is $\Omega(g(n))$ if there exist constants $c > 0, n_0 \geq 0$ s.t. $\forall n \geq n_0, \quad f(n) \geq cg(n)$

• e.g. $\frac{n}{10}\log_2 n$ is $\Omega(n)$ for $c = 1, n_0 = 1024$

In general, we attempt to find a sufficiently large $n_0$ such that our upper and lower bounds are satisfied.

• Asymptotic Tight Bound: a function $f(n)$ is $\Theta(g(n))$ if $f(n)$ is $O(g(n))$ and $f(n)$ is $\Omega(g(n))$

Transitively:

• if $f = O(g)$ and $g = O(h)$, then $f = O(h)$
• if $f = \Omega(g)$ and $g = \Omega(h)$, then $f = \Omega(h)$
• if $f = \Theta(g)$ and $g = \Theta(h)$, then $f = \Theta(h)$

• if $f = O(h)$ and $g = O(h)$, then $f + g = O(h)$
• Similar statements hold for lower and tight bounds
• if $k$ is a constant and there are $k$ functions: $f_i = O(h), 1\leq i \leq k$
\begin{aligned} f_1 + f_2 + ... + f_k = O(h) \end{aligned}
• if $f = O(g)$, then $f + g = \Theta(g)$

### Examples

$p,q,n > 0$

$f(n)$ $g(n)$ Reason
$pn^2 + qn + r$ $\Theta(n^2)$ $pn^2 + qn + r \leq (p + q + r)n^2$, and we can ignore low order terms
$pn^2 + qn + r$ $O(n^3)$ $n^2 \leq n^3, \text{ if } n \geq 1$
$\displaystyle\sum_{0 \leq i \leq d} a_i n^i$ $\Theta(n^d)$ if $d > 0$ is an integer constant and $a_d > 0$
$O(n^{1.59})$ polynomial time $n^{1.59}$ is $O(n^2)$
$\log_a n$ $O(\log_b n)$ $\log_2(x) = \frac{\log_{10}(x)}{\log_{10}(2)} \forall a,b > 1$ therefore the base is irrelevant
• $\forall \text{ constant } x > 0, \log n = O(n^x)$ e.g. $\log n = n^{0.00001}$
• $\forall \text{ constant } r > 1, d >0, n^d = O(r^n)$ e.g. $n^3 = O(1.1^n)$

# 03 - Review of Graphs and Priority Queues

## Priority Queues: Motivation

### Sorting

• Instance: Nonempty list $[x_1, x_2, \mathellipsis, x_n]$ of integers
• Solution: A permutation $[y_1, y_2, \mathellipsis, y_n]$ of $x_1, x_2, \mathellipsis, x_n$ such that $y_i \leq y_{i+1}$ for all $1 \leq i < n$Possible algorithm: • insert each number into some data structure $D$ • Repeatedly find the smallest number in $D$, output it, and remove it • To get $O(n \log n)$ running time, each "find minimum" step and each remove step must additively take $O(\log n)$ ## Priority Queues • Store a set $S$ of elements where each element $v$ has a priority value $key(v)$ • smaller key values = higher priorities • Supported operations: • find element with smallest key • remove smallest element • insert an element • delete an element • update the key of an element • element deletion and key update require knowledge of the position of the element in the priority • combines benefits of both lists and sorted arrays • balanced binary tree • heapy order. For every element $v$ at a node $i$, the element $w$ at $i$'s parent satisfies $key(w) \leq key(v)$ • Each node's key is at least as large as its parent's • storing nodes of the heap in an array: • Node at index $i$ has children at indices $2i$ and $2i+1$, with a parent at index $\lfloor \frac{i}{2} \rfloor$ • index 1 is the root • if $2i > n$, where $n$ is the current number of elements in the heap, the node at index $i$ is a leaf ### Inserting an Element: Heapify-up • insert a new element at index $n+1$ • Fix heap order using Heapify-up(H, n+1) \boxed{ \begin{aligned} &\text{Heapify-up(H, i)} \\ &\text{If } i > 1 \text{ then} \\ &\quad\text{let } j = \text{parent(i) } = \lfloor \frac{i}{2} \rfloor \\ &\quad\text{If key[H[i]] < key[H[j]] then } \\ &\qquad\text{swap entries H[i], H[j]} \\ &\qquad\text{Heapify-up(H, j)} \\ \end{aligned} } #### Proof of Running Time Complexity for Heapify-up(i) • each invocation decreases the second argument by a factor of at least 2 • after $k$ invocations, argument is at most $\frac{i}{2^k}$ • Therefore $\frac{i}{2^k} \geq 1$ which implies that $k \leq \log_2 i$ therefore heapify up is $O(\log i)$ ### Deleting an Element: Heapify-down • Suppose $H$ has $n+1$ elements • Delete element $H[i]$ moving element at $H[n+1]$ to $H[i]$ • If element at $H[i]$ is too small, fix the heap order using Heapify-up(H, i) • If element at $H[i]$ is too large, fix heap order using Heapify-down(H, i) \boxed{ \begin{aligned} &\text{Heapify-down(H, i)} \\ &\text{let } n = length(H) \\ &\text{If } 2i > n \text{ then} \\ &\quad\text{Terminate with } H \text { unchanged} \\ &\text{Else If } 2i < n \text{ then} \\ &\quad\text{Let Left, Right } = 2i, 2i+1 \\ &\quad\text{Let } j \text{ be the index that minimizes key[H[Left]] and key[H[Right]]} \\ &\text{Else } 2i = n \text{ then} \\ &\quad\text{Let } j = 2i \\ &\text{If key[H[j]] < key[H[i]] then } \\ &\quad\text{swap the entries H[i] and H[j]} \\ &\quad\text{Heapify-down(h,j)} \\ \end{aligned} } #### Proof of Running Time Complexity for Heapify-down(i) • Each invocation increases is second argument by a factor of at least 2 • after $k$ invocations arguments must be at least $i2^k \leq n$, which impleies that $k \leq \log_2 \frac{n}{i}$ • Therefore running time is $O(\log_2 \frac{n}{i})$ ## Sorting • Instance: Nonempty list $[x_1, x_2, \mathellipsis, x_n]$ of integers • Solution: A permutation $[y_1, y_2, \mathellipsis, y_n]$ of $x_1, x_2, \mathellipsis, x_n$ such that $y_i \leq y_{i+1}$ for all $1 \leq i < n$

Final algorithm:

• Insert each number in a priority queue $H$
• Repeatedly find the smallest number in $H$, output it, and delete it from $H$

Thus, each insertion and deletion take $O(\log n)$ for a total running time of $O(n \log n)$

## Graphs: Motivation

• Contact tracing hahaaaaa

### Taxonomy of a Graph

• comprised of vertices and (directed / undirect) edges, they can form face.
• Euler Path: Only possible if the number of nodes with odd degree is at most 2
• Formally, an Undirected graph $G = (V, E)$ where $V, E$ are sets of vertices and edges with $E \subseteq V \times V$

• Elements of $E$ are unordered pairs
• $Edge(u, v)$ is incident on $u,v$ meaning $u,v$ are neighbors of one another
• Exactly one edge between any pair of nodes
• $G$ contains no self loops, i.e. no edges of the from $(u,u)$
• Formally, an directed graph $G = (V, E)$ where $V, E$ are sets of vertices and edges with $E \subseteq V \times V$

• Elements of $E$ are ordered pairs
• $e = (u, v)$ where $u$ is the tail of the edge $e$, $v$ is the head, and we can say that $e$ is directed from $u$ to $v$
• A pair of nodes may be connected by two directed edges $(u,v)$ and $(v,u)$
• $G$ contains no self loops
• A $v_1 - v_k$ path in an undirected graph $G = (V,E)$ is a sequence of $P$ nodes $v_1, v_2, \mathellipsis, v_{k-1}, v_k; v_k \in V$ s.t. every consecutive pair of nodes $v_i; v_{i+1}, 1 \leq i < k$ is connected by an edge $E$
• a path is simple if all its nodes are distinct
• a cycle is a path where $k>2$, the first $k-1$ nodes are distinct and $v_1 = v_k$
• An undirected graph $G$ is connected if, for every pair of nodes $u,v \in V$, there is a path from $u$ to $v$ in $G$
• The distance $d(u,v)$ between nodes $u,v$ is the minimum number of edges in any $u-v$ path
• The connected component of $G$ containing $s$ is the set of all nodes $u$ such that there is an $s-u$ path in $G$

#### Computing Connected Components

• Rather than computing the connected component of $G$ that contains $s$ and check if $t$ is in that component, we can "explore" $G$ starting from $s$, maintaining a set $R$ of visited nodes
\boxed{ \begin{aligned} &R \text{will consist of nodes to which } s \text{has a path} \\ &\text{Initially } R = \{s\} \\ &\text{While there is ad edge } (u,v) \text{ where } u \in R, v \notin R \\ &\quad\text{Add } v \text{ to } R \end{aligned} }

• Explore $G$ starting at $s$ and going "outward" in all directions, adding nodes one "layer" at a time
• Layer $L_0$ only contains $s$
• Layer $L_1$ contains $s$ and all its neighbors, etc. etc.
• Layer $L_0, L_1, \mathellipsis , L_j, L_{j+1}$ contains all nodes that:

• do not belong to an earlier layer
• are connected by an edge to a node in layer $L_j$
• The shortest path from $s$ to each node contains $j$ edges
• Claim: For each $j \geq 1$, layer $L_j$ consists of all nodes exactly at distance $j$ from $S$
• A non-tree edge is an edge of $G$ that does not belong to the BFS tree $T$

### Proof

• There is a path from $s$ to $t$ if an only iff $t$ is a member of some layer
• Let $v$ be a node in layer $L_{j+1}$ and $u$ be the "first" node in $L_j$ such that $(u,v)$ is an edge in $G$. Consider the graph $T$ formed by all edges, directed from $u$ to $v$ • Notice that $T$ is a tree becasue it is connected and the number of edges in $T$ is the number of nodes in all the laters minus one • $T$ is called the BFS Tree ### Inductive Proof of BFS Distance Property • for every $j \geq 0$, for every node $u \in L_j$, $d(s, u) = j$ • Basis: $k =0$, $d(s,s) = 0$ • Inductive Hypothesis: Assume the claim is true for every node $v \in L_k$, $d(s, v) = k$ • Inductive Step: Prove for every node $x \in L_{k+1}$, $d(s, x) = k + 1$ • $d(s,x) = d(s, y) + 1$ if $y$ is a node in $L_k$ • Therefore $d(s,x) = k+1$ ## Depth First Search • Explore $G$ as if it were a maze: start from $s$, traverse first edge out (to node $v$), traverse first edge out of $v$, . . . , reach a dead-end, backtrack, repeat \boxed{ \begin{aligned} &\text{Mark } u \text{as explored and add it to the reachable set} R \\ &\text{For each edge } (u,v) \text{incident to } u \\ &\quad\text{If } v \text{ is not marked, invoke DFS(} v \text{)} \\ \end{aligned} } ## Properties of each • BFS and DFS visit the same set of nodes, but in a different order ## Representing Graphs • A Graph $G = (V,E)$ has two input parameters: $|V| =n, |E| = m$, with $n-1 \leq m \leq \binom{n}{2}$ • Size of graph is defined to be $m + n$ • Strive for algorithms whose running time is linear in graph size i.e. $)(m + n)$ • We assume that $V = {1, 2, \mathellipsis, n}$ • Use an Adjaceny Matrix: $n \times n$ boolean matrix, where the entry in row $i$, col $j$ is 1 iff the graph contains the edge $(i, j)$ • the space used is $\Theta(n^2)$, which is optimal in the worst case • can check if there is an edge between nodes $i$, $j$ in $O(1)$ time • iterate over all the edges incident on node $i$ in $\Theta(n)$ time • Use an Adjaceny List: array $Adj$, where $Adj[v]$ stores the list of all nodes adjacent to $v$ • an edge $e = (u,v)$ appears twice: $Adj[u],$Adj[v]
• $n_v$ is the number of neighbors of node $v$
• space used is $O(\displaystyle\sum_{v \in G}n_v) = O(m+n)$
• check if there is an adge between nodes $u, v$ in $O(n_u)$ time
• Iterate over all the edges incident on node $u$ in $\Theta(n_u)$ time
• Inserting an edge takes $O(1)$ time
• Deleting an edge takes $O(n)$ time
Is $(i, j)$ an edge? $O(1)$ time $O(n_i)$ time
Iterate over edges incident on node $i$ $O(n)$ time $O(n_i)$ time
Space used $O(n^2)$ $O(m + n)$

# 04 - Linear-Time Graph Algorithms

## Bipartite Graphs

• A graph $G = (V, E)$ is bipartite if $V$ can be partitioned into two subsets $X,Y$ such that every edge in $E$ has one endpoint in $X$ and one in $Y$

• $X \times X \cap E = \emptyset$ and $Y \times Y \cap E = \emptyset$
• Color the nodes in $X$ red and the nodes in $Y$ blue. No edge in $E$ connects nodes of the same graph
• no cycle with an odd number of nodes is bipartite, similarly all cycles of even length are bipartite
• Therefore, if a graph is bipartite, then it cannot contain a cycle of odd length

### Algorithm for Testing Bipartite

• Assume $G$ is connected, otherwise apply the algorithm to each connected component separately
• Pick an arbitrary node $s$ and color it red.
• Color all it's neighbors blue. Color the uncolored neighbors of these nodes red, and so on till all nodes are colored
• Check if every edge has endpoints of different colours

more formally:

1. Run BFS on $G$. Maintain an array for the color
2. When we add a node $v$ to a layer $i$, set $\text{color[i]}$ to red if $i$ is even, blue of odd
3. At the end of BFS, scan all the edges to check if there is any edge both of whose endpoints received the same color

Running time is linear proprotional to the size of the graph: $O(m + n)$ since we do a constant amount of work per node in addition to the time spent by BFS.

#### Proof of Correctness of a "two-colorable" algorithm

Need to prove that if the algorithm says $G$ is bipartite, then it is actually bipartite AND need to prove that if $G$ is not bipartite, can we determine why?

• Let $G$ be a graph and let $L_0, L_1, \mathellipsis, L_k$ be the layers produced by the BFS, starting at node $s$. Then exactly one of the following statements is true:

• No edge of $G$ joins two nodes in the same layer: Bipartite since nodes in the even laters can be colored red and those in odd blue
• There is an edge of $G$ that joins two nodes in the same layer: Not Bipartite
• $| L_i - L_j | = 1, \quad \forall L \in BFS$

# 05 - Greedy Algorithms

• Greedy algorithms: make the best current choice without looking back. Assume the choices made prior were perfect

## Example Problem: Interval Scheduling

• At an amusement park, want to compute the largest number of rides you can be on in one day
• Input: start and end time of each ride
• Constraint: cannot be in two places at one time
• Instance: nonempty set $\{(s(i), f(i)), 1 \leq i \leq n\}$ of start and finish times of $n$ jobs
• Solution: The largest subset of mutually compatibe jobs
• Two jobs are compatible if they do not overlap
• Problem models the situation where you havea resource, a set of fixed jobs, and you want to schedule as many jobs as possible
• For any input set of jobs, the algorithm must provably compute the largest set of compatible jobs (measured by interval count, not cumulative interval length)

### Template for a Greedy Algorithm

• Process jobs in some order. Add next job to the result if it is compatible with the jobs already in the result
• key question: in what order should we process job?

• earliest start time: increasing order of start time $s(i)$
• earliest finish time: increasing order of start time $f(i)$
• shortest interval: increasing order of $f(i) - s(i)$
• fewest conflicts: increasing order of the number of conflicting jobs

### Earliest Finish Time

• the most optimal general solution
\boxed{ \begin{aligned} &\text{Initially let } R \text{ be the set of all jobs, and let } A \text{ be empty} \\ &\text{While } R \text{ is not yet empty}\\ &\quad\text{Choose a job } i \in R \text{ that has the smallest finishing time } \\ &\quad\text{Add request } i \text{ to } A \\ &\quad\text{Delete all jobs from } R \text{ that are not compatible with job } i \\ &\text{Return the set } A \text{ as the set of accepted/scheduled jobs} \\ \end{aligned} }

### Proof of Optimality

• Claim $|A|$ is a compatible set of jobs that is the larget possible in any set of mutually compatible jobs
• Proof by contradiction that there's no "better" solution at each step of the algorithm

• need to define "better", and what a step is in terms of the progress of the algorithm
• order the output in terms of increasing finish time for a measure of progress
• Finishing time of a jon $r$ selected by $A$ must be less than or equal to the finishing time of a job $r$ selected by any other algorithm
• Let $O$ be an optimal set of jobs. We will show that $|A| = |O|$
• Let $i_1, i_2, \mathellipsis, i_k$ be the set of jobs in $A$ in order of finishing time
• Let $j_1, j_2, \mathellipsis, j_m$ be the set of jobs in $O$ in order of $m \geq k$
• Claim: for all indices $r \leq k, \quad f(i_r) \leq f(j_r)$
• Base case: is it possible for finish time of the first job of our algorithm to be later than the opposing job?

• $f(i_1) > f(j_1)$ is not possible, only $f(i_1) \leq f(j_1)$
• Inductive Hypothesis: this is always the case for any generic job index $r$:

• $f(i_r) \leq f(j_r)$ for some $r \leq k$
• Inductive Step: show that the same claim holds for the next job:

• $f(i_{r+1}) \leq f(j_{r+1})$ for some $r+1 \leq k$
• $s(j_{r+1}) \geq f(i_r) \geq f(i_r)$
• claim $m = k$: $f(i_{k}) \leq f(j_{k}) \leq s(j_{k+1})$

A complete proof can be found here:

### Implementing EFT

• Sort jobs in order of increasing finish time
• Store starting time of jobs in an array $S$
• $k=1$
• While $k \leq |S|$

• output job $k$
• Let the finish time of job $k$ be $f$
• Iterate over $S$ from $k$ onwards to find the first index $i$ s.t. $S[i] \geq f$
• $k = i$
• Must be careful to iterate over $S$ s.t. we never scan same index more than once
• Running time is $O(n \log n)$ since it's dominated by the first sorting step

## Scheduling to Minimize Lateness

• Suppose a job $i$ has a length $t(i)$ and a deadline $d(i)$
• want to schedule all $n$ jobs on one resource
• Goal is to assign a starting time $s(i)$ to each job such that each job is delayed as little as possible
• A job $i$ is delayed if $f(i) > d(i)$; the lateness of job is $\max(0, f(i) - d(i))$
• the lateness of a schedule is $\max \limits_{1 \leq i \leq n}(\max(0, f(i) - d(i)))$

• the largest of each job's lateness values

### Minimizing Lateness

• Instance: Set $\{ (t(i), d(i)), 1 \leq i \leq n \}$ of lengths of deadlines of $n$ jobs
• Solution: Set $\{ s(i), 1 \leq i \leq n \}$ such that $\max \limits_{1 \leq i \leq n}(\max(0, f(i) - d(i)))$ is as small as possible

## Template for Greedy Algorithm

• Key question: In what order should we schedule the jobs:

• Shortest length: increasing order of length $t(i)$. Ignores deadlines completely, shortest job may have a very late deadline:
$i$ 1 2
$t(i)$ 1 10
$d(i)$ 100 10
• Shortest slack time: Increasing order of $d(i) - t(i)$. Bad for long jobs with late deadlines. Job with smallest slack may take a long time:
$i$ 1 2
$t(i)$ 1 10
$d(i)$ 2 10
• Earliest Deadline: Increasing order deadline $d(i)$. Correct? Does it make sense to tackle jobs with earliest deadlines first?

## Proof of Earliest Deadline Optimality

\boxed{ \begin{aligned} &\text{Order the jobs in order of increase deadlines } d(i) \\ &\text{Assume for simplicity of notation that } d(1) \leq \mathellipsis d(n) \\ &\text{Initially, } f = 0 \\ &\text{Consider the jobs } i=1, \mathellipsis, n { in this order }\\ &\quad\text{Assign the job } i \text{ to the time interval from } s(i) = f \text{ to } f(i) = f + t_i\\ &\quad f \leftarrow f + t_i \\ &\text{Return the set of scheduled intervals} [s(i), f(i)], \text{ for } i = 1, \mathellipsis, n\\ \end{aligned} }

### Inversions

• A schedule has an inversion if a job $i$ with a deadline $d(i)$ is scheduled before another job $j$ with an earlier deadline $d(j)$ i.e. $d(j) < d(i), s(i) < s(j)$

• if two jobs have the same deadline, they cannot cause an inversion
• in $n$ jobs, the maximum amount of inversion is $n \choose 2$
• Claim: if a schedule has and inversion, then there must be a pair of jobs $i, j$ such that $j$ is scheduled immediately after $i$ and $d(j) < d(i)$

#### Proof of Local Inversion

• If we have an inversion between $l, m$ s.t. $s(l) < s(m), d(l) > d(m)$, then we can find some inverted $i,j$ scheduled between $l,m$
• This is because: in a list where each element is greater than the last (in terms of deadline), then the list is sorted.
• The contrapositive of this is: if a list is unsorted, then there are two adjacent elements that are unsorted.

## Properties of Schedules

• Claim 1: The algorithm produces a schedule with no inversion and no idle time (i.e. jobs are tightly packed)
• Claim 2: All schedules (produced from the same input) with no inversions have the same lateness

• Case 1: All jobs have distinct deadlines. There is a unique schedule with no inversions and no idle time.
• Case 2: Some jobs have the same deadline. Ordering of the jobs does not change the maximum lateness of these jobs.
• Claim 3: There is an optimal schedule with no idle time.
• Claim 4: There is an optimal schedule with no inversions and no idle time.

• Start with an optimal schedule $O$ (that may have inversions) and use an exchange argument to convert $O$ into a schedule that satisfies Claim 4 and has lateness not larger than $O$.
• If $O$ has an inversion, let $i, j$ be consecutive inverted jobs in $O$. After swapping $i, j$ we get a schedule $O'$ with one less inversion.
• Claim: The lateness of $O'$ is no larger than the lateness of $O$
• It is sufficient to prove the last item, since after $n \choose 2$ swaps, we obtain a schedule with no inversions whose lateness is no larger than that of $O$
• In $O$, assume each job $r$ is scheduled for the interval $[s(r), f(r)]$ and has lateness $\mathcal l(r)$. For $O'$, let the lateness of $r$ be $l'(r)$
• Claim: $l'(k) = l(k), \quad \forall k \neq i,j$
• Claim: $l'(j) = l(j)$,
• Claim: $l'(j) \leq l(j)$ because $l'(j) = f(j) - d_i \leq f(j) - d_j = l(j)$
• N.B. common mistakes with exchange arguments:

• Wrong: start with algorithm's schedule A and argue it cannot be improved by swapping two jobs
• Correct: Start with an arbitrary schedule O (which can be optimal) and argue that O can be converted into a schedule that is essentially the same as A without increasing lateness
• Wrong: Swap two jobs that are not neighboring in $O$. Pitfall is that the completion time of all intervining jobs then changes
• Corrext: Show that an inversion exists between two neighboring jobs and swap them
• Claim 5: The greedy algorithm produces an optimal schedule, follows from 1, 2, 4

## Summary

• Greedy algorithms make local decisions
• Three strategies:

• Greedy algorithm stays ahead - Show that after each step in the greedy algorithm, its solution is at least as good as that produced by any other algorithm
• Structural bound - First, discover a property that must be satisfied by every possible solution. Then show that the (greedy) algorithm produces a solution with this property
• Exchange argument - Transform the optimal solution in steps into the solution by the greedy algorithm without worsening the quality of the optimal solution

# 06 - Greedy Graph Algorithms

## The Shortest Path Problem

• $G(V,E)$ is a connected, directed graph. Each edge has a length $l(e) \geq 0$
• length of a path $P$ us the sum of the lengths of the edges in $P$
• Goal: compute the shortest path from a specified start node $s$ to every other node in the graph
• Instace: A directed graph $G(V,E)$, a function $I: E \rightarrow \reals^+$ and a node $s \in V$
• Solution: A set $\{P_u, u \in V\}$ of paths, where $P_u$ is the shortest path in $G$ from $s$ to $u$

## Generalizing BFS

• If all edges have the same wight, or distance, BFS would work, processing nodes in order of distance.
• What if the graph has integer edge weights, can we make the graph unwieghted?

• yes, placing dummy edges and nodes to pad out lengths > 1 at the expense of memory and running time
• Edge weight of $w$ gets $w - 1$ nodes
• Size of the graph (and therefore runtime) becomes $m + n + \sum_{e \in E} l(e)$. Pseudo-polynomial time depending on input values

## Dijkstra's Algorithm

• Fmaous for pointing out the harm of the goto command, in doing so developed the shortest path algorithm
• Like BFS: explore nodes in non-increasing order of distance $s$. Once a node is explored, its distance is fixed
• Unlike BFS: Layers are not uniform, edges are explored by evaluating candidates by edge wight
• For each unexplored node, determine "best" preceding explored node. Record shortest path length only through explored nodes
• Like BFS: Record previous node in path, and build a tree.

### Formally:

• Maintain a set $S$ of explored nodes

• For each node $u \in S$, compute $d(u)$, which (we will prove, invariant) is the length of the shortest path from $s$ to $u$
• For each node $x \notin S$, matain a value $d'(x)$, which is the length of the shortest path from $s$ to $x$ using only the nodes in $S$ and $x$ itself
• Greedily add a node $v$ to $S$ that has the smallest value of $d'(v)$
\boxed{ \begin{aligned} &S = \{s\} \text{ and } d(s) = 0 \\ &\text{while } S \neq V \\ &\quad\text{for every node } x \in V - S \\ &\qquad\text{Set } d'(x) = \min_{(u, x): u \in S} (d(u) + l(u, x)) \\ &\quad\text{Set } v = \arg \min_{x \in V - S }(d'(x)) \\ &\quad\text{Add } v \text{ to } S \text{ and set } d(v) = d'(v)\\ \end{aligned} }

What does $v = \arg \min_{x \in V - S }(d'(x))$ mean?

• Run over all unexplored nodes $x \in V - S$
• examine all $d'$ values for each node
• Return the argument (i.e. the node) that has the smallest value of $d'(x)$

To compute the shorts paths: when adding a node $v$ to $S$, store the predecessor $u$ that minimizes $d'(v)$

### Proof of Correctness

• Let $P_u$ be the path computed by the algorithm for an arbitrary node $u$
• Claim: $P_u$ is the shortest path from $s$ to $u$
• Prove by induction on the size of $S$

• Base case: $|S| = 1$. The only node in $S$ is $s$
• Inductive Hypothesis: $|S| = k$ for some $k \geq 1$. The algorithm has computed $P_u$ for every node $u \in S$. Strong induction.
• Inductive step: $|S| = k+1$ because we add the node $v$ to $S$. Could the be a shorter path $P$ from $s$ to $v$? We must prove this is not the case.

• poll: $P'$ must contain an edge from x to y where x is explor (in S) and y is unexplored (in V - S)
• poll: The node v in P' must be explored (in S)
• Locate key nodes on $P'$

• Break $P'$ into sub-paths from $s$ to $x$, $x$ to $y$, $y$ to $v$

• use $l$ to denote the lengths of the sub-paths in $P'$
• $d(x) \leq l(s, x)$
• $d(u) + l(u,v) \leq d(x) + l(x, y)$
• $0 \leq l(y,v)$
• $d(u) + l(u,v) = d(v) \leq l(P') = l(s,x) + l(x,y) + l(y,v)$

• As described, it cannot handle negative edge lengths
• Union of shortest paths output by Dijkstra's forms a tree: why?
• Union of the shortest paths from a fixed source $s$ forms a tree; paths not necessarily computed by computed by Dijkstra's

#### Running time of Dijkstra's

\boxed{ \begin{aligned} &S = \{s\} \text{ and } d(s) = 0 \\ &\text{while } S \neq V \\ &\quad\text{for every node } x \in V - S \\ &\qquad\text{Set } d'(x) = \min_{(u, x): u \in S} (d(u) + l(u, x)) \\ &\quad\text{Set } v = \arg \min_{x \in V - S }(d'(x)) \\ &\quad\text{Add } v \text{ to } S \text{ and set } d(v) = d'(v)\\ \end{aligned} }
• $V$ has $n$ nodes and $m$ edges, so their are $n-1$ iterations on the while loops
• In each iteration, for each node $x \in V - S$, compute $d'(x) = \min_{(u, x): u \in S} (d(u) + l(u, x))$ which is proportional the the number of edges incident on $x$
• Running time per iteration is $O(m)$, since the algorithm processes each edge $(u,x)$ in the graph exactly once (when computing $d'(x)$)
• Therefore, the overall running time is $O(mn)$

### Optimizing our Algorithm

• Observation, if we add $v$ to $S$, $d'(x)$ changes only if $(v, x)$ is an edge in $G$
• Idea: For each node $x \in V - S$, store the current value of $d'(x)$. Upon adding a node $v$ to $S$, update $d'()$ only for neighbors of $v$
• How do we efficiently compute $v = \arg \min_{x \in V - S }(d'(x))$

• Priority Queue!
\boxed{ \begin{aligned} &Insert(Q, s, 0) \\ &S = \{s\} \text{ and } d(s) = 0 \\ &\text{while } S \neq V \\ &\quad(v, d'(v)) = ExtractMin(Q) \\ &\quad\text{Add v to S and set d(v) = d'(v)}\\ &\quad\text{for every node } x \in V - S \text{ } s.t. (v,x)\text{ is an edge in } G \\ &\qquad\text{If } d(v) + l(v,x) < d'(x) \\ &\qquad\quad d'(x) d(v) + l(v,x) \\ &\qquad\quad ChangeKey(Q,x,d'(x)) \\ \end{aligned} }
• For each node $x \in V - S$, store the pair $(x, d'(x))$ in a priority queue $Q$ with $d'(x)$ as the key
• Determine the next node $v$ to add to $S$ using $ExtractMin$
• After adding $v$ to $S$, for each node $x \in V - S$ such that there is an edge from $v$ to $x$, check if $d'(x)$ should be updated i.e. if there is a shortest path from $s$ to $x$ via $v$
• in line 8, if $x$ is not in $Q$, simply insert it

### New Runtime

• $ExtractMin$ happens $n - 1$ times
• For every node $v$, the running time of step 5 $O(deg_v)$, the number of outgoing neighbors of $v$: $\sum_{v \in V} O(deg_v) = O(m)$
• $ChangeKey$ is invoked at most once for each edge, $m$ times, and is an $O(\log n)$ operation
• So, the total runtime is $O(m \log n)$

## Network Design

• want to connect a set of nodes using a set of edges with certain properties
• Input is usually a graph, and the desired network (output) should use a subset of edges in the graph
• Example: connect all nodes using a cycle of shortest total length. This problem is the NP-complete traveling salesman problem

## Minimum Spanning Tree (MST)

• Given an undirected graph $G(V,E)$ with a cost $c(e) > 0$ associated with each edge $e \in E$.
• Find a subset $T$ of edges such that the graph $(V,T)$ is connected and the cost $\sum_{e \in T} c(e)$ is as small as possible
• Instance: An undirected graph $G(V,E)$ and a function $c: E \rightarrow \R^+$
• Solution: A set $T \sube E$ of edges such that $(V, T)$ is connected and the cost $\sum_{e \in T} c(e)$ is as small as possible
• Claim: if $T$ is a minimum-cost solution to this problem, then $(V,T)$ is a tree
• A subset $T of E$ is a spanning tree of $G$ if $(V, T)$ is a tree

### Characterizing MSTs

• Does the edge of smallest cost belong to an MST? Yes

• Wrong proof: Because Kruskal's algorithm adds it.
• Right proof: tbd
• Which edges must belong to an MST?

• What happens when we delete an edge from an MST?
• MST breaks up into sub-trees
• Which edge should we add to join them?
• Which edges cannot belong to an MST?

• What happens when we add an edge to an MST?
• We obtain a cycle
• Which edge in the cycle can we be sure does not belong to an MST

### Greedy Algorithm for the MST Problem

• Template: process edges in some order. Add an edge to $T$ if tree property is not violated.

• increasing cost order: Process edges in increasing order of cost. Discard an edge if it creates a cycle – Kruskal's
• Dijkstra-like: Start from a node $s$ and grow $T$ outward from $s$: add the node that can attached most cheaply to current tree – Prim's
• Decreasing cost order: Delete edges in order of decreasing cost as long as graph remains connected – Reverse-Delete
• each of these works
• Simplifying assumption: all edge costs are distinct

## Graph Cuts

• a cut in a graph $G(V,E)$ is a set of edges whose removal disconnects the graph (into two or more connected components)
• Every set $S \sub V$ ($S$ cannot be empty or the entire set $V$) has a corresponding cut: cut($S$) is the set of edges (v,w) such that $v \in S, w \in V - S$
• cut($S$) is a "cut" because deleting the edges in cut($S$) disconnects $S$ from $V - S$
• Claim: for every $S \sub V, S \neq \empty$, every MST contains the cheapest edge in cut$(S$)

• will have to proof by contradiction using exchange argument

### Proof of Cut Property of MSTs

• Negation of the desired property: There is a set $S \sub V$ and an MST $T$ such that $T$ does not contaian the cheapest edge in cut($S$)
• Proof strategy: If $T$ does not contain $e$, show that there is a tree with a smaller cost than $T$ that contains $e$.
• Wrong proof:

• Since $T$ is spanning, it must contain some edge e.g. $f$ in cut(\$S)
• $T - \{f\} \cup \{e\}$ has smaller cost than $T$ but may not be a spanning tree
• Correct proof:

• Add $e$ to $T$ forming a cycle
• This cycle must contain an edge $e'$ in cut($S$)
• $T - \{e'\} \cup \{e\}$ has smaller cost than $T$ and is a spanning tree

## Prim's Algorithm

• Maintain a tree (S, T), i.e., a set of nodes and a set of edges, which we will show will always be a tree
• Start with an arbitrary node $s \in S$
• Step 3 is the cheapest edge in cut(S) for the currently explored set S
• In other words, each step in Prim's algorithm computes and adds the cheapest edge in the current value of cut($S$)

• $\arg \min_{(u,v):u \in S, v \in V - S} c(u,v) \equiv \arg \min_{(u,v) \in cut(S)} c(u,v)$
\boxed{ \begin{aligned} &S = \{s\} \text{ and } T = \empty \\ &\text{while } S \neq V \\ &\quad\text{Compute } (u, v) = \arg \min_{(u,v):u \in S, v \in V - S} c(u,v) \\ &\quad\text{Add the node } v \text{ to } S \text{ and add the edge } (u,v) \text{ to } T \end{aligned} }

### Optimality of Prim's

• Claim: Prim's algorithm outputs an MST

• Prove that every edge inserted satisfies the cut property (true by construction, in each iteration $(u, v)$ is necessarily the cheapest edge in cut($S$) for the current value of $S$)
• Prove that the graph constructed is a spanning tree
• Why are there no cycles in $(V, T)$
• Why is $(V,T)$ a spanning tree (edges in $T$ connect all nodes in $V$) - Because $G$ is connected, if there were an unconnected node $v$ at the end, then Prim's would not have terminated

### Final version of Prim's Algorithm

\boxed{ \begin{aligned} &Insert(Q, s, 0, \empty) \\ &\text{while } S \neq V \\ &\quad(v, a(v), u) = ExtractMin(Q) \\ &\quad\text{Add node } v \text{ to } S \text{ and edge } (u,v) \text{ to } T \\ &\quad\text{for every node } x \in V - S \text{ s.t. } (v, x) \text{ is ans edge in } G \\ &\qquad\text{ if } c(v,x) < a(x) \text{then }\\ &\qquad\quad a(x) = c(v,x) \\ &\qquad\quad ChangeKey(Q,x,a(x),v) \\ \end{aligned} }
• $Q$ is a priority queue
• Each element in $Q$ is a triple, the node, its attachement cost, and its predecessaor in the MST
• In step 8, if $x$ is not already in $Q$, simply insert (x, a(x), v) into $Q$
• Total of $n - 1$ $ExtractMin$ and $m$ $ChangeKey$ operations, yielding a running time of $O(m \log n)$
• running time of step 5 is proportional to the degree of $x$ which is proportional to the number of edges in the graph $m$

## Kruskal's Algorithm

• Start with an empty set $T$ of edges
• Process edges in $E$ in increasing order of cost
• Add the next edge $e$ to $T$ only if adding $e$ does not create a cycle. Discard $e$ otherwise
• Note: at any iteration, $T$ may contain several connected components and each node in $V$ is in some component
• Claim: Kruskal's algorithm outputs an MST

• For every edge $e$ added, demonstrate the existence of a set $S \sub V$ (and $V - S$) such that $e$ and $S$ satisfy the cut property, i.e.e, the cheapest edge in $cut(S)$
• If $e = (u,v)$, let $S$ be the set of nodes connected to $u$ in the current graph $T$
• Why is $e$ the cheapest edge in cut($S$) - because we process them in increasing order of cost
• Prove that the algorithm computes a spanning tree

• $(V,T)$ contains no cycles by construction
• If $(V,T)$ is not connected, then there exists a subset $S$ of nodes not connected to $V-S$. What is the contradiction?

### Implementing Kruskal's Algorithm

• start with an empty set $T$ of edges
• Process edges in $E$ in icreasing order of cost
• Add the next edge $e$ to $T$ only if adding $e$ does not create a cycle
• Sorting edges takes $O(m \log n)$ time
• Key question: "Does adding $e = (u,v)$ to $T$ create a cycle?

• Maintain set of connected components of $T$
• $Find(u)$: return the name of the connected component of $T$ that $u$ belongs to
• $Union(A,B)$: merge connected componenets A, B

### Analysing Kruskal's Algorithm

• How many $Find$ invocations does Kruskal's need? $2m$
• How many $Union$ invocations? $n-1$
• Two implementations of $Union-Find$:

• Each $Find$ take $O(1)$, $k$ invocations of $Union$ takes $O(k \log k)$ time in total – $O(m + n \log n)$
• Each $Find$ take $O(\log n )$, each invocation of $Union$ takes $O(1)$$O(m \log n + n)$
• In general $m < n$, but in either case, the total running time is $O(m \log n)$ since we have to spend $O(m \log n)$ time sorting the edges by increasing cost, regardless of the underlying implementation of the $Union-Find$ data structure

### Comments on Union-Find and MST

• useful to maintain connected components of a graph as edges are added to the graph
• Data structure does not support edge deletion efficiently
• Current best algorith for MST runs in $O(m\alpha (m,n))$ time (Chazelle 2000) where $\alpha(m,n)$ is a function of $m,n$ and $O(m)$ randomisesd time

• Let $\log^* n =$ the number of times you take $\log n$ before you reach 1
• e.g. $\log^*(2^{10}) = 4$, $\log^*(2^{2^{10}}) = 5$
• Holy grail: $O(m)$ deterministic algorithm for MST

## Cycle Property

• When can we be sure that an edge cannot be in any MST
• Let $C$ be any cycle in $G$ and let $e = (v,w)$ be the most expensive edge in $C$
• Claim: $e$ does not belong to any MST of G
• Proof: exchange argument.

• If a supposed MST $T$ contains $e$, show that there is a tree with smaller cost than $T$ that does not contain $e$

### Reverse-Delete Algorithm

¯\(ツ)

• Any algorithm that constructs a spanning tree by including the Cut and Cycle properties (include edges that satisfy the cut P, delete edges that satisfy the cycle property) will be an MST

# 07 - Applications of Minimum Spanning Trees

## Minimum Bottleneck Spanning Tree (MBST)

• MST minimises the total cost of the spanning network
• Consider another network design criterion

• build a network connext all cities in mountainous region, but ensure the highest elevation is as low as possible
• total road length is not a criterion
• Idea: compute an MST in which the edge with the highest cost is as cheap as possible
• In an undirected graph $G(V,E)$, let $(V,T)$ be a spanning tree. The bottleneck edge in $T$ is the edge with the largest fcost in $T$
• Instance: An an undirected graph $G(V,E)$ such that $V,T$ is a spanning tree
• Solution: A Set $T \sube E$ of edges such that $(V,T)$ is a spanning tree and there is no spanning tree in $G$ with a cheaper bottleneck

Is every MST and MBST, or vice versa?

• If a tree is an MBST, it might not be an MST
• If a tree is an MST, it will be an MBST

Proof:

• Let $T$ be the MST, and let $T'$ be a spanning tree with a cheaper bottleneck edge.
• Let $e$ be the bottleneck edge in $T$
• Every edge in $T'$ has lower cost than $e$
• Adding $e$ to $T'$ creates a cycle consisting only of edges, where $e$ is the costliest edge in the cycle
• Since $e$ is the costliest edge in this cycle, by the cycle property, $e$ cannot belong to any MST, which contradicts the fact that $T$ is an MST

## Motivation for Clustering

• Given a set of objects and distances between them
• objects can be anything
• Distance function: increase distance corresponds to dissimilarity
• Goal: group objects into clusters, where each cluster is a group of similar objects

### Formalizing the Clustering Problem

• Let $U$ be the set of $n$ objects labels $p_1, p_2, \mathellipsis, p_n$
• For every pair $p_i, p_j$, we have a distance $d(p_i, p_j)$
• We require that $d(p_i, p_i) = 0$, $d(p_i, p_j) > 0$, $d(p_i, p_j) = d(p_j, p_i)$,
• Given a positive integer $k$, a k-clustering of $U$ is a partition of $U$ into $k$ non-empty subsets or clusters $C_1, C_2, ..., C_k$
• that spacing of a clustering is the smallest distance between objects in 2 different subsets:

• $spacing(C_1, C_2, ..., C_k) = \min_{1 \leq i,j \leq k, i\neq j, p \in Ci, q \in C_j} d(p,q)$
• spacing is the minimum of distance between objects in two different subsets
minimum = inifnity
for every cluster C_i
for every cluster C_j ≠ i
for every point p in C_i
for every point q in C_j
if d(p,q) < minimum
minimum = d(p,q)

return minimum
• Clustering of Maximum Spacing

• Instance: a Set $U$ of objects, a distance function $d : U \times U \rightarrow \Reals^+$
• Solution: A k-clustering of $U$ whose spacing is the largest over all possible k-clusterings
• $O(n^2)$ on $n$ clusters and then $O(i \times j)$ on points in disparate cluster

## Algorithm for Max Spacing

• intuition: greedily cluster objects in increasing order of distance
• Let $\mathcal C$ be the set of $n$ clusters, with each object in $U$ in its own cluster
• Process pairs of objects in increasing order of distance

• Let $(p,q)$ be the next pair with $p \in C_p$ and $q \in C_p$
• If $C_p \neq C_q$ add a new cluster $C_p \cup C_q$ to $\mathcal{C}$, delete $C_p, C_q$ from $\mathcal{C}$
• Stop when there are $k$ cluster in $\mathcal{C}$
• Same as Kruskal's algorithm, but do not add the last $k-1$ edges in MST
• Given a clustering $\mathcal{C}$, what is spacing($\mathcal{C}$)?

• The length of the next edge that would be added - the cost of the $(k-1)$st most expensive edge in the MST. Let this cost be $d^*$

### Proof of optimal computing

• Let $\mathcal{C'}$ be any other cluster (with $k$ clusters).
• We will prove that spacing($\mathcal{C'}$) $\leq d^*$

• There must be two objects $p_i, p_j \in U$ in the same cluster $C_r \in \mathcal{C}$ but in different clusters in $\mathcal{C'}$: spacing($\mathcal{C'}$) $\leq d(p_i, q_i)$
• Suppose $p_i \in C'_s$ and $p_j \in C'_t$ in $\mathcal{C}$
• All edges in the path $Q$ connecting $p_i \rightarrow p_j$ in the MST have lengths $\leq d^*$
• In particular, there is an object $p \in C'_s$ and an object $p' \notin C'_s$ s.t. $p$ and $p'$ are adjacent in $Q$
• $d(p, p') \leq d^* \implies$