7 Minimum Spanning Tree

Minimum Spanning Tree (MST) problem is a global problem since it needs at least $Ω (D)$ rounds in a network of diameter $D$ , i.e., the computation needs the entire graph to be traversed.

The MST is an important and commonly occurring primitive in the design and operation of communication networks. In practice, the weights can be used to model delay, congestion etc. and hence an MST gives a spanning tree that minimizes total delay, congestion etc. One of the most common applications of MST is that can serve a backbone for efficient communication, e.g., it can be used naturally for broadcast.

We assume the clean network model and the synchronous CONGEST model. At the end of the distributed MST algorithm, each node will know which of its incident edges belong to the MST and which do not.

We also assume, for simplicity that all edge weights in the graph are distinct. This constraints the graph from having more than one MST. This assumption is without loss of generality, since one can tag each edge weight with an additional identifier that can be used to break ties.

As in centralized MST algorithms, distributed MST algorithms also rely on two important properties of a MST:

Cut property: A cut in a graph is a partition of the vertex set into two disjoint sets. The cut property states that, given any cut in a graph, the lightest (minimum weight) edge crossing the cut belongs to the MST (due to the assumption of unique edge weights, there is a unique lightest edge crossing the cut.)
Cycle property: Consider any cycle in the graph. The heaviest (maximum weight) edge in the cycle will not be in the MST.

Gallager-Humblet-Spira (GHS) Algorithm:

We are given an undirected, connected, weighted graph $G = (V, E, w)$ . Let $n$ be the number of nodes and $m$ be the number of edges of $G$ . Let $T$ be the MST on $G$ . A MST fragment $F$ of $T$ is defined as a connected subgraph of $T$ i.e., $F$ is a subtree of $T$ .

An outgoing edge of a MST fragment is an edge where one adjacent node to the edge is present in the fragment and the other is not. As an immediate consequence of the cut property, the minimum outgoing edge (MOE) of a fragment $F$ is an edge of the MST.

The GHS algorithm operates in phases. In the first phase, the GHS algorithm starts with each individual node as a fragment by itself and continues till there is only one fragment which is the MST. All fragments find their MOE simultaneously in parallel.

In each phase, the algorithm maintains the following invariant: Each MST fragment has a leader and all nodes know their respective parents and children. The root of the tree will be the leader. Initially, each node is a root node of the fragment, and each fragment is identified by the identifier of the root, called the fragment ID. Each node in the fragment knows its fragment ID.

Description of GHS Algorithm:

The GHS algorithm assigns a level to each fragment, which is a non-decreasing integer with initial value 0. Furthermore, each fragment with a non-zero level has an ID, which is the ID of the core edge/root in the fragment, which is selected when the fragment is constructed. During the execution of the algorithm, each node can classify each of its incident edges into three categories:

Branch edges are those that have been determined to be part of the MST.
Rejected edges are those that have been determined not to be part of the MST.
Basic edges are all edges that are neither branch edges nor rejected edges.

In level-0 fragments, each awakened node will do the following:

Choose its minimum-weight incident edge and mark that edge as a branch edge.
Send a message via the branch edge to notify the node on the other side.
Wait for a message from the other end of the edge.

The edge that is chosen by the two nodes it connects becomes the core edge, and is assigned level 1.

In non-zero-level fragments, a separate algorithm is executed in each level. This algorithm can be separated into three stages: broadcast, convergecast, and change core.

Broadcast:

The two nodes adjacent to the core broadcast messages to the rest of the nodes in the fragment. The messages are sent via the branch edge but not via the core. Each broadcast message contains the ID and level of the fragment. At the end of this stage, each node has received the new fragment ID and level.

Convergecast:

In this stage, all nodes in the fragment cooperate to find the minimum weight outgoing edge of the fragment. The message sent in this stage are opposite direction to that of the broadcast stage. Each node upon receiving all the MOE from its children will find the minimum among them along with its MOE and send the result to its parent.

Change core:

After the completion of the previous stage, the two nodes connected by the core can inform each other of the best edges they received. Then they can identify the minimum outgoing edge from the entire fragment. A message will be sent from the core to the minimum outgoing edge via a path of branch edges. Finally, a message will be sent out via the chosen outgoing edge to request to combine the two fragments that the edge connects. Depending on the levels of those two fragments, one of two combined operations are performed to form a new fragment.

Finding the minimum-weight incident outgoing edge:

As discussed above, every node needs to find its minimum weight outgoing incident edge after the receipt of a broadcast message from the core. If node $n$ receives a broadcast, it will pick its minimum weight basic edge and send a message to the node $n^{'}$ on the other side with its fragment’s ID and level. Then, node $n^{'}$ will decide whether the edge is an outgoing edge and send back a message to notify node $n$ of the result. The decision is made according to the following:

$F_{ID} (n) = F_{ID} (n^{'})$ , Then nodes $n$ and $n^{'}$ belong to the same fragment
$F_{ID} (n) \neq = F_{ID} (n^{'})$ and $L (F) \leq L (F^{'})$ , $F$ and $F^{'}$ are corresponding fragments and $L (F)$ is the level of fragment $F$ , then nodes $n$ and $n^{'}$ belong to different fragments, so the edge is outgoing.
$F_{ID} (n) = F_{ID} (n^{'})$ and $L (F) > L (F^{'})$ , Here we cannot make any conclusion. The reason is that two nodes may belong to the same fragment already, but $n^{'}$ has not discovered this fact yet due to the delay of a broadcast message . In this case, the algorithm lets node $n^{'}$ postpone the response until its level becomes higher than or equal to the level it received from node $n$ .

Combining two fragments:

Let $F$ and $F^{'}$ be the two fragments that need to be combined.

Merge: If both $F$ and $F^{'}$ share a common MOE and $L (F) = L (F^{'})$ then combine. The level of the combined fragment will be $L (F) + 1$ .
Absorb: If $L (F) < L (F^{'})$ then combine. The combined fragment will have its level as $L (F^{'})$ .
Wait: Wait till the above rules apply.

Pseudocode:

Variables:

state : sleep , find , found
- sleep - The node is not initialized
- find - The node is currently helping its fragment search for MOE
- found - MOE for the fragment has been found
status $[q]$ : basic , branch , reject
- basic - Edge is not yet determined
- branch - Edge is a part of the MST
- reject - Edge is not a part of the MST
name - Name of the fragment
level - Level of the fragment
parent - Points towards the combining edge
bestWt , bestNode , rec , testNode - temporary variables

Current node $p$ , Neighbor is $q$

Initialization:

1    pq is the least weight 
2    status[q] = branch 
3    level = 0 
4    state = found 
5    rec = 0 
6    send <connect , 0> to q

Connecting Message:

1     Receive <connect,L> from q:
2     if L < level then:
3 	      // Combine directly with root of p as the combined root
4 	      status[q] = branch
5	      send <initiate , level , name , state> to q
6
7     else if status[q] = basic then 
8	      wait
9
10    else 
11	    // Combine but increase the rank/level by 1
12	    send <initiate , level + 1 , pq , find> to q

Initiate Message:

1    Receive <initate , level' , name' , state'> from q
2	level = level'
3	state = state'
4	name = name'
5
6	bestNode = {} 
7	bestWt = INF
8	testNode = {}
9
10	for each r in adj[p]: // adj[p] - adjacency list of p
11		if (status[r] = branch && (r !=q)) then:
12			send <initiate , level' , name' , state'> to r
13
14	if state = find then:
15		rec = 0 
16		findMin()

findMin() Function:

1	if there exists a q in adj[p] such that status[q] = basic and weight of pq is minimal then:
2		testNode = q
3		send <test , level , name> to testNode
4
5	else
6		testNode = {}
7		report()

Test message:

1	Receive <test , level' , name'> from q
2
3	if level' > level then
4		wait
5	
6	if name = name' then
7		if status[q] = basic then
8			status[q] = reject
9
10		if q != testNode then
11			send <reject> to q
12
13		else
14			findMin()
15
16	else
17		send <accept> to q

Accept and Reject messages:

1	Receive <accept> from q:
2	testNode = {}
3	if weight of pq < bestWt then
4		bestWt = weight of pq
5		bestNode = q
6	report()
7	
8	Receive <reject> from q:
9	if status[q] = basic then:
10		status[q] = reject
11	
12	findMin()

report() Function:

1	report:
2	if((rec == Number such that status[q] == branch && q != parent) && testNode = {}):
3		state = found
4		send <report , bestWt> to parent

Report Message:

1	Receive <report , w> from q:
2	if q != parent then:
3		if w < bestWt then
4			bestWt = w 
5			bestNode = q
6		rec = rec + 1 
7		report()
8	
9	else:
10		if state == find:
11			wait
12	
13		else if w > bestWt then:
14			changeRoot()
15	
16		else if w = bestWt = INF:
17			stop

changeRoot() Function:

1	changeRoot():
2	if status[bestNode] = branch then 
3		send changeroot to bestNode
4	
5	else
6		status[bestNode] = branch
7		send <connect, level> to bestNode
8	
9	Receive changeroot:
10	changeRoot()

Message Complexity:

Every node is rejected only once, one test message and one reject message: Total $2∣ E ∣$ messages.

At every level, a node sends/receives at most:

receives - 1 initiate message , 1 accept message
sends - 1 report message , 1 changeroot/connect message , 1 successful test message

Since each node can be at at most $lo g n$ levels (Every level doubles the fragment size), therefore the message complexity is $2∣ E ∣ + 5 n lo g n$ .

Pipeline Algorithm:

The Pipeline algorithm is essentially an upcast algorithm, where we build a BFS tree over the graph and each node upcasts edges to the root of the BFS tree; the root ends up having (enough) global knowledge of the network topology and locally computes the MST and downcasts the MST edges to all nodes in the network. A naive algorithm would be to upcast all the edges in $Θ (m)$ rounds.

The main idea of the Pipeline MST algorithm is to filter the number of edges broadcasted so that the running time is reduced to $O (n)$ rounds. However, the message complexity can be as much as $Θ (n^{2})$ .

The pipeline algorithm uses the cycle property of MST to filter edges at the intermediate steps. Each node $v$ except the root $r$ , maintains two lists of edges $Q$ and $U$ . Initially, $Q$ only contains edges adjacent to $v$ and $U$ is empty. At each round, $v$ sends min-weight edge in $Q$ that does not create a cycle with the edges in $U$ to its parent and moves this edge from $Q$ to $U$ . If $Q$ is empty, $v$ sends a terminate message to its parent. The parent after receiving an edge from a child, adds the edge in its $Q$ list. A leaf node starts sending edges upwards at round $0$ . An intermediate node starts sending at the first round after it has received at least one message from each of its children.

Pseudocode:

1. Build a BFS Tree B in G. Let r be the root of B
2. Each node v , except the root r , maintains two list of edges, Q(v) and U(v)
3. A leaf node starts sending edges upwards at round 0. An intermediate node starts sending at the first round after it has received at least one message from each of its children.
4. Initially Q(v) contains only edges adjacent to v and U(v) is empty. At each round, v sends the minimum-weight edge in Q(v) that does not create a cycle with the edges in U(v) to its parent and moves this edge from Q(v) to U(v). Any edge that creates a cycle with edges in U(v) is deleted from Q(v). If Q(v) is empty, v sends a terminate message to its parent. The parent after receiving an edge from a child adds the edge in its Q(v) list.
5. The root r computes the MST locally among the edges it hears from its children . The solution is then broadcast over tree B to all nodes.

Analysis:

We make two observations

The edges reported by each intermediate vertex to its parent in the tree are cycle-free.
Each vertex starts sending messages upwards at round $L (v) = height (T (v))$ .

Consider an intermediate vertex $v$ at height $h$ that has still not terminated its participation in the algorithm, at round $t$ , for some $t \geq h$ . A child is active if it has not terminated yet.

Claim:

For each child $u$ of $v$ that is still active at round $t$ , $Q (v)$ at the beginning of round $t$ contains at least one edge.
If $v$ sends to its parent an edge of weight $w_{0}$ at round $t$ , then all of the elements $v$ was informed at round $t - 1$ by its active children were of weight $w_{0}$ or larger.
If $v$ sends to its parent an element of weight $w_{0}$ at round $t$ , then any later element it will learn is of weight $w_{0}$ or larger.
Any non-root node $v$ sends elements in nondecreasing weight order to its parent; it sends the elements in a continuous fashion till it terminates.

Proof:

The proof is by induction on the height of the tree.

Base case: Trivially holds for leaves
Induction Step: Consider an intermediate vertex $v$ at height $h$ and assume that the claims hold for each of its children.
- Let $A_{v}$ be the set of $k$ elements sent by $v$ to its parent during the first $t - h$ rounds. The set of edges in $A_{v}$ are cycle-free , by the line 4 in algorithm (Sent edges are in $U (v)$ and edge which is going to be sent doesn’t form cycle with $U (v)$ ). Consider an active child $u$ of $v$ . Let $A_{u}$ be the set of elements sent by $u$ to $v$ up to round $t - 1$ . $u$ has continuously transmitted to $v$ , since round $L (u) \leq h - 1$ . Hence $∣ A_{u} ∣ \geq k + 1$ . Thus there exists some edge $f \in A_{u} - A_{v}$ such that $A_{u} \cup {f}$ is cycle-free. This element belongs to $Q (v)$ .
- Consider any active child $u$ of $v$ . Let $f$ be the element sent by $u$ on round $t - 1$ . Let $f^{'}$ be some element by $u$ at some round $t^{'} \leq t - 1$ and is still in $Q (v)$ at round $t$ . Then $w (f) \geq w (f^{'}) \geq w_{0}$ .

Running Time and Message Analysis:

The root starts getting messages at time $height (T)$ . The root receives at most $n - 1$ elements from each of its children. The time to know all edges is $O (n - 1 + height (T))$ . If we consider the additional broadcasting, the total time is $O (n)$ . In the worst case, each node can $Θ (n)$ edges upward, and hence the overall message complexity is $O (n^{2})$ .

Garay-Kutten-Peleg (GKP) Algorithm:

Lets look at a distributed MST algorithm that runs in $O (D + n lo g^{*} n)$ time. The GKP algorithm consists of two parts: it combines the GHS algorithm

Resources:

Advanced Distributed Systems Lectures by Prof. Smruti Sarangi (GHS Algorithm)
Distributed Network Algorithms by Prof. Gopal Pandurangan (Chapter 7)

Indefinite Tree

Explorer

7 Minimum Spanning Tree

Gallager-Humblet-Spira (GHS) Algorithm:

Description of GHS Algorithm:

Broadcast:

Convergecast:

Change core:

Finding the minimum-weight incident outgoing edge:

Combining two fragments:

Pseudocode:

Initialization:

Connecting Message:

Initiate Message:

findMin() Function:

Test message:

Accept and Reject messages:

report() Function:

Report Message:

changeRoot() Function:

Message Complexity:

Pipeline Algorithm:

Pseudocode:

Analysis:

Claim:

Proof:

Running Time and Message Analysis:

Garay-Kutten-Peleg (GKP) Algorithm:

Resources:

Graph View

Backlinks