Minimum Spanning Tree (MST) problem is a global problem since it needs at least rounds in a network of diameter , i.e., the computation needs the entire graph to be traversed.
The MST is an important and commonly occurring primitive in the design and operation of communication networks. In practice, the weights can be used to model delay, congestion etc. and hence an MST gives a spanning tree that minimizes total delay, congestion etc. One of the most common applications of MST is that can serve a backbone for efficient communication, e.g., it can be used naturally for broadcast.
We assume the clean network model and the synchronous CONGEST model. At the end of the distributed MST algorithm, each node will know which of its incident edges belong to the MST and which do not.
We also assume, for simplicity that all edge weights in the graph are distinct. This constraints the graph from having more than one MST. This assumption is without loss of generality, since one can tag each edge weight with an additional identifier that can be used to break ties.
As in centralized MST algorithms, distributed MST algorithms also rely on two important properties of a MST:
- Cut property: A cut in a graph is a partition of the vertex set into two disjoint sets. The cut property states that, given any cut in a graph, the lightest (minimum weight) edge crossing the cut belongs to the MST (due to the assumption of unique edge weights, there is a unique lightest edge crossing the cut.)
- Cycle property: Consider any cycle in the graph. The heaviest (maximum weight) edge in the cycle will not be in the MST.
Gallager-Humblet-Spira (GHS) Algorithm:
We are given an undirected, connected, weighted graph . Let be the number of nodes and be the number of edges of . Let be the MST on . A MST fragment of is defined as a connected subgraph of i.e., is a subtree of .
An outgoing edge of a MST fragment is an edge where one adjacent node to the edge is present in the fragment and the other is not. As an immediate consequence of the cut property, the minimum outgoing edge (MOE) of a fragment is an edge of the MST.
The GHS algorithm operates in phases. In the first phase, the GHS algorithm starts with each individual node as a fragment by itself and continues till there is only one fragment which is the MST. All fragments find their MOE simultaneously in parallel.
In each phase, the algorithm maintains the following invariant: Each MST fragment has a leader and all nodes know their respective parents and children. The root of the tree will be the leader. Initially, each node is a root node of the fragment, and each fragment is identified by the identifier of the root, called the fragment ID. Each node in the fragment knows its fragment ID.
Description of GHS Algorithm:
The GHS algorithm assigns a level to each fragment, which is a non-decreasing integer with initial value 0. Furthermore, each fragment with a non-zero level has an ID, which is the ID of the core edge/root in the fragment, which is selected when the fragment is constructed. During the execution of the algorithm, each node can classify each of its incident edges into three categories:
- Branch edges are those that have been determined to be part of the MST.
- Rejected edges are those that have been determined not to be part of the MST.
- Basic edges are all edges that are neither branch edges nor rejected edges.
In level-0 fragments, each awakened node will do the following:
- Choose its minimum-weight incident edge and mark that edge as a branch edge.
- Send a message via the branch edge to notify the node on the other side.
- Wait for a message from the other end of the edge.
The edge that is chosen by the two nodes it connects becomes the core edge, and is assigned level 1.
In non-zero-level fragments, a separate algorithm is executed in each level. This algorithm can be separated into three stages: broadcast, convergecast, and change core.
Broadcast:
The two nodes adjacent to the core broadcast messages to the rest of the nodes in the fragment. The messages are sent via the branch edge but not via the core. Each broadcast message contains the ID and level of the fragment. At the end of this stage, each node has received the new fragment ID and level.
Convergecast:
In this stage, all nodes in the fragment cooperate to find the minimum weight outgoing edge of the fragment. The message sent in this stage are opposite direction to that of the broadcast stage. Each node upon receiving all the MOE from its children will find the minimum among them along with its MOE and send the result to its parent.
Change core:
After the completion of the previous stage, the two nodes connected by the core can inform each other of the best edges they received. Then they can identify the minimum outgoing edge from the entire fragment. A message will be sent from the core to the minimum outgoing edge via a path of branch edges. Finally, a message will be sent out via the chosen outgoing edge to request to combine the two fragments that the edge connects. Depending on the levels of those two fragments, one of two combined operations are performed to form a new fragment.
Finding the minimum-weight incident outgoing edge:
As discussed above, every node needs to find its minimum weight outgoing incident edge after the receipt of a broadcast message from the core. If node receives a broadcast, it will pick its minimum weight basic edge and send a message to the node on the other side with its fragment’s ID and level. Then, node will decide whether the edge is an outgoing edge and send back a message to notify node of the result. The decision is made according to the following:
- , Then nodes and belong to the same fragment
- and , and are corresponding fragments and is the level of fragment , then nodes and belong to different fragments, so the edge is outgoing.
- and , Here we cannot make any conclusion. The reason is that two nodes may belong to the same fragment already, but has not discovered this fact yet due to the delay of a broadcast message . In this case, the algorithm lets node postpone the response until its level becomes higher than or equal to the level it received from node .
Combining two fragments:
Let and be the two fragments that need to be combined.
- Merge: If both and share a common MOE and then combine. The level of the combined fragment will be .
- Absorb: If then combine. The combined fragment will have its level as .
- Wait: Wait till the above rules apply.
Pseudocode:
Variables:
- state : sleep , find , found
- sleep - The node is not initialized
- find - The node is currently helping its fragment search for MOE
- found - MOE for the fragment has been found
- status : basic , branch , reject
- basic - Edge is not yet determined
- branch - Edge is a part of the MST
- reject - Edge is not a part of the MST
- name - Name of the fragment
- level - Level of the fragment
- parent - Points towards the combining edge
- bestWt , bestNode , rec , testNode - temporary variables
Current node , Neighbor is
Initialization:
1 pq is the least weight
2 status[q] = branch
3 level = 0
4 state = found
5 rec = 0
6 send <connect , 0> to q
Connecting Message:
1 Receive <connect,L> from q:
2 if L < level then:
3 // Combine directly with root of p as the combined root
4 status[q] = branch
5 send <initiate , level , name , state> to q
6
7 else if status[q] = basic then
8 wait
9
10 else
11 // Combine but increase the rank/level by 1
12 send <initiate , level + 1 , pq , find> to q
Initiate Message:
1 Receive <initate , level' , name' , state'> from q
2 level = level'
3 state = state'
4 name = name'
5
6 bestNode = {}
7 bestWt = INF
8 testNode = {}
9
10 for each r in adj[p]: // adj[p] - adjacency list of p
11 if (status[r] = branch && (r !=q)) then:
12 send <initiate , level' , name' , state'> to r
13
14 if state = find then:
15 rec = 0
16 findMin()
findMin() Function:
1 if there exists a q in adj[p] such that status[q] = basic and weight of pq is minimal then:
2 testNode = q
3 send <test , level , name> to testNode
4
5 else
6 testNode = {}
7 report()
Test message:
1 Receive <test , level' , name'> from q
2
3 if level' > level then
4 wait
5
6 if name = name' then
7 if status[q] = basic then
8 status[q] = reject
9
10 if q != testNode then
11 send <reject> to q
12
13 else
14 findMin()
15
16 else
17 send <accept> to q
Accept and Reject messages:
1 Receive <accept> from q:
2 testNode = {}
3 if weight of pq < bestWt then
4 bestWt = weight of pq
5 bestNode = q
6 report()
7
8 Receive <reject> from q:
9 if status[q] = basic then:
10 status[q] = reject
11
12 findMin()
report() Function:
1 report:
2 if((rec == Number such that status[q] == branch && q != parent) && testNode = {}):
3 state = found
4 send <report , bestWt> to parent
Report Message:
1 Receive <report , w> from q:
2 if q != parent then:
3 if w < bestWt then
4 bestWt = w
5 bestNode = q
6 rec = rec + 1
7 report()
8
9 else:
10 if state == find:
11 wait
12
13 else if w > bestWt then:
14 changeRoot()
15
16 else if w = bestWt = INF:
17 stop
changeRoot() Function:
1 changeRoot():
2 if status[bestNode] = branch then
3 send changeroot to bestNode
4
5 else
6 status[bestNode] = branch
7 send <connect, level> to bestNode
8
9 Receive changeroot:
10 changeRoot()
Message Complexity:
Every node is rejected only once, one test message and one reject message: Total messages.
At every level, a node sends/receives at most:
- receives - 1 initiate message , 1 accept message
- sends - 1 report message , 1 changeroot/connect message , 1 successful test message
Since each node can be at at most levels (Every level doubles the fragment size), therefore the message complexity is .
Pipeline Algorithm:
The Pipeline algorithm is essentially an upcast algorithm, where we build a BFS tree over the graph and each node upcasts edges to the root of the BFS tree; the root ends up having (enough) global knowledge of the network topology and locally computes the MST and downcasts the MST edges to all nodes in the network. A naive algorithm would be to upcast all the edges in rounds.
The main idea of the Pipeline MST algorithm is to filter the number of edges broadcasted so that the running time is reduced to rounds. However, the message complexity can be as much as .
The pipeline algorithm uses the cycle property of MST to filter edges at the intermediate steps. Each node except the root , maintains two lists of edges and . Initially, only contains edges adjacent to and is empty. At each round, sends min-weight edge in that does not create a cycle with the edges in to its parent and moves this edge from to . If is empty, sends a terminate message to its parent. The parent after receiving an edge from a child, adds the edge in its list. A leaf node starts sending edges upwards at round . An intermediate node starts sending at the first round after it has received at least one message from each of its children.
Pseudocode:
1. Build a BFS Tree B in G. Let r be the root of B
2. Each node v , except the root r , maintains two list of edges, Q(v) and U(v)
3. A leaf node starts sending edges upwards at round 0. An intermediate node starts sending at the first round after it has received at least one message from each of its children.
4. Initially Q(v) contains only edges adjacent to v and U(v) is empty. At each round, v sends the minimum-weight edge in Q(v) that does not create a cycle with the edges in U(v) to its parent and moves this edge from Q(v) to U(v). Any edge that creates a cycle with edges in U(v) is deleted from Q(v). If Q(v) is empty, v sends a terminate message to its parent. The parent after receiving an edge from a child adds the edge in its Q(v) list.
5. The root r computes the MST locally among the edges it hears from its children . The solution is then broadcast over tree B to all nodes.
Analysis:
We make two observations
- The edges reported by each intermediate vertex to its parent in the tree are cycle-free.
- Each vertex starts sending messages upwards at round .
Consider an intermediate vertex at height that has still not terminated its participation in the algorithm, at round , for some . A child is active if it has not terminated yet.
Claim:
- For each child of that is still active at round , at the beginning of round contains at least one edge.
- If sends to its parent an edge of weight at round , then all of the elements was informed at round by its active children were of weight or larger.
- If sends to its parent an element of weight at round , then any later element it will learn is of weight or larger.
- Any non-root node sends elements in nondecreasing weight order to its parent; it sends the elements in a continuous fashion till it terminates.
Proof:
The proof is by induction on the height of the tree.
- Base case: Trivially holds for leaves
- Induction Step: Consider an intermediate vertex at height and assume that the claims hold for each of its children.
- Let be the set of elements sent by to its parent during the first rounds. The set of edges in are cycle-free , by the line 4 in algorithm (Sent edges are in and edge which is going to be sent doesn’t form cycle with ). Consider an active child of . Let be the set of elements sent by to up to round . has continuously transmitted to , since round . Hence . Thus there exists some edge such that is cycle-free. This element belongs to .
- Consider any active child of . Let be the element sent by on round . Let be some element by at some round and is still in at round . Then .
Running Time and Message Analysis:
The root starts getting messages at time . The root receives at most elements from each of its children. The time to know all edges is . If we consider the additional broadcasting, the total time is . In the worst case, each node can edges upward, and hence the overall message complexity is .
Garay-Kutten-Peleg (GKP) Algorithm:
Lets look at a distributed MST algorithm that runs in time. The GKP algorithm consists of two parts: it combines the GHS algorithm
Resources:
- Advanced Distributed Systems Lectures by Prof. Smruti Sarangi (GHS Algorithm)
- Distributed Network Algorithms by Prof. Gopal Pandurangan (Chapter 7)