Homework 3 CpSc 418
Due: Mar. 31, 2022, 11:59pm
100 points Please submit your solution using handin: handin cs-418 hw3 Your solution should have one file, hw3.pdf.
The Questions
- Bitonic sequences (15 points) See @208. First a few definitions. Let S = [s0 , s1 , . . . sn−1 ] be a sequence of elements. S is monotonically non-decreasing if mono_up(S) = ∀0 ≤ i ≤ j < n. ⇒ (si ≤ sj )
Because ≤ is transitive, it’s sufficient to just test adjacent elements of the sequence: mono_up(S) = ∀0 ≤ i < n − 1.si ≤ si+1
Note that this definition implies that the empty sequence (i.e. n = 0) and any singleton sequence (i.e. n = 1) is monotonically non-decreasing. We can also call such sequences monotonically increasing, because it’s easier to say, and just note that the “increasing” isn’t necessarily strict. S is monotonically non-increasing if mono_down(S) = ∀0 ≤ i ≥ j < n.si ≤ sj
With the analogous observations for those about monotonically non-decreasing sequences. S is monotonic non-increasing if it is either monotonically non-decreasing or monotonically nonincreasing: monotonic(S) = mono_up(S) ∨ mono_down(S) S is up-down bitonic if bitonic_up_down(S) = ∃0 ≤ i < n.mono_up([s0 , . . . , si−1 ]) ∧ mono_down([si , . . . , sn−1 ])
Any sequence with two or fewer elements is up-down bitonic. S is down-up bitonic if bitonic_up_down(S) = ∃0 ≤ i < n.mono_down([s0 , . . . , si−1 ]) ∧ mono_up([si , . . . , sn−1 ])
S is bitonic if it is either up-down bitonic or down-up bitonic. bitonic(S) = bitonic_up_down(S) ∨ bitonic_down_up(S)
S 0 is a subsequence of S if there is a sequence of indices, H = [h0 , h1 , . . . , hm−1 ] with 0 ≤ h0 < h1 · · · < hm−1 < n] S 0 = [sh0 , sh1 , . . . , shm−1 ] where m is the length of S 0 . We’ll write s0j to
Now for the questions:
(a) (5 points) Prove that if S is a monotonically non-decreasing sequence, and S 0 is a subsequence of S, then S 0 is a monotonically non-decreasing sequence. Hint: let S = [s0 , s1 , . . . , sm−1 ], S 0 = [s00 , s01 , . . . , s0m−1 ], and H be the sequence of indices [h0 , h1 , . . . , hm−1 ] such that for 0 ≤ i < m, s0i = shi . From the definition of monotonically non-decreasing sequences, mono_up(S 0 ) = ∀0 ≤ i ≤ j < m. ⇒ (s0i ≤ s0j )
Use the hypotheses that S is monotonically non-decreasing to show that this condition holds. Similar reasoning shows that if S is monotonically non-increasing then so is S 0 , and likewise if S is monotonic.
(b) (10 points) Prove that if S is a up-down bitonic sequence, and S 0 is a subsequence of S, then S 0 is a up-down bitonic. Hint: From the definition of up-down bitonic, S can be written as the concatenation of two sequences, Sup and Sdown such that mono_up(Sup ) and mono_down(Sdown ). that is a subsequence of Sup and Sdown that is a subsequence Show that you can construct Sup and Sdown. Include a sentence to state why of Sdown such that S 0 is the concatenation of Sup S 0 = concat(Sup , Sdown) then shows that S 0 is up-down bitonic. concat(X, Y) is the concatenation of sequences X and Y .
Similar reasoning shows that if S is down-up bitonic then so is S 0 , and likewise if S is bitonic. Congratulations! You have answered piazza question @208.
- Oddly Bitonic (25 points) In class, we presented the bitonic sorting algorithm, and made the simplifying assumption that N , the number of values to sort was a power of two. What if N is some other integer? We could pad our input to make it the next larger power of two than N , but this is wasteful of CPU operations, memory, and communication. Fortunately, bitonic merge (and thus bitonic sort) can be extended to handle the case when N is odd.
Recall the basic sketch of a bitonic merge. Let B = [b0 , b1 , . . . , bN −1 ] be a bitonic sequence with N even (we’ll get to the N odd case shortly). For simplicity, assume that all elements of B are in {0, 1}. Let C = [c0 , c1 , . . . , cN −1 where
min(bi , bi+(N/2) ), if 0 ≤ i < (N/2) ci = max(bi−(N/2) , bi ), if (N/2) ≤ i < N
We showed that C has the following properties: • Clo = [c0 , . . . , c(N/2)−1 ] is all zeros or Chi = [cN/2 , . . . , cN −1 ] is all ones; • C has the same number of 0s as B. Likewise, C has the same number of 1s as B.
These properties ensure that we can now sort Clo and Chi separately, and the concatenation of their sorted sequences will be the sorted version of B.
What if N is odd? Assume that B is down-up bitonic (the up-down case works the same way). This means that B has a bunch of 1s on each end and 0s in the middle. We can extend B 0 with one element, bN = 1. I could ask you to prove that B 0 is down-up bitonic, but I believe that’s obvious. If it’s not obvious, post a question to piazza or ask it at office hours. I also claim that it’s obvious that N + 1 is even and that C 0 is down-up bitonic. Therefore, Clo = [c00 , . . . , c0(N −1)/2 ] is all zeros or Chi = [c0(N +1)/2 , . . . , c0N ] is all ones; Clo and Chi are bitonic;
• C 0 has the same number of 0s as B, and C 0 has one more 1 than B.
This means we can continue our merge working on Clo and Chi (nearly) as before. (a) (10 points) Show that Clo is up-down bitonic, and Chi is down-up bitonic. (b) (3 points) Show that c0N = 1. (c) (3 points) Show that c0(N −1)/2 = b(N −1)/2 . (d) (4 points)
i. Show that Clo is all 0s or Chi is all ones. ii. Show that Clo is up-down bitonic. iii. SHow that Chi is down-up bitonic.
These are really simple proofs. One or two sentences will suffice for each of the claims. Long answers are likely to lose points.
(e) (5 points) Show that C has the same number of 0s as B, and C has the same number of 1s as B.
Having shown that c0(N −1)/2 = b(N −1)/2 , and discarding c0N when constructing Chi , we don’t actually need bN , nor do we need a compare-and-swap between bN and b(N −1)/2 . We can just set c(N −1)/2 to b(N −1)/2 without any compare-and-swap needed. With the properties that you showed above, we now can sort Clo and Chi separately, and the concatenation of their sorted sequences will be the sorted version of B.
Clo is up-down bitonic. The changes for this case are simple: we just extend B in the other direction and let B−1 = 0, and the rest of the construction is similar.
- Work-Span and Dynamic Programming: (30 points) Dynamic programming is a method for solving optimization problems when we have a way of breaking the problem into smaller problems, we can define an ordering of the sub-problems, and it’s “easy” to solve a sub-problem once all of the sub-problems that are earlier in the ordering have been solved. A classical example is the editing distance between two strings: if we have two strings, s1 and s2 , and have a costs associated with inserting a character, cins , a cost for deleting a character, cdel , and a cost for replacing a character with another one, crpl , what is the minimum cost sequence of operations that transforms string s1 into string s2 ? See, for example, A general method applicable to the search for similarities in the amino acid sequence of two proteins, S.B. Needleman & C.D. Wunsch.
Let’s say that s1 has n characters and s2 has m characters. We can solve the editing distance problem by filling in a (n + 1) × (m + 1) tableau, T . Initialize T (0, 1 : m) and T (1 : n, 0) to 0. For each (i, j) with 0 < i < n and 0 < j < m, if we have filled in T (i − 1, j − 1), T (i − 1, j), and T (i, j − 1), the we can set T (i, j) according to:
min(T (i − 1, j − 1)), T (i − 1, j) + cdel , T (i, j − 1) + cins ), if s1 (i) = s2 (j) T (i, j) = min(T (i − 1, j − 1) + crpl ), T (i − 1, j) + cdel , T (i, j − 1) + cins ), if s1 (i) 6= s2 (j)
In English, this means: • If we have already matched the first i − 1 characters of s1 to the first j − 1 characters of s2 with a cost of T (i − 1, j − 1), then – if the ith character of s1 matches the j th character of s2 , then we can match the first i characters of s1 to the first j characters of s2 with a cost of T (i − 1, j − 1). In other words, there is no cost if the corresponding character match.
- otherwise, we can match the first i characters of s1 to the first j characters of s2 with a cost of T (i − 1, j − 1) + crpl . In other words, we replace the ith character of s1 with the j th character of s2 . • If we have already matched the first i − 1 characters of s1 to the first j characters of s2 with a cost of T (i − 1, j), then we can match the first i characters of s1 to the first j characters of s2 with a cost of T (i − 1, j) + cdel , by deleting the ith character of s1 . • If we have already matched the first i characters of s1 to the first j − 1 characters of s2 with a cost of T (i, j − 1), then we can match the first i characters of s1 to the first j characters of s2 with a cost of T (i, j − 1) + cins , by inserting the j th character of s2 into s1 .
We choose the cheapest of these options for the cost of T (i, j −1). Note that there is never an advantage of choosing a more expensive solution for T (i, j −1) when finding solutions that are later in the tableau. Of course, we want a parallel solution to this problem. There are many published papers on parallel implementations of dynamic programming algorithms. Here, we will examine the problem from the perspective of the work-span model. (a) (5 points) What is the Work in the work-span model for solving the editing distance using the dynamic programming approach described above when s1 and s2 when are both of length n? If you want a working example, try filling in the tableau where s1 is "length" and s2 is "slings". (b) (5 points) What is the Span in the work-span model when s1 and s2 are both of length n? (c) (5 points) Let SpeedUp(n, p) denote the optimal speed-up according to work-span when using p processors and s1 and s2 are of length. What is the lower bound from SpeedUp(n, p) given by Brent’s lemma. You may start by stating the definition of Brent’s lemma using T1 , T∞ , and p, but that’s not a complete answer. Use your formulas for T1 and T∞ from questions 3a and 3b and then simplify your formula. (d) (5 points) Derive the actual value of SpeedUp(n, p)?. In other words, what is the minimum number of time-steps to perform dynamic programming with p processors on inputs of length n assuming zero communication costs. Show that the optimal speed-up is greater than the lower bound from Brent’s lemma, and give a one or two sentence explanation of why this is the case for this particular algorithm. (e) (5 points) Show that SpeedUp(n, ∞) can be achieved with fewer than ∞ processors. Give a formula the smallest number of processors that can be used to achieve SpeedUp(n, ∞) as a function of n. We will write pmax (n) to denote this number of processors. Thus, SpeedUp(n, pmax (n)) = SpeedUp(n, ∞). (f) (5 points) Let E(n) denote the parallel efficiency when p = pmax (n). What is limn→∞ E(n). Explain why the efficiency you calculated above is less than one referring to the causes of performance loss that we described in class.
- Bisection Width (30 points) I was going to pose this question specifically about Dragonfly networks, but I found I needed to give lots of details about routing in the network. Instead, this question has you prove a more general result that can be applied to many networks, including Dragonfly and hypercubes.
Let’s say I have a network topology with N terminals (i.e. processors). We represent the network as a directed graph; bidirectional links are represented by a pair of edges. Let P be the set of processors.
e say that A and B are a balanced partition of P if balanced_partition(A, B, P ) = (A ∪ B = P ) ∧ (A ∩ B = ∅) ∧ (|A| = d|P |/2e)
The bisection width of a network is the minimum number of edges between A and B for all balanced partitions of the network.
Showing an upper bound on the bisection width of a network is usually straightforward. Any balanced partition provides an upper bound. Lower bounds on bisection width are often more subtle. There are N partitions of P . How do we know that a particular partition, (A, B) minimizes the number N/2 of edges between A and B? In this problem, we’ll show how to derive a lower bound without specifying the partition (A, B).
We’ll use a model where each processor can sent one message per time-unit. Likewise, each edge can convey one message per time unit. A sending-schedule: j = send(i, t) which means that processor i sends a message on time-step t that will be delivered to processor j. We an define a receiving schedule, i = receive(j, t0 ) which means that processor j receives a message from processor i at time t0 . I’ll say that a network has the “fast-all-to-all” (faa) property if we can find a schedule where every processor sends a message to every other processor once every N time steps. For those who like formulas, that means we can find a schedule, send_bw(i, t) (the _bw means it’s a bisection witness), such that
faa(send) = ∀i, j, t. ∃0 ≤ k < N.(send(i, t + k) = j)
In this question, you will show that if there is a schedule for a network that has the faa property, then the bisection width of the network is lower-bounded by Ω(N ). Let (A, B) be a balanced partition of P . For simplicity, assume that N is even; thus, |A| = |B| = N/2. (a) (5 points) How many messages are sent from processors in A to processors in B every N time units? (b) (2 points) How many messages are sent from processors in A to processors in B every N time units? Hint: see your answer to Question 4a. (c) (3 points) How many messages are sent in total between processors in A and processors in B (either direction) every N time units? Hint: see your answer to Questions 4a and 4b. (d) (10 points) Show that at least Ω(N ) edges must be cut to disconnect partitions A and B. Now, pause a moment and think about how you just proved a lower bound for the bisection width without every considering any specific partitioning. (e) Upper bound for the bisection width (10 points) Show that a hypercube with N = 2d processors has an upper bound for bisection width of O(N ). Hint: Find a simple way to divide the processors into two sets of N/2 processors each that have O(N ) edges between them. The most obvious approach has exactly N edges connecting the two partitions.