Chapter 27
|
Document last modified: |
Resources
Intel Cilk++ Overview - http://software.intel.com/sites/products/evaluation-guides/docs/cilk-plus-evaluation-guide.pdf
Windows - Intel Cilk++ automatically integrates with Visual Studio 8.
- Download and install Cilk++ - http://software.intel.com/en-us/articles/download-intel-cilk-sdk/
- Depending upon the Parallel Studio and Visual Studio versions, you may need to download the updates also.
Examples
Official Intel examples, fail in converting to Visual Studio 10.
Same examples that convert to Visual Studio 10.
- Extract examples from Intel\Cilk directory, this will create a directory with several Cilk++ examples and VS solutions - cilkexamples.msi
- Open one of the solutions in the examples directory.
Or, after installing Cilk++, create a C++ Windows Console application, copy the following to Maximum.cilk and execute
- Download and extract the examples.
- Open one of the solutions in the examples directory.
Maximum.cilk #include <cilk.h> #include <stdio.h> #include <math.h> int Pmax(int A[], int l, int r) { int m, x, y; if(l>=r) return A[l]; m=(int)floor((double)((l+r)/2)); x = cilk_spawn Pmax(A, l, m); y = cilk_spawn Pmax(A, m+1, r); cilk_sync; if (x > y) return x; else return y; } int cilk_main(int argc, char *argv[]) { int A[8] = {6, 3, 5, 7, 8, 1, 4, 2}; int x; x = cilk_spawn Pmax(A, 0, 7); cilk_sync; printf("maximum: %d\n", x); return 0; }
Mac - Cilk from MIT
Compiler as installed on Mac OSX - Cilktest.zip
- Open terminal window, change directory to Cilktest.
- sudo mv /usr/include/mach/i386/_structs.h /usr/include/mach/i386/_structs.h.original
- sudo cp _structs.h /usr/include/mach/i386
Compiler - http://sourceforge.net/projects/cilkformac/files/cilk-5.4.6.tar.gz/download
Installing Cilk on Mac OSX - http://d.hatena.ne.jp/yuyarin/20101026/1288036789
Compile/Execute commands at terminal shell
alias cilkc='cilkc -D__CILK' Create alias for changes to: /usr/include/secure/_string.h cilkc -O2 examples/fib.cilk -o fib Compile fib.cilk example time ./fib --nproc 1 40
Result: 102334155
real 0m6.833s
user 0m5.849s
sys 0m0.009sTimed execution on one core. time ./fib --nproc 2 40
Result: 102334155
real 0m3.507s
user 0m5.904s
sys 0m0.033sTimed execution on two cores. Fibonacci execution comparison using C++, Open MP and Cilk++ - http://myxman.org/dp/node/182
Cilk lecture notes Rice University - http://www.clear.rice.edu/comp422/lecture-notes/comp422-2011-Lecture4-Cilk.pdf
Overview
Current multicore processors generally support shared-memory parallel computation, where multiple threads can access a common state in parallel.
Figure at right illustrates a dual core processor sharing access to common memory with algorithm instructions or data. Each thread (processor core) can execute code independent of other, concurrent threads.
Static-threading, where threads generally exist for life of computation, programmer responsible for partitioning problem to balance load between threads, communication between threads, and scheduling thread execution.
Dynamic-multithreading allows programmers to specify parallelism without specifying communication protocol, load balancing, scheduling, etc. Though still evolving, main features are:
- nested parallelism - allows subroutine to be spawned and execute in parallel with caller.
- loop parallelism - iterations of the loop can execute concurrently.
Model advantages
- Multithreading extension of serial programming
parallel - executes for iterations concurrently.
spawn - executes spawned function in parallel with caller.
sync - waits for any spawned threads to complete.
- Quantify parallelism by work and span.
- Analysis of recursive parallel algorithms using recurrences.
- Faithful to parallel computing evolution, e.g. multi-core technology.
Ideal parallel computer and sequentially consistency
- Set of processors of equal capability
- Shared memory that is sequentially consistent, means results as though only single instruction executed at a time by set of processors.
So even though two or more processors may be simultaneously modifying the same location of shared memory, the result is as though one processor made a modification and then the other.
This says nothing about the order of execution, only that memory remains consistent.
- Ignore cost of scheduling in analysis.
Question 27.0
- Do both sum array A = <1,2,3,4,5,6,7,8>?
- What is the recurrence for sum(A, l, r)? T(n) = ?
- Draw the recursion tree for sum(A, l, r).
- What is the upper bound of each sum?
- What is the recursion tree height, in terms of n?
- Suppose each call to sum(A, l, r) was executed on a new processor.
How many total processors would we need to sum the 8 values?
Upper bound execution time, in terms of n?
- What is 1024 / lg 210?
sum(A, n)
- result = 0
- for i = 1 to n
- result = result + A[ i ]
- return result
sum(A, l, r)
- if l ≥ r return A[ l ]
- m=⌊(l+r)/2⌋
- return sum(A, l, m) + sum(A, m+1, r)
27.1 The basics of dynamic multithreading
A first look
Recursion tree - each box represents a processor and call sum(A, l, r).
The top level processor is result of initial sum( <1,2,3,4,5,6,7,8>, 0, 7)
Subsequent processors are result of spawn.

Work Master Theorem Case 1
|
Span Master Theorem Case 2
Parallelism
|
Question 27.1.1
- The number of subproblems double but the size is reduced by half for each. Upper bound for execution on one processor?
- How many processors are used?
- Recursion tree height in terms of n?
- Upper bound guess for the parallel version?
Fibonacci
Fibonacci (serial version) F0 = 0 F1 = 1
Fi = Fi-1 + Fi-2 for i ≥ 2
Fib( n )
- if n ≤ 1
- return n
- else x = Fib(n-1)
- y = Fib(n-2)
- return x + y
Analysis T(n) = Θ( φn )
T(n) = T(n-1) + T(n-2) + Θ(1)
Assume T(n) ≤ aFn - b a > 1 and b > 0
T(n) = T(n-1) + T(n-2) + Θ(1) ≤ aFn-1 - b + aFn-2 - b + Θ(1) Substitute = a(Fn-1 + Fn-2) - 2b + Θ(1) = aFn - b -(b - Θ(1)) Fibonacci definition ≤ aFn - b for b ≥ Θ(1) T(n) = Θ( φn ) where φ = 1.61803... and Fn = φn (see equation 3.25)
Fibonacci (serial version) Fibonacci (parallel version)
spawn P-Fib(n-1) - spawns execution of P-Fib(n-1) that may execute concurrently with y = P-Fib(n-2)
sync - waits for completion of all executions spawned from current execution.
Fibonacci (parallel version) P-Fib( n )
- if n ≤ 1
- return n
- else x = spawn P-Fib(n-1)
- y = P-Fib(n-2)
- sync
- return x + y
Strands
strand - Group of instruction chains having no parallel control (spawn, sync or return from spawn). These instructions are all executed serially in a single thread.
Following identifies instructions comprising the three types of strands:
Black - base case or procedure instructions up to spawn
Gray - instructions following spawn and up to sync
White - instructions following sync up to return, that is x+yLogically in series - when directed path from strand u to strand v. Below, strands 1 → 2, 1 → 3, 1 → 4, 2 → 3, 2 → 4, 3 → 4 are in series.
Logically in parallel - when no directed path from strand u to strand v. Below, strands 5 and 3 are parallel.
Critical path - longest path in a DAG, passing through strands 1, 2, 3 and 4, of length 4.
Fibonacci (parallel version) P-Fib( 2 ) if 2 ≤ 1
return 2
else x = spawn P-Fib(1)
y = P-Fib(0)
sync
return x + y
Question 27.1.2 - In above, which strands are parallel or serial?
- 1 and 5
- 2 and 5
- 4 and 5
Critical path marked by heavy edges passing through 8 strands of execution.
Because strands 1-8 in same path, execution is logically serial. Defines the longest execution time, upper bound.
Assuming each strand takes unit time, 17 time units of work (each circle is one unit).
Question 27.2
Does the left subroutine exhibit more parallelism (do more work in less time) than the right or just use more processors?
Draw the execution of strands for the left similar to that below at right.
P-Fib( n )
- if n ≤ 1
- return n
- else x = spawn P-Fib(n-1)
- y = spawn P-Fib(n-2)
- sync
- return x + y
P-Fib( n )
- if n ≤ 1
- return n
- else x = spawn P-Fib(n-1)
- y = P-Fib(n-2)
- sync
- return x + y
Following identifies instructions comprising the three types of strands:
Black - base case or procedure instructions up to spawn
Gray - instructions following spawn and up to sync
White - instructions following sync up to return, that is x+y
Performance measures
Serial: When two sub-computations joined in series, work is sum of their work and the span is the sum of their spans.
Parallel: When joined in parallel, work is sum of their work and the span is the maximum of their spans.
work - total time to execute on one processor. Figure below marks 17 time units (strands in this case) of work.
span - longest time to execute in any path in DAG. Figure below marks 8 time units (strands in this case) shaded edges in critical path.
T1 - work, running time on 1 processor
Tp - running time on P processors
T∞- span, running time on ∞ processors
work law - Tp ≥ T1/P
P processors can complete at most P units of work in one step.
At best, P processors requires time T1/P or linear improvement.
T1 = 100s
P = 5
T1/P = 100s /5 = 20s
span law - Tp ≥ T∞
An ∞-processor machine can emulate a P-processor machine, the time on a P-processor machine will always be equal or greater than a ∞-processor machine.
speedup - T1/Tp
Ratio number of times faster on P-processor machine.
T1/T5 = 20s / 10s = 2 indicates a 5-processor machine produces a 2 times speedup.
T1/Tp ≤ P (work law) so speedup of P-processors at most P.
T1/Tp = P = Θ(P) is perfect or linear speedup.
parallelism - T1/T∞
Ratio of work to the span. Can be viewed as:
- Average amount of work performed in parallel along the critical path.
- As upper bound, the maximum possible speedup on ∞-processor machine.
- Limit on obtaining linear speedup.
After number of processors exceeds algorithm parallelism, additional processors cannot improve speedup.
Example P-Fib(4 )
T1 = 17 work
T∞ = 8 span
T1/T∞ = 17/8 = 2.125 parallelism
No matter the number of processors, the algorithm parallelism limits speedup to 2.125 times for P-Fib(4).
Larger n for Fibonacci allows greater parallelism hence greater speedup possible by additional processors.
Question 27.3
- T1/T∞ = parallelism. What happens as parallel execution time T∞ decreases?
- What does a larger number for parallelism mean?
- Consider the recursion tree for Fibonacci and that each node is a separate processor.
What is obvious that supports the statement: "Larger n allows greater parallelism hence greater speedup possible by additional processors."
parallel slackness - (T1/T∞)/P = T1/PT∞
Factor that algorithm parallelism exceeds number of processors.
T1/PT∞ decreases as P increases.
T1/PT∞ < 1 cannot achieve linear speedup, means number of processors already exceeds parallelism.
T1/PT∞ > 1 number of processors is limiting factor.
T1 = 100s
T∞ = 5s
P = 10
parallelism - T1/T∞ = 100s /5s = 20
T1/PT∞ = 100s/ 5s*10 = 2 more processors would reduce parallel slackness.
Scheduling - online, centralized, greedy
Schedulers determine which strands are executed on a given processor.
Text's multithreaded model does not specify which strands execute on which processor. Rather than the programmer, the scheduler must then determine in order to balance processors' load.
Without any a priori knowledge of strands' execution - scheduler must operate online - always available.
Centralized, greedy scheduler:
- knows the global state of computation and processors (what needs to be done and which processors are free) and
- assigns as many strands to processors as possible at each time step.
Best time possible on P processors
Tp = T1/P linear speedup
T1/P for complete step - when P strands available to be assigned to P-processors.
Tp = T∞ plenty of processors
T∞ for incomplete step - when less than P strands available to be assigned to P-processors.
Theorem 27.1 - On an ideal P-processor computer, a greedy scheduler executes a multithreaded computation with work T1 and T∞ span in time:
Tp ≤ T1/P + T∞
Tp is computation time on P-processors.
T1/P + T∞ is upper-bound of a greedy scheduler. The time to execute all complete and incomplete steps.
Corollary 27.2 - Running time Tp of any multithreaded computation scheduled by a greedy scheduler on ideal parallel P-processor computer is within a factor of 2 of optimal.
Analyzing multithreaded algorithms
Fib( n )
- if n ≤ 1
- return n
- else x = Fib(n-1)
- y = Fib(n-2)
- return x + y
P-Fib( n )
- if n ≤ 1
- return n
- else x = spawn P-Fib(n-1)
- y = P-Fib(n-2)
- sync
- return x + y
Serial Fib( n )
T(n) = Θ( φn )
Parallel P-Fib( n )
T1(n) = T(n) = Θ( φn )
T∞(n) = max( T∞(n-1), T∞(n-2) ) + Θ( 1 ) T∞(n-1) is worst (max) case.
= T∞(n-1) + Θ( 1 )
= Θ( n ) by Chip & Conquer
Parallelism increases as n increases
T1(n) / T∞(n) = Θ( φn ) / Θ( n ) = Θ( φn / n )
limit φn / n → ∞
Fibonacci execution comparison using C++, Open MP and Cilk++ - http://myxman.org/dp/node/182
For small values of n
T1(n) < T∞(n)
T1(n)/PT∞(n) < 1 cannot achieve linear speedup, means number of processors already exceeds parallelism.
parallelism less than 1, serial execution time less
2 processors
serial 0.08s
parallel 0.49sCutoff
Execution tests showed a cutoff at n ≤ 30 to use serial Fibonacci function and parallel for n > 30 yielded significant speedup over pure serial or pure parallel.
Use of cutoff, an example of improving performance by coarsening parallelism.
CILK_NPROC=1 ./a.out 35
Serial fib(35) = 9227465 Time: 0.090000
Parallel fib(35) = 9227465 Time: 0.730000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.070000
CILK_NPROC=2 ./a.out 35
Serial fib(35) = 9227465 Time: 0.080000
Parallel fib(35) = 9227465 Time: 0.490000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.040000CILK_NPROC=4 ./a.out 35
Serial fib(35) = 9227465 Time: 0.130000
Parallel fib(35) = 9227465 Time: 0.260000
Parallel fib(35) cutoff(30) = 9227465 Time: 0.020000Question 27.4
The following finds the maximum of an array using divide-and-conquer.
P-max(A, l, r)
if l ≥ r return A[ l ]
m=⌊(l+r)/2⌋
x = spawn P-max(A, l, m)
y = spawn P-max(A, m+1, r)
sync
if x > y return x
else return yFor: P-max( <6, 3, 5, 7, 8, 1, 4, 2>, 0, 7)
- Draw the serial recursion tree by ignoring the spawn and sync.
- Are the parallel threads the same as recursion tree?
- What is the length of the Critical Path?
- What is lg 8 = lg 23?
- Determine the parallelism.
![]() Simulated parallel execution of P-max for large n (1000) shows near linear speedup and parallelism = 121.06
|
|||||||
Work
Parallelism T1(n) / T∞(n) |
Recursion tree height = lg n Number recursions = 2lg n+1-1
|
||||||
Question 27.5
Give a parallel version that multiplies each element of an array by 2 using divide-and-conquer.
P-double( <6, 3, 5, 7, 8, 1, 4, 2>, 0, 7)
Number of calls as function of n?
Recursion tree height as function of n?
T1(n) =
T∞(n) =
T1(n) / T∞(n) =
Coarsening
Spawns are relatively costly compared to simply calling a subroutine but, in divide-and-conquer algorithms, for example, the cost can be amortized over parallel spawns.
At each node, there are two spawns but the overall spawn cost is constant for the next level while amount of work completed doubles.
For small n problems, the percentage of cost due to spawns is higher because there are fewer parallel threads over which to spread the cost.
Small n problems can often be solved more efficiently serially than in parallel.
Solving small problems serially and larger problems in parallel is called parallelism coarsening.
As an example, P-max has been coarsened to solve problems of size 8 or smaller serially, it has a cutoff of 8.
Fibonacci had a cutoff of 30.
P-max(A, l, r)
if l ≥ r return A[ l ]
m=⌊(l+r)/2⌋
x = spawn P-max(A, l, m)
y = spawn P-max(A, m+1, r)
sync
if x > y return x
else return y
P-max(A, l, r)
if r-l ≤ 8
max = A[ l ]
for i = l+1 to r
if A[ i ] > max
max = A[ i ]
return max
m=⌊(l+r)/2⌋
x = spawn P-max(A, l, m)
y = spawn P-max(A, m+1, r)
sync
if x > y return x
else return y
Parallel loops
Multiply n element vector by 2.
VecDouble( A )
- n = A.length
- parallel for i = 1 to n
- Ai = Ai * 2
A 1 2 3 4 20 22 24 26
=
A 1 2 3 4 10 11 12 13
* 2Line 3: n parallel executions, each with a different value for i: Ai = Ai * 2
Therefore each loop iteration uses a different array variable, one assignment is independent of all others, can occur concurrently, that is in any order.
A1 = A1 * 2 A3 = A3 * 2
gives same results asA3 = A3 * 2 A1 = A1 * 2
parallel for can be implemented by compiler using divide-and-conquer.
VecDouble( A )
|
VecDouble( A )
|
P-VecDouble(A, i, i')
|
| Work
T1(n) = 2T1(n/2) +
Θ( 1 ) |
Span T∞(n) = T∞(n/2)
+ Θ( 1 ) |
Parallelism T1(n) / T∞(n)
= Θ( n ) /
Θ( lg n ) |
Question 27.6.1
1. Give a parallel version using parallel for that searches an array for a value and returns the index if found or -1.
LinearSearch( <16, 13, 15, 17, 15, 12, 15, 14>, 15) returns index 3.
2. Give a parallel version using divide-and-conquer that searches an array for a value and returns the index if found or -1.
LinearSearch(A,x,n)
- location = -1
- for i = 1 to n
- if A[ i ] == x
- location = i
- return location
Hint: For analysis, recall that the parallel for operating on an array can be synthesized by a parallel divide-and-conquer of the array. 3. For the parallel algorithm, what is:
T1(n)
T∞(n), the span of parallel for (recall that parallel for is equivalent to divide-and-conquer)
T1(n) / T∞(n)
4. Is there a best or worst case for your algorithm?
5. Does your algorithm always return the same value, given the same data?
Example
Add two n x n arrays A and B.
Mat-Vec( A, B )
- n = A.rows
- let y be a new n x n array
- parallel for i == 1 to n
- parallel for j = 1 to n
- yij = Bij + Aij
- return y
y 1 2 3 1 30 32 34 2 36 38 40 3 42 44 46 =
A 1 2 3 1 10 11 12 2 13 14 15 3 16 17 18 +
B 1 2 3 1 20 21 22 2 23 24 25 3 26 27 28 Line 5: Notice that each addition is independent of other additions, can occur concurrently, that is in any order.
y22 = B22 + A22 y23 = B23 + A23
gives same results asy23 = B23 + A23 y22 = B22 + A22
Example
Multiply n x n array A by n-vector x.
Mat-Vec( A, x ) Correct
- n = A.rows
- let y be a new vector of length n
- parallel for i = 1 to n
- yi = 0
- parallel for i = 1 to n
- for j = 1 to n
- yi = yi + Aij * xj
- return y
Mat-Vec( A, x ) Wrong
- n = A.rows
- let y be a new vector of length n
- parallel for i = 1 to n
- yi = 0
- parallel for i = 1 to n
- parallel for j = 1 to n
- yi = yi + Aij * xj
- return y
y 1 2 3 6 12 18 =
A 1 2 3 1 1 1 1 2 2 2 2 3 3 3 3 *
x 1 2 3 1 2 3 Line 6: Why not use parallel for j = 1 to n
Line 7: Because allows two or more processors to access same variable in shared memory.
When an instruction that is non-atomic:
yi = yi + Aij * xj
reads and writes a shared variable yi, race conditions can occur that produce results different than serial execution.
Correct Wrong
Processor
5y5 = y5 + A5j * xj Processor
9y9 = y9 + A9j * xj
Processor
5y5 = y9 + A5j * xj Processor
9y9 = y5 + A9j * xj Example
In following, what can happen at Line 7 of the wrong version, each processor must first read value of y into processor register.
No guaranteed order in which processors will access shared memory or complete execution.
Serial y ← 0 y ← y + 2
Parallel P1 starts and ends, then
P2 starts and ends.P1: rP1 ← 0
y ← rP1
rP1 ← y
rP1 ← rP1 + 2
y ← rP1
P2: rP2 ← 0
y ← rP2
rP2 ← y
rP2 ← rP2 + 2
y ← rP2
can give different results
Question 27.6.2
What is y at the end of each?
P1 and P2 intermix execution. P1: rP1 ← 0
y ← rP1
rP1 ← y
P2: rP2 ← 0
y ← rP2
P1: rP1 ← rP1 + 2
y ← rP1
P2: rP2 ← y
rP2 ← rP2 + 2
y ← rP2
Analysis
Mat-Vec( A, x )
- n = A.rows
- let y be a new vector of length n
- parallel for i = 1 to n
- yi = 0
- parallel for i = 1 to n
- for j = 1 to n
- yi = yi + Aij * xj
- return y
T1(n), work of Mat-Vec(A: n x n matrix, x), is serial running time by replacing parallel for with for.
T1(n) = Θ( n2 )
parallel for implemented as divide-and-conquer subroutine using nested parallelism.
Mat-Vec( A, x )
- n = A.rows
- let y be a new vector of length n
- Mat-Vec-Init-Loop(y, 1, n)
- Mat-Vec-Main-Loop(A, x, y, n, 1, n)
- return y
Mat-Vec-Main-Loop(A, x, y, n, i, i')
- if i == i' -- Single row of matrix
- for j = 1 to n -- serial for
- yi = yi + Aij * xj
- else mid = ⌊(i+i')/2⌋
- spawn Mat-Vec-Main-Loop(A, x, y, n, i, mid)
- Mat-Vec-Main-Loop(A, x, y, n, mid+1, i')
- sync
- return
Mat-Vec-Init-Loop(y, i, i')
- if i == i' -- Single array element
- yi = 0
- else mid = ⌊(i+i')/2⌋
- spawn Mat-Vec-Init-Loop(y, i, mid)
- Mat-Vec-Init-Loop(y, mid+1, i')
- sync
- return
T∞(n), span of parallel for, as implemented using divide-and-conquer.
lg n - depth of recursion of parallel for i = 1 to n
iter∞(i) span of ith iteration
![]()
T∞(n) = Θ( lg n ) = Θ( lg n ) + Θ( 1 ), for Mat-Vec-Init-Loop(y, 1, n)
Θ( lg n ) recursive spawning dominates the constant work, Θ( 1 ), of yi = 0 and other constant time parts of subroutine
T∞(n) = Θ( n ) = Θ( lg n ) + Θ( n ) + Θ( 1 ), for Mat-Vec-Main-Loop(A, x, y, n, 1, n)
Θ( lg n ) the recursive spawning tree height.
Θ( n ) the maximum iter∞(i), since all n have same span executing Lines 2 & 3
for i = 1 to n yi = yi + Aij * xj.
Θ( 1 ) the constant time parts of subroutine.
parallelism
T1(n) / T∞(n) = Θ( n2 ) / Θ( n ) = Θ( n )
Counting strands does not work here because most of time spent in serial execution.
Question 27.6.3 - For below DAG:
What is the work, T1(n)?
What is the Critical Path of the following execution DAG?
What is its length, T∞(n)?
Do those agree with our analysis, T1(n) / T∞(n) = Θ( n )?
Mat-Vec(A: 8 x 8 matrix, x) DAG is:
Mat-Vec-Main-Loop(A, x, y, n, i, i')
- if i == i' -- Single row of matrix
- for j = 1 to n
- yi = yi + Aij * xj
- else mid = ⌊(i+i')/2⌋
- spawn Mat-Vec-Main-Loop(A, x, y, n, i, mid)
- Mat-Vec-Main-Loop(A, x, y, n, mid+1, i')
- sync
- return
Race conditions
Determinacy race when two or more logically parallel instructions access same memory location and at least one performs a memory write.
Two parallel strands executing: x = x + 1
non-atomic, composed of instructions:
- read x into a register
- increment register
- write register to x
Race-Example
- x = 0
- parallel for i = 1 to 2
- x = x + 1
- print x
Following diagrams how parallel execution produces wrong result due to race condition.
The wrong implementation achieves a span of Θ( lg n ) but introduces a race condition, at Line 7.
Line 5 and 6 allow n2 concurrent threads, with the parallel for j threads attempting to execute: yi = yi + Aij * xj
Mat-Vec( A, x ) Correct
- n = A.rows
- let y be a new vector of length n
- parallel for i == 1 to n
- yi =0
- parallel for i == 1 to n
- for j = 1 to n
- yi = yi + Aij * xj
- return y
Mat-Vec( A, x ) Wrong
- n = A.rows
- let y be a new vector of length n
- parallel for i == 1 to n
- yi =0
- parallel for i == 1 to n
- parallel for j = 1 to n
- yi = yi + Aij * xj
- return y
Question 27.7
Below is our parallel version of linear search using parallel for that searches an array for a value and returns the index if found or -1.
- Does a race condition exist? If so, where?
- Does P-search( <16, 13, 15, 17, 15, 12, 15, 19>, 15, 8) return index 3, 5 or 7?
- Will the answer always be the same, given the same array? If not, how is another answer possible?
P-search(A, x, l, r)
if l ≥ r then
if A[ l ] = x return l
else return -1
m=⌊(l+r)/2⌋
a = spawn P-search(A, x, l, m)
b = spawn P-search(A, x, m+1, r)
sync
if a != -1 return a
else return bP-search(A, x, n)
- location ← -1
- parallel for i ← 1 to n
- if A[i] = x
- location ← i
- return location
27.3 Multithreaded merge sort
|
Example
Merge_Sort(A, 9, 16) result
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
Work T1(n) = 2T1(n/2) + Θ(n) = Θ(n lg n) Merge is Θ(n)
Serial version Merge_Sort (A, p, r) 1 if p < r then 2 q ← floor ((p + r) / 2) 3 Merge_Sort (A, p, q) 4 Merge_Sort (A, q + 1, r) 5 Merge (A, p, q, r)
Parallel version Merge_Sort' (A, p, r) 1 if p < r then 2 q ← ⌊(p + r) / 2⌋ 3 spawn Merge_Sort' (A, p, q) 4 Merge_Sort' (A, q + 1, r) 5 sync 6 Merge (A, p, q, r) Span
T∞(n) = T∞(n/2) + Θ( n ) T∞(n/2) not 2T∞(n/2) because running 2 threads
f(n) > nlog21 = 1
= Θ( n ) by Master Theorem case 3f(n) = Θ( n ) → f(n) = Ω(nlog21+1 )
parallelism
T1(n) / T∞(n) = Θ(n lg n) / Θ( n ) = Θ( lg n )
Not great parallelism, lg 1024 = lg 210 = 10
Reducing the span, T∞(n), increases parallelism.
T∞(n) = Θ( n ) due to serial Merge.
Multithreaded Merge
Text defines a multithreaded P-Merge resulting in P-Merge-Sort parallelism of: Θ( n/lg2 n )
P-Merge-Sort(A, p, r, B, s)
|
P-Merge(T, p1, r1, p2, r2,
A, p3)
|
How P-Merge works
A is merged into B array.
Merges two arrays of length n1 and n2 using divide-and-conquer by reducing array size to n = 0.
1-6. Ensures n1 ≥ n2. 7-8. Base case, arrays are empty. 9. Lower part of T contains one sorted array, upper part of T the other. The middle element, T[ q1 ], of lower sorted array is median, all elements below T[ q1 ] are less than or equal and all elements above are greater than or equal.
10. Using BinarySearch, find where T[ q1 ] could be inserted into upper sorted array, index q2, and array still be sorted. 11. q3 is index where to copy T[ q1 ] into A to maintain sorted order. 12. Copy T[ q1 ] into A[ q3 ] 13. Recursively merge lower half. 14. Recursively merge upper half. Analysis of P-Merge
BinarySearch is serial, Θ(lg n)
PM∞(n) = PM∞(3n/4) + Θ(lg n)
= Θ(lg2 n) Exercise 4.6-2PM1(n) = Θ(n)
PM1(n) / PM∞(n) = Θ(n) / Θ(lg2 n) = Θ(n / lg2 n)
Analysis of P-Merge-Sort
PMS1(n) = 2PMS1(n/2) + PM1(n)
= 2PMS1(n/2) + Θ(n)
= Θ(n lg n) Master Theorem Case 2PMS∞(n) = PMS∞(n/2) + PM∞(n)
= PMS∞(n/2) + Θ(lg2 n)
= Θ(lg3 n) Exercise 4.6-2PMS1(n) / PMS∞(n) = Θ(n lg n) / Θ(lg3 n)
= Θ(n / lg2 n)
Comparison of parallelism: P-Merge-Sort using Serial and Parallel Merge
Very little difference for small values of n.
Serial merge with parallel sort does not improve parallelism as n increases, utilizing only a few processors for large n.
Parallel merge does improve parallelism as n increases, utilizing relatively more processors for large n.
Serial T1(n) / T∞(n) = Θ(n lg n) / Θ( n ) = Θ( lg n )
lg 1024 = lg 210 = 10
lg 4294967296 = lg 232 = 32
Parallel PMS1(n) / PMS∞(n) = Θ(n lg n) / Θ(lg3 n) = Θ(n / lg2 n)
1024/lg2 1024 = 1024/(lg 210)2 = 1024/100 = 10.2
4294967296/lg2 4294967296 =
4294967296/(lg 4294967296)2 =
4294967296/(lg 232)2 =
4294967296/322 =
4294967296/1024 = 4194304
Examples
Analysis - Maximum of an array using divide-and-conquer - coded for MIT Cilk
Work T1(n) = 2T1(n/2) + Θ(1)
= Θ( n ) by Master Theorem Case 1Span
T∞(n) = T∞(n/2) + Θ( 1 )
T∞(n/2) running 2 threads
= Θ( lg n ) + Θ( 1 ) by Master Theorem Case 2
= Θ( lg n )Parallelism
T1(n) / T∞(n) = Θ( n ) / Θ( lg n ) = Θ( n / lg n )
Speedup - measured results
When executed for n=1000 and increasing
numbers of processors, the speedup follows
a near linear increase with the number
of processors.Simulation shows near linear speedup and parallelism = 121.06
Our parallelism analysis for n=1024 yields:
1024 / lg 210 = 102
#include <cilk-lib.cilkh>
#include <stdio.h>
#include <math.h>
cilk int Pmax(int A[], int l, int r) {
int m, x, y;
if(l>=r) return A[l];
m=(int)floor((l+r)/2);
x = spawn Pmax(A, l, m);
y = spawn Pmax(A, m+1, r);
sync;
if (x > y) return x;
else return y;
}
cilk int cilk_main(int argc, char *argv[]) {
int A[8] = {6, 3, 5, 7, 8, 1, 4, 2};
int x;
x = spawn Pmax(A, 0, 7);
sync;
printf("maximum: %d\n", x);
return 0;
}
Intel Cilk++ is slightly different from the examples of the text. Below is the same maximum solution and a parallel search using the Intel syntax, the search uses cilkview to display parallel execution measures. See the resources listed at the beginning of this page for documentation on Intel Cilk++.
Array maximum by divide-and-conquer #include <cilk.h> #include <stdio.h> #include <math.h> int Pmax(int A[], int l, int r) { int m, x, y; if(l>=r) return A[l]; m=(int)floor((double)((l+r)/2)); x = cilk_spawn Pmax(A, l, m); y = cilk_spawn Pmax(A, m+1, r); cilk_sync; if (x > y) return x; else return y; } int cilk_main(int argc, char *argv[]) { int A[8] = {6, 3, 5, 7, 8, 1, 4, 2}; int x; x = cilk_spawn Pmax(A, 0, 7); cilk_sync; printf("maximum: %d\n", x); return 0; }
Array search by parallel for #include <cilk.h> #include <cilkview.h> #include <stdio.h> #include <math.h> int search(int A[], int x, int n) { int location=-1; cilk_for(int i=0;i<n; i++) if( A[i]==x) location = i; return location; } int cilk_main(int argc, char *argv[]) { int A[1000]; int x; cilk::cilkview cv; for(int i=0;i<1000;i++) A[i]=0; A[12]=13; A[789]=13; cv.start(); x = search(A, 13, 1000); cv.stop(); cv.dump("search results", true); printf("location: %d\n", x); printf("%f8 seconds\n", cv.accumulated_milliseconds() / 1000.f ); return 0;Speedup as measured at execution.