Solutions for Chapel course
Part 1: basic language features
Solution to Exercise Basic.1
To see the evolution of the temperature at the top right corner of the plate, we just need to modify iout
and jout
. This corner correspond to the first row (iout=1
) and the last column (jout=cols
) of the plate.
$ chpl baseSolver.chpl -o baseSolver
$ sbatch serial.sh
$ tail -f solution.out
Temperature at iteration 0: 25.0
Temperature at iteration 20: 1.48171
Temperature at iteration 40: 0.767179
...
Temperature at iteration 460: 0.068973
Temperature at iteration 480: 0.0661081
Temperature at iteration 500: 0.0634717
Solution to Exercise Basic.2
To get the linear distribution, the 80 degrees must be divided by the number of rows or columns in our plate. So, the following couple of for loops at the start of time iteration will give us what we want:
// boundary conditions
for i in 1..rows do
T[i,cols+1] = i*80.0/rows; // right side
for j in 1..cols do
T[rows+1,j] = j*80.0/cols; // bottom side
Note that 80 degrees is written as a real
number 80.0. The division of integers in Chapel returns an integer, then, as rows
and cols
are
integers, we must have 80 as real so that the result is not truncated.
$ chpl baseSolver.chpl -o baseSolver
$ sbatch serial.sh
$ tail -f solution.out
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 460: 0.826941
Temperature at iteration 480: 0.824959
Temperature at iteration 500: 0.823152
Solution to Exercise Basic.3
The idea is simple: after each iteration of the while loop, we must compare all elements of Tnew
and T
, find the
greatest difference, and update delta
with that value. The following nested for
loops should do the job:
// update delta, the greatest difference between Tnew and T
delta = 0;
for i in 1..rows do {
for j in 1..cols do {
tmp = abs(Tnew[i,j] - T[i,j]);
if tmp > delta then delta = tmp;
}
}
Clearly there is no need to keep the difference at every single position in the array, we just need to
update delta
if we find a greater one.
$ chpl baseSolver.chpl -o baseSolver
$ sbatch serial.sh
$ tail -f solution.out
Temperature at iteration 0: 25.0
Temperature at iteration 20: 2.0859
Temperature at iteration 40: 1.42663
...
Temperature at iteration 460: 0.826941
Temperature at iteration 480: 0.824959
Temperature at iteration 500: 0.823152
Solution to Exercise Basic.4
For example, lets use a 650 x 650 grid and observe the evolution of the temperature at the position (200,300) for 10000 iterations or until the difference of temperature between iterations is less than 0.002; also, let’s print the temperature every 1000 iterations.
$ chpl --fast baseSolver.chpl -o baseSolver
$ ./baseSolver --rows=650 --cols=650 --iout=200 --jout=300 --niter=10000 --tolerance=0.002 --nout=1000
Temperature at iteration 0: 25.0
Temperature at iteration 1000: 25.0
Temperature at iteration 2000: 25.0
Temperature at iteration 3000: 25.0
Temperature at iteration 4000: 24.9998
Temperature at iteration 5000: 24.9984
Temperature at iteration 6000: 24.9935
Temperature at iteration 7000: 24.9819
Final temperature at the desired position after 7750 iterations is: 24.9671
The greatest difference in temperatures between the last two iterations was: 0.00199985
Solution to Exercise Basic.5
Without --fast
the calculation will become slower by ~95X.
Part 2: task parallelism
Solution to Exercise Task.1
The following code is a possible solution:
var x = 1;
config var numthreads = 2;
var messages: [1..numthreads] string;
writeln('This is the main thread: x = ', x);
coforall threadid in 1..numthreads do {
var c = threadid**2;
messages[threadid] = 'this is thread ' + threadid:string + ': my value of c is ' + c:string + ' and x is ' + x:string; // add to a string
}
writeln('This message will not appear until all threads are done ...');
for i in 1..numthreads do // serial loop, will be printed in sequential order
writeln(messages[i]);
$ chpl exercise1.chpl -o exercise1
$ sed -i -e 's|coforall --numthreads=5|exercise1 --numthreads=5|' shared.sh
$ sbatch shared.sh
$ cat solution.out
This is the main thread: x = 10
This message will not appear until all threads are done ...
this is thread 1: my value of c is 1 and x is 10
this is thread 2: my value of c is 4 and x is 10
this is thread 3: my value of c is 9 and x is 10
this is thread 4: my value of c is 16 and x is 10
this is thread 5: my value of c is 25 and x is 10
Solution to Exercise Task.2
config const numthreads = 12; // let's pretend we have 12 cores
const n = nelem / numthreads; // number of elements per thread
const r = nelem - n*numthreads; // these did not fit into the last thread
var lmax: [1..numthreads] real; // local maximum for each thread
coforall threadid in 1..numthreads do { // each iteration processed by a separate thread
var start, finish: int;
start = (threadid-1)*n + 1;
finish = (threadid-1)*n + n;
if threadid == numthreads then finish += r; // add r elements to the last thread
for i in start..finish do
if x[i] > lmax[threadid] then lmax[threadid] = x[i];
}
for threadid in 1..numthreads do // no need for a parallel loop here
if lmax[threadid] > gmax then gmax = lmax[threadid];
$ chpl --fast exercise2.chpl -o exercise2
$ sed -i -e 's|coforall --numthreads=5|exercise2|' shared.sh
$ sbatch shared.sh
$ cat solution.out
the maximum value in x is: 1.0
We use coforall
to spawn threads that work concurrently in a fraction of the array. The trick here is to determine,
based on the threadid, the initial and final indices that the thread will use. Each thread obtains the maximum in its
fraction of the array, and finally, after the coforall is done, the main thread obtains the maximum of the array from
the maximums of all threads.
Solution to Exercise Task.3
var x = 0;
writeln('This is the main thread, my value of x is ', x);
sync {
begin {
var x = 5;
writeln('this is thread 1, my value of x is ', x);
}
begin writeln('this is thread 2, my value of x is ', x);
}
writeln('this message will not appear until all threads are done...');
Solution to Exercise Task.4
The code most likely will lock (although sometimes it might not), as we’ll be hitting a race condition. Refer to the diagram for explanation.
Solution to Exercise Task.5
You need two separate locks, and for simplicity increment them both:
var lock1, lock2: atomic int;
const numthreads = 5;
lock1.write(0); // the main thread set lock to zero
lock2.write(0); // the main thread set lock to zero
coforall id in 1..numthreads {
writeln('greetings form thread ', id, '... I am waiting for all threads to say hello');
lock1.add(1); // thread id says hello and atomically adds 1 to lock
lock1.waitFor(numthreads); // then it waits for lock=numthreads (which will happen when all threads say hello)
writeln('thread ', id, ' is done ...');
lock2.add(1);
lock2.waitFor(numthreads);
writeln('thread ', id, ' is really done ...');
}
Part 3: data parallelism
Solution to Exercise Data.1
Change the line
for i in 1..n {
to
forall i in 1..n with (+ reduce total) {
Solution to Exercise Data.2
Run the code with
$ ./test -nl 4 --n=3
$ ./test -nl 4 --n=20
For n=3 we get fewer threads (7 in my case), for n=20 we still get 12 threads (the maximum available number of cores inside our job).
Solution to Exercise Data.3
Something along the lines of m = here.id:string + '-' + m.locale.id:string;
should work.
In most cases m.locale.id
should be the same as here.id
(computation follows data distribution).
Solution to Exercise Data.4
It should be forall (i,j) in largerMesh[1..rows,1..cols] do
(run on multiple locales in parallel) instead of
forall (i,j) in mesh do
(run in parallel on locale 0 only).
Another possible solution is forall (i,j) in Tnew.domain[1..rows,1..cols] do
(run on multiple locales in parallel).
Also, we cannot have forall (i,j) in largerMesh do
(run in parallel on multiple locales) as this would overwrite the
boundaries.
Solution to Exercise Data.5
Just before temperature output (if count%nout == 0), insert the following:
var total = 0.0;
forall (i,j) in largerMesh[1..rows,1..cols] with (+ reduce total) do
total += T[i,j];
and add total to the temperature output. It is decreasing as energy is leaving the system:
$ chpl --fast parallel3.chpl -o parallel3
$ ./parallel3 -nl 1 --rows=30 --cols=30 --niter=2000 # run this from inside distributed.sh
Temperature at iteration 0: 25.0
Temperature at iteration 20: 3.49566 21496.5
Temperature at iteration 40: 2.96535 21052.6
...
Temperature at iteration 1100: 2.5809 18609.5
Temperature at iteration 1120: 2.58087 18608.6
Temperature at iteration 1140: 2.58085 18607.7
Final temperature at the desired position [1,30] after 1148 iterations is: 2.58084
The largest temperature difference was 9.9534e-05
The simulation took 0.114942 seconds
Solution to Exercise Data.6
Here is one possible solution examining the locality of the finite-difference stencil:
var message: [largerMesh] string = 'empty';
and in the next line after computing Tnew[i,j] put
message[i,j] = "%i".format(here.id) + message[i,j].locale.id + message[i-1,j].locale.id +
message[i+1,j].locale.id + message[i,j-1].locale.id + message[i,j+1].locale.id + ' ';
and before the end of the while
loop
writeln(message);
assert(1>2);
Then run it
$ chpl --fast parallel3.chpl -o parallel3
$ ./parallel3 -nl 4 --rows=8 --cols=8 # run this from inside distributed.sh