Parallelizing Julia set
The Julia set problem was described in one of the earlier sections.
Parallelizing
How would we parallelize this problem with multi-processing? We have a large array, so we can use DistributedArrays and compute it in parallel. Here are the steps:
- Load
Distributed
on the control process. - Load
DistributedArrays
on all processes. stability
array should be distributed:
stability = dzeros(Int32, height, width); # distributed 2D array of 0's
- Define function
pixel()
on all processes. - Create
fillLocalBlock(stability)
to compute local piecesstability.localpart
on each worker in parallel. If you don’t know where to start, begin with checking the complete example withfillLocalBlock()
from the previous section. This function will cycle through all local indiceslocalindices(stability)
. This function needs to be defined on all processes. - Replace the loop
@btime for i in 1:height, j in 1:width
point = (2*(j-0.5)/width-1) + (2*(i-0.5)/height-1)im
stability[i,j] = pixel(point)
end
with
@btime @sync for w in workers()
@spawnat w fillLocalBlock(stability)
end
- Why do we need
@sync
in the previousfor
block? - To the best of my knowledge, both Plots’
heatmap()
and NetCDF’sncwrite()
are serial in Julia, and they cannot take distributed arrays. How do we convert a distributed array to a local array to pass to one of these functions? - Is your parallel code faster?
Results for 1000^2
Finally, here are my timings on (some old iteration of) the training cluster:
Code | Time on login node (p-flavour vCPUs) | Time on compute node (c-flavour vCPUs) |
---|---|---|
julia juliaSetSerial.jl (serial runtime) |
147.214 ms | 123.195 ms |
julia -p 1 juliaSetDistributedArrays.jl (on 1 worker) |
157.043 ms | 128.601 ms |
julia -p 2 juliaSetDistributedArrays.jl (on 2 workers) |
80.198 ms | 66.449 ms |
julia -p 4 juliaSetDistributedArrays.jl (on 4 workers) |
42.965 ms | 66.849 ms |
julia -p 8 juliaSetDistributedArrays.jl (on 8 workers) |
36.067 ms | 67.644 ms |
Lots of things here to discuss!
One could modify our parallel code to offload some computation to the control process (not just compute on workers as we do now), so that you would see speedup when running on 2 CPUs (control process + 1 worker).