Parallelizing N-body

In this section I will describe a project that you can work at home: the direct N-body solver.

Imagine that you place $N$ identical particles randomly into a unit cube, with zero initial velocities. Then you turn on gravity, so that the particles start attracting each other. There are no boundary conditions: these are the only particles in the Universe, and they can fly to $\infty$.

Question

What do you expect these particles will do?

We will adopt the following numerical method:

  • force evaluation via direct summation
  • single variable (adaptive) time step: smaller $\Delta t$ when any two particles are close
  • time integration: more accurate than simple forward Euler + one force evaluation per time step
  • two parameters: softening length and Courant number (I will explain these when we study the code)

In a real simulation, you would replace:

  • direct summation with a tree- or mesh-based $O(N\log N)$ code
  • current integrator with a higher-order scheme, e.g. Runge-Kutta
  • current timestepping with hierarchical particle updates
  • for long-term stable evolution with a small number of particles, use a symplectic orbit integrator

Expected solutions:

  • 2 particles: should pass through each other, infinite oscillations
  • 3 particles: likely form a close binary + distant 3$^{\rm rd}$ particle (hierarchical triple system)
  • many particles: likely form a gravitationally bound system, with occasional ejection

In these clips below the time arrow is not physical time but the time step number. Consequently, the animations slow down when any two particles come close to each other.

 

Below you will find the serial code nbodySerial.jl. I removed all parts related to plotting the results, as it’s slow in Julia, and you would need to install Plots package (takes a while with many dependencies!).

using ProgressMeter

npart = 20
niter = Int(1e5)
courant = 1e-3
softeningLength = 0.01

x = rand(npart, 3);   # uniform [0,1]
v = zeros(npart, 3);

soft = softeningLength^2;

println("Computing ...");
force = zeros(Float32, npart, 3);
oldforce = zeros(Float32, npart, 3);
@showprogress for iter = 1:niter
    tmin = 1.e10
    for i = 1:npart
        force[i,:] .= 0.
        for j = 1:npart
            if i != j
                distSquared = sum((x[i,:] .- x[j,:]).^2) + soft;
                force[i,:] -= (x[i, :] .- x[j,:]) / distSquared^1.5;
                tmin = min(tmin, sqrt(distSquared / sum((v[i,:] .- v[j,:]).^2)));
            end
        end
    end
    dt = min(tmin*courant, 0.001);   # limit the initial step
    for i = 1:npart
        x[i,:] .+= v[i,:] .* dt .+ 0.5 .* oldforce[i,:] .* dt^2;
        v[i,:] .+= 0.5 .* (oldforce[i,:] .+ force[i,:]) .* dt;
        oldforce[i,:] .= force[i,:];
    end
end

Tring running this code with julia nbodySerial.jl; the main loop takes ~6m on the training cluster. Obviously, the most CPU-intensive part is force evaluation – this is what you want to accelerate.

There are many small arrays in the code – let’s use SharedArrays and fill them in parallel, e.g. you would replace

force = zeros(Float32, npart, 3);

with

force = SharedArray{Float32}(npart,3);

When updating shared arrays, you have a choice: either update array[localindices(array)] on each worker, or use a parallel for loop with reduction. I suggest the latter. What do you want to reduce? Hint: what else do you compute besides the force in that loop? For code syntax, check parallelFor.jl code in this earlier section.

Results

With default 20 particles and $10^5$ steps the code runs slower in parallel on the training cluster:

Code Time
julia nbodySerial.jl (serial runtime) 340s
julia -p 1 nbodyDistributedShared.jl (2 processes) 358s

This is the same problem we discussed in the Chapel course: with a fine-grained parallel code the communication overhead dominates the problem. As we increase the problem size, we should see the benefit from parallelization. E.g. with 1000 particles and 10 steps:

Code Time
julia nbodySerial.jl (serial runtime) 83.7s
julia -p 1 nbodyDistributedShared.jl (2 processes) 47.9s

Here is what I got on Cedar with 100 particles and $10^3$ steps:

Run Time
serial 1m23s
2 cores 46s
4 cores 29s
8 cores 22s
16 cores 18s
32 cores 19s