Multi-locale Chapel

Setup

So far we have been working with single-locale Chapel codes that may run on one or many cores on a single compute node, making use of the shared memory space and accelerating computations by launching parallel threads on individual cores. Chapel codes can also run on multiple nodes on a compute cluster. In Chapel this is referred to as multi-locale execution.

Docker side note

If you work inside a Chapel Docker container, e.g., chapel/chapel-gasnet, the container environment simulates a multi-locale cluster, so you would compile and launch multi-locale Chapel codes directly by specifying the number of locales with -nl flag:

$ chpl --fast mycode.chpl -o mybinary
$ ./mybinary -nl 3

Inside the Docker container on multiple locales your code will not run any faster than on a single locale, since you are emulating a virtual cluster, and all tasks run on the same physical node. To achieve actual speedup, you need to run your parallel multi-locale Chapel code on a real HPC cluster.

On an HPC cluster you would need to submit either an interactive or a batch job asking for several nodes and then run a multi-locale Chapel code inside that job. In practice, the exact commands to run multi-locale Chapel codes depend on how Chapel was built on the cluster.

When you compile a Chapel code with the multi-locale Chapel compiler, two binaries will be produced. One is called mybinary and is a launcher binary used to submit the real executable mybinary_real. If the Chapel environment is configured properly with the launcher for the cluster’s physical interconnect, then you would simply compile the code and use the launcher binary mybinary to run a multi-locale code.

For the rest of this class we assume that you have a working multi-locale Chapel environment, whether provided by a Docker container or by multi-locale Chapel on a physical HPC cluster. We will run all examples on four nodes with two cores per node.

Let’s write a job submission script distributed.sh:

#!/bin/bash
#SBATCH --time=0:5:0         # walltime in d-hh:mm or hh:mm:ss format
#SBATCH --mem-per-cpu=1000   # in MB
#SBATCH --nodes=3
#SBATCH --cpus-per-task=2
#SBATCH --output=solution.out
./test -nl 3   # in this case the 'srun' launcher is already configured for our interconnect

Simple multi-locale codes

Let us test our multi-locale Chapel environment by launching the following code:

writeln(Locales);
$ source /home/razoumov/shared/syncHPC/startMultiLocale.sh   # on the training cluster
$ chpl test.chpl -o test
$ sbatch distributed.sh
$ cat solution.out

This code will print the built-in global array Locales. Running it on four locales will produce

LOCALE0 LOCALE1 LOCALE2

We want to run some code on each locale (node). For that, we can cycle through locales:

for loc in Locales do   // this is still a serial program
  on loc do             // run the next line on locale `loc`
	writeln("this locale is named ", here.name[0..4]);   // `here` is the locale on which the code is running

This will produce

this locale is named node1
this locale is named node2
this locale is named node3

Here the built-in variable class here refers to the locale on which the code is running, and here.name is its hostname. We started a serial for loop cycling through all locales, and on each locale we printed its name, i.e., the hostname of each node. This program ran in serial starting a task on each locale only after completing the same task on the previous locale. Note the order in which locales were listed.

To run this code in parallel, starting four simultaneous tasks, one per locale, we simply need to replace for with forall:

forall loc in Locales do   // now this is a parallel loop
  on loc do
	writeln("this locale is named ", here.name[0..4]);

This starts four tasks in parallel, and the order in which the print statement is executed depends on the runtime conditions and can change from run to run:

this locale is named node1
this locale is named node3
this locale is named node2

We can print few other attributes of each locale. Here it is actually useful to revert to the serial loop for so that the print statements appear in order:

use Memory.Diagnostics;
for loc in Locales do
  on loc {
	writeln("locale #", here.id, "...");
	writeln("  ...is named: ", here.name);
	writeln("  ...has ", here.numPUs(), " processor cores");
	writeln("  ...has ", here.physicalMemory(unit=MemUnits.GB, retType=real), " GB of memory");
	writeln("  ...has ", here.maxTaskPar, " maximum parallelism");
  }
$ chpl test.chpl -o test
$ sbatch distributed.sh
$ cat solution.out
locale #0...
  ...is named: node1
  ...has 2 processor cores
  ...has 2.77974 GB of memory
  ...has 2 maximum parallelism
locale #1...
  ...is named: node2
  ...has 2 processor cores
  ...has 2.77974 GB of memory
  ...has 2 maximum parallelism
locale #2...
  ...is named: node3
  ...has 2 processor cores
  ...has 2.77974 GB of memory
  ...has 2 maximum parallelism

Note that while Chapel correctly determines the number of physical cores on each node and the number of cores available inside our job on each node (maximum parallelism), it lists the total physical memory on each node available to all running jobs which is not the same as the total memory per node allocated to our job.