Benchmark for measure the possible overlap of computation and communication at MPI collective communications.
This benchmark is inspired by the sandia micro benchmark: W. Lawry, C. Wilson, A. Maccabe, R. Brightwell. COMB: A Portable Benchmark Suite for Assessing MPI Overlap. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER 2002), p. 472, 2002. http://www.cs.sandia.gov/smb/overhead.html
The following communication times are measured: Blocking: Blocking call. E.g. MPI_Allreduce Nonblocking_wait (NB_Wait): Nonblocking (e.g. MPI_Iallreduce) call directly followed by MPI_Wait. Nonblocking_sleep (NB_Sleep): Nonblocking call followed by a busy wait until the work time has passed. Then MPI_Wait. Nonblocking_active (NB_active): Nonblocking call followed by a basy wait where in every iteration MPI_Test is called until the work time has passed. The MPI_wait.
The overhead is computed as the time for the Nonblocking call plus the time for MPI_Wait. The iteration time is the time for the whole communication. The available part of the communication time(avail(%)) is computed as 1-(overhead/base_t), where base_t is the time for calling the method with wait time = 0. The overhead is determined by increasing the work time successive until it is the dominant factor in the iteration time. Then the overhead is computed as iter_t-work_t.
Usage: mpirun ./mpi_collective_benchmark [options]
options: -method: default: allreduce. possible methods: allreduce, barrier, broadcast, gather, allgather, scatter -iterations: default: 10000. Number of iterations for measure the time for one communication -allMethods: default:0. If 1 iterates over all available methods -startSize: default: n, where n is the size of MPI_COMM_WORLD. runs the benchmark for different communicator sizes, starting with startSize. After every run the size is doubled. Finally one run is made for the whole communicator. -verbose: default: 0. If 1 prints intermediate information while determining the overhead. -threshold: default: 2. The threshold when the work time is the dominant factor in the iteration time. (Similar to the threshold in the sandia benchmark) -nohdr: default: 0. Suppress output of the header.
options can be set either in the options.ini file or can be pass at the command-line (-key value).
To get a good 'available' value for the NB_sleep communication, some MPI implementation need to spawn an extra thread. With MPICH you can activate this by setting the environment variable MPI_ASYNC_PROGRESS to 1, with IntelMPI the variable is called I_MPI_ASYNC_PROGRESS. (https://software.intel.com/en-us/mpi-developer-reference-linux-asynchronous-progress-control)