Global operations are those that affect all processes at once. This section deals with some of these operations provided by MPI. These routines are highly optimized for the hardware in use, and are likely to be more efficient than any equivalent "hand-coded" similar operations.
int MPI_Barrier ( MPI_Comm comm )
A call that causes each process in the communicator "comm" to block until all the process have called it. Useful for timing codes. Also used in preparation for transmission operations. The use of this routine is time consuming and should be avoided under most conditions. Recall that synchronous message sending routines provide synchronization implicitly.
int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
The process specified by root sends the content of buffer, count elements of type datatype, to all processes in the communicator comm. This is a very efficient way to send information from any process to all other processes in a communicator. The time used by this routine is much shorter than the time needed for sequential calls to MPI_Ssend - MPI_Recv, specially for large number of processors.
The necessity to find global maximum, minimum, sum, product, or other "global" operations on a set of process distributed numbers often occurs. MPI provides general purpose reduce commands to perform these tasks.
int MPI_Reduce ( void *local_value, void *results, int count, MPI_Datatype datatype, MPI_Op operator, int target_process, MPI_Comm comm )
int MPI_Allreduce ( void *local_value, void *results, int count, MPI_Datatype datatype, MPI_Op operator, MPI_Comm comm )
The commands perform a specific operation specified operator on count elements of the local array local_value and stores the results in the array results.
The difference between the MPI_Allreduce() and MPI_Reduce() routines is the disposition of the latest result; MPI_Reduce() leaves the final result on the target_process, while MPI_Allreduce() broadcast the results to all the nodes in the communicator.
The pre-defined operations are
MPI_MAX | Maximum |
MPI_MIN | Minimum |
MPI_SUM | Sum |
MPI_PROD | Product |
MPI_LAND | Logical and |
MPI_BAND | Bitwise and |
MPI_LOR | Logical or |
MPI_BOR | Bitwise or |
MPI_LXOR | Logical exclusive or |
MPI_BXOR | Bitwise exclusive or |
MPI_MAXLOC | Maximum and location |
MPI_MINLOC | Minimum and location |
The code global_mpi_operations.c illustrates the use of the MPI global operation routines. Note the syntax of the calls.
The MPI global operations routines are based on clever algorithms for efficiency. As a learning tool, we look here at a possible implementation of a broadcast algorithm. It is based on the fact that a binary tree nomenclature leads to an efficient broadcasting strategies which can seriously reduce the time to send the messages to all the nodes in a parallel system. Such a tree, based on node 0 and written for a total of 8 nodes is illustrated as follows
The node zero sends the info to node 1. Then nodes 0 and 1 send the info to nodes 2 and 3 respectively. Then nodes 0, 1, 2, and 3 all send the info to nodes 4, 5, 6, and 7 respectively. It is seen that this strategy saves a large factor in time compared to a sequential series of sends. This is illustrated as follows:
Each time unit in this diagram includes the overhead time (handshaking protocol) as well as the time necessary to send the info. The larger the number of processors is, more significant the saving factor is.
A skeleton implementation of this algorithm is given in the code print_tree.c.
The transmission tree translates in the following rules ( generation refers to the time sequence of receive/send in the tree - generation ranges over 1 to 3 for a tree with up to 8 nodes, nodes 0 to 7 ).
if |
action | |
2generation <= my_rank < 2(generation-1) | receive from | my_rank - 2generation |
my_rank < 2generation | send to | my_rank + 2generation |
The code print_tree.c implements these rules.