Building and Running WRF with OpenMP and MPI support

This technical note will aid interested users (and their system administrators) in building and testing WRF with OpenMP and MPI support. Familiarity with WRF and the basic installation process described in the official documentation will be helpful.

Please contact Jan.Ploski (at) offis.de if you have corrections, questions or remarks regarding this document.

Introduction

WRF v.2.2 can use two different mechanisms for parallel execution:

  • OpenMP - multiple threads are run in a single Unix process named wrf.exe (shared memory on a single machine). Each thread is assigned a part of the grid to process, a so-called "tile". An implementation of OpenMP ships with your commercial Fortran compiler, such as PGI. There is no additional installation necessary.
  • MPI - multiple wrf.exe Unix processes are started (possibly on different machines) and communicate by passing messages over the network (distributed memory). Each process is assigned a part of the grid to process, a so-called "patch". Different implementations of MPI exist. Each such implementation consists of a library which you have to compile your code against, as well as tools for running the code in parallel. The installation of MPI is prerequisite and independent from WRF.

The mechanisms described above are not mutually exclusive. A "hybrid" approach is also possible:

  • OpenMP+MPI - multiple wrf.exe processes are started, each of which starts multiple threads. The processes communicate using MPI. The threads communicate using OpenMP. Thus, there is one patch per process, which is split up into multiple tiles.

Additionally, the MPI implementations differ in the supported networking technology, such as:

  • Gigabit Ethernet
  • InfiniBand

Finally, there are two versions of the RSL library (which provides a software layer on top of MPI) to choose from:

  • RSL
  • RSL_LITE

The official documentation of WRF v.2.2 is short on details regarding the build process. In particular, it does not provide detailed instructions for compiling and running the following cases:

  • OpenMP
  • MPI over InfiniBand
  • Hybrid OpenMP+MPI

The following sections, based on our experience gathered in the e-Science project WISENT sponsored by the German Federal Ministry of Education of Research, describe the WRF build process in more detail in order to provide understanding how the different pieces of software fit together and to help interested users create one of the above configurations.

Basic WRF Build

The basic WRF build process consists of the following steps (in Bash shell):

export NETCDF=/path/to/your/netcdf/installation
./clean -a
./configure
./compile em_real

The "configure" step offers you an interactive choice of the WRF variant to build. A prompt like the one below is displayed:

Please select from among the following supported platforms.
   1.  PC Linux x86_64 (IA64 and AMD Opteron), PGI compiler 5.2 or higher
       (Single-Threaded, No Nesting)
   2.  PC Linux x86_64 (IA64 and AMD Opteron), PGI compiler 5.2 or higher
       (Single-Threaded, RSL, Allows Nesting)
   3.  PC Linux x86_64 (IA64 and AMD Opteron), PGI compiler 5.2 or higher
       (OpenMP, Allows Nesting)
   4.  PC Linux x86_64 (IA64 and AMD Opteron), PGI 5.2 or higher, DM-Parallel
       (RSL, MPICH, Allows nesting)
   5.  PC Linux x86_64 (IA64 and AMD Opteron), PGI 5.2 or higher, DM-Parallel
       (RSL_LITE, MPICH, Allows nesting, No P-LBCs)
   6.  AMD x86_64 Intel xeon i686 ia32 Xeon Linux, ifort compiler
       (single-threaded, no nesting)
   7.  AMD x86_64 Intel xeon i686 ia32 Xeon Linux, ifort compiler
       (single threaded, allows nesting using RSL without MPI)
...

The problem is that the list displayed by "configure" might not contain the desired configuration. For example, there is generally no option to build a hybrid OpenMP+MPI variant at all. On the other hand, some unnecessary configurations might be displayed. For example, the above snippet copied from our system contains a configuration with ifort compiler even though no such compiler is installed.

To resolve such problems, you have to understand how the "configure" script works and how to add additional configurations to the presented list.

Understanding configure

We examined the configure script with the Bash debugger so that you don't have to ;-) This script performs the following steps:

  1. Determines the location of the Perl interpreter on your system.
  2. Determines the location of the NetCDF library (environment variable NETCDF)
  3. Determines the location of the optional HDF5 I/O module (environment variable PHDF5)
  4. Determines the operating system and architecture using the command uname
  5. Inspects the (optional) environment variable WRF_CHEM. If it is set to 1, WRF will be built with the chemistry option.
  6. Runs the script arch/Config.pl with command-line options corresponding to results of the above checks.

The script arch/Config.pl displays the list of available target configurations and prompts you to select one. After you have selected an option, the file configure.wrf is created, which contains all the necessary environment settings and compiler options. This file is included directly from Makefiles, thus it is in Makefile syntax. You could potentially edit this file after configure to influence the compilation. However, there is a better way, described next.

Understanding arch/Config.pl

Here is an example invocation of arch/Config.pl done by the configure script (with line breaks inserted for clarity):

arch/Config.pl \
    -perl=perl \
    -netcdf=/opt/netcdf-3.6.1-pgcc \
    -pnetcdf= \
    -phdf5= \
    -os=Linux \
    -mach=x86_64 \
    -ldflags= \
    -compileflags=

The arch/Config.pl works as follows:

  1. Determines whether the environment variables JASPERLIB and JASPERINC are defined. The optional Jasper library supports Grib2 I/O.
  2. Determines whether the environment variables ESMFLIB and ESMFINC are defined. The optional (and experimental) ESMF library supports coupling WRF with external ESMF components.
  3. Reads the file configure.defaults to create the list of available target configurations.

During the last step, the file arch/configure.defaults is inspected for lines which start with the #ARCH prefix. If the OS name and machine architecture strings (both passed to arch/Config.pl from configure, see above) appear in such a line, the line's content is added to the displayed list of configurations. Otherwise, it is ignored.

After your selection, the corresponding section of arch/configure.defaults (from the initial #ARCH line until the next #ARCH line) is read. All strings that begin with CONFIGURE_ are replaced with the values provided by configure or based on the content of environment variables mentioned above. Finally, the text is written to configure.wrf, surrounded by the content of arch/preamble (with some ESMF-related substitutions) and arch/postamble.

Creating a Hybrid OpenMP+MPI Configuration

With knowledge of arch/configure.defaults you can create custom configurations based on the existing ones by adding an additional #ARCH section to the file. arch/configure.defaults is the right location to modify compiler options. In particular, such modifications are necessary to build a hybrid OpenMP+MPI configuration matching your OS and machine architecture.

  1. Make a backup copy of arch/configure.defaults.
  2. Edit arch/configure.defaults.
  3. Copy a section which corresponds to the MPI variant which you want to enhance with OpenMP. This should be one of the sections displayed by configure. You can choose RSL or RSL_LITE. According to the official documentation one should choose compile options that use RSL_LITE whenever one can. The only option RSL_LITE doesn't support is the periodic boundary condition in y. RSL_LITE is a bit faster, and will work for domains dimensioned greater than 1024x1024. The positive-definite advection scheme is implemented using RSL_LITE only. Unfortunately, we ran into problems compiling RSL_LITE with PGI compiler version 6.2 - it has a bug which can be only worked around by disabling the -fastsse optimization.
  4. Find a section which corresponds to the OpenMP variant for your platform and compiler of choice. In that section, have a look at what OMP and OMPCPP is set to and where $(OMP) and $(OMPCPP) appear in the other options. You will have to adjust your copied MPI configuration accordingly.
  5. For example, on Linux using the PGI compiler, the following adjustments were necessary:
    1. Add the variable definitions OMP = -mp and OMPCPP = -D_OPENMP
    2. Add $(OMP) to FCFLAGS and LDFLAGS
    3. Add $(OMPCPP) to POUND_DEF

Creating an MPI-over-InfiniBand Configuration

You will notice that the configurations displayed by configure make no mention of the networking technology used by MPI. The reason is that the choice of networking technology is implied by the version of MPI library installed on your system (you can also have multiple libraries).

For example, if you have installed the MPICH library with Gigabit Ethernet support, there will be tools such as mpicc, mpif77, mpif90 on your system, which the WRF build process will use to build a wrf.exe capable of executing over Gigabit Ethernet. In fact, a version of MPICH is shipped with the PGI compiler, so you might not even have to install it separately.

On the other hand, if you wish to use MPI over InfiniBand, you will likely have to install the MVAPICH library yourself. Instead of downloading it directly, it might be more sensible to use the version which ships with the InfiniBand drivers. In our Linux setup, we installed the OFED 1.1 drivers. MVAPICH contains alternate versions of the mpicc, mpif77, mpif90 tools mentioned above. Therefore, to compile an InfiniBand-enabled version of WRF, you only need to set the PATH and LD_LIBRARY_PATH environment variables to point to the appropriate directories.

Running OpenMP and MPI-enabled WRF Models

We next focus on the procedures for running the different model variants.

OpenMP

Running an OpenMP-enabled variant of WRF is rather simple:

ulimit -s unlimited
export KMP_STACKSIZE=500000000
export OMP_NUM_THREADS=8
./wrf.exe

Set the OMP_NUM_THREADS environment variable to the number of threads which should be created. The number should be less than or equal to the number of processors on the target machine. (Under Linux you can cat /proc/cpuinfo to find out how many processors you have.)

An OpenMP variant of WRF, just like the basic non-parallel variant, will report the execution progress through the standard output stream. You will be seeing messages such as

Timing for main: time 2005-03-29_18:00:18 on domain   3:    4.86800 elapsed seconds.
Timing for main: time 2005-03-29_18:00:36 on domain   3:    1.95200 elapsed seconds.
Timing for main: time 2005-03-29_18:00:54 on domain   3:    1.95000 elapsed seconds.
Timing for main: time 2005-03-29_18:00:54 on domain   2:   21.44400 elapsed seconds.
...

The times on the left indicate how far the simulation has progressed in the model, while the elapsed seconds on the right indicate how much wall-clock time the computations take.

Before the timings begin, a line will be output that reports the level of parallelism:

 WRF NUMBER OF TILES =   4

The multi-threaded wrf.exe process will appear in the output of top on the target machine as a single entry consuming n*100% CPU time, with n equal to the number of threads (at least that's the case under Linux with Native POSIX Thread Library):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12500 jploski   18   0 1162m 1.0g 6504 R  399 12.9  12:20.16 wrf.exe

MPI

Manually running an MPI-enabled variant of WRF (unfortunately) depends on the implementation of MPI library which you use. For MPICH, you should use the mpirun starter tool. For MVAPICH, you have to use the mpirun_rsh starter. Apart from the name and purpose, these tools are completely independent, so read their documentation for details. Running a set of wrf.exe processes does not differ from running a set of MPI processes in general, so a basic understanding of MPI tools is necessary.

For example, the invocation using the MPICH library looks as follows:

mpirun -np 4 $PWD/wrf.exe

In the above invocation, a text file with the available machines and their processor counts is found in a standard location in the file system. Each MPI task is started on a separate machine using either rsh or ssh with a different set of environment variables to indicate the task's execution environment and number in the task group.

For comparison, here is the invocation using the MVAPICH library, with an explicit selection of machines:

mpirun_rsh -np 4 node1 node2 node4 node4 $PWD/wrf.exe

When you run an MPI variant of WRF, the standard output and standard error of each wrf.exe process in the process group is redirected to a pair of files named rsl.error.<number> and rsl.output.<number>, respectively. There must be one such pair of files for every process, if not, something is wrong.

The first process of the group reports the progress of execution of the entire group (file rsl.error.0000). The very first line of this file will be something like

node1 -- rsl_nproc_all 16, rsl_myproc 0

rsl_nproc_all should be equal to the number of total running processes.

For pure-MPI runs, the following will appear in the output:

 WRF NUMBER OF TILES =   1

The top output on the target machine will contain multiple wrf.exe processes, each consuming about 100% CPU time (on a multi-processor machine):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12384 jploski   25   0  445m 317m  31m R  100  2.0 103:10.99 wrf.exe
12385 jploski   25   0  466m 334m  26m R  100  2.1 105:27.31 wrf.exe
12386 jploski   25   0  465m 337m  30m R  100  2.1 105:26.68 wrf.exe
12387 jploski   25   0  466m 333m  26m R  100  2.1 105:27.57 wrf.exe

OpenMP+MPI

Running a hybrid variant of WRF requires setting the environment variables just like for the OpenMP variant while using a starter tool just like for the MPI variant. Keep in mind that then number of "processes" (-np) which you request from the starter tool is really the number of wrf.exe processes, not the number of threads. Therefore, if you have 2 nodes with 4 processors each, you must specify -np 2 and set OMP_NUM_THREADS=4, for a total of 8 concurrent threads.

There will be a pair of rsl.error and rsl.output files created per wrf.exe process (not per thread). Likewise, the reported rsl_nproc_all will be equal to the number of processes. However, the reported WRF NUMBER OF TILES will match the number of threads per process.

WRF with TORQUE

If you wish to use the TORQUE (or PBS) batch system to submit and manage WRF jobs that use MPI, then the procedures described in the documentation of your MPI implementation should not be followed (and you don't need to understand them, either). Instead, you should familiarize yourself with a dedicated TORQUE-specific starter tool called mpiexec.

mpiexec, not to be confused with the mpiexec term as used in the MPICH2 standard, is a little utility which supports the various start-up protocols used by the different MPI implementations and integrates with TORQUE for improved CPU time accounting. It also provides a uniform user interface for starting MPI programs, which is useful in case your cluster has multiple network interconnects.

An example TORQUE job file for an MPI WRF job using the MPICH library (that is, communicating over Gigabit Ethernet):

#PBS -q verylong
#PBS -l nodes=4
#PBS -m n
#PBS -o wrf.OUT
#PBS -e wrf.ERR

cd /path/to/WRFV2/test/em_real
source /path/to/environment/vars/for/mpich.sh
mpiexec -np 4 -comm mpich-p4 ./wrf.exe

A job file for an MPI WRF job using the MVAPICH library (communicating over InfiniBand) is almost identical:

#PBS -q verylong
#PBS -l nodes=4
#PBS -m n
#PBS -o wrf.OUT
#PBS -e wrf.ERR

cd /path/to/WRFV2/test/em_real
source /path/to/environment/vars/for/mvapich.sh
mpiexec -np 4 -comm mpich-ib ./wrf.exe

To run a hybrid model with OpenMP+MPI use something like that:

#PBS -q verylong
#PBS -l nodes=4:ppn=4
#PBS -m n
#PBS -o wrf.OUT
#PBS -e wrf.ERR

ulimit -s unlimited
export KMP_STACKSIZE=500000000
export OMP_NUM_THREADS=4

cd /path/to/WRFV2/test/em_real
source /path/to/environment/vars/for/mvapich.sh
mpiexec -np 4 -npernode 1 -comm mpich-ib ./wrf.exe

The reservation #PBS -l node=4:ppn=4 requests 4 nodes with 4 available processors each. The option -npernode passed to mpiexec is important; it forces each of the MPI tasks to be started on a separate machine, so that each task indeed has 4 available processors to start 4 threads. Without this option, multiple MPI tasks might be started on the same target machine with all their threads, leading to an overload and significantly decreased performance.