Condor Configuration for Submitting Jobs to Globus Toolkit 4

This page, which you should read following the Condor-GT4 Introduction, explains how to set up Condor to submit jobs to a Globus Toolkit 4 host (a so-called "Condor-G" configuration). The target audience are administrators or advanced users who can edit Condor configuration, install software and reconfigure the firewall on their submission host.

Summary for the Impatient

Set HIGHPORT=25000, LOWPORT=20000 in condor_config. Open this port range in the firewall for incoming TCP connections from GT4 to the submission host. Authenticate with your user cert using grid_proxy_init (from GT4). The authentication can also take place on another host (other than the submission host), provided that you log in to the submission host using gsissh. Use the following in your job file:

universe = grid
grid_resource = gt4 https://srvgrid01.offis.uni-oldenburg.de/wsrf/services/ManagedJobFactoryService PBS

You can also provide an XML fragment to be inserted into the WS GRAM job description directly:

globus_xml = <queue>dgiseq</queue>
universe = grid
grid_resource = gt4 https://srvgrid01.offis.uni-oldenburg.de/wsrf/services/ManagedJobFactoryService PBS

If you don't understand the above or it does not work, read on.

Step 1: Installing Authentication Tools - grid-proxy-init and gsissh

Before using Condor to submit Grid jobs (defined here as jobs submitted to a host running Globus Toolkit 4), you will need to install Globus Toolkit 4 (GT4) on every machine from which you are going to log into the Condor submission host - or on the submission host itself. We recommend that you perform the installation centrally (e.g., on an NFS server) for the benefit of multiple users. In the end, the only important thing is that the programs grid-proxy-init and gsissh work correctly for anyone interested in submitting Grid jobs with Condor.

Luckily, you do not have to install the whole GT4. The following steps are sufficient for our purposes and (much) faster than the entire GT4 installation process:

  1. Download and unpack the source tarball of the latest stable release of GT4. At the time of writing, it is version 4.0.5.
  2. ./configure --prefix=/opt/gt-4.0.5 or some other preferred installation location.
  3. make gsi-openssh This will only compile the required components. You don't even need a working Java version in this case.
  4. make install
  5. Modify your $HOME/.profile (or the appropriate system-wide settings) and set the environment variable GLOBUS_LOCATION to point to the installation location you specified as --prefix.
  6. Also in your .profile script, source $GLOBUS_LOCATION/etc/globus-user-env.sh
  7. (This and the following step only apply directly as described if you wish and are authorized to use D-Grid:) Download the D-Grid CA certificates and unpack them either in $HOME/.globus/certificates or in /etc/grid-security/certificates in case of a system-wide installation. You will have to create these directories.
  8. Set up a cron job which regularly (e.g. daily) updates the certificates by downloading them from the above location.

After performing these steps, you are finished with the installation of GT4 authentication tools. To test it, try running grid-proxy-init. At this stage, it should produce an error message like this one:

ERROR: Couldn't find valid credentials to generate a proxy.
Use -debug for further information.

Also try running gsissh as shown below:

gsissh -p 2222 srvgrid01.offis.uni-oldenburg.de
The authenticity of host '[srvgrid01.offis.uni-oldenburg.de]:2222 ([134.106.52.210]:2222)' can't be established.
RSA key fingerprint is 4a:ef:28:2c:f2:4e:d8:2a:d9:59:99:3d:89:d6:1a:3e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[srvgrid01.offis.uni-oldenburg.de]:2222,[134.106.52.210]:2222' (RSA) to the list of known hosts.
jploski@srvgrid01.offis.uni-oldenburg.de's password:

gsissh is not needed if you can submit Condor jobs from the same machine on which you have just installed grid-proxy-init. This would be the case if the Condor submission host is your own workstation. However, if submitting Condor jobs from your workstation directly is not possible (for example, because you are located behind a strict firewall), then you will need to use gsissh to connect to a dedicated Condor submission host outside of the firewall.

Right now, abort gsissh with CTRL-C when asked for the password. After everything is set up appropriately, this password prompt should not even appear. Read on.

Step 2: Installing A Valid Grid User Certificate

The GT4 command grid-proxy-init is used to create a so-called proxy certificate (a temporary file) which Condor sends to the GT4 frontend node to establish your (personal) identity and authorize access. Alternatively, the same proxy certificate can be used by gsissh to log into the submission host and submit Condor jobs from there.

In order for grid-proxy-init to create the proxy certificate, it must have access to your Grid user certificate (a file named usercert.pem) and private key (userkey.pem). You should copy these both files to $HOME/.globus on your workstation. The Grid user certificate and private key can be obtained from the Grid-RA (Registration Authority) responsible for your organization. Alternatively, you might be able to obtain a test certificate directly from an administrator of the GT4 site you wish to test. This administrative process is not explained here.

jploski@pcoffis26:~> ls -l .globus
total 22
-rw-r--r--  1 jploski users  1879 2007-04-19 18:02 usercert.pem
-r--------  1 jploski users   951 2007-04-19 18:03 userkey.pem

To test whether GT4 works correctly after installing your user certificate as described above, run the command grid-proxy-init. This time, you should see an output similar to the following:

jploski@pcoffis26:~/condor-gt4> grid-proxy-init
Your identity: /C=DE/O=GridGermany/OU=OFFIS e.V./OU=Grid-RA/CN=Jan Ploski
Enter GRID pass phrase for this identity:
Creating proxy ............................................. Done
Your proxy is valid until: Wed Jul 18 01:26:25 2007

The generated proxy certificate is simply a file named /tmp/x509up_u<your Unix user ID>. This proxy certificate, valid for a limited amount of time, is transferred by Condor to the GT4 host to prove your identity. It is also used by Condor when transferring files between the submission and execution host and by gsissh to authenticate with a server running gsisshd. Thanks to the proxy certificate mechanism you do not have to enter your password upon every job submission or gsissh login. The password is only necessary once to access the userkey.pem file in order to generate the proxy certificate.

If you wish to use gsissh to log in to the Condor submission host, the proxy certificate is transferred along, so that it can be transparently reused by condor_submit. Note that in order for gsissh logins to work at all, the submission host must be running the gsisshd daemon (a drop-in replacement for the standard sshd). gsisshd is installed, but not activated, as described in Step 1. Refer to GT4 documentation for more details.

Step 3: Firewall Configuration

The firewall on the Condor submission host (the host on which you run condor_submit for Grid jobs) must likely be reconfigured to allow the necessary TCP communication with the target GT4 hosts.

You will need to open a range of ports for incoming TCP connections from each GT4 host to the submission host, for example 20000-25000. These ports are used as follows:

  • For the control channel connection from GT4 to Condor's internal GridFTP server on the submission host
  • For the data channel connections from GT4 to Condor's internal GridFTP server on the submission host
  • For GRAM notifications from GT4 to Condor on the submission host

Besides of opening the port range in your firewall, you must also change the condor_config file on the submission host and set HIGHPORT and LOWPORT accordingly there. Don't forget to run condor_reconfigure or restart Condor after changing this file. Unfortunately, the HIGHPORT and LOWPORT setting has the side effect that other Condor daemons will also start listening for their TCP connections in this port range. Read on if this is a concern, otherwise skip to the next section.

Is Setting HIGHPORT And LOWPORT Really Necessary?

In short, yes.

If you examine the behind-the-scenes submission mechanism more closely, you will find the script gridftp_wrapper.sh used by Condor to start the GridFTP server. You might think that the Condor-wide setting of HIGHPORT/LOWPORT could be avoided if you set the Globus-specific port range environment variables in the file gridftp_wrapper.sh and fix the control channel port as shown below:

#!/bin/sh
unset GLOBUS_LOCATION
GRIDMAP=`pwd`/$GRIDMAP
export GRIDMAP
cmd=$1
shift
export GLOBUS_TCP_PORT_RANGE=20000,25000
export GLOBUS_TCP_SOURCE_RANGE=20000,25000
HOSTNAME=`hostname -f`
echo Server listening at $HOSTNAME.de:20000 > gridftp.out
exec $cmd -p 20000 -c /dev/null "$@"
#exec $cmd -c /dev/null "$@"

However, this does not tell Condor and GT4 which TCP port the GRAM notifications should be sent to. In effect, your jobs will not terminate correctly (there will be delays) and their status might be misreported.

Example Job Command File

To prepare a job for submission to a GT4 host, set the universe parameter to grid and provide an additional parameter grid_resource with the host address as shown below:

executable = /bin/hostname
transfer_executable = false
universe = grid
grid_resource = gt4 https://srvgrid01.offis.uni-oldenburg.de/wsrf/services/ManagedJobFactoryService PBS
output = test.out
log = test.log
queue

You will have to change srvgrid01.offis.uni-oldenburg.de, of course. The rest of the address should not require modifications, unless the GT4 site uses a different job manager (change PBS to the manager's name in this case).

Job Submission

After grid-proxy-init (which only needs to be run once per day or so), submit your job as usual with condor_submit job.cmd. After several seconds, two new entries should appear in the output of condor_q:

jploski@pcoffis26:~/condor-gt4> condor_q

-- Submitter: pcoffis26.offis.uni-oldenburg.de : <134.106.52.79:1070> : pcoffis26.offis.uni-oldenburg.de
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  21.0   jploski         7/17 14:27   0+00:00:00 I  0   9.8  hostname
  22.0   jploski         7/17 14:27   0+00:00:00 R  0   0.0  gridftp_wrapper.sh

2 jobs; 1 idle, 1 running, 0 held

The first entry (21.0) corresponds to the submitted job. The second entry (22.0) is a helper job, which is executed automatically on the submission host. It starts a GridFTP server (which is preinstalled with Condor) in order to manage uploading and downloading files from the GT4 host.

If everything is configured properly, the first job will complete after a while and its output will be written to the test.out file as specified in the job command file. The gridftp_wrapper.sh might continue running for some time after that, but will also be eventually terminated.

Troubleshooting

To gather more diagnostic information, use the following settings in condor_config:

GRIDMANAGER_LOG = /tmp/GridmanagerLog.$(USERNAME)
GRIDMANAGER_DEBUG = D_FULLDEBUG

Submit your job and then inspect the submitting user's GridmanagerLog file.

You may also obtain relevant information by running strace -f -p pid as root, where pid is the process id of condor_schedd. If you use strace, you will most likely want to restrict the logging to a subset of all system calls. For example, the following invocation was helpful when diagnosing one problem:

strace -f -p 4500 -e trace=write -e write=3 &> /tmp/schedd.trace

The following subsections describe non-trivial problems we have encountered with Condor-G installations in project WISENT and their solutions.

Java Version

First of all, check that the Java version, as specified in the condor_config file, is at least 1.5. If you are using a recent version of Condor, 1.4 (shipped at the time of writing with most Linux distributions) is unfortunately not enough. Luckily, installing a newer version of Java from http://java.sun.com is very easy.

Firewall Settings, GridFTP Log

If your firewall rules or the HIGHPORT/LOWPORT setting is not correct, you will encounter some problems during the first run. For example, the jobs may stay in the queue indefinitely - the status of the hostname job never changes from I to R. Use the following command to obtain more information (replace 22.0 with the job id reported by condor_q above):

condor_q -long 22.0

The output will include a line like this:

Iwd = "/usr/local/condor/local.pcoffis26/spool/cluster22.proc0.subproc0"

This is the working directory of the GridFTP server. You should inspect its contents for troubleshooting. In particular, the file gridftp.out will contain the port number on which the local GridFTP server is listening.

Expired CRLs

If the job changes state to 'H' (held), run condor_q -long <job id> to figure out the hold reason. If you see HoldReason = "Globus error: Staging error for RSL element fileStageIn.", it might mean that you have an issue with expired CRLs in your $HOME/.globus/certificates directory. Download a fresh copy of this directory from the address mentioned above and log in again with grid-proxy-init.

The expired CRLs problem can be diagnosed in more detail by sniffing TCP packets from the GridFTP connections. If the problem occurs, they will contain messges as shown below:

530-globus_gsi_callback_module: Could not verify credential
530-globus_gsi_callback_module: Could not verify credential
530-globus_gsi_callback_module: Invalid CRL: The available CRL has expired
530 End. Caused by java.io.IOException: 530-globus_xio: Authentication Error
530-globus_gsi_callback_module: Could not verify credential
530-globus_gsi_callback_module: Could not verify credential
530-globus_gsi_callback_module: Invalid CRL: The available CRL has expired
530 End.

Note that if you look into the GridFTP log, you might see messages like this one:

[14324] Tue Jul 17 16:16:42 2007 :: srvgrid01.offis.uni-oldenburg.de:7362: [CLIENT ERROR]: SITE BUFSIZE 16384

These messages are harmless and can be ignored.

Hostname Resolution

If your CRLs are up-to-date, but you see HoldReason = "Globus error: Staging error for RSL element fileStageIn.", it might mean that the file transfers have failed due to an incorrect IP-to-hostname resolution on the submission host. This is more likely if the submission host has multiple IP addresses, some of which are not public. In such circumstances, Condor may be sending a wrong hostname in the sourceUrl or destinationUrl field of the WS-GRAM SOAP message to Globus.

You can inspect the SOAP messages received by Globus in container.log on the Globus host. For that you will likely need to be administrator (or work together with administrator) of the Globus host. Detailed SOAP message logging will have to be enabled in container-log4j.properties.

The following little test program can help you diagnose hostname resolution. Compile and run it on your submission host:

#include <unistd.h>
#include <stdio.h>
#include <netdb.h>
#include <arpa/inet.h>

int main()
{
    char name[80];
    struct hostent* ent;
    int i;

    gethostname(name, 80);
    printf("gethostname: %s\n", name);
    ent = gethostbyname(name);
    printf("gethostbyname(%s) = %s\n", name, ent->h_name);
    for (i = 0; ent->h_addr_list[i]; i++)
    {
        unsigned char* addr = ent->h_addr_list[i];
        struct in_addr in_addr;
        in_addr.s_addr = 0;
        in_addr.s_addr |= addr[3]; in_addr.s_addr <<= 8;
        in_addr.s_addr |= addr[2]; in_addr.s_addr <<= 8;
        in_addr.s_addr |= addr[1]; in_addr.s_addr <<= 8;
        in_addr.s_addr |= addr[0];
        printf("ip[%d] = %s\n", i, inet_ntoa(in_addr));
    }
    return 0;
}

Try mapping the displayed IP address back to a hostname using nslookup. If you get back a hostname which is not known on the Globus host, then this is your problem. The solution is most likely to edit /etc/hosts on the submission host so that the hostname resolves to an IP address that can be resolved back to the externally known hostname.

The Condor configuration parameter NETWORK_INTERFACE may also be helpful in dealing with hostname resolution.

Jobs Start Slowly

If you submit a few hundred Grid jobs in the default Condor configuration, you may notice that it takes ages for all the jobs to actually become submitted to Grid sites. A somewhat plausible excuse is that Condor is designed to be a "high-throughput", not "high-performance" system. However, tweaking configuration options can alleviate this problem substantially. Here are the options we use:

UPDATE_INTERVAL = 30
JOB_START_COUNT = 10
JOB_START_DELAY = 2
NEGOTIATOR_INTERVAL = 30
NEGOTIATOR_MATCHLIST_CACHING = FALSE
GRIDMANAGER_MAX_PENDING_SUBMITS_PER_RESOURCE = 20
GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE = 10

Other Issues

We also got error messages in the Globus container.log, which looked like this one:

2007-07-19 18:06:31,367 ERROR service.URLExpander [Thread-93,doMlsd:153] Error
expanding a directory URLServer refused performing the request. Custom message:
  (error code 1) [Nested exception message:  Custom message: Unexpected reply:
500-Command failed. : globus_xio: Unable to connect to 10.0.0.254:20002
500-globus_xio: System error in connect: No route to host
500-globus_xio: A system call failed: No route to host
500 End.]
org.globus.ftp.exception.ServerException: Server refused performing the request.
Custom message:  (error code 1) [Nested exception message:  Custom message:
Unexpected reply: 500-Command failed. : globus_xio: Unable to connect to
10.0.0.254:20002
500-globus_xio: System error in connect: No route to host
500-globus_xio: A system call failed: No route to host
500 End.].  Nested exception is
org.globus.ftp.exception.UnexpectedReplyCodeException:
  Custom message: Unexpected reply: 500-Command failed. :
globus_xio: Unable to connect to 10.0.0.254:20002
500-globus_xio: System error in connect: No route to host
500-globus_xio: A system call failed: No route to host
500 End.
        at org.globus.ftp.vanilla.TransferMonitor.run(TransferMonitor.java:191)
        at org.globus.ftp.vanilla.TransferMonitor.start(TransferMonitor.java:105)
        at org.globus.ftp.FTPClient.transferRunSingleThread(FTPClient.java:1451)
        at org.globus.ftp.FTPClient.performTransfer(FTPClient.java:756)
        at org.globus.ftp.FTPClient.mlsd(FTPClient.java:709)
        at org.globus.ftp.FTPClient.mlsd(FTPClient.java:658)
        at org.globus.ftp.GridFTPClient.mlsd(GridFTPClient.java:163)
        at org.globus.ftp.FTPClient.mlsd(FTPClient.java:641)
        at org.globus.transfer.reliable.service.URLExpander.doMlsd(URLExpander.java:151)
        at org.globus.transfer.reliable.service.URLExpander.run(URLExpander.java:181)

These messages were caused by the name of our GT4 host being resolved to an internal IP address. Adjusting the name resolution (/etc/hosts) to make it use the external IP address fixed the problem.

Additional Information

The following two articles at IBM developerWorks describe integrating Condor and Globus Toolkit in more detail and also explain the benefits of such integration. However, their examples are based on an older version of Globus Toolkit: