Metascheduling with Condor-G
The first part of this tutorial explained how to submit Condor jobs to an explicitly specified Globus Toolkit 4 site. However, in Grid context, it is usual to have multiple target sites to choose from. This part shows how to implement so called metascheduling, which means selecting the target Grid site automatically based on some technical or user-specified criteria.
Match-making with substitution macros
It would be possible to implement metascheduling outside of Condor - based on some external information, simply generate the Condor command file which contains a
Here is a simple example of a job file with placeholders:
executable = /bin/bash arguments = yourscript.sh transfer_executable = false transfer_input_files = yourscript.sh when_to_transfer_output = ON_EXIT universe = grid grid_resource = gt4 $$(gatekeeper_url) $$(job_manager_type) output = test.out error = test.err log = test.log queue
The names of the placeholders can be chosen freely. The only requirement is that they match the names of attributes from the machine ClassAd which describes a matching GT4 head node, discussed next. Note that the placeholders can also appear in other (but not all) lines of the job command file, not just in the
While it is imaginable to use a single placeholder for the entire
Advertising GT4 head nodes
In case of a local Condor pool, the
In Grid context, we are not normally authorized to run
Incidentally, this is the same level of access also required for submitting jobs (they are ClassAds, too), meaning the we can advertise the GT4 head nodes from any host where
A machine ClassAd (which, when posted, can be seen with
MyType = "Machine" TargetType = "Job" Name = "srvgrid01.offis.uni-oldenburg.de" Machine = "srvgrid01.offis.uni-oldenburg.de" gatekeeper_url = "https://srvgrid01.offis.uni-oldenburg.de/wsrf/services/ManagedJobFactoryService" job_manager_type = "PBS" Requirements = (TARGET.JobUniverse == 9) Rank = 0.000000 CurrentRank = 0.000000 WantAdRevaluate = True OpSys = "LINUX" Arch = "X86_64" ClassAdLifetime = 60 State = "Owner" Activity = "Idle" UpdateSequenceNumber = 1 wisent_GlobusQueue = "test" wisent_DefaultWRFVariant = "19"
Most of the attributes are self-explanatory. The
To post the ClassAd manually, run:
condor_advertise UPDATE_STARTD_AD /path/to/classad_file.txt
In a production setup, the above command would have to be executed at regular intervals, and the content of the posted ClassAd file might have to be updated to reflect the Grid site's state.
Immediately after that, you should see a new machine appearing in
Name OpSys Arch State Activity LoadAv Mem ActvtyTime srvgrid01.off LINUX X86_64 Owner Idle [???] [??] [Unknown]
The "load average" and "memory" are not displayed because the corresponding attributes are missing from the machine ClassAd. These attributes do not make much sense for a GT4 head node managing an entire cluster. Here we can see a little "semantic mismatch" between the Condor command-line tools and the Grid: Condor traditionally expects machines to be real computers capable of accomodating a single job at a time, with an "activity" life cycle reflecting the job handling state. These assumptions simply do not hold for Grid sites consisting of multiple machines. However, the slightly confusing output should not worry us much. Submission of multiple jobs to the target Grid site is possible, as is keeping track of various Grid site attributes. For serious applications, the standard command-line end-users tools can be replaced by more sophisticated versions that understand and make good use of the custom attributes.
If the entry does not appear in
SEC_DEFAULT_NEGOTIATION = REQUIRED SEC_DEFAULT_AUTHENTICATION = REQUIRED SEC_DEFAULT_ENCRYPTION = OPTIONAL SEC_DEFAULT_INTEGRITY = OPTIONAL SEC_DEFAULT_AUTHENTICATION_METHODS = FS SEC_DEFAULT_CRYPTO_METHODS = 3DES, BLOWFISH
There is nothing special about submitting your job to a GT4 machine advertised by a ClassAd. Use the placeholders as shown above and
Integration with Grid information systems
The match-making mechanism of Condor is not aware of Grid information systems such as Globus MDS. It is entirely up to you which attributes you insert into the machine ClassAds, and where that information comes from. A simple command-line tool for reading the contents of Globus MDS is wsrf-query. Unfortunately, based on our experience, the contents of Globus MDS are not very useful in making scheduling decisions. While it is quite easy to figure out the number of free nodes using Globus MDS, it is difficult to estimate when and whether the submitted job will start running in a situation where all CPUs are already occupied. Furthermore, it is impossible to determine from MDS how many of your own jobs are already queued at a particular Grid site (you may be able find out with
An obvious difficulty in implementing machine ClassAds for describing dynamic properties of Grid resources lies in ensuring that the information published in the ClassAds is (at least approximately) up-to-date. This requires observing the state changes of the remote resource. Ideally, such changes should be delivered to the ClassAd publisher as soon as they occur, in order to become incorprated into ClassAds. In reality, Globus MDS does not support such notifications. Likewise, Condor APIs to subscribe for notifications about match-making events (for example, to count jobs assigned to a machine) are missing. Accordingly, the less-than-ideal solution of polling the state of MDS/Condor to retrieve the required information must be used. Additionally, the Condor daemon and/or job logs can be monitored (and parsed) to observe relevant state changes.
Additional links and information