Missing Platform LSF commands part 1: bwait

In the spirit of the POSIX wait command, Platform LSF should have a bwait command. It should work in a way similar to this:

$ bwait 7372 7378 7123 7812
Passed: 7812
Failed: 7372
Failed: 7123
Passed: 7378
$ echo $?

Let’s explain how this hypothetical command would work. As its name suggests, it is a command that waits for LSF jobs to be finished (EXIT or DONE, see the job state diagram) and then returns with a proper exit status. The exit status would be zero if all the jobs succeeded, and non-zero if one or more failed. Note that POSIX wait returns the exit status of the last PID specified, but bwait should return the “worst” exit status of all the job ids specified.

Unlike the unix wait though, it would accept some options, like --stop-at-n-failures=N and --timeout=M. This way, the user could limit the duration of the wait not only by absolute time, but also by number of failing jobs, because in many contexts, there is no point in continuing when there are too many failures.

Another option I would have is --killall-on-stop. This would cause bwait to kill all unfinished jobs as soon as one of the termination condition was met (timeout or number of failures). The use for this is that unfinished jobs would be removed from the compute cluster to free CPUs. Again, in some contexts, when there are too many failures, there is no point in running more jobs, so might as well terminate them.

As far as printing to stdout goes, in the unix tradition, nothing goes to stdout if all goes well. However, it would be tolerable for bwait to print a pass/fail string along with the job id number to stdout when a job finishes. Maybe this should only be enabled by a --verbose option.

Lastly, bwait needs to do something very important: it must wait for all jobs to close their I/O pipes before returning. Suppose bwait returns, but the jobs are not finished writing to files. Then there would be a race condition for any subsequent processing of those files. The reason I am stating this requirement is that in the current implementation of LSF, when a job reaches the EXIT or DONE state, the I/O stream to the output file specified by the bsub -oo file.log is closed long after the jobs is declared as finished, forcing the user to “guess” when it’s the right time to start post-processing the file.log file by using a sleep command without really knowing how much time to wait.

Having a bwait command would be superior to polling jobs statuses with the bjobs command followed by a sleep of some hard to guess arbitrary time.

In part 2, I will discuss the other missing LSF command, btree.


3 Responses to Missing Platform LSF commands part 1: bwait

  1. Mike Page says:

    I came across this web page while searching for similar capabilities.

    The functionality I need is fulfilled by using the -K option on bsub. It locks the command line after submission of a single job, returning control to the user once the job has finished.

    • Martin d'Anjou says:

      If you need to wait for a single job, bsub -K is the right solution. In my case I need to submit multiple jobs at the same time without locking the terminal, and then wait for them.

  2. David C Black says:

    Here is a thought on how to kludge your requirements in. Suppose I want to run three jobs in parallel (A, B & C) and finish with a summary job (F).

    launch 3 jobs that each in turn launch a single -K job, then wait on the -K jobs. You can setup a separate queue for the first level.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: