Missing Platform LSF commands part 1: bwait

In the spirit of the POSIX wait command, Platform LSF should have a bwait command. It should work in a way similar to this:

$ bwait 7372 7378 7123 7812
Passed: 7812
Failed: 7372
Failed: 7123
Passed: 7378
$ echo $?
1

Let’s explain how this hypothetical command would work. As its name suggests, it is a command that waits for LSF jobs to be finished (EXIT or DONE, see the job state diagram) and then returns with a proper exit status. The exit status would be zero if all the jobs succeeded, and non-zero if one or more failed. Note that POSIX wait returns the exit status of the last PID specified, but bwait should return the “worst” exit status of all the job ids specified.

Unlike the unix wait though, it would accept some options, like --stop-at-n-failures=N and --timeout=M. This way, the user could limit the duration of the wait not only by absolute time, but also by number of failing jobs, because in many contexts, there is no point in continuing when there are too many failures.

Another option I would have is --killall-on-stop. This would cause bwait to kill all unfinished jobs as soon as one of the termination condition was met (timeout or number of failures). The use for this is that unfinished jobs would be removed from the compute cluster to free CPUs. Again, in some contexts, when there are too many failures, there is no point in running more jobs, so might as well terminate them.

As far as printing to stdout goes, in the unix tradition, nothing goes to stdout if all goes well. However, it would be tolerable for bwait to print a pass/fail string along with the job id number to stdout when a job finishes. Maybe this should only be enabled by a --verbose option.

Lastly, bwait needs to do something very important: it must wait for all jobs to close their I/O pipes before returning. Suppose bwait returns, but the jobs are not finished writing to files. Then there would be a race condition for any subsequent processing of those files. The reason I am stating this requirement is that in the current implementation of LSF, when a job reaches the EXIT or DONE state, the I/O stream to the output file specified by the bsub -oo file.log is closed long after the jobs is declared as finished, forcing the user to “guess” when it’s the right time to start post-processing the file.log file by using a sleep command without really knowing how much time to wait.

Having a bwait command would be superior to polling jobs statuses with the bjobs command followed by a sleep of some hard to guess arbitrary time.

In part 2, I will discuss the other missing LSF command, btree.