June 10, 2009 3 Comments
In the spirit of the POSIX
wait command, Platform LSF should have a
bwait command. It should work in a way similar to this:
$ bwait 7372 7378 7123 7812 Passed: 7812 Failed: 7372 Failed: 7123 Passed: 7378 $ echo $? 1
Let’s explain how this hypothetical command would work. As its name suggests, it is a command that waits for LSF jobs to be finished (EXIT or DONE, see the job state diagram) and then returns with a proper exit status. The exit status would be zero if all the jobs succeeded, and non-zero if one or more failed. Note that POSIX
wait returns the exit status of the last PID specified, but
bwait should return the “worst” exit status of all the job ids specified.
Unlike the unix
wait though, it would accept some options, like
--timeout=M. This way, the user could limit the duration of the wait not only by absolute time, but also by number of failing jobs, because in many contexts, there is no point in continuing when there are too many failures.
Another option I would have is
--killall-on-stop. This would cause
bwait to kill all unfinished jobs as soon as one of the termination condition was met (timeout or number of failures). The use for this is that unfinished jobs would be removed from the compute cluster to free CPUs. Again, in some contexts, when there are too many failures, there is no point in running more jobs, so might as well terminate them.
As far as printing to stdout goes, in the unix tradition, nothing goes to stdout if all goes well. However, it would be tolerable for
bwait to print a pass/fail string along with the job id number to stdout when a job finishes. Maybe this should only be enabled by a
bwait needs to do something very important: it must wait for all jobs to close their I/O pipes before returning. Suppose
bwait returns, but the jobs are not finished writing to files. Then there would be a race condition for any subsequent processing of those files. The reason I am stating this requirement is that in the current implementation of LSF, when a job reaches the EXIT or DONE state, the I/O stream to the output file specified by the
bsub -oo file.log is closed long after the jobs is declared as finished, forcing the user to “guess” when it’s the right time to start post-processing the
file.log file by using a
sleep command without really knowing how much time to wait.
bwait command would be superior to polling jobs statuses with the
bjobs command followed by a
sleep of some hard to guess arbitrary time.
In part 2, I will discuss the other missing LSF command,