I recently wrote an article published on AgileSOC titled
A Continuous Integration System For ASIC Development.
Hope you enjoy it!
I recently wrote an article published on AgileSOC titled
A Continuous Integration System For ASIC Development.
Hope you enjoy it!
→ 2 CommentsCategories: asic verification
Tagged: continuous integration
In a previous article, I discussed a hypothetical Platform LSF command, bwait, which would waits for LSF jobs to complete before returning. Another command that I would like to see in the Platform LSF suite of commands is a command to track parent-child relationship between jobs. Linux has a pstree command, LSF could have a btree command.
A simple scenario for submitting hierarchical LSF jobs would be for launching regressions. You may want to have a master job representing a regression responsible for launching multiple children jobs: the test cases. In this case, a command tracking the parent-child relationship between LSF jobs would be useful:
$ bsub my_regression Job <56478> is submitted to default queue. # Some time later $ btree 56478 56478 56479 56481 56482 ... $
This command would be able to trace a child job back to the parent:
$ btree --find-parent 56481 56478 56479 56481 56482 ... $
Another application for this command is for hierarchical builds, where a parent job dispatches multiple smaller build jobs. For instance, if you describe your hierarchical build in GNU Make and in sub-makes, you are letting GNU Make decide what needs to be built. If you want to know what GNU Make has launched behind the scenes, you need to track the hierarchy of jobs. Again, a btree command becomes useful:
$ bsub make large_build Job <1234> is submitted to default queue. # And some time later: $ btree 1234 1234 1235 1236 1239 ... $
The btree command would be able to report the tree of jobs long after the jobs are done. It could also do all kinds of fancy stuff like print log files name and the job status along with the jobs id, so you could quickly review the status of a large set of jobs by holding on to a single job id number.
→ Leave a CommentCategories: asic verification
In the spirit of the POSIX wait command, Platform LSF should have a bwait command. It should work in a way similar to this:
$ bwait 7372 7378 7123 7812 Passed: 7812 Failed: 7372 Failed: 7123 Passed: 7378 $ echo $? 1
Let’s explain how this hypothetical command would work. As its name suggests, it is a command that waits for LSF jobs to be finished (EXIT or DONE, see the job state diagram) and then returns with a proper exit status. The exit status would be zero if all the jobs succeeded, and non-zero if one or more failed. Note that POSIX wait returns the exit status of the last PID specified, but bwait should return the “worst” exit status of all the job ids specified.
Unlike the unix wait though, it would accept some options, like --stop-at-n-failures=N and --timeout=M. This way, the user could limit the duration of the wait not only by absolute time, but also by number of failing jobs, because in many contexts, there is no point in continuing when there are too many failures.
Another option I would have is --killall-on-stop. This would cause bwait to kill all unfinished jobs as soon as one of the termination condition was met (timeout or number of failures). The use for this is that unfinished jobs would be removed from the compute cluster to free CPUs. Again, in some contexts, when there are too many failures, there is no point in running more jobs, so might as well terminate them.
As far as printing to stdout goes, in the unix tradition, nothing goes to stdout if all goes well. However, it would be tolerable for bwait to print a pass/fail string along with the job id number to stdout when a job finishes. Maybe this should only be enabled by a --verbose option.
Lastly, bwait needs to do something very important: it must wait for all jobs to close their I/O pipes before returning. Suppose bwait returns, but the jobs are not finished writing to files. Then there would be a race condition for any subsequent processing of those files. The reason I am stating this requirement is that in the current implementation of LSF, when a job reaches the EXIT or DONE state, the I/O stream to the output file specified by the bsub -oo file.log is closed long after the jobs is declared as finished, forcing the user to “guess” when it’s the right time to start post-processing the file.log file by using a sleep command without really knowing how much time to wait.
Having a bwait command would be superior to polling jobs statuses with the bjobs command followed by a sleep of some hard to guess arbitrary time.
In part 2, I will discuss the other missing LSF command, btree.
→ Leave a CommentCategories: asic verification
In software design, a continuous integration system is a system for compiling and testing code on a continuous basis. When a build fails, the code is rejected and the owner notified. When the code is good (compiles okay and sanity tests pass), the code is accepted and is made available for distributing to the development team.
There are many aspects to a continuous integration system. A hypothetical buildbot could implement the following phases:
This needs to be as independent as possible from the source code management tool, but hooks are necessary for the buildbot to extract code from it, and mark the source code management database with some form of tag when code is accepted after it passes the tests. I mentionned sending pass/fail notifications to owners, but I won’t cover it here (most likely, sendmail is your friend).
A the heart of the buildbot is the build system, its purpose is to build (compile) the code. A build system based on GNU Make can take advantage of GNU Make parallel prerequisite build capabilities to speed up the build. By default GNU Make will attempt to build as much of your code as possible. That is, parallel make will make all the targets it can except those whose dependencies have failed to build. While this might be desirable when you incrementally compile code for yourself, this behavior is detrimental in the case where a buildbot tries to process independent build requests from a submission queue. The buildbot simply needs to abort at the first failure, notify the owner, clean up, and move on to the next request.
So we want GNU Make to abort at the first failure when we do make -j N. There is nothing in GNU Make allowing that. However, we can edit job.c and add a call to fatal_error_signal(SIGTERM), so that GNU Make will abort, as if we had hit control-c on the command line, at the first error it encounters:
# GNU Make 3.81, job.c, near line 500
if (err && block)
{
static int printed = 0;
fflush (stdout);
if (!printed)
{
error (NILF, _("*** Waiting for unfinished jobs...."));
fatal_error_signal(SIGTERM);
}
printed = 1;
}
Save, compile, and try this modification with a simple makefile:
all: t1 t2
t1:
sleep 60 && echo done || echo failed
t2:
sleep 2 && exit 1
The long compile times is emulated with sleep 60 and the error with a simple exit 1 statement. Now running this with the modified make:
$ make -j 2 sleep 60 && echo done || echo failed sleep 2 && exit 1 make: *** [t2] Error 1 make: *** Waiting for unfinished jobs.... make: *** [t1] Terminated Terminated $ ps -ef | grep sleep martin 22343 1 0 18:36 pts/5 00:00:00 sleep 60
That did not go as expected. GNU Make did exit, but the sleep 60 remains on the system! If we were using command substitution, we would want the command to return at the first error, but it does not return until the sleep 60 is done:
$ time t=$(make -j 2) make: *** [t2] Error 1 make: *** Waiting for unfinished jobs.... make: *** [t1] Terminated real 1m0.007s ... $
Clearly, something is wrong as make returned to the prompt before sleep 60, but the command substitution did not return the same way, it waited until the end.
Changing the makefile to the following alters the behavior:
all: t1 t2
t1:
sleep 60
t2:
sleep 2 && exit 1
Notice how I simplify the command in the t1 target: sleep 60 is the only command in the chain. The outcome is better:
$ make -j 2 sleep 60 sleep 2 && exit 1 make: *** [t2] Error 1 make: *** Waiting for unfinished jobs.... make: *** [t1] Terminated Terminated $ ps -ef | grep sleep $
The first observation is that the sleep 60 is killed, and, not shown but tried, command substitution returns as soon as the first target fails. Now this is annoying. How can I know which commands will be terminated properly and which ones won’t?
Paul D. Smith explains that GNU Make has two paths for running command: simple and complex. A complex command contains things like && and ||, in which case GNU Make opens a shell to run the command and it is that shell that receives SIGTERM. Armed with this information I head to #bash where I learn that I simply need to add a trap to the complex commands, so I rewrite the makefile like so:
all: t1 t2
t1:
trap 'kill $$(jobs -p)' EXIT; sleep 60 && echo done || echo failed
t2:
trap 'kill $$(jobs -p)' EXIT; sleep 2 && exit 1
And now make aborts at the first error, command substitution does not hang and no process is left lingering.
→ Leave a CommentCategories: asic verification · software development
Tagged: bash, buildbot, continuous integration, gnu-make
A way to solve this problem, which I presented in a previous post, was found with the help from the gurus on #bash. The easy way to make this work is to save the arguments to a file, one per line, and read them back at the other end. Using bash syntax for arrays "${a[@]}" which quotes every element separately, the original arguments can be reconstructed at the remote end, with the same separation they had at the origin.
Here is the code that does just that:
$ cat ssh-wrapper
#!/bin/bash
tmp=$(mktemp -p $(pwd))
for i; do
echo $i >>$tmp
done
ssh -t localhost "$(pwd)/remote_script" "$(pwd)" "./prog" "$tmp"
rm -f $tmp
$ cat remote-script
declare -a a
declare i
while read; do
echo $REPLY
a[i++]="$REPLY"
done <$3
cd $1 && $2 "${a[@]}"
$ cat prog
#!/bin/bash
eval "$@"
# This program traps INT and queries the user for input
function process_int {
echo -n "INT> "
read LINE
echo $LINE
exit 2
}
trap process_int SIGINT
sleep 30
$
Now it works as expected for both types of end user quoting:
# Double quotes quoting $ ./ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO" hi ^CINT> it works! it works! Connection to localhost closed. # Single quotes quoting $ ./ssh-wrapper 'FOO="a b" && echo " hi" && echo $FOO' hi a b ^CINT> it works too! it works too! $ Connection to localhost closed.
If the two hosts were on different file systems, scp could be used to send the arguments over to a file on the remote host. You could even send environment information this way.
→ Leave a CommentCategories: asic verification
Tagged: bash, ssh
Note: This has been solved. Please see the answer here.
Wrapping ssh inside a script and passing it arbitrary commands is hard. What I am trying to do here is to come up with a generic way to invoke an ssh-wrapper, have it record the $(pwd), perform an ssh to a remote host, and at the remote host, reposition inside $(pwd), and run an arbitrary, possibly interactive, program.
Let’s first start with an ssh-wrapper
$ cat ssh-wrapper #!/bin/bash ssh -t localhost "$(pwd)/remote_script '$*'"
Then we define the remote-script
$ cat remote-script #!/bin/bash eval $*
Now let’s try some commands, and compare with what bash -c does:
$ ssh-wrapper "echo ' hi'" hi $ bash -c "echo ' hi'" hi
Notice that bash -c correctly renders the space preceding hi, but ssh-wrapper does not. This indicates potential future problems. Let’s try another case:
$ ssh-wrapper "echo hi && echo hello" hi hello $ bash -c "echo hi && echo hello" hi hello
Nothing here. Let’s try another one:
$ bash -c "FOO='a b' && echo ' hi' && echo $FOO" hi $ ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO" remote_script: line 4: b: command not found
It starts to hurt here. The bash -c does the correct thing ($FOO is expanded before the value is set, which is correct), but in the ssh-wrapper, things go wrong. However, a different quoting produces a better outcome:
$ ssh-wrapper 'FOO="a b" && echo " hi" && echo $FOO' hi a b $ bash -c 'FOO="a b" && echo " hi" && echo $FOO' hi a b
This works better, and even the space preceding “hi” is present. However, we can see that ssh-wrapper is not very end user proof. And end users are non-linear: give them that script, and they will break it right away.
Let’s modify the original scripts to position the remote execution in the correct path and add a potentially interactive program:
$ cat ssh-wrapper
#!/bin/bash
ssh -t localhost "$(pwd)/remote_script $(pwd) ./prog '$*'"
$ cat remote-script
#!/bin/bash
cd $1 && $2 $3
$ cat prog
#!/bin/bash
eval "$@"
# This program traps INT and queries the user for input
function process_int {
echo -n "INT> "
read LINE
echo $LINE
exit 2
}
trap process_int SIGINT
sleep 30
The difference here is that the remote-script will place us in the path where the ssh-wrapper was originally called before calling ./prog. Now we try our command again:
$ ssh-wrapper 'FOO="a b" && echo " hi" && echo $FOO' hi a b ^CINT> it works! it works! $
You need to hit ctrl-c to get out here, then type some arbitrary string. It works, but is far from being robust: the way to make it fail is by changing the quoting:
$ ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO" Executing FOO=a ^CINT> ^CConnection to localhost closed. $
We are not better off. The prog did not interpreted the whole command, only FOO=a!
This is annoying, because I may want to allow variables to be expanded on the original command line, and hence it would be very nice to allow users to use double-quotes in their argument to ssh-wrapper. In fact, it is necessary to allow either type of quoting to work, bash -c works with both, so should ssh-wrapper.
Let’s change remote-script to pass all arguments after $2:
$ cat remote-script
#!/bin/bash
cd $1 && $2 ${@:3}
And we try again:
$ ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO" ./prog: line 6: b: command not found ^CINT> ^CConnection to localhost closed.
This time, the quoting around the variable assignment is gone, so b is interpreted as a command! At this point I have run out of ideas on how to make this work.
This has been solved. Please see the answer here.
→ Leave a CommentCategories: asic verification
Tagged: bash, ssh
Environment modules is a nice software environment that lets you switch verilog simulator version very easily without having to edit your environment (.cshrc or other). But it seems to have some drawbacks. One of which is that is does not return proper exit status when it fails, making it hard to write scripts that fail when they should. The other one is it does not play well with ssh, or so it seems.
When I hit ctrl-c on VCS at runtime, it will drop the command line interface, aka the verilog CLI. When I try to make this work with ssh and environment modules, I have a nasty surprise: environment modules interferes somewhere.
Let’s first emulate the verilog command line with a simple script, since you don’t need an expensive verilog simulator to see the problem:
$ cat prog
#!/bin/bash
# This program traps INT and queries the user for input
function process_int {
echo -n "cli> "
read LINE
echo to stdout $LINE
echo to stderr $LINE >&2
exit 2
}
trap process_int SIGINT
echo test stdout
echo test stderr >&2
sleep 30
$
Next, we want to run this script through ssh and environment modules, hit ctrl-c and be dropped to the cli> prompt:
$ ssh -t localhost 'tcsh -c "source Modules/init/tcsh && module list && ./prog"' Currently Loaded Modulefiles: 1) make/3.81a test stdout test stderr Connection to localhost closed. $
Well, this did not drop to the command line interface. Let’s try without module list:
$ ssh -t localhost 'tcsh -c "source Modules/init/tcsh && ./prog"' test stdout test stderr cli> t to stdout t to stderr t Connection to localhost closed. $
This time, it has let me typed something on the cli> command line prompt. Okay, so there is some interaction between ssh and modules. I’ve struggled with this for a few days and it is getting annoying. The same commands will work with rsh, but rsh does not return the a proper exit status, so I really do want to use ssh.
→ Leave a CommentCategories: asic verification
This is the best rant on the demise of the goto statement I have ever heard. It is from the Tango conference 2008 – Fibers talk by Mikola Lysenko. If you fast forward to 23 min 05 secs, you will hear this:
One way you can think about states [in a state machine] is that they’re kind of a label. And you put this label here for where you want the code to goto after you’re done with this other, sort of, state. A state machine is kind of an indirect goto and so these states and switch statements are just like nested gotos within gotos.
Now if you use coroutines, you can then use structured programming to represent the states. I mean this is a debate that was you know, played out years ago in a more limited context of structured programming versus gotos and ultimately, structured programming won out.
Nowadays I mean if you go into entry level programming course, they’ll just fail you on your projects if you even use a goto statement because those goto statements are that toxic to the integrity of programming code. You’re much better off using structured programming. And yet despite that, people still advocate using these state machines which are basically a really indirect horrible obfuscated goto mess, just split across multiple source files and larger projects.
So it’s like if you make the goto sin big enough, then no one is going to call you on how bad it is! But the thing is if you use coroutines, not a problem! So why have we been using this all along? Oh once again I’d probably appeal to the fact that, ah, you know, ignorance, right, people just don’t know about coroutines.
Now I want my goto back… er… coroutines!
→ 1 CommentCategories: d language · software development
If you use GNU Make in your verification environment, maybe you have dreamed of typing make -j 120. It turns out it is possible, and it is a very interesting use of the GNU Make SHELL variable. You know that make -j N causes GNU make to spawn the build of N prerequisites in parallel. In the code below, prereq1, prereq2 and prereq3 would be built at the same time if -j 3 were used:
SHELL=./myshell
prereq1:
very_long_processing
prereq2:
more_very_long_processing
prereq3:
more_and_more_processing
target: prereq1 prereq2 prereq3
very_long_compile
This may look like nothing, but it gets more interesting when you define myshell as follows, as explained on the GNU mailing list:
#!/bin/bash ssh `avail_host` "cd $PWD || exit; $2"
Now just type make and your build is distributed to 3 available hosts, as returned by the avail_host script (writing this script is an exercise left to the reader ;-)).
Shortly after coming up with this, a bit of searching revealed that this was nothing new really. The article Distributed processing by make explains that by using the GNU Make SHELL variable, the number of jobs you can dispatch with make is no longer limited to how many cores or local CPUs you have.
The article also shows how to extend the makefile syntax to control the dispatching of commands. Once you realize that GNU Make effectively calls $SHELL by passing “-c” followed by your entire command, you also realize that you can intercept anything you want in your own implementation of $SHELL to control job submission, such as the single, double and triple equal sign syntax in the aforementioned article. All without changing anything to GNU Make source code. Wow!
Exploring a bit further on the GXP website, I have found that the GXP flow for dispatching jobs to multiple machines relies on daemons being spawned at the user space level using the ssh command. I still don’t understand why it would be necessary to spawn daemons though. I would be interested in knowing this.
→ Leave a CommentCategories: asic verification · software development
You run a tool for hours, save its output to a logfile, then parse the log for errors only to realize that there was a critical message in the first few minutes of the run. What a waste of time.
You can instead process stdout of a program on the fly. Using a unix pipe and your most powerful ally, the bash shell, Greg’s wiki explains how you can read stdout on the fly.
Let’s start with a basic example: You want to stop at the first error.
set -o pipefail
(echo "foo"; echo "error"; echo "baz") |
(
while read; do
if [ `echo $REPLY | grep -c "error"` -eq 1 ]; then
echo Found error: $REPLY >/dev/stderr
exit 2
fi;
echo $REPLY
done
)
exit $?
First we pretend to have a process which outputs messages. This is modeled by the echo statements. They are grouped in a subshell (the parens open a subshell) and the output of this subshell is piped to next command down the line, which is a subshell too. Inside the second subshell contains the while statement. The while statement reads its stdin and greps for errors. When an error is detected, the problematic input is copied to /dev/stderr and an exception is thrown with the exit statement. If there were no errors, the input is simply copied to stdout with the echo statement. At at the end, we re-thrown the exit code with exit $? so the caller knows this script has encountered an error.
There are a few shell things in there. The parenthesis ( ... ) creates a subshell. The while read reads stdin one line at a time. The $REPLY is a bash built-in variable whose value is set by read. The $? built-in variable holds the exit code of the last command, function or script that was run. In this script, the last thing that was run is the subshell containing the while loop, so $? holds the exit code of that subshell.
Here is another version, where we look for both errors and a “must see” expression.
set -o pipefail
var=1
(echo "foo" && echo "bar" && echo "baz") |
(
while read; do
if [ `echo $REPLY | grep -c "error"` -eq 1 ]; then
echo Found error: $REPLY >/dev/stderr
exit 2
fi;
if [ `echo $REPLY | grep -c "must see"` -eq 1 ]; then
var=0;
fi;
echo $REPLY
done
exit $var
)
exit $?
Here we have a variable var which we clear when we find the “must see” string. In this example, it will not find the expression, and after the loop exits, exit $var throws an non-zero exit code (var is initialized to 1 at the beginning). The exit code is re-thrown after the loop has exited so callers to this script will know how it ended.
You can do all kinds of sophisticated things here, such as count lines, or print a few hundred lines beyond an error before exiting, or count errors and abort after you’ve seen N of them. You can store the output to a logfile with I/O redirection. It gets a bit hairy when you also want to use tee, but it can be done.
Your first book when you go into ASIC verification should not be the Art of Verification with Vera, but rather the Advanced Bash-Scripting Guide. Read all you can about the shell, it will not be wasted. Hang out on IRC #bash, it helps too. When you reduce every program to its essence, you realize all you ever need is the exit code. You do care about what thousands of log files have to say, but first and foremost you want to know: pass or fail? The exit code is the answer, no matter how sophisticated your entire verification environment becomes: cmd && echo pass || echo fail.
→ 3 CommentsCategories: asic verification