Name your compare method diff

In this article, I will talk about naming the comparison method “diff”, what constitute a good return value for the diff method, and how I emulate exception handling in SystemVerilog, a language which does not have formal exception handling.

First, we name the comparison method diff rather than compare. Why? Because is it easier. Consider this code

if (packet1.compare(packet2)) begin
  // what does compare return exactly?
end

Chances are that you have to look up the compare method before you write your code, each time. Now read this code:

if (packet1.diff(packet2)) begin
  // it reads like english
  // so packet1 must be different than packet2
  $display("Packets are different);
end

Much easier. And I bet you don’t have to look up the diff method.

Now onto the concept of what constitute good return values. Let’s continue with the diff example. When you run the Unix diff command, it return 0 when there is no differences between two files, or 1 if there is a difference. This makes conditional statements easy to write, both in Unix and in your code. Let’s write a diff method that returns the number of differences:

// Return the number of differences
function int a_class::diff(a_class other);
  diff = 0;
  if (this.a != other.a)
    diff++;
  if (this.b != other.b)
    diff++;
endfunction

Nice, but we can do better. We can enhance the diff method to return messages about the differences:

function int a_class::diff(a_class other, ref string msg[$]);
begin
  diff = 0;
  if (this.a != other.a)
  begin
    msg.push_back($psprintf("This a (%0d) is different than the other a (%0d)",this.a,other.a));
    diff++;
  end
  if (this.b != other.b)
  begin
    msg.push_back($psprintf("This b (%0d) is different than the other b (%0d)",this.b,other.b));
    diff++;
  end
end
endfunction

And now we can print the diff messages in the context where the comparison was made:

  if (foo.diff(bar,msg))
  begin
    foreach(msg[idx]) $display("%s",msg[idx]);
    // This was fatal, so end here
    `vmm_fatal(log,"Unrecoverable error");
  end

Since VMM supports the printing of line and file number, we get a clear picture of where the error was detected. This is better than aborting from inside the comparison method. Until SystemVerilog gets exception handling, this is an acceptable compromise.

GNU Make poetry

I use GNU Make a lot. A while back I wrote this:

Three rules for a thousand Vera files
Seven for the Designers in the Valleys of Verilog
Nine for a dynamic Register Abstraction Layer
One for processing the output log
In the Compute Cloud where the CPUs lie.

One target to rule them all, One vpath to find them
One target to bring them all and in the makefile build them
In the land of GNU Make where the DAG shines.

Git push for is publishing, not for integration

This may seem obvious, but don’t use git push like CVS commit. Repositories should be accepting commits by pulling them, not pushing them. CVS commit may be your integration strategy in a simple world without branches, but git push shouldn’t be used for that.

In the land of ASIC development, where a CVS commit is instantly shared with all the ASIC designers, the worst that can happen is someone unintentionally commits a file on a branch. Then what? You go talk to the verification guy (why do they do always know how to fix CVS?) and admit your mistake. He fixes it and everyone gets the fixed up repo on their next update, no biggie. Not so much with git.

With git, as soon as you have pushed the commits, the central repository branch on which you just pushed looks exactly like the local repository branch it was pushed from, with the commits you wish you had done on another branch, and the commits you should have reverted before continuing, but more devastatingly, with the incorrect merge you should not have done. Sure, git gives you the ability to correct all your mistakes IN YOUR LOCAL REPO, but once you’ve pushed them to the central repository, it’s like a printed newspaper: it’s out there and there is no recall, only further editions. Actually, you can fix an incorrect merge, but wouldn’t you want to avoid them in the first place?

So instead of having developers push to the integration repository, the integration manager should obtain the code through a git pull. If the integration tests pass, the commits are then pulled into the release repository by the release manager, where everyone can pull from. If the tests fail, the commits are taken out of the integration repository (where no one can pull from), the contributor is notified, and the integration manager processes the next integration request.

All this may seem obvious, but today when an incorrect merge between two branches was pushed, effectively merging the branches on the release repository, it became clear to me that the workflow needed to prevent this from happening, and that it could be done easily if two conditions were met:

  1. use git pull for getting code into a git repository
  2. use two repositories, one for integration, and another one for releases, thereby isolating the release tree from mistakes on the integration tree

You could argue that development, integration and release can all happen on different branches of the same repository, but in reality, typing git merge followed by git push is and easy mistake to make (never trust humans), whereas a CI system pulling code will not inadvertently merge two branches.

Wait, I have just found git bundle… more ways to skin the cat!

Git Workflow for ASIC Development

Pro GitI recently purchased Pro Git at my local Chapters bookstore. After reading it, I discovered I could get it for free on-line, but that did not make me feel bad, on the contrary! In a week, I had taken it everywhere, and showed it to half a dozen people.

The reason I blog about this book is because in reading chapter 5, I fell in love with the Integration Manager Workflow as I think it could completely replace the CVS workflow I previously described on AgileSOC.

The Integration Manager Workflow works as follow:

  1. A central blessed repository holds the tape out version of the ASIC database.
  2. Each ASIC developer pulls from the blessed repository to develop new features and implement bug fixes.
  3. Each ASIC developer publishes their changes to their “developer public” repositories.
  4. Through an automated process (partly provided by git built-in commands), they request that their changes be pulled in into the integration manager repository.
  5. The integration manager queues the requests, and if they pass the tests, they are accepted and committed into the blessed repository.
  6. Rinse and repeat.

Here is a picture from the book representing the Integration Manager Workflow, reproduced here thanks to the Creative Commons Attribution-Non Commercial-Share Alike 3.0 license:
Integration manager workflow

But that is not all. I discovered GitHub’s network graphs (the link takes you to an article linking to a live clickable network graph). If you’ve ever wondered what the hell designers were up to in your ASIC team, GitHub’s network graphs is the answer. The network graphs show you what happens live in your Integration Manager Workflow. It shows commits across the multiple databases (repositories), as well as the merges into the integration repository, from any developer’s point of view. This most impressive hosting solution can be purchased and installed inside your intranet.

I am currently installing gitolite to manage access to the repositories. Hopefully, that will go well!

This blog entry is published under the terms of the Creative Commons Attribution-Non Commercial-Share Alike 3.0 license because it uses work published under this license.

A Continuous Integration System For ASIC Development

I recently wrote an article published on AgileSOC titled
A Continuous Integration System For ASIC Development.

Hope you enjoy it!

Missing Platform LSF commands part 2: btree

In a previous article, I discussed a hypothetical Platform LSF command, bwait, which would waits for LSF jobs to complete before returning. Another command that I would like to see in the Platform LSF suite of commands is a command to track parent-child relationship between jobs. Linux has a pstree command, LSF could have a btree command.

A simple scenario for submitting hierarchical LSF jobs would be for launching regressions. You may want to have a master job representing a regression responsible for launching multiple children jobs: the test cases. In this case, a command tracking the parent-child relationship between LSF jobs would be useful:

$ bsub my_regression
Job <56478> is submitted to default queue.

# Some time later
$ btree 56478
56478
  56479
  56481
  56482
  ...
$

This command would be able to trace a child job back to the parent:

$ btree --find-parent 56481
56478
  56479
  56481
  56482
  ...
$

Another application for this command is for hierarchical builds, where a parent job dispatches multiple smaller build jobs. For instance, if you describe your hierarchical build in GNU Make and in sub-makes, you are letting GNU Make decide what needs to be built. If you want to know what GNU Make has launched behind the scenes, you need to track the hierarchy of jobs. Again, a btree command becomes useful:

$ bsub make large_build
Job <1234> is submitted to default queue.

# And some time later:
$ btree 1234
1234
  1235
  1236
  1239
  ...
$

The btree command would be able to report the tree of jobs long after the jobs are done. It could also do all kinds of fancy stuff like print log files name and the job status along with the jobs id, so you could quickly review the status of a large set of jobs by holding on to a single job id number.

Missing Platform LSF commands part 1: bwait

In the spirit of the POSIX wait command, Platform LSF should have a bwait command. It should work in a way similar to this:

$ bwait 7372 7378 7123 7812
Passed: 7812
Failed: 7372
Failed: 7123
Passed: 7378
$ echo $?
1

Let’s explain how this hypothetical command would work. As its name suggests, it is a command that waits for LSF jobs to be finished (EXIT or DONE, see the job state diagram) and then returns with a proper exit status. The exit status would be zero if all the jobs succeeded, and non-zero if one or more failed. Note that POSIX wait returns the exit status of the last PID specified, but bwait should return the “worst” exit status of all the job ids specified.

Unlike the unix wait though, it would accept some options, like --stop-at-n-failures=N and --timeout=M. This way, the user could limit the duration of the wait not only by absolute time, but also by number of failing jobs, because in many contexts, there is no point in continuing when there are too many failures.

Another option I would have is --killall-on-stop. This would cause bwait to kill all unfinished jobs as soon as one of the termination condition was met (timeout or number of failures). The use for this is that unfinished jobs would be removed from the compute cluster to free CPUs. Again, in some contexts, when there are too many failures, there is no point in running more jobs, so might as well terminate them.

As far as printing to stdout goes, in the unix tradition, nothing goes to stdout if all goes well. However, it would be tolerable for bwait to print a pass/fail string along with the job id number to stdout when a job finishes. Maybe this should only be enabled by a --verbose option.

Lastly, bwait needs to do something very important: it must wait for all jobs to close their I/O pipes before returning. Suppose bwait returns, but the jobs are not finished writing to files. Then there would be a race condition for any subsequent processing of those files. The reason I am stating this requirement is that in the current implementation of LSF, when a job reaches the EXIT or DONE state, the I/O stream to the output file specified by the bsub -oo file.log is closed long after the jobs is declared as finished, forcing the user to “guess” when it’s the right time to start post-processing the file.log file by using a sleep command without really knowing how much time to wait.

Having a bwait command would be superior to polling jobs statuses with the bjobs command followed by a sleep of some hard to guess arbitrary time.

In part 2, I will discuss the other missing LSF command, btree.

Continous integration system using parallel make

In software design, a continuous integration system is a system for compiling and testing code on a continuous basis. When a build fails, the code is rejected and the owner notified. When the code is good (compiles okay and sanity tests pass), the code is accepted and is made available for distributing to the development team.

There are many aspects to a continuous integration system. A hypothetical buildbot could implement the following phases:

  1. Receive code submitted to the system from a request queue
  2. Build the code. There are two possible outcomes:
    1. The build fails, reject the code and process the next request in the queue
    2. The build succeeds, move on to the test phase
  3. Test the code. There are two possible outcomes:
    1. One or more test fails, reject the code and process the next request in the queue
    2. All tests pass, accept the code and make it available for distribution among the team

This needs to be as independent as possible from the source code management tool, but hooks are necessary for the buildbot to extract code from it, and mark the source code management database with some form of tag when code is accepted after it passes the tests. I mentionned sending pass/fail notifications to owners, but I won’t cover it here (most likely, sendmail is your friend).

A the heart of the buildbot is the build system, its purpose is to build (compile) the code. A build system based on GNU Make can take advantage of GNU Make parallel prerequisite build capabilities to speed up the build. By default GNU Make will attempt to build as much of your code as possible. That is, parallel make will make all the targets it can except those whose dependencies have failed to build. While this might be desirable when you incrementally compile code for yourself, this behavior is detrimental in the case where a buildbot tries to process independent build requests from a submission queue. The buildbot simply needs to abort at the first failure, notify the owner, clean up, and move on to the next request.

So we want GNU Make to abort at the first failure when we do make -j N. There is nothing in GNU Make allowing that. However, we can edit job.c and add a call to fatal_error_signal(SIGTERM), so that GNU Make will abort, as if we had hit control-c on the command line, at the first error it encounters:

# GNU Make 3.81, job.c, near line 500
if (err && block)
  {
    static int printed = 0;
    fflush (stdout);
    if (!printed)
    {
      error (NILF, _("*** Waiting for unfinished jobs...."));
      fatal_error_signal(SIGTERM);
    }
    printed = 1;
  }

Save, compile, and try this modification with a simple makefile:

all: t1 t2
t1:
        sleep 60 && echo done || echo failed
t2:
        sleep 2 && exit 1

The long compile times is emulated with sleep 60 and the error with a simple exit 1 statement. Now running this with the modified make:

$ make -j 2
sleep 60 && echo done || echo failed
sleep 2 && exit 1
make: *** [t2] Error 1
make: *** Waiting for unfinished jobs....
make: *** [t1] Terminated
Terminated
$ ps -ef | grep sleep
martin   22343     1  0 18:36 pts/5    00:00:00 sleep 60

That did not go as expected. GNU Make did exit, but the sleep 60 remains on the system! If we were using command substitution, we would want the command to return at the first error, but it does not return until the sleep 60 is done:

$ time t=$(make -j 2)
make: *** [t2] Error 1
make: *** Waiting for unfinished jobs....
make: *** [t1] Terminated

real    1m0.007s
...
$

Clearly, something is wrong as make returned to the prompt before sleep 60, but the command substitution did not return the same way, it waited until the end.

Changing the makefile to the following alters the behavior:

all: t1 t2
t1:
        sleep 60

t2:
        sleep 2 && exit 1

Notice how I simplify the command in the t1 target: sleep 60 is the only command in the chain. The outcome is better:

$ make -j 2
sleep 60
sleep 2 && exit 1
make: *** [t2] Error 1
make: *** Waiting for unfinished jobs....
make: *** [t1] Terminated
Terminated
$ ps -ef | grep sleep
$

The first observation is that the sleep 60 is killed, and, not shown but tried, command substitution returns as soon as the first target fails. Now this is annoying. How can I know which commands will be terminated properly and which ones won’t?

Paul D. Smith explains that GNU Make has two paths for running command: simple and complex. A complex command contains things like && and ||, in which case GNU Make opens a shell to run the command and it is that shell that receives SIGTERM. Armed with this information I head to #bash where I learn that I simply need to add a trap to the complex commands, so I rewrite the makefile like so:

all: t1 t2
t1:
        trap 'kill $$(jobs -p)' EXIT; sleep 60 && echo done || echo failed
t2:
        trap 'kill $$(jobs -p)' EXIT; sleep 2 && exit 1

And now make aborts at the first error, command substitution does not hang and no process is left lingering.

Wrapping ssh inside a script, and passing arbitrary commands to another program

A way to solve this problem, which I presented in a previous post, was found with the help from the gurus on #bash. The easy way to make this work is to save the arguments to a file, one per line, and read them back at the other end. Using bash syntax for arrays "${a[@]}" which quotes every element separately, the original arguments can be reconstructed at the remote end, with the same separation they had at the origin.

Here is the code that does just that:

$ cat ssh-wrapper
#!/bin/bash
tmp=$(mktemp -p $(pwd))
for i; do
  echo $i >>$tmp
done
ssh -t localhost "$(pwd)/remote_script" "$(pwd)" "./prog" "$tmp"
rm -f $tmp

$ cat remote-script
declare -a a
declare i
while read; do
  echo $REPLY
  a[i++]="$REPLY"
done <$3
cd $1 && $2 "${a[@]}"

$ cat prog
#!/bin/bash
eval "$@"
# This program traps INT and queries the user for input
function process_int {
  echo -n "INT> "
  read LINE
  echo $LINE
  exit 2
}
trap process_int SIGINT
sleep 30
$

Now it works as expected for both types of end user quoting:

# Double quotes quoting
$ ./ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO"
 hi

^CINT> it works!
it works!
Connection to localhost closed.

# Single quotes quoting
$ ./ssh-wrapper 'FOO="a b" && echo " hi" && echo $FOO'
 hi
a b
^CINT> it works too!
it works too!
$
Connection to localhost closed.

If the two hosts were on different file systems, scp could be used to send the arguments over to a file on the remote host. You could even send environment information this way.

Experiment in wrapping ssh inside a script and passing arbitrary commands to another program

Note: This has been solved. Please see the answer here.

Wrapping ssh inside a script and passing it arbitrary commands is hard. What I am trying to do here is to come up with a generic way to invoke an ssh-wrapper, have it record the $(pwd), perform an ssh to a remote host, and at the remote host, reposition inside $(pwd), and run an arbitrary, possibly interactive, program.

Let’s first start with an ssh-wrapper

$ cat ssh-wrapper
#!/bin/bash
ssh -t localhost "$(pwd)/remote_script '$*'"

Then we define the remote-script

$ cat remote-script
#!/bin/bash
eval $*

Now let’s try some commands, and compare with what bash -c does:

$ ssh-wrapper "echo ' hi'"
hi
$ bash -c "echo ' hi'"
 hi

Notice that bash -c correctly renders the space preceding hi, but ssh-wrapper does not. This indicates potential future problems. Let’s try another case:

$ ssh-wrapper "echo hi && echo hello"
hi
hello
$ bash -c "echo hi && echo hello"
hi
hello

Nothing here. Let’s try another one:

$ bash -c "FOO='a b' && echo ' hi' && echo $FOO"
 hi

$ ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO"
remote_script: line 4: b: command not found

It starts to hurt here. The bash -c does the correct thing ($FOO is expanded before the value is set, which is correct), but in the ssh-wrapper, things go wrong. However, a different quoting produces a better outcome:

$ ssh-wrapper 'FOO="a b" && echo " hi" && echo $FOO'
 hi
a b
$ bash -c 'FOO="a b" && echo " hi" && echo $FOO'
 hi
a b

This works better, and even the space preceding “hi” is present. However, we can see that ssh-wrapper is not very end user proof. And end users are non-linear: give them that script, and they will break it right away.

Let’s modify the original scripts to position the remote execution in the correct path and add a potentially interactive program:

$ cat ssh-wrapper
#!/bin/bash
ssh -t localhost "$(pwd)/remote_script $(pwd) ./prog '$*'"

$ cat remote-script
#!/bin/bash
cd $1 && $2 $3

$ cat prog
#!/bin/bash
eval "$@"
# This program traps INT and queries the user for input
function process_int {
  echo -n "INT> "
  read LINE
  echo $LINE
  exit 2
}
trap process_int SIGINT
sleep 30

The difference here is that the remote-script will place us in the path where the ssh-wrapper was originally called before calling ./prog. Now we try our command again:

$ ssh-wrapper 'FOO="a b" && echo " hi" && echo $FOO'
 hi
a b
^CINT> it works!
it works!
$

You need to hit ctrl-c to get out here, then type some arbitrary string. It works, but is far from being robust: the way to make it fail is by changing the quoting:

$ ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO"
Executing FOO=a
^CINT> ^CConnection to localhost closed.
$

We are not better off. The prog did not interpreted the whole command, only FOO=a!

This is annoying, because I may want to allow variables to be expanded on the original command line, and hence it would be very nice to allow users to use double-quotes in their argument to ssh-wrapper. In fact, it is necessary to allow either type of quoting to work, bash -c works with both, so should ssh-wrapper.

Let’s change remote-script to pass all arguments after $2:

$ cat remote-script
#!/bin/bash
cd $1 && $2 ${@:3}

And we try again:

$ ssh-wrapper "FOO='a b' && echo ' hi' && echo $FOO"
./prog: line 6: b: command not found
^CINT> ^CConnection to localhost closed.

This time, the quoting around the variable assignment is gone, so b is interpreted as a command! At this point I have run out of ideas on how to make this work.

This has been solved. Please see the answer here.