Why ASIC designers should not fear merging or conflicts

ASIC designers have a very strong sense of ownership towards their code, and although they collaborate at interfaces, they seldom let another designer edit their code. I finally understood using an analogy.

ASIC designers are like civil engineers. Let’s look at what each type of engineer builds:

Civil Engineer ASIC Designer Verification Engineer
City blocks ASIC blocks Data Generators
Highways, Roads Buses, Signals Transaction data
Intersections, Traffic Control Devices Muxes, Arbiters Functions and methods
Buildings Memories Abstract data types, lists, databases

The things ASIC designers (and civil engineers) build are static in nature, they don’t flow, they don’t live: they exist to hold stuff and do things to stuff. Their creations are instantiated statically, and while they can serve more than one need, they are the place you go to for a service. The file that describes the “road” or the “building” is strongly tied to the engineer who designed it. And so designers see nothing limiting in operating a versioning control system that locks files while they are being edited. And when two people edit the same file, panic sets in: how can a file be edited by two people at the same time? How is this not breaking the file (and ruining my design)?

The verification engineer on the other hand builds the living things that flows through the services provided by the statically instantiated hardware. Those living things travel on buses, flow from block to block, get stored and are read back to be stored in another place. They are processed by multiple hardware instances. They are transformed along the way. They don’t have a sense of ownership over the data they throw at the design.

The problem of merging and conflict resolution has been solved a long time ago. Have no fear. But I guess it depends on the tools you are using.

This post was originally written in 2008, but has been published in 2016!

Continous integration system using parallel make

In software design, a continuous integration system is a system for compiling and testing code on a continuous basis. When a build fails, the code is rejected and the owner notified. When the code is good (compiles okay and sanity tests pass), the code is accepted and is made available for distributing to the development team.

There are many aspects to a continuous integration system. A hypothetical buildbot could implement the following phases:

  1. Receive code submitted to the system from a request queue
  2. Build the code. There are two possible outcomes:
    1. The build fails, reject the code and process the next request in the queue
    2. The build succeeds, move on to the test phase
  3. Test the code. There are two possible outcomes:
    1. One or more test fails, reject the code and process the next request in the queue
    2. All tests pass, accept the code and make it available for distribution among the team

This needs to be as independent as possible from the source code management tool, but hooks are necessary for the buildbot to extract code from it, and mark the source code management database with some form of tag when code is accepted after it passes the tests. I mentionned sending pass/fail notifications to owners, but I won’t cover it here (most likely, sendmail is your friend).

A the heart of the buildbot is the build system, its purpose is to build (compile) the code. A build system based on GNU Make can take advantage of GNU Make parallel prerequisite build capabilities to speed up the build. By default GNU Make will attempt to build as much of your code as possible. That is, parallel make will make all the targets it can except those whose dependencies have failed to build. While this might be desirable when you incrementally compile code for yourself, this behavior is detrimental in the case where a buildbot tries to process independent build requests from a submission queue. The buildbot simply needs to abort at the first failure, notify the owner, clean up, and move on to the next request.

So we want GNU Make to abort at the first failure when we do make -j N. There is nothing in GNU Make allowing that. However, we can edit job.c and add a call to fatal_error_signal(SIGTERM), so that GNU Make will abort, as if we had hit control-c on the command line, at the first error it encounters:

# GNU Make 3.81, job.c, near line 500
if (err && block)
  {
    static int printed = 0;
    fflush (stdout);
    if (!printed)
    {
      error (NILF, _("*** Waiting for unfinished jobs...."));
      fatal_error_signal(SIGTERM);
    }
    printed = 1;
  }

Save, compile, and try this modification with a simple makefile:

all: t1 t2
t1:
        sleep 60 && echo done || echo failed
t2:
        sleep 2 && exit 1

The long compile times is emulated with sleep 60 and the error with a simple exit 1 statement. Now running this with the modified make:

$ make -j 2
sleep 60 && echo done || echo failed
sleep 2 && exit 1
make: *** [t2] Error 1
make: *** Waiting for unfinished jobs....
make: *** [t1] Terminated
Terminated
$ ps -ef | grep sleep
martin   22343     1  0 18:36 pts/5    00:00:00 sleep 60

That did not go as expected. GNU Make did exit, but the sleep 60 remains on the system! If we were using command substitution, we would want the command to return at the first error, but it does not return until the sleep 60 is done:

$ time t=$(make -j 2)
make: *** [t2] Error 1
make: *** Waiting for unfinished jobs....
make: *** [t1] Terminated

real    1m0.007s
...
$

Clearly, something is wrong as make returned to the prompt before sleep 60, but the command substitution did not return the same way, it waited until the end.

Changing the makefile to the following alters the behavior:

all: t1 t2
t1:
        sleep 60

t2:
        sleep 2 && exit 1

Notice how I simplify the command in the t1 target: sleep 60 is the only command in the chain. The outcome is better:

$ make -j 2
sleep 60
sleep 2 && exit 1
make: *** [t2] Error 1
make: *** Waiting for unfinished jobs....
make: *** [t1] Terminated
Terminated
$ ps -ef | grep sleep
$

The first observation is that the sleep 60 is killed, and, not shown but tried, command substitution returns as soon as the first target fails. Now this is annoying. How can I know which commands will be terminated properly and which ones won’t?

Paul D. Smith explains that GNU Make has two paths for running command: simple and complex. A complex command contains things like && and ||, in which case GNU Make opens a shell to run the command and it is that shell that receives SIGTERM. Armed with this information I head to #bash where I learn that I simply need to add a trap to the complex commands, so I rewrite the makefile like so:

all: t1 t2
t1:
        trap 'kill $$(jobs -p)' EXIT; sleep 60 && echo done || echo failed
t2:
        trap 'kill $$(jobs -p)' EXIT; sleep 2 && exit 1

And now make aborts at the first error, command substitution does not hang and no process is left lingering.

The goto sin

This is the best rant on the demise of the goto statement I have ever heard. It is from the Tango conference 2008 – Fibers talk by Mikola Lysenko. If you fast forward to 23 min 05 secs, you will hear this:

One way you can think about states [in a state machine] is that they’re kind of a label. And you put this label here for where you want the code to goto after you’re done with this other, sort of, state. A state machine is kind of an indirect goto and so these states and switch statements are just like nested gotos within gotos.

Now if you use coroutines, you can then use structured programming to represent the states. I mean this is a debate that was you know, played out years ago in a more limited context of structured programming versus gotos and ultimately, structured programming won out.

Nowadays I mean if you go into entry level programming course, they’ll just fail you on your projects if you even use a goto statement because those goto statements are that toxic to the integrity of programming code. You’re much better off using structured programming. And yet despite that, people still advocate using these state machines which are basically a really indirect horrible obfuscated goto mess, just split across multiple source files and larger projects.

So it’s like if you make the goto sin big enough, then no one is going to call you on how bad it is! But the thing is if you use coroutines, not a problem! So why have we been using this all along? Oh once again I’d probably appeal to the fact that, ah, you know, ignorance, right, people just don’t know about coroutines.

Now I want my goto back… er… coroutines!

Overriding GNU Make SHELL variable for massive parallel make

If you use GNU Make in your verification environment, maybe you have dreamed of typing make -j 120. It turns out it is possible, and it is a very interesting use of the GNU Make SHELL variable. You know that make -j N causes GNU make to spawn the build of N prerequisites in parallel. In the code below, prereq1, prereq2 and prereq3 would be built at the same time if -j 3 were used:

SHELL=./myshell
prereq1:
        very_long_processing
prereq2:
        more_very_long_processing
prereq3:
        more_and_more_processing

target: prereq1 prereq2 prereq3
        very_long_compile

This may look like nothing, but it gets more interesting when you define myshell as follows, as explained on the GNU mailing list:

#!/bin/bash
ssh `avail_host` "cd $PWD || exit; $2"

Now just type make and your build is distributed to 3 available hosts, as returned by the avail_host script (writing this script is an exercise left to the reader ;-)).

Shortly after coming up with this, a bit of searching revealed that this was nothing new really. The article Distributed processing by make explains that by using the GNU Make SHELL variable, the number of jobs you can dispatch with make is no longer limited to how many cores or local CPUs you have.

The article also shows how to extend the makefile syntax to control the dispatching of commands. Once you realize that GNU Make effectively calls $SHELL by passing “-c” followed by your entire command, you also realize that you can intercept anything you want in your own implementation of $SHELL to control job submission, such as the single, double and triple equal sign syntax in the aforementioned article. All without changing anything to GNU Make source code. Wow!

Exploring a bit further on the GXP website, I have found that the GXP flow for dispatching jobs to multiple machines relies on daemons being spawned at the user space level using the ssh command. I still don’t understand why it would be necessary to spawn daemons though. I would be interested in knowing this.

How one line of code quadruples testing – in hardware at least!

A sentence in an article by Joel Spolsky caught my attention: “Can’t we just default to IE7 mode? One line of code … Zip! Solved!”. This is the typical necessary solution that quadruples the testing effort. If you though it only doubled the testing, think again.

Let’s examine this in detail. First we have the code:

...
// When fizBoo is enabled, call it for additional sparkling
if (fizBooEnabled) {
  fizBoo();
}
...

All right, so how does this quadruple the testing? Well, first thing is that in hardware verification, there is no such thing as conditional compilation to remove FizBoo from the silicon: the fizBoo() function is always there. This is because only ONE chip is manufactured to supports all hardware architectures. In software, you can #ifdef things out of the executable based on architecture, but in hardware, everything is always there to support all of them. This way you can use the same hardware with a variety of operating systems (Windows, Linux) and a variety of architectures (Sun, IBM, low-end PCs, etc).

So now we have to test the fizBoo() in 4 different cases:

  1. fizBooEnabled == 0, in systems where fizBoo() does not need to be called
  2. fizBooEnabled == 1, in systems where fizBoo() adds some sparkling
  3. fizBooEnabled == 0, in systems supporting fizBoo(), but if not called, only the additional sparkling is missing
  4. fizBooEnabled == 1, in systems not supporting fizBoo(), to ensure there is no evil things like memory corruption or a bus error

One line of code quadruples testing. But this example is easy. It’s easy for the designer to put it in, easy for the tester to spot, and easy to cover off in a unit test.

But there are thousands of features in hardware, and end users are chaotic: they will plug in the hardware in the wrong architecture, and will load the wrong driver in the right architecture. The last thing they want though is the system to go up in flames. The hardware is expected to neither electrocute the user, nor cause memory corruption, regardless of how chaotic the end user is. End users are not the only chaotic people: your own software team is chaotic, even the team writing the driver for the hardware is chaotic!

In my opinion, this makes hardware verification a much more challenging job than hardware or software design. And because the fabrication cost of chips is so high, as much as possible has to be verified BEFORE fabrication, (no one likes a coaster, esp. when it costs in excess of 7-figure digits to fab). Without actual hardware to verify the hardware, all verification is done in software, through simulations, including the simulation of end user behavior, and the simulation of all the relevant pieces of the architectures under which the hardware is expected to be used. This is some serious testing. And this testing better simulate not only correct use with its humonguous state space, but also chaos pretty well. In fact, the only way to verify correctness is to introduce chaotic behavior in the test environment itself. By building randomness at each step of the input stimulus generation, chaos is introduced in the verification, and finds more bugs in more areas.

This is way more complicated than just using the same seed to reproduce a bug. This means that constraint random predictive stimulus generators, self checking scoreboards, automated test benches, and computer farms are needed to automate functional, formal and runtime verification. Actually, I don’t understand why these types of verification are classified under “software development” in the Wikipedia article on verification as they are used extensively in hardware verification.

ASIC Verification is software development, not hardware

Why do I end up reading about the Knuth Shuffle when I try to verify whether all the different sequences of bus transactions have been tested? Why does it make more sense to me to use an SQL backend to handle loads of timing data rather than do it with a perl script like a hardware designer would do? Why do I laugh out loud when I read “How to write unmaintainable code“?

Simply because ASIC Verification is a software development activity. One hundred percent a software activity. Sure we bring up waveforms and run verilog simulations all the time, but who thinks in terms of transistors and logic gates anymore? At most, you need to understand the difference between synthesizable code, and the code your backend guy will reject because some obscure tool in the backend won’t take it. But other than that, as much as a hardware person you think you are, you are really only writing software. Software that compiles into millions of logic gates. It would compile into machine code and you, the ASIC Verification engineer, would not know the difference.

However, there is a slight difference. Once the chip is released to the fab, it’s gone. It’s over. The bugs that are in it, are in it for good. You don’t patch hardware, you don’t release 4.0.6.5 when 4.0.6.4 has a bug. Unlike software where new releases come out all the time because users are just an extension of QA, an ASIC cannot come out every week because hardware must go through extensive QA tests which software is never subject to: static discharge, temperature, humidity, power rail variations, etc. Physical things you see, take more time than testing a button on a web page. Plus, you cannot download a new graphic processor from sourceforge: physical things have to be fabricated. This means that the ASIC better be close to feature perfection when it’s time to release it.

This also means that the QA team for the ASIC (i.e. the ASIC verification team) has to anticipate all possible usage models: everything that software developers are going to throw at the ASIC has to be tested in simulation ahead of releasing the hardware. Every single driver that will be written for the ASIC has to be anticipated by the QA team, and everything those drivers will ever do has to be simulated. I just have one thing to say: wow.

This is why your ASIC Verification team (aka QA in software jargon) uses make, python, MySQL, git, and needs to be managed like they are software developers. This is also why they code using object oriented languages. This is why they use every feature their revision control software has to offer, and why they are not afraid to merge their code into what looks to the hardware designers like a code clusterfuck. It’s not a code clusterfuck, it’s software.

But there are still a few things to learn for ASIC Verification engineers: the lessons of software development. Recently, I have read rands in repose‘s recent book (managing humans), and an excellent post by Joel on software. Read, you will learn. But the two things that still differentiate us from software developers is that we can read waveforms and we undertand what “executing in zero time” means.