PDF

Regression Testing with Random Tests
Cannot Identify Regressions.
- What to do about it.
by Daniel Hansson, CEO Verifyter
ABSTRACT
Most ASIC companies use random tests not only to verify new
designs but also for regression testing. Using random tests for
regression testing is a great idea for coverage as the randomness
over time will ensure that the total coverage will improve. Instead
of running the same tests every night, each night’s regression test
suite is slightly different with different seeds. However improving
coverage is not what the specific topic of regression testing is
about. The purpose of regression testing is to quickly identify dips
in quality, i.e. regressions, in order to address them and keep the
quality high. And here random tests have one downside – they
cannot identify regressions. But there are ways to address this
issue.
1. IS IT BETTER OR WORSE?
To be more precise, random tests cannot distinguish between a dip
in quality and increased coverage. A random test that fails may do
so because it hit a new and never before tested corner case which
reveals a bug in a module that was designed by professors and
PhD’s long time ago in a completely different project. It is great
news to stumble upon such a corner case in order to iron it out,
hopefully before the customer will notice it. Alternatively the
random test may fail because John accidentally sat on his
keyboard while checking in his code update (he is very agile).
This caused some unexpected behavior in same functions he was
not even working on (sitting on keyboards often do). This is a
classical case of a regression. In the first case you have great news
to report, an old corner case has been identified, you are a hero. In
the second case, you have to hit the panic button and hold the
release. Distinguishing between good and bad news is always
welcome, not only in the world of regression testing, but alas
random tests cannot help you with this. The random test just tells
you that something failed, but cannot say whether it is a new or
old problem.
2. FAST FIXES FOR HIGH QUALITY
Another difference between regression bugs and new test that
covers a new corner case is that regression bugs are comparatively
easy to fix. If you point out that a developer made an error in an
update then it is often quite easy to fix. Identifying problems as
regressions, and even better, linking the problem to the revision(s)
where the problem was introduced, results in faster fixes. The
faster you fix regression bugs the better quality you have of the
design during development, which in turn leads to earlier time to
market, as the developers jobs are not hampered by quality dips.
So separating regression bugs from failures due to new test
scenarios also leads to a substantial productivity gain.
3. DIFF DOES NOT WORK
A directed test, as opposed to a random test, is good at identifying
regressions. If a directed test passed earlier, but it fails now then
you most probably have identified a regression. Comparing the
revision database today, when the test fails, with some time back
when the test fails, makes it possible to narrow down when the
problem occurred. You can basically do a diff between good
result and bad result both in terms of log files and the revision
database and draw some conclusions. The cause of the quality dip,
i.e. the regression, is one of the updates to the revision database in
this time window. You don’t know exactly which one, but you
have a list of changes and limited set of people you can blame.
Directed tests are great at identifying regressions. They are not
great at providing good and steadily improving coverage over
time as random tests are, but in terms of being able to identifying
regressions directed tests are great and doing a diff between pass
and failure gives you lots of useful information.
Figure 1 Diff Cannot be Used For Random Tests
Diff doesn’t work at all with random tests (refer to Fig 1). If a
random test passed yesterday, but failed today, but with a different
seed, then this can be due to either a regression in quality,
increased coverage or the test may even be illegal. An illegal test
will probably lead to a constraint being set to eliminate this type
of test, whereas in the case of regressions and coverage
improvements will lead to fixes in the design under test. For
random tests we must find a different solution.
4. BACKTRACKING IS THE WAY
FORWARD
In order to draw conclusions why a random test failed we must
retest the very same test on older revisions. This means rerunning
the failing test, using the same seed, on older revisions, in order to
identify when the problems started to arise. This is the only way
to be able to compare the test results on older revisions with the
test results on the latest revision. Once you have rerun the same
test on an old revision then you will be able to do the same
comparison as you would with a directed test. If the same test
with the same seed passes in an older revision then you are able to
identify that a regression has occurred. If the same test and seed
has always failed then you know this is a new test. This new test
may in turn either be catching a new corner case, or alternatively
it may be an illegal test. Either way you are now able to
distinguish between new tests and regression in quality.
Backtracking through older revisions used to be a manual process,
consuming expensive engineering time, but this has now been
automated in PinDown, the automatic debug tool. PinDown can
automatically debug any test failure, both random and directed,
down to the exact revision that caused the failure and send the
developers who cause the failure a bug report before the night’s
regression has even finished.
Figure 2 shows how PinDown operates on the flow of random test
failures. The stream of random failures are split into regressions
and new tests, where the regressions are diagnosed down to the
exact revision that caused the problem and a bug report is sent to
the person who committed each the error. This allows regression
errors to be fixed fast and thus allows the device and testbench to
maintain high quality.
The other category is new tests, i.e. tests that have always failed
and are consequently covering a new test scenario. These are not
failing due to a sudden regression in quality, which may lead to
panic and holding the release, but is new test coverage which is
overall positive news.
This setup solves the problem with using random tests in
regression testing. It allows you to keep running random testing
with the upside of getting good coverage without the downside of
not being able to identify regressions.
5. CHALLENGES WITH
BACKTRACKING
There are challenges with backtracking, it is not as straight
forward (or backward) as it may sound.
Figure 2 Random Testing with PinDown
The first challenge is random stability, a topic widely discussed as
it affects any debugging with random tests. Random stability is
the art by which the same seed should always give you the same
test even if the testbench has been updated. When you debug a
test failure you want to reproduce the same scenario by providing
the same seed number, and not get a new scenario where the test
may not even fail, just because the testbench was updated. In one
end of the spectrum, the EDA vendors often claim that they have
perfect random stability, but at the other end of the spectrum it is
impossible to make such guarantees for major changes of the
testbench.
Random stability of the commercial tools has improved in recent
years. Some years ago a vendor, who shall not be named, could
not handle any changes to the testbench, not even comments,
without losing the random stability, as the randomness was based
on the number of characters in a file. Luckily those days are over.
These days limited changes to the testbench does not affect
random stability unless you fiddle with the random generation
itself or change the structure of the entire testbench, .e.g
instantiate more modules with random generators or change the
dependencies between modules.
How does random stability affect back tracking? Well, if you
encounter a pass in an older revision this is probably because you
have reached a point before that error was introduced (which
allows you to point at the faulty revision), but the pass may also
be because of testbench changes which have changed the test to
test something else. Capturing the impact of limited testbench
changes is as important as capturing design bugs, but there always
the risk that the test with the same seed passed on an earlier
revision was producing a different test back then as random
stability is not guaranteed for major changes. This problem is
bigger when the testbench is undergoing major design changes
and is reduced at the later stages of the project when the testbench
is updated with smaller changes, such as constraint changes. The
fact that backtracking can help you narrow down the problem is
still very useful, especially using automatic backstracking such as
PinDown, as it can point to the exact revision in the testbench
when the test started to fail. If the commit message for the faulty
revision says something like “changed constraints to solve an
issue” then this revision probably introduced a real error and the
debug analysis was correct. However if the commit message on
the other hand says “Changed the random generation for one
module” then this revision may not have introduced an actual
error, just changed the test to test something completely different.
automatically debugged then you get the much sought after
overall productivity improvement.
How often do bigger changes occur? Most changes are minor
changes, like constraints update, whereas major revamps or new
designs come less frequent. Every big change is followed by a
number of small fixes. According to one paper 90% of updates is
less than 10 lines of code. Depending how well designed the
testbench is the more it will be randomly stable. But in most
systems the far majority of the testbench changes will be minor
and easily debuggable by back tracking. But what happens if the
debug goes wrong because the exact same test is not reproducible
on older revisions due to a major change? Well, if the automated
debug failed, you are back to where you are now: manual debug.
Automatic back tracking is about improving productivity, and no
damage is done if there are cases where you still have to do
manual debug. As long as the far majority of all issues can be
6. CONCLUSION
A second challenge with backtracking is that it consumes time.
All debugging takes time, a lot of time, so this is nothing unique
with backtracking. However, the smarter you make the selection
of older revisions and tests the faster you can backtrack through
the revision history. PinDown has an algorithm (patent pending of
course) which does a very good job at this, but if you do
backtracking manually you should use your knowledge of the
design to carefully select the fastest test on some good older
revisions to get to a conclusion fast.
Random tests are great to use in regression testing to get good
coverage, but they cannot distinguish regressions, i.e. dips in
quality, from improvement in coverage. This can be solved by
backtracking through older revisions and retest the failing test
using the same test and the same seed on older revisions in order
to separate regressions from tests that fail because it contains a
new test scenario. This process can be done manually or
automated with a tool such as PinDown. The issue of random
stability means some updates of the testbench will still need to be
manually debugged, but the far majority of all test failures can be
automatically analysed. Identifying regressions quickly and
automatically allows you to maintain high quality, which in the
end leads to an earlier release.
Formatted: Font: Times New Roman,
9 pt, Not Bold