Using Regression Isolation to Decimate Bug Time-to-Fix

Software defects generally fall into two primary classifications: defects that represent functional misses of a new implementation, or defects that are regressions in the behaviour of the system due to some change. This paper discusses a major reduction in the Time-To-Fix for regressions. In a previous role I lead a team of engineers in implementing this system.

To improve the Time-To-Fix, it was ultimately determined that we needed to resolve two problems that accounted for the majority of the time developers spent dealing with regressions:

Reduce effort to isolate the regression. Early effort on bug analysis can be greatly accelerated if the guess-work on which change causes the regression is removed. Using bisection provides an absolute level of confidence of the regression trigger. There is no guessing or supposition. I have written on the use of bisection in preference of code diving in this blog post.
Ensure that the regression goes to the correct team the first time. Reduction of the bouncing around of the defect report is achieved by identifying the engineer who regressed the issue and ensuring that it is fixed by that person or team. In particular breaking the attempt to reproduce, debug, re-queue cycle prevents days of wasted engineering time.

We implemented and deployed the system described between 2008 and 2010. Read on for more information.

Background Information on Development Cycle

The codebase in question was complex. It carried 25 million lines of code, 250 checkins each day, 5 operating systems, 300 active products, 500 active developers across 5 countries and over 50 releases each year.

The release streams were very structured. There is a monthly major and a weekly minor release tempo. The releases are branched from a common mainline integration branch, the integration branch is fed by subsystem streams, the subsystems are fed by subcomponents. The codebase is very dynamic with many changes, merges and backports entering the development streams on an ongoing basis. Below is a representation of the typical development flow over the codebase.

Release Branches — Development and Release Streams

Given the number of products and the release tempo of the development environment it was impossible to provide full test coverage over all hardware and all functionality at each stage of development. The diagram below shows how sparse coverage at various stages of development can be combined to ultimately provide high levels of coverage over the complete development cycle. With this layered approach, most functionally would get some coverage across some hardware at some point between initial development and the final release. I’ll further explore sparse test coverage in a future blog post.

Sparse and Layered Testing — Layered Testing for Full Coverage

A major impact of this test model is that the regressions identified in the release cycle may have been injected into the system a number of weeks or months previously. In most organizations, this presents a huge challenge to manage, particularly if first-principles debugging is the primary way that an organization resolves regressions on release branches.

Overview of the Approach

Independent of any particular testing regimen, most regressions are typically identified by the heavy testing in the release branch. This regression has invariably avoided detection until late in the release cycle, having been injected many months previously. Although a typical type of regression, it also represents a potentially large sink of engineering effort late in a cycle when it cannot easily be afforded.

Regression Isolation — Late Cycle Defect Injection

We took the diagram above and looked at a number of scenarios and determined that we would need both systems created and processes introduced to support our aims. The highlighted path represents a realized worst-case scenario for understanding a regression.

Systems Created

We created an automated regression isolation system that would allow us to identify the regressing change with little to no user interaction.

Automatic bisection was employed to isolate the regression within a particular development stream. This was supported by an archive of over six months of continuous integration build artifacts representing the major active development streams. More information on my views of using bisection for regression isolation can be found on my some of my blog posts here.

As the regression may need to be isolated across multiple release streams, we had to automatically switch between different builds from the different development streams to isolate which subsystems, subcomponents or code changes contributed to the regression on each development stream. This was probably the most powerful part of the system. Perforce provides rich merge information – so this was surprisingly easy.

In the example below, the regressing changelists (A and B) in both the main integration branch and the subsystem branch were found to be merges. These merges allowed tracing to the next upstream development stream. Ultimately, the regressing code change is found (C) in a subcomponent development stream. I have removed the traversal of the release branches for brevity.

Multiple Instances of Regressions on Branches — Original Integration Carried forward via Integration

Process and Workflow Improvements

A key driver for the success of the system was the application of the system through standing operating procedures. This removed the subjective judgement from the reporting of the regression. Removing judgement from the queueing removed a large amount of engineer animosity or “There is no way it’s my problem. My code doesn’t do that. Go away” (this is an actual quote from an engineer). There was no discussion, the analysis was performed with or without their support.

When any QE or QA test team supporting my team found a regression, they would immediately begin isolating the regression through the system described above. This was by rule with minimal judgement applied. When the regressing changelist was isolated, the defect report would be created with a reference to the changelist and would be queued to the engineer that was responsible for the regression. Since test execution of a single test took 5 minutes, the typical regression at the subcomponent level could be isolated in around 15-30 minutes, while a regression at the release level may take around 60 minutes to isolate. Depending on where the build was

The issue would always be queued directly to the engineer that introduced the regressing changelist. This engineer would typically have the best understanding for the reasoning and implementation of the change, hence they were also the best person to look at resolving the issue as well. The pain of receiving regressions would also help the engineer improve their code quality to reduce future regressions.

We also placed time limits on how long an isolated defect could survive once known. The only two choices were to reanalyze the regressing changed and provide a solid fix, or to revert the regressing changelist. This forced the engineer to communicate well and provide estimates or provide direction for reverting the change. If there was no or poor communication regarding the regression a judgement call was made to revert the changelist if the secondary impacts were not high.

Quantitative Measures of Success

Regression Time To Fix: This measure represents the time from when the regression was first noted to when the issue was resolved. Unfortunately, there was insufficient controlled data to determine statistically relevant measure both before and after, however the measured time of the related regression Time-to-Offending-Engineer metric went from a typical 3-4 days to a typical 3 to 4 hours. When considering that problem characterization typically takes up 40 to 70 % of the the defect resolution time, the benefit to Time to Fix should be apparent.

Regression Injection Ratio: This measure is the number of identified regressions against the number of code changes (ie: <# regressions>:<# code changes>). The industry average for our type of code base was 3:10. This was confirmed as being a reasonable and accurate measure of the general team performance across the organization. Once my team had started taking this approach to regressions, this measure went from approximately 3:10 to around 1:10.

Qualitative Benefits Including Secondary Benefits

Due to the realization by the developers that poor quality code would ultimately come back to haunt them as their introduced regression was isolated, the developers moved from passively supporting best practices to being clear advocates for improving capabilities to drive quality such as automated pre-checkin testing, rigorous code reviews, test development and so on.

The developers began to also tangibly understand the benefits of other engineering best practices that increased their awareness of the code and consequently helped deliver higher quality code. The best practices that began to emerge almost organically were focused around architectural documentation, sufficiently analyzed and documented requirements for new features, and historical functional use cases for previously implemented features.

Barriers in Understanding and Adoption

Although I have presented a series of positive outcomes to the introduction of this approach on regressions, there were significant barriers to adoption. Both of these barriers can be overcome with patience and education.

In explaining the operation of the system to engineers, it was unfortunately too common for there to be little depth of understanding of what should be well known industry terms (regressions, CM, CI, etc). This increased the difficulty in getting engineers to understand and accept what was being put forward to them. Unfortunately, I believe this is a general malaise of the software industry – I’ll discuss aspects of this disparity in a future blog post.

Until trust in the results from the regression isolation became commonplace, there was a lot of developer disbelief that either the system or the QA team had correctly isolated a problem to their change. Most developers began to trust the system by the third or fourth correctly identified regression. Unfortunately having a visible and public record showing that you had made a mistake is always initially a hard pill to swallow.

Final Thoughts and Comments

The improvements that resulted from the changes introduced by this system were truly astonishing. Our initial expectation was that we would get some marginal improvements in time-to-fix and other quality metrics, but the measured improvements were gratifying for us to see. The marked emergence of a culture of quality within the engineering team was also unexpected. It is clear that when faced with the hard truth of the quality their own output, an engineer will begin to move from their comfort zone and look to achieve further mastership in their trade.

There are many extensions to the system that I hope explore at some time in the future. The opportunity to implement this system at provided me far more insights into engineering behaviors than I had expected. I strongly suggest all organizations look to implement similar systems. A lot of startups are well positioned to bring in this sort of system and set a more solid basis for their development practices for years to come.

As always – comments (below), commentary (via reflagging) or feedback (in general) is eagerly welcomed. What is your approach to regression management, and what have you found works well?

One thought on “Using Regression Isolation to Decimate Bug Time-to-Fix”

rlevin says:

February 23, 2013 at 8:28 am

Liked this article; nice counterpart to the one on A / B / C players and how automating Understanding, Analysis, etc., speeds up process; this article drove that home with some details…thanks

LikeLike