Summary:

component used to work and now doesn’t, the regression happened 1+ years ago and we’re not sure when, now I’m considering replacing the broken and poorly architected component with a mature library

Background:

I am working on a web application which interacts with a 3rd party API to import and export data. This is an optional feature, which would be used instead of manually inputting data. Two years ago, I implemented an authentication system from scratch that uses basic authentication (i.e. generating basic authentication http header, including it with requests). I then spent two years at another company, but have now returned and am working on the same application again. While I was gone, someone introduced a regression error somewhere in the authentication module: now you can’t connect, period. Nada. Zip.

And I don’t know when this started because I wasn’t here – but if I dug through commit logs I could probably get a rough idea. At some point someone disabled the ability to connect to the 3rd party API because it didn’t work (yeah, I know… great solution -_-), so I could use that to get a rough idea of when it happened and possibly get a lead on where the regression error was introduced.

What now:

While investigating this, I have discovered that there is a mature library written for our technology stack for interacting specifically with this 3rd party API. (As an aside, it supports basic authentication and oauth, and we would like to move to oauth if possible – so switching to this library also has nice side-benefits.)

The current plan is to replace the (not functioning) hand-rolled authentication component with the library, which may involve a database schema change, and some non-trivial refactoring.


My question:

is it important to identify the bug in the old authentication component before starting to replace it with the library?


Some thoughts:

  • why waste time finding the bug if we’re about to replace the component?
  • what if somehow replacing the authentication component won’t fix the bug – and if I took the time to find the bug in the first place then I would have known that

Other details:

  • there is no need to start using the library if the current code works, though the side benefits of the library are nice
  • the current authentication code is poorly architected and hard to maintain

2

Well, it depends on how important it is for your company to get the authentification working again with the existing component, and if you can expect to do this significantly faster than replacing the component. If you can repair the system in two days, but exchanging the old component by the new takes you two months, you should probably consider to repair it first. If you expect both things take similar time (or you think you can replace the old component in less time than finding the error), the decision should be clearly to replace the component.

If the developers who maintained the software used your VCS regularly, and strictly committed only compilable code, you probably have a good chance to find the specific commit which introduced the error by “bisecting”. If you have a revision n1 which works, and a revision n2>n1 which does not work, check out the code of revision n3:=(int)((n1+n2)/2) and test if it works or not. If it works, continue with (n3,n2), else continue with (n1,n2). Repeat this until you found the one specific commit which created the error.

You have to decide for yourself if that works for your case, this surely depends on the size of your system, the checkout and build times, and your possibilities to test old revisions. My experience is that this can be a very quick way to locate the root cause of bugs. I had to deal with a similar situation in the past where we knew there was a bug introduced in the time around seven years ago, and I wanted to know when it happened. Bisecting took me less than an hour to find the commit which caused the problem, and then it was pretty easy to identify the faulty change to the code.

6

My question is more aimed at whether it is important to find the offending bug – and if it is, why?

There are a couple of advantages to finding (not necessarily fixing) the bug.

Without finding the bug, it is impossible to find the root cause.

  • Was it a stupid typo by an overworked tired intern in an all-nighter before a looming deadline? Or was it something much more nefarious, an attempt at inserting a backdoor into the authentication code, maybe? Was it stupidity, negligence, an honest mistake, a misunderstanding? Did the developer in question just make a mistake or does he not understand the system?
  • Why wasn’t the bug caught by unit tests, functional tests, system tests, regression tests, … during development? Don’t the developers run the test suites before committing?
  • Why wasn’t the bug caught by unit tests, … on the CI server? Did the developers disable CI?
  • Why wasn’t it caught during code review?
  • Why wasn’t it caught by the QA team?
  • Why wasn’t it fixed once caught? Again, do the developers not understand the system?
  • Why doesn’t your coding style prevent such kinds of bugs?
  • Is there a problem with your coding culture?

Once you have found and understood the bug, and have understood the root cause, you can search for, track, eliminate, and prevent similar instances of bugs.

  • Write a script / add a test / add an analysis task to your static analyzer/linter/style checker that checks for (the absence of) not only this particular bug, but all similar instances.
  • Change your coding style to prevent similar bugs (a simplistic example would be requiring checks of the form 0 == foo instead of foo == 0 to prevent the (in)famous if (userId = 0) backdoor).
  • Perform a system-wide code review to look for similar bugs.
  • Train developers in coding techniques to prevent similar bugs.
  • Adopt a coding culture that prevents this particular bug from re-occuring.
  • Change or augment the processes that led to this bug not only being introduced but also not being fixed.

An extreme example of this is OpenBSD. When they find a security bug in the core system, they perform a root cause analysis, and then do a full and complete manual line-by-line review of the entire codebase, and look for similar bugs as well revising any processes and coding styles that may have led to this bug. The security community in general is paranoid about this. Whenever a new vulnerability is found in some piece of security-relevant code, not only will that library be thoroughly reviewed, but all similar libraries as well. E.g., a security bug in Apache, will generally also trigger a review of Nginx, etc. It will also often lead to projects like Valgrind or Coverity trying to figure out how they can check for this and similar bugs, and then running their check against a large body of open source code.

Now, whether or not this makes (business) sense is a question only you and your managers can answer. How important is it for the business to figure out this bug? How important is it for the confidence in the development team? Or the confidence of the developers in themselves?

Also, once you figure out what the bug was, depending on what exactly the root cause was, a lot of the things I listed may not make sense. E.g. if it really was just an honest stupid typo, doing a full-system code review to look for other typos, probably is overkill. Although some questions still are interesting, like why the typo wasn’t caught by the compiler, the type system, the static analyzer, tests, or code reviews, and what can be done to ensure that next time it does get caught.

2