There is a rather large data flow engine – more than 2000 different flow definitions of “what to do with inbound data”. The engine deals with various data formats (flat-file, CSV, JSON, XML, or even binary), performs filtering, transforms data formats, etc.
The flow engine consists of a number of libraries and tools involved in the processing. There are 3rd party ones like Saxon performing XML transformations or Jackson for JSON parsing and a variety of in-house converters, filters, etc.
Naturally, there is a need for deploying new versions of the libraries and tools (new features, security fixes, etc.). This is a risk since correctness of the processing is business-critical.
A simple tests like – unit testing the new feature, integration testing with a fixed set of input data – is not enough. There have been cases of regressions which first appeared after several days after deploy. For example, a very rare combination of numeric values triggered a formatting in a way it triggered a Saxon optimization bug which caused the formatting being omitted.
The currently employed method of testing is to compare new and current versions “online” for several weeks. It’s like adding a “copy & divert” stage resulting in processing the inbound data using both the current and the new version of a tool/utility and then comparing results. This is a very cumbersome, time consuming and potentially risky.
I’ve been thinking about a more effective way of doing this. Any ideas?
There is only so much you can do, and it sounds like you are already doing most of it. In particular, your “copy & divert” method already goes above and beyond what most people do. I’ve only heard of that being used in extreme situations like spacecraft. One thing you could do is instead of doing this test online, record the inputs and outputs and run it offline all at once.
You may already be doing this, but you should definitely try to create or move particular tests to earlier stages when you detect regressions. If you find a problem in production, write an integration test so you can detect similar problems in the future. If an integration test finds an issue, try to write a unit test that detects it.
The other thing I didn’t see you mention is fuzz testing or quickcheck-style testing. These have the computer generate test cases for you that you may not have considered.