Recently, I was charged with making about 9000 Selenium tests start running in CI/CD nightly. These tests had built up over about 8 years and had up until then been run in an ad hoc way. It was important that failures in the CI/CD be genuine failures, which meant that you could run the test as many times as you liked and it would continue to fail.

Anyway, my immediate observation was that lots of these tests unreliable in the sense of would sometimes fail for all sorts of spurious reasons, test issues, like not waiting for something to be present and so only passing if that thing was only coincidentally present. Sometimes one test would change the state of the application and another test might rely on that being/not being the case, etc. It was basically arbitrarily complicated reasons a test could be flaky.

My ‘resolution’ was, I introduced a lot of retry logic and added mechanisms by which the state of the application got reset between retries. To handle the overhead all these retries, etc., I introduced a huge amount of parallelism to the testing process, split across multiple deployments of the application to mitigate other limitations and in this way managed to make the tests run in a time frame that was acceptable to the business.

Everyone is happy now, in as much as the original task was achieved and the CI/CD is providing meaningful value and identifying genuine issues introduced.

I’m not so happy, because I want the tests themselves to be reliable, because unreliability is still harmful with respect to running these tests locally – and I want people to be able to identify if their changes will cause a failure pre merging, and also because the tests would run around 8x faster without the retry overhead.

What is a process I can follow to actually iteratively achieve this? It’s hard to present a genuine business justification. My point about running the tests locally isn’t much of an argument, because no one does that anyway (because they’re unreliable, so it is a bit circular). And, as I say, the business is happy with the time, and the design of things is such that you can always throw more resources at the tests to make them faster.

0

I’ve been in a similar situation a few years ago: the original external team had been asked to create tests, and had effectively only created acceptance-level ones, running the whole application and interacting with it. Over the next four years (at least) we hacked away at this giant. Here are some of the strategies we used; maybe some of them could be useful for you too:

  • Work out a way to figure out which tests are most unreliable. We had a dashboard with a simple HTML table. Each column was a run, and each row was a test. Each cell was either red and said “fail” or green and said “success” (you might want to use a more colourblind-friendly palette, like red and blue). Ordinarily this would’ve been useless because of the size of the table, but the trick was that the rows were sorted so that the tests which failed the most (that is, with the biggest number of “failed” cells in a row) were on top. So over time you could trivially see which were the most unreliable. Correlate that with which tests are most important, and you have yourself a priority list, an excellent way to get management’s attention. You’ll also very quickly identify any tests which never succeed.
  • The table above could also be used to figure out which tests failed or succeeded together. This wasn’t super common, but would usually indicate some strong interdependency.
  • If you can, try various test runner configurations. Flaky tests often fail more reliably if there’s very little or very much of any particular resource (RAM, CPU, disk speed, network speed). Knowing which resource configurations cause particular tests to fail can be a useful heuristic for figuring out where the problem lies. For example, if a slow network connection causes a test to fail reliably, then there’s probably a too-small timeout somewhere, either in the system itself or in the test.
  • Move down the test pyramid! Changing an acceptance test to an integration test, or an integration test to a unit test, is generally going to speed up the test several orders of magnitude.

3

I’m adding this answer to address a different point that I think is more important than the specific on how you could’ve written the codebase (which is important too, but it feels like you’re capable enough of figuring that part out).

An umbrella salesman knows that they can sell umbrellas for a higher price while it’s raining. This is why e.g. swimming pools are more expensive in summer, and heaters are more expensive in winter; people have a more urgent need for what you’re offering and will pay a premium to do so.

When you quick-fixed the issue, you took away the sense of urgency that helped you sell your proverbial umbrella to management.

You’ve made the classic developer error of quickly patching the problem and thinking that management would still care about the technical elegance of it all. They cared about it for exactly how long they actually had a problem.

Once the problem was fixed, they stopped caring. They have no interest in fixing the problem in a different way. Why would they, it’s already fixed! (I’m aware that there’s more nuance here, but my point is that management is either not aware or actively does not care).

The best way to use this experience is as a way to understand why taking best practice shortcuts isn’t just a short term bad decision, they get enshrined as a long term one.

This is why I have stopped giving detailed estimates. I no longer tell my manager that it’s going to take “two days of coding and one day of testing”, because they’re going to hear “two days of coding” and ignore everything else because it doesn’t directly benefit them. Instead, I tell them it’s three days of development work. That’s not a lie, I genuinely believe that writing tests is a necessary part of development work. I’m merely avoiding giving them more granular knowledge than what they need to know.

Similarly, I don’t do quick and dirty fixes anymore, unless I get management to explicitly sign on to doing the proper fix straight after. If they don’t, or they agree to it and then renege on that deal, I will not do another quick and dirty fix, because quick and dirty fixes are caused by managers (who dole out unreasonably short deadlines) and the repercussions are only felt by the developers (who have to work in a codebase of degraded quality).

Note also that I’m not blaming non-technical management for not making technical decisions. That why they hired the technical people in the first place. But management is there to make decisions based on information that’s put before them, and if you keep opening the door to bad practice for them, they will assume that it’s a viable option.
A doctor wouldn’t be allowed to suggest medically unsound practices to a patient, because the patient will usually assume that the doctor only offers reasonable medical advice. The same principle applies to software development, non-technical people cannot make a technical judgment call, and giving them information that presents negatively impactful technical decisions as a valid decision is a failing on your end as much as theirs (if not more, since this is not their expertise).

11

you could run the test as many times as you liked and it would continue to fail.

That’s called deterministic.

Sometimes one test would change state of the application and another test might rely on that being/not being the case, etc. basically arbitrarily complicated reasons a test could be flaky.

Parallelizable.

unreliablity is still harmful with respect to running these tests locally – and I want people to be able to identify if their changes will cause a failure pre merging

Regression.

What I don’t hear you saying is that anyone writes unit tests. Everything you’ve said you want is something a good suite of unit tests can give you.

Important to understand here is that a unit isn’t only a single class. It can be many classes. Classes that even when used together are still deterministic, parallelizable, and fast.

A key thing about these tests is they don’t depend on global state, file IO, something on the network, or the DB. Those requirements limit what the unit can be. But they ensure that the tests are reliable without having to put each test in its own separate universe.

However, not all code is ready for this. And making code unit testable is no small thing. But testable code can solve the same problems legacy code can.

Michael Feathers gave us a set of rules to define what unit tests are. They are not the same as integration tests. What you run them with doesn’t matter. What matters is what they do.

What you want is possible. But it may take a lot of work.

Anyway, my immediate observation was that lots of these tests unreliable in the sense of would sometimes fail for all sorts of spurious reasons, test issues, like not waiting for something to be present and so only passing if that thing was only coincidentally present. Sometimes one test would change state of the application and another test might rely on that being/not being the case, etc. basically arbitrarily complicated reasons a test could be flaky.

This sounds you already know what to do:

  • change the tests so they don’t rely on something which happens coincidentally – if necessary, implement some protocol in the application (for example based on some logging or some event mechanics) which allows the test code more directly to monitor the application state. The test should be able to find out if this “something” is still pending, will happpen or will definitely not happen.

  • change the tests so individual tests do not depend on each other. If a tests requires a specific application state, make it part of the initialization code, in a mostly self-contained manner. If necessary, add features to the application to bring it into a specific state, even if those features are just for testing purposes.

  • or in short, avoid tests having arbitrarily complicated reasons to behave flaky

Of course, the issue with this approach is that you probably have to go through all of the 9000 tests one-by-one and check if and how it has to be fixed. For this, a systematic priorization, as suggested in this answer, will definitely be a huge help.

You may also consider to split the tests into those which can be run locally, because they are fast and reliable, and those which can only run on the CI server, because they are expected to be slow. That way, you may be able motivate the team to run the local tests more often, getting earlier feedback.

Remember how to eat an elephant: one byte at a time.

2

Interesting problem, but one detail caught my eye: There are genuine failures which do not occur every time you run the test (timing issues, race conditions, memory corruption etc.). In fact, those are the gnarliest because they can be hard to reproduce but are still intolerable in the field. This class of errors may be masked by a “retry until we succeed” strategy.

Your tests now reliably find certain errors: When a test fails even after many retries, you are sure the software is faulty.

The opposite is not true: A test which passes after a number of retries does not indicate an error-free software.

That is a good reason to clean up the tests that should convince your management.

Consolidating the integration tests that rely on specific states can reduce their resource load, freeing up the other resources for other, later tests.

This may not be solving the entire problem, and mostly focusing on:

Sometimes one test would change state of the application and another test might rely on that being/not being the case

To me, the way to solve this is to have Test B that relies on the state from Test A to simply re-apply all of Test A’s steps before doing Test B proper.

Whether you solve it by making Test A a superclass of Test B ( i.e. TestB extends TestA), or by putting the core of Test A’s code into a function that Test A calls, and Test B can call by having both import the same module/extend the same base class, could be up to debate. I’d probably go with the superclass approach because as a result of that, later you can gate Test B on Test A succeeding – and not run it at all if Test A fails (Or run Test B first, and only run Test A separately if Test B fails). The module/method import way would allow you to reuse the function across a lot of tests, which could be useful when they need to be modified in a way that doesn’t reflect a change in the individual contract, but will probably make it harder to gate tests on other tests succeeding.

The advantage here can be for all parties:

  1. If a developer makes a change to the workflow of Test A, they only have to change Test A, and then they can run Test B to see if everything still lines up – and effectively test both cases at once.
  2. For business stakeholders, more tests can be added for other, newer features using the existing paralleled testing pool resources, getting more value for their total usage.
  3. Even if not gated, the information that a given test fails can clarify if all the Red “Failed” states are because of one given test (And therefore less developer work to fix), or if there’s an issue that indicates flakiness because Test B shouldn’t have succeeded if Test A failed (Given that their related testing code is explicitly stated as such in their source.). That information helps with triage during a later sprint meeting of trying to focus on the important bugs.

If the team still wants to keep the “Retry fix” as well, that can also work – but hopefully the buy-in can be on “We’re only expending additional resources on the tests when we know that .”.

Flaky tests are a reality, unfortunately.

Very often the source issues of the problem are common to multiple tests, and it’s important to attack those first.

I have seen tests fail because they ran at a particular time of day. Time-based test flakiness is an infamous source of problems. It’s important to use proper mocking to make sure the test run at a specific “time” – as far as the tests are concerned.

Introducing delays generally just adds more complexity and flakiness to an already bad problem. In programming, if you have to code something like “wait 2 seconds because it usually takes a while” is when you need to step away from the keyboard and rethink your entire approach. You are digging a deeper hole with a bandaid, to mix the metaphors.

Each test suite is a framework of your own making, reflecting the company’s very specific needs. There are libraries, utilities, and patterns, and it’s important to try to solve the flakiness by leveraging those.

Meaning, if you have a test that fails due to time of something else, figure out how to make the test reliably wait for that output, and then spread that technique to all other problem tests.

Finally, I mark flaky tests as such, and then I run “make tests-flaky” to run those specifically, maybe 100 times at a time using a shell command, to check if they are still passing.

Khám phá các thẻ bài đăng