I’m writing a math library and I’ve encountered trouble with floating point imprecision. I want to know if there are any standard workarounds that are used often. I have doubts about my current method and my lack of knowledge and terminology when it comes to these floating point errors emphasizes the doubt.

Specifically, if I want to check the equality of two floats a and b, instead of a == b, I resort to doing a check abs(a - b) < precision where precision is a global variable that can be set by the user.

The issue is that I feel like exposing precision to the user is just passing on the issue instead of solving it. I’d prefer to have the precision as a constant, but I do not know of any way to find a value aside from performing random tests and using an arbitrary value that works in most cases.

Any advice is appreciated! Thank you for your help 🙂


The fixed comparison you suggested is feasible,
but works best when we know the tested values
will fall within anticipated ranges that are not too broad.
If they could span many octaves (many logarithmic decades)
then your users will be much better off with examining
relative error.
That’s the terminology you’re looking for.

There’s a whole
numerical analysis
literature on the topic, which might consume multiple
semesters of undergraduate study.
Cheat sheet: subtracting quantities of similar magnitude is trouble.
But there’s lots of details.

The advice that your users perhaps will really want relates to
condition number,
sometimes informally referred to as the “stiffness” of a problem.
If users are submitting ill-conditioned problems to your library,
then they should have reduced expectations about precision,
and you may be in a position to offer such feedback.
Techniques such as Lagrangian multipliers,
or solving a dual problem,
can sometimes let us cast the original problem in a different way
so it has improved conditioning.

Returning error bars to your caller might be helpful.
One approach to this is very simple:

  • choose some small Δ
  • solve f(x – Δ), f(x), and f(x + Δ)
  • use the differences between those solutions to advise your caller of the answer’s likely precision

Floating point precision is a very difficult problem, even if you can statically analyze all the source code of your library (and its direct and indirect dependencies).

And Rice’s theorem applies to this issue. In general getting a reliable bound on rounding errors of an arbitrary code is (probably) undecidable, or at least a very hard problem.

Specifically, if I want to check the equality of two floats a and b, > instead of a == b, I resort to doing a check abs(a – b) < precision > where precision is a global variable that can be set by the user.

Probably a check like abs(a - b) / (abs(a) + abs(b)) < precision when a and b are not bitwise equal makes better sense.

If your library is coded in C, consider writing your GCC compiler plugin (perhaps reusing some open source code borrowed from bismon) to analyze or prove (or estimate rounding errors in) your C code with the C code of the user base.

Read more about IEEE 754 (it has NaNs…)

More important than checking precision is to make calculations precise in the first place.

Errors are relative to the size of the result. So adding up terms from smallest to largest has smaller error. Calculating two large values and then their difference gives you large relative error. A bit of mathematics often helps, like replacing sqrt(x+1) -sqrt(x) with 1 / (sqrt(x+1) + sqrt(x)).

Re terminology. The word precision is often confused with accuracy. In your case you may want to refer to precision as either float (32bit IEEE) precision representation or double-precision (64bit) representation. For each, you can then define a minimum accuracy value for making such comparisons. In this way, you have clear terminology for saying what you mean and a clear strategy for defining the minimum accuracy for each precision.

Another aspect of this problem is scaling, as mentioned already. If you want to support motion over the domain of the data, with consistent accuracy, then it is possible to achieve invariance (or close to) to scale by using a floating origin. e.g., always set the floating origin to the minimum of a, b, before performing any calculations/logic. This enables greater accuracy because the resolution of floating point values is highest near the origin.

An example of achieving extreme scale (Solar System) with single precision numbers is here: https://youtu.be/_04gv3CnjDU.

Fixed “near” comparison is one approach, but is has its own issues.

Basically, the amount of the precision of a floating point number is not a fixed value, which means that different sizes of numbers have different amounts of error. A number to the power of 100 for example, has a much larger error range than a number to the power of -100. Your simplistic solution is a good “first” approach, but it won’t work well outside of a narrow range of values.

In addition, error on a number should be tracked through the computation. If I add 20 numbers together, the error range isn’t constant, and if I multiply 20 numbers together, the error range doesn’t grow linearly.

The real answer to all of these questions is “measurement error analysis”, but it’s a non-useful answer because in effect, you are asking a question about measurement error analysis.

Assuming you believe the inputted value is 100% correct, then you can automatically know the error of the error of the floating point representation. This error is basically the distance to the next valid floating point number.

The first bit (bit 0) of a floating point number is the sign bit (0 = positive, 1 = negative). This has nothing to do with distance, so we skip it.

The next 8 bits (bit 1-9) is the exponent of the number, in 2’s complement form. This provides a range of -128 to 127 for the exponent.

The remaining 23 bits is the fractional portion of the number’s mantissa. Every non-zero number in binary eventually has a 1, and the 23 bits represent every bit after that initial 1. For you to become familiar with these numbers, you need to realize that these are bits representing fractions, so the first bit is 1/2, the next 1/4, and so on.

This means that the 23 bit is 1/8388608 of the value of the “1” bit, and the error is (+ or -)(1/8388608)^(your float’s power).

Adding two numbers together is (A + B), but as we have errors, it is (A + err(a) + B + err(b)), this results in the answer’s error growing (A + B) with err(a)+err(b). Note that err(a) and err(b) are “plus or minus” values, so the error range is “the smallest err(a) + the smallest err(b) to the largest err(a) + the largest err(b)).

Likewise, when multiplying you are really multiplying (A + err(a))(B + err(b)), leading to an error of B(err(a)) + A(err(b)) + (err(a))(err(b)). Again, this will have to be done with each err(x) being plus or minus, so and then capturing the output of the error in maximum upper and lower bounds). Similar computations can be done for division (A + err(a))/(B + err(b)).

So, if you want true error tracking through your computations, it won’t be enough to just limit you errors to the inputted values, you will also have to follow them through each computation the system makes.

True error analysis is quite laborious, so much of the world assumes the errors won’t accumulate to a significant value sufficient to alter the result. When you get ‘to the next layer’ of the field, you start to apply statistical tools to the error, often applying a statistical distribution to the initial error of a measurement and then tracking the error distribution fields through the computations.