How does one add observability to python code without affecting code quality?

  softwareengineering

We’re working an agile project and designing as we go on a new python commandline app / systemd service for some fancy in-house project. Right now, we’re supposed to be adding an observability / logging layer to the program. According to the product owner and the guy on our team that knows the most about observability (not me) we’re supposed to send to an OTEL collector basically every function call we do so we can 100% recreate what the program did.

This is supposed to create hierarchical spans of work so we can use a dashboard to see what it did (I’ll basically look like a gantt chart / timeline of all the work so we can see bottlenecks or whatever).

I’m of the opinion that this is rather insane and if I log every function call I’ll ruin my app’s performance and denude the program of any semblance of thread safety (not that that has been a concern thus far).

Furthermore, it seems like, in order to create all these child spans for OTEL collectors you need to have some global variables watching everything (which also makes multithreading a bigger chore and muddies up the whole program)

I’m not opposed to a global logger, but since that doesn’t really keep any state, it doesn’t hurt too much, what kinds of patterns do people use when they’re adding observability to python code that doesn’t either:

  • add extra, unnecessary variables to every function
  • add shakey and unsafe read/write global variables
  • slow down the program to a screeching halt on the network IO.

Right now, I’ve got a pretty good rating with the linter and the complexity of the program is low. It seems like a violently abhorrent tradeoff to add observability into a relatively simple program just so managers can tick a box – but I don’t want to get into the philosophy of this, just figure out a way to do it without making the program look like crap.

1

threading

You made some remarks about threading, for a GIL-bound python program,
which seemed surprising.
Is this really a flask app you’re describing?
The context matters.

In any event, create per-thread datastructures and log to those.
Then you have the simpler problem of gathering the logs once
per second or whatever, and sending them to your central monitoring
server.
The familiar queue
is one candidate, but certainly there are others.

intrusiveness

add extra, unnecessary variables to every function

Right, we don’t want to uglify everything in the codebase.

We might decorate each function with @observe,
which logs the input args and the {result, exception raised}.
But that’s tedious, especially for two-line helpers.
Probably better to create a function which,
given a module, will iterate through its functions
and decorate them en masse.

Hopefully much of your code is methods within classes.
So define a class decorator which will iterate
over relevant methods to instrument them.
An excellent example of this technique can be found in
beartype.
It’s very practical.

latency

slow down the program to a screeching halt on network IO

Your instrumentation should not be awaiting network roundtrips.
Append to a list or other datastructure,
and worry about flushing it to the network every few seconds
as a separate problem.
Then you have throughput concerns,
not latency concerns.

efficiency

Take a look at
tqdm.
It logs events, but does a very good job of suppressing
“too many logs” when there’s an event arriving every millisecond.
Also take a look at the
cProfile
module, which does accounting on number of calls, time spent in self,
and total time for self + children.

log every function call I’ll ruin my app’s performance

That is a cogent concern.
So pay attention to measures like “elapsed time” and “CPU seconds”,
in order to suppress log records of trivial helpers.
If it turns out that a helper is always uninteresting
(sub-millisecond elapsed times), then monkey patch it
to un-instrument it.

GUIDs

You didn’t mention IDs, so I will just make this explicit.

I would expect each unit-of-work to be tagged with a GUID,
which follows it through the stack.
And then each (large enough) sub-unit of work might get its
own GUID, associated with the parent GUID.

LEAVE A COMMENT