Today I had a task to “write a health check” for a long running service that is an orchestration system to deploy a web-app.
I am trying to determine what the scope for such a health check would be, and came up with these questions related to the scope of the health check :
- Is it good enough to consider the service healthy if the orchestration system reports that the task is running?
- Or should we manually ping each service?
- Or should it go further and attempt to ensure that the web-app does what it is supposed to do, like show a web page?
- Does the healthcheck also have to check that some dependent services are also running? Like a database or the orchestration system itself. Or is that the responsibility of another health check?
- And last of all, if one of the dependent services are dead, and the web-app subsequently fails, should the web-app report a bad health, or is it good health, because it is not the web-apps fault?
I know these are 5 separate questions, but they all relate to the scope for a health check for a long-running service that deploys a web app, so I thought it would make more sense to keep them grouped in a single question.
This is hard to implement for me because I am not sure the definition of what is healthy, or what a standard health check for something like this should look like.
What should a health check for this specific service contain?
This is hard to implement because of the definition of what is healthy
You answered your own question here. The definition of a health check is going to vary, because what is healthy varies. It also depends on what is issuing the healthcheck.
A good question to ask yourself is, “from the perspective of the asker, is the checked service working as expected?” If this is you, you get to define it. If it’s another team/service you need to identify what the standard/specification for healthchecks are.
Likely in a large organization, you will have some sort of standard for what a healthcheck should do. Figure that out.
Specifically here, your webapp example means it should not return healthy because the webapp isn’t healthy. But perhaps your definition of “healthy” would include this as “ok.” This is part of the requirements discussion above (again, even if it’s just your own code).
My recommendation assuming it is not specified elsewhere would be to have some sort of status code associated with different failures. When you query the webapp, it might return an error that says “dependent service is dead” and so your client (or whatever is performing the healthcheck) can know the reason the client is dead.
For the edited questions:
Is it good enough to consider the service healthy if the orchestration
system reports that the task is running?
No, just because a process is running does not mean it is not hung, totally nonfunctional, or a large variety of other possibilities.
Or should we manually ping each service?
This might work, depending on the scope of your application functionality. If verifying the service responds to an “are you alive?” ping then this might be all that is required. But if the service could easily be “alive and responsive but not actually working” then perhaps you need to check other things too.
Or should it go further and attempt to ensure that the web-app does
what it is supposed to do, like show a web page?
Your healthcheck needs to ensure that the required functionality that is expected works as expected.
If your app returns “healthy” and cannot do what it needs to do, you might as well get rid of the entire healthcheck as it will give false positives (not to mention confuse the heck out of people trying to debug the problem – ‘hey our webserver shows healthy, why can’t we see the page?’).
Does the healthcheck also have to check that some dependent services
are also running? Like a database or the orchestration system itself.
Or is that the responsibility of another health check?
This depends somewhat. If your service depends on another service, the nature of that interaction should be reflected in the API/network calls sent to it in your app and incorporated into the healthcheck.
For example, a webserver reading from a database needs to have status information about the database built into it – or the web app will simply crash if the API calls fail. You can trivially modify these calls to be incorporated into your healthcheck.
However, if your service is sending events to consumers which listen, without any validation, then it is less important to the functionality of your app that the consumers are alive. “Healthy” to your app is sending the messages, not actually receiving them.
Basically, if your service needs to talk with other services and verify their health anyways it makes sense to at least have a basic level of check in this for your service’s healthcheck. This should make sense conceptually given what I just said as your application will already be handling this (or randomly crashing, I guess).
And last of all, if one of the dependent services are dead, and the
web-app subsequently fails, should the web-app report a bad health, or
is it good health, because it is not the web-apps fault?
This is basically answered above. My recommendation would be to have your healthcheck return a code/message/whatever that gives this information. Both pieces of information are important: that the dependent service your service needs is dead and that your service will not work as expected as a result.
Generally a health-check just means “is it alive and is it responding”. Further checks than that are highly specialised and depend entirely on the use of the system. Whether you go the extra mile to check that a system is processing requests correctly is up to you, but you should do the basics first – check it’s there, check it can receive requests and will return a response.
The easiest way to implement a health check is to simply write a command that the service processes using the same mechanism other commands use, that does nothing but return an acknowledgement. That will show live-ness, and that the system is receiving and processing responses.
Checking dependant systems is not part of the health check, you need to keep it simple and self-contained. Add a health check to each dependant service in turn. That way you can get a list of running, healthy systems and easily tell when one goes bad, which one it is!
In my experience, critical services tend to have the following features:
If the service runs on a regular basis, this just writes a line to a log file or similar along with a timestamp to indicate that the service body kicked in at a given time.
Similar to the above, breadcrumbs are usually just a dump of the method name (and occasionally parameters) to show that the service is processing the service body as expected and whereabouts in the flow it is. Since these can generate more output, these are commonly controlled by config files or similar so they can be turned off once the service had bedded in.
It can be tempting to add a lot of other stuff such as the state of various servers, services and databases and the like. Whilst this is no doubt valuable, I’d advise against writing anything too extensive. These might be useful for your own peace of mind but such safeguards tend to get abused once the parties in charge of the various touch points know they’re there. Before you know it, you could be writing a diagnostic app for the entire company.