Let’s assume we have two services –
auth with a message broker in between.
user service handles CRUD action on user entity and
auth service handles authentication. When a user is created/updated/deleted the
user service publishes an event that the
auth service consumes and writes in its own database.
If something happens, like pushing a new version of
user service that publishes events in a format that the
auth service can’t process, when a user is deleted in the
user service, it won’t be in
auth service and the user will still be able to authenticate.
What can one do to handle or recover from this situation?
Microservices aim to be loosely coupled and independently deployable.
Could there be something wrong?
If every time you change your
user service, the event message might change in a way your
auth service can no longer proceed, there must be something wrong in your design:
- either the two services are in reality tightly coupled: since they are not independent, they should belong to a same microserservice. Maybe consider another decomposition strategy (see decomposition patterns here);
- or the interface between the services is at the wrong level of abstraction: the interface (here the message format) leaks details that create a dependency that shouldn’t be there. Then consider to rethink your interface.
Or could it be about facilitating deployment?
If it’s very rare, and caused by a major evolution of one of the service, you need engineer into your service a transition strategy to support the independent deployment over the long run.
You could for example consider:
- Format versioning: the message format should be versioned so that any service consuming them can verify message version compatibility dynamically. Take the real-life example of SAML which has
xmlns:samlallowing to determine which version is used (1.1 is backwards compatible with 1.0, but 2.0 is not backwards compatible). SemVer like versioning could facilitate compatiblity checking.
- Backwards compatibility: design a an evolutionary format for your event message, with a format version number, the old part remaining unchanged, and the new information being added for the services that can use it.
- Transitional compatibility: it’s a variant of the backward-compatibility where the format multiplexes different format-versions. Once all the
userconsumers support the new format, you can release a new version that drops the old stuff. A real-world example could be multipart MIME which allows for alternative subparts. But in your context, I’d see this more as a workaround: evolution would be best managed with backward compatibility, whereas disruption caused by a new major version can be much more than the disruption in reading a message format, and may therfore be better managed with the next proposal:
- Side-by-side, also called the Darwinistic approach: Your new major version is released with new events. Old and new services coexist as long as there are still subscribers to the old. Service discovery can help the other services to find the most suitable
userservice to rely on. You could consider to make a bridge service to keep old and new in sync for a while, if relevant. You could also divert new incoming user registrations to the new version of the service using a “Blue/green” deployment
I believe there’s three typical ways you could handle this:
- Guarantee delivery of your messages
- Run a reconciliation process
- Switch to a pulled events approach
A little detail on each approach…
Guarantee delivery of your messages
This is for when you really want to make sure that your messages get through, and as soon as possible.
Firstly, use the transactional outbox pattern in the
user service to ensure that all messages that should be sent as a result of a database transaction are successfully sent to the message broker.
Secondly, design and deploy your message broker to have high availability and (more importantly) very high durability (i.e. 99.99….?% of messages are not lost).
Thirdly, ensure that messages are not ACK’d by the
auth service until they’ve been processed and the results committed to your db (the receive-side analog of the transactional outbox).
If it’s really, really important that you never lose messages, you might also want to keep a Sent Messages log file at the
user service. In the case of data loss in your message broker, you can then replay messages from the log.
Run a reconciliation process
If you’re okay with the “eventual” of your eventual consistency being a little longer, and the amount of data that’s shared between the services is not prohibitively large, you can run a reconciliation process. This would typically be done either by the
user service regularly exporting a dump, or by the
auth service regularly requesting all the data owned by the
user service which it is caching. Either way,
auth regularly receives
user service’s full picture of the world and can either update itself if that’s relatively easy to do or alert humans to intervene if it detects an inconsistency that can’t be automatically handled. It’s a good idea to have a method of ensuring that you’re not overwriting changes in
auth from recently received messages with stale data from
user that is from before the message was sent.
Switch to a pulled events approach
This one’s an alternative to using a message broker, so a little orthogonal to your question, but it’s worth considering. Instead of pushing events through a message broker, you can provide an
events/ endpoint on your
user service. The
auth service then becomes responsible for knowing where it is up to in the events stream, and for processing messages in order and calling developers for help if it can’t understand the data it receives.
Avoiding the problem in the first place
You do want to design for the day when this kind of error case to occur. But it’s a good idea to also put into place practices that will greatly lower the likelihood. One such practice you probably want to look into is consumer-driven contracts, which essentially try to catch such errors at build time by breaking the build.
Handling the problem well
Also, when the problem does happen, it’s nice to handle it gracefully. This is typically done using what’s called a Dead Letter Queue, where any message that causes an error at the receiver gets removed from the inbox and placed in a separate queue. The messages are then kept in that queue until something is changed in the system to allow them to be processed again. In your scenario, you would probably push all the DLQ messages back into the inbox after deploying a new version of the
auth service that has been updated to understand the new message format.
userservice owns this data, and
authservice is caching it for its own purposes. While this duplicates data, it’s a good pattern because it increases service autonomy.
How would you recover from that if this was paperwork between departments?
Perhaps someone at the auth department would notice the error and request the paperwork be redone.
Perhaps someone at the user department would notice that the auth department failed to replied in the time frame expected (or replied with negative acknowledgement), and raise a flag.
Additionally, we might have versioning approaches that only allow optional fields to be added, and that allow the optional fields to be ignored.
Further, there might be some kind of exhaustive testing perhaps driven off of a detailed description of the message schema, so both message senders and message receivers could be independently tested to work to spec (above and beyond just working with each other).
Despite using a message broker that addresses some delivery problems, we might also digitally sign or encrypt messages so the cannot be tampered with, e.g. accidentally by dropping or flipping bits, or otherwise.
And lastly, perhaps user and auth should be the same service.