Should serialization and deserialization be “atomic” transactions?

  softwareengineering

I am wondering if serialization and deserialization of classes should always be treated as an “atomic transaction?”

What I mean is, if an error were to occur during the process of serializing or deserializing a member of an object, should the whole serialization/deserialization of the object be considered to have failed?

For a more concrete example, I am going to use C++. Suppose a very basic structure as follows:

struct RGB
{
    uint8_t r;
    uint8_t g;
    uint8_t b
};

Suppose I have an RBG instance defined as so:

RGB myRGB{0x10, 0x20, 0x30};

Which, if serialized into a raw binary stream would look like:

0x10 0x20 0x30

Suppose, the one of the bytes gets lost during transmission so that the “deserializer” is fed only:

0x10 0x20

I can see two options here.

a) Because the ‘b’ member of the struct cannot be deserialized, the whole struct cannot be deserialized.

b) ‘r’ and ‘g’ can be deserialized, and will we just use the default value for ‘b’

Both have their merits. The problem I can see with (b) is that, while it ensures you at least get “something” it is not actually an accurate reconstruction of the thing that was serialized which (for a more complex example) could result in further errors down the line.

I suppose an option (c) would be:

c) It depends on the application. If the object in question can be default constructed, then option (b) is fine. If the object cannot be default constructed (i.e., requires values in its constructor), all of the values required for constructed need to be deserialized atomically.

10

Meyer’s DbC design by contract
has something to say about this.

Suppose that one of the bytes gets lost during transmission so that the “deserializer” is fed [just a prefix of the bytestream before EOF truncation].

We are constructing an RGB pixel object.
That class makes certain guarantees:
values shall always be integers in the range 0 .. 255
and never e.g. a NULL pointer or a NaN value.

The .deserialize() method makes a guarantee:
it shall return a valid RGB object that was received from the bytestream.

If it can’t do that, it should return nothing at all — it should
raise an exception to indicate it was unable to fulfill its contract.

How the calling code responds to that is up to it.
Things are out of the hands of .deserialize once it did its job.

5

I would say it is c) or something similar to c): “it depends on the application and the data that is being transmitted”.

If I am serializing the details of a banking transaction, the idea that you would accept a partial transaction and fill in the missing details seems pretty crazy, right? It should because that could land the institutions involved in some hot water.

I would go as far as to say that, by default, all transmissions should be considered atomic and only once it is determined that a partial transmission can be tolerated, that you would implement any other solution. Generally speaking it’s only going to be things that don’t matter that you can treat non-atomically. For example, if you are sending out a list of new cat gifs available on a website, maybe you just want to show what you received instead of nothing.

5

Don’t tell lies to your caller. He relies on the results being correct.

Returning an object from a deserializer tells your caller that this was the content found in the serialized data.

If you can’t reliably reconstruct the original object, inform your caller about that exceptional situation. In that error message (exception object?), you might choose to include a best guess about the contents, but you should not disguise such a guesswork as being the truth.

If you are sure that, from current problem analysis, it’s also acceptable to e.g. replace some missing fields with their defaults, be sure to document this highly unexpected behaviour in a place where every future user of your deserializer will take notice.

1

This is highly contextual, both from a business perspective (as JimmyJames mentions, an partial banking transaction) but also from a data model perspective.

If you’re dealing with sequential individual elements, such as:

A, B, C, D, E, F, G

Then you can reasonably argue that each individual element’s parsing can be individually decided (assuming no business conflict like was mentioned before).

However, if your data is nested, such as JSON or XML, then you inherently can never fully process an element since the root element is only closed at the end of the transmission. If your parsing fails halfway, you cannot have parsed the root element correctly, and therefore you cannot decide whether the information you did parse is individually complete enough or not.

Again, context applies. Maybe your root element is really just an array of discrete items, at which point the earlier advice might apply. Similarly, maybe even an array of items is only meaningful when fully known, at which point the earlier advice doesn’t apply.

In short, there is no absolute here. If you have to err one way, I would err to assuming atomicity, simply because the impact of making a wrong call is significantly less than when you err the other way.

The behavior of a deserialization/decoding routine should depend upon the extent to which its output will be considered a Source of Truth.

If a deserialization/decoding routine’s output will be used ephemerally, and it’s better to produce mostly-correct output quickly than perfect output slowly (this situation may arise with audiovisual decoders or applications that perform real-time interactive rendering) then producing imperfect output when given imperfect input may be better than simply providing an error indication.

If a deserialization/decoding routine’s output may be used as a Single Source of Truth, then the routine should not produce incorrect output. If, because of faulty input, it cannot produce correct output, then it shouldn’t produce any output at all.

If a deserialization/decoding routine will be used to try to extract useful information from input that is known or expected to be corrupted, and its output, though imperfect, will be better than any other Source of Truth, and will thus become the new Source of Truth, then it should produce the best possible output from even corrupted parts of the input, but its output should also include information about any problems that it “patched over”.

Probably the most important distinction is between the first two usage patterns. In the first scenario, a transient problem that causes the decoder to receive invalid data may result in transient disturbance to the output, but if the output is going to be discarded in any case, the fact that it is invalid won’t matter. If, however, a transient disturbance on the input would prevent the data from being decoded correctly, substituting default data from what will be interpreted as the Source of Truth may result in real data being permanently overwritten, in circumstances where refusing to decode anything would have resulted in the real data being retained until the disturbance subsided and the replacement data could be successfully decoded.

If the structure is very big and is passed over noisy environment, it may be partitioned into smaller chunks that are more likely to pass undisturbed than the whole message in completeness. In this case only processing of the single chunk needs to be atomic. If some chunks have not been received correctly, it may be possible to ask re-sending only the failed segments. For instance, a big image could be split into tiles that are displayed when received, and the failed tiles are re-sent and displayed later.

An optimal API would provide both alternatives. This can be implemented by returning the object and a success status (or a list of errors), or alternatively by having a parameter such as ignore_errors.

This is useful when loading user data files, which can get corrupted due to various reasons and may be the only available copy. Being able to provide a “recover as much as possible” option can save someone’s day.

In comparison, if the corrupted data is the parameter of a request, it is better just to reject the request and have the other end retry.

In practice though, many deserialization libraries will not provide partial data. The more complex the format gets, the smaller chance there is to recover anything successfully even after a single byte corruption.

Take the JSON file format as an example. If you get a sequence of bytes that are supposed to be a JSON document, you either decode it completely, or you fail completely. A JSON decoder will never say “Here is some partial data, but there are more bytes that I couldn’t decode”.

That’s a very reasonable attitude. It makes your life a lot easier, and prevents disastrous mistakes. In the end, if you are given data that you cannot process or that is incomplete, what are the chances that what you can decode is pure nonsense?

Same with plain http or https. The software that you call will do its best to get the correct results, but the caller will either get a correct result or nothing.

6

The general question completely depends on the application. If you are serializing/deserializing a live video stream, you would typically do a best effort and not bother if a few pixels went black. If you are serializing/deserializing economic transactions, you would typically need to detect and flag any anomalies.

If the transfer medium is unreliable, it is normally necessary to add redundant information to be able to detect or correct errors (i.e. add a checksum or its generalization, an error-correcting code). Most media which we regard as reliable employ some mechanism of error detection/correction on top of an unreliable physical medium.

Hence, in the RGB-example, if the transfer medium could be unreliable, the data should maybe not be serialized into three bytes [“r”, “g”, “b”] but four bytes [“r”, “g”, “b”, “XOR(r,g,b)”] or more.

Mathematically, for a given amount of redundant information, there is a trade-off between detecting errors and correcting errors. Again, it depends on the application whether it makes most sense to detect errors (for flagging/retrying) or correct errors (for best effort/reduced risk) or do a combination.

It should really be atomic. The concept of fail hard, fail fast applies here, especially since serialization and deserialization are usually used at the border of your system, to communicate with the outside world (even if it’s storage or DB you have full control over, but especially if network access is involved).

For your particular example, nothing worse than a little off-color pixel may be the result of assuming default values. For other, more involved objects, it may be catastrophic failure later down the road. Slowly corrupting data is one of the worst conditions a computer system can be in. If you forget (or are not aware) that the issue exists, you might hunt down weird bugs in your data forever, wasting days or weeks debugging, building workarounds around it, and so on and so forth. Depending on the data it may also be completely impossible to distinguish correct from bad data, which may make all the data worthless.

It should be very easy to make sure the process is error-free with some measure of certainty by some form of checksum/hash appended to the serialized data.

If you do, on occasion, wish to deserialize known-bad serialized data (for example from a source system you have no control over), then you need to approach that with a lot of sense, and specifically define the cases you wish to accept and “fix” during deserialization. In this case, the operation still needs to be atomic, but you’ll define rules to fix the data issues right there and then. In a sense, this is then similar to being error tolerant when, say, parsing HTML or other structured data, with the exception that the serialized form is usually binary.

LEAVE A COMMENT