Why store hash of decompressed data?

  Kiến thức lập trình

I don’t know anything about compression so I’m trying to learn about them. In the LZAV compression library API there is a comment for the decompress function which advises to store a hash of the original (uncompressed) data, so that after decompressing you can then check the validity of it (ie., that the decompression process was successful). The comment:

  • Note that while the function does perform checks to avoid OOB memory
  • accesses, and checks for decompressed data length equality, this is not a
  • strict guarantee of a valid decompression. In cases when the compressed
  • data is stored in a long-term storage without embedded data integrity
  • mechanisms (e.g., a database without RAID 1 guarantee, a binary container
  • without a digital signature nor CRC), then a checksum (hash) of the
  • original uncompressed data should be stored, and then evaluated against
  • that of the decompressed data. Also, a separate checksum (hash) of
  • application-defined header, which contains uncompressed and compressed data
  • lengths, should be checked before decompression. A high-performance
  • “komihash” hash function can be used to obtain a hash value of the data.

Would I be able to forego the hashing of the uncompressed data if I check the integrity/sameness of the bytes where I store the compressed data? For example, I compress some data, and then save to storage along with other data. If I check the validity of that stored data, then I ensure that the input I pass to the decompressor is the same as that which came out of the compressor. If the input given to the decompressor is identical to the output given by the compressor, then it’s guaranteed to decompress correctly, right?

LEAVE A COMMENT