Relative Content

Tag Archive for hashing

Are there any standards for storing checksums of a repository?

I have a repository with many files (mostly binary, images and raw data) and some documentation. The files are store in a hierarchical folder structure, I want to allow checking the fixity of the file, for detecting data corruption. At the moment I am generating a .json file representing the structure and containing for each file its checksum. Plus some metadata containing, for instance, the date and the algorithm that I used for calculating the checksum.

How to rebalance data across nodes?

I am implementing a message queue where messages are distributed across nodes in a cluster. The goal is to design a system to be able to auto-scale without needing to keep a global map of each message and its location.

How should I handle different hashes of identical files in .zip archive with different ‘last changed’ date?

We store zipped files in the storage of a cloud provider which contain certain fields (metadata). These files are derived from other, larger files. Every time we (re)generate these files, their ‘last changed’ date is set to the generation time, while the content of the file is identical. When we recreate one of these files, which have previously been stored in the online storage, their file hashes (md5/sha) differ. The reason for that is that the zip algorithm seems to include the ‘last changed’ information in the .zip file.

Measuring “novelty” of data

I have a heuristic in mind that should allow me to “score” data based on “novelty” that I would like to work in real-ish time.

Measuring “novelty” of data

I have a heuristic in mind that should allow me to “score” data based on “novelty” that I would like to work in real-ish time.