My company has a huge C++/C# software system composed of 1800 binaries. The technology range goes from old school native C++/MFC, then C++/CLI, up to .NET Framework and .NET Core 6. Almost all possible project types within these technologies exist in our software.
We struggle to find an optimal way to deploy these binaries at the client by sending them an incremental installer, since slighly changes in the source code may cause binaries to be built very differently.
We sign the DLLs and edit the version info, which also means that a new build of the software results in 1800 different binaries, when actually just a few bytes changed for each binary.
It is very hard to keep track of what source code changes caused exactly the change in the binaries. The complexity of tracking this for inter-related C++ projects is huge, so it is simply not a valid solution for us.
Currently we have an in-house automated PE Header comaparer system to detect changes in the binaries from one version to the other, but sometimes it fails to recognize minor changes in our binaries, which may result in a catastrophic deployment.
Any good ideas, strategies and/or recommendations on how to produce an incremental patch installer for thousands of C++/C# binaries?
The installer must be as small as possible. Suggestions on 3rd party solutions/frameworks are indeed welcome, as Microsoft does not seem to address this issue at all.
14
You are looking for a tool that can calculate, transfer, and apply binary diffs.
You cannot transfer just modified files, since all files will be modified. However, most parts within the files will stay constant. Thus, such a tool would have to
- split the files into chunks
- check which chunks already exist in the old version
- create a binary diff that only contains the new chunks
- insert the new chunks into the target file
Rolling hashes for finding matching chunks
A common way to split a file into chunks is a content-based slicing using a rolling hash. The rolling hash efficiently calculates a hash over a sliding window of bytes. If we want 2n byte large chunks on average, we might draw a chunk boundary whenever the lowest n bits of the hash are zero. This is capable of dealing with insertions or moving offsets, unlike splitting chunks at fixed offsets.
Each chunk can be identified by a cryptographic hash. Then, to find changed chunks we merely have to perform a set difference.
Example old bytestream:
| ..... | ......... | ..... | ........... |
4d60 5633 929e aaee
Example new bytestream:
| ..... | ..... | ..... | ........... |
4d60 7a5d 929e aaee
Here, the second chunk changed from hash 5633
to hash 7a5d
. So to update the file on the target system, we only have to transfer the contents of chunk 7a5d
, and have to apply the instruction “replace 5633
with 7a5d
”.
Alternatively, you can slice the new bytestream into fixed-sized chunks, calculate their rolling hash, and use that to efficiently locate instances of the chunks in the old bytestream – essentially a kind of fast substring search algorithm.
Example new bytestream:
| .... | .... | .... | .... | .... | .... |
0c95 147c 676b ddab a224 0dc9
Example old bytestream containing some of the new chunks:
| .... | .. | .... | ...... | .... | .. | .... |
0c95 676b a224 ddab
Here, four of the chunks are already present in the old bytestream, so that they can be reused when patching the file. Some parts of the file were modified so that chunk ddab
is now found at a different location, and chunks 147c
and 0dc9
are no longer present. In principle, the location of chunks in the old bytestream could also overlap.
Note that such modifications cannot be performed in-place in a file, since the size or offset of chunks might change. If minimizing downtime is important, create the patched/updates files in a separate directory, and then restart the services from that directory. Compare the concept of blue–green deployments.
Application of content chunking and binary diffs in Rsync, Borg-Backup, and Git
The above happens to be exactly what the rsync
tool does with its default “delta encoding” strategy (fixed-sized chunks approach).
Normally, rsync uses an interactive protocol that chops new versions into chunks, then checks which chunks are already available in the old version, and then transfers new data. The new chunks are compressed in transfer.
Rsync also has a non-interactive batch mode where the binary diff is calculated once and saved as a file that can then be applied on target system. This seems to be exactly what you’re looking for.
Rsync can handle entire directory trees, but any single-file binary diff can be extended to handle multiple files if the byte stream being chunked is an archive format that bundles files in a deterministic order, without compressing them separately. For example, uncompressed TAR archives where files are added alphabetically would be suitable. Freshly created ZIP archives without per-file compression would also work.
A similar approach is used by Git to synchronize repositories. Git has a concept of “objects” that are identified by hash, for example commits or files. Git can create packfiles in which objects are represented either by their contents, or as a diff: instructions to re-assemble the object contents from other objects. Usually, two repositories are synchronized interactively, by negotiating known objects and then transferring packfiles for the missing objects, which may be expressed as deltas/diffs to existing objects. But again, it is possible to use a non-interactive approach to transfer a packfile manually. With git bundle create update.pack old..new
, we create a packfile that contains the changes between commits old
and new
. The packfile can then be imported in a different repo via git unbundle update.pack
, as long as that repo already contains the old
revision. While Git is usually used for text files, the packfiles just deal with opaque binary blobs and use a binary diff for delta compression.
A combination of these approaches (splitting chunks in a content-defined manner by rolling hashes, and content-addressable object storage) is also used by some backup tools such as Borg-backup/Attic to enable deduplicated backups. For example, my desktop backup currently covers 640GB of files (495GB compressed). However, most of these chunks do not change and are already present in the backup chunk database. Thus, a typical backup run only needs to transfer about 100MB of data for new chunks, and the entire database of over 30 snapshots spanning more than half a year only takes up 595GB of storage thanks to this content-defined deduplication technique.
There is of course a tradeoff between chunk size and overhead for managing chunks. My Borg-backup configuration targets an average chunk size of 2MB, which results in a few million of chunks and a chunk index that is about 100MB large. Rsync handles each file individually, and dynamically selects a chunk size based on the file size, typically around 0.7 to 32KB.
Applying these techniques to your problem
You may be able to use adapt these tools for your purposes. In particular, Rsync’s batch mode should just work. However, these tools were developed in a Unix context, and might not handle some Windows file system features appropriately. If you’re already working on the level of writing custom tools to inspect PE headers, you could also consider applying the principles of Rsync’s batch mode to your own tools. Such custom tools could also include additional domain knowledge, such as selecting the chunk size so that changed file headers are included entirely in the first chunk with high probability.
4
I suggest rsync. Using rsync you don’t need to correctly reimplement vast majority of fuctionality like diff algorithm, working with files, network protocol to transfer data and so on.
Currently we have an in-house automated PE Header comaparer system to detect changes in the binaries from one version to the other, but sometimes it fails
It’s very strange that you don’t compare file hashes to detect if a file has changed.
It very basic, standard and reliable technique and it far more robust then cryptic “PE checks”. Obviously any part of file may change, not just PE header.
it fails to recognize minor changes in our binaries, which may result in a catastrophic deployment.
We don’t detect by using hashes obviously because this would always classify all files as being different. The PE Header of all files will always contains different version information, timestamps, and signatures, emitted by the build. So we currently ignore these and deliver the binary only if the rest of the fields are different…Furthermore, some C++ compilers wriite random stuff in some parts of the binary (the parts not contaiining any useful program instruction).
Thats exactly why your deployment is fragile and eventually fail.
The client has multiple servers and thousands of clients, all on-premise. The server distributes new data to all these clients, so it might overload the network infrastructure if we send a 2GB package to every client machine.
This is what rsync is made for.
You may run rsync as a server to efficiently update files on clients over network directly from the server.
https://www.mankier.com/1/rsync#Starting_an_Rsync_Daemon_to_Accept_Connections
https://www.mankier.com/1/rsync#Batch_Mode
man rsync
It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.
Using rsync you be able to get binary diffs between two files or two directories and apply them on client machine.
In simpliest scenario you may update all files with one rsync command if all your binaries are under a single directory.
# on machine with new files
rsync -r --checksum --compress-choice=zstd --include='*.exe' --include='*.dll' --exclude='*' --only-write-batch=diff_file /deploy/current/ /deploy/previous
# transfer diff_file to client with your installer
# on client
rsync --read-diff=diff_file /path/where_binaries_are/
You may write simple program/script to get update logic you needs like process each file individually.
https://www.mankier.com/1/rsync#–write-batch
rsync binaries for windows
https://jl-workshop.com/RSync-for-Windows-Free-Download/
https://community.chocolatey.org/packages/rsync https://community.chocolatey.org/api/v2/package/rsync/6.2.5
https://download.samba.org/pub/rsync/binaries/GitHub-CI-builds – which needs dlls from cgwin project.
.NET
https://github.com/OctopusDeploy/Octodiff
https://www.nuget.org/packages/FastRsyncNet/
Also
Delta Compression Application Programming Interfaces
https://learn.microsoft.com/en-us/previous-versions/bb417345(v=msdn.10)
bsdiff, xdelta, rdiff
5