I have modeled a problem as a graph that consists of many trees. Some of the nodes in the graph may belong to more than one tree. I am trying to describe a subset of paths in the graph with as few nodes as possible in order to store them efficiently. All paths start from a root and end at a leaf node.
Below are a few examples:
-
Suppose the subset of paths chosen all start from the root node R. Then, I can describe all these paths with “all paths that start from R”. This uniquely determines the paths that were selected. So I just need to store R and a flag that specifies that this node is a root.
-
A similar scenario, but for the case where all the paths end at a specific leaf node L. They can be described with “all paths that end at L”. So I just need to store L and a flag that specifies that this node is a leaf.
-
A similar scenario, but for the case where all the paths pass through a specific intermediate node I. They can be described with “all paths that pass through I”. So I just need to store I and a flag that specifies that this node is an intermediate node.
The problem could get more complicated if the paths need to be described with more than just a root/leaf/intermediate node. For example, I may need to specify many roots, leaves, and intermediate nodes. However, I want the description to contain as few nodes as possible.
Is there any known algorithm/heuristic that I can apply to my problem?
Thanks a lot.
2
You are trying to solve the compression problem (from the way it’s described). This has been proven to be impossible in the general case. Unless you know something about your data, such as some type of pattern that occurs that you can remove you are stuck.
To prove that it’s the compression problem, let’s assign each root node a number starting at 0 and ending at the total -1. We will do the same with the leaf nodes: start at 0 and number them up to the total – 1. Because you have trees, each path can uniquely be identified by its origin and its destination, so a pair of the root node’s ID and the leaf node’s ID.
To efficiently store this you can merge the two numbers using:
rootNumber * number of leaf nodes + leafNumber.
So now you just have an Integer.
Store all the integers in a sequence and you get (effectively) one long random number.
In your example of the possible ways of compressing the storage, marking “all paths start with” and “all nodes end at” are specialized cases, and unless they are more likely than any other configuration, identifying that case is of no benefit (to see why this is the case, investigate Huffman encoding).
As for “All paths go through an intermediate node”, well this is analogous the finally computed number having a factor, i.e. “All numbers are divisible by two”. It’s something that is intrinsic to the system you are dealing with (Divisibility being part of the numeric system). This also does not help, unless, again your system has some bias (aka pattern) in they way the paths are created.
If they are truly random then there is no way to compress your data.