I need to compare and synchronize two tree data structures.
I have Tree A (the “Reference Tree”) and Tree B (the “Target Tree”). I need to bring the Target Tree into compliance with the Reference Tree. I need to find differences — additions and deletions, including those that result in new descendants or branches — and bring the trees into synchronization.
I have to assume there is a body of thought/knowledge around this, given all the JavaScript DOM-based templating/UI systems out there (note: I’m not working in JS, that’s just a handy example).
What names/theories am I looking for here? I’m not asking for actual instruction, I just want to know what schools of thought or practice exist that I should start with, rather from fumbling around from scratch.
6
The main challenge/consideration when doing this is determining what you consider a difference to be. This can be more tricky than you might think. The experience I had that lead me to that conclusion was around XML which has thankfully fallen into disuse, at least for new development. Finding a good XML diffing tool was tough. It might be a place to start your research, though.
Probably the main thing is to determine whether ordering of child nodes matters. For example, are the following trees the same or different?
A A
| |
B C C B
Aside from that and what properties you are considering, the approach to determining differences is pretty straightforward: recursion. A node is the same if it has all the same children and they are the same.
Where things get tricky again is how you interpret those differences. For example:
A A
| |
B C D
|
B C
I can interpret that at least 2 different ways. The first being that a new node D
was inserted. The other being that B
and C
were deleted, D
was added, and B
and C
were added to D
. The latter interpretation is easier, in my estimation, to code. The former will require you to come up with more rules. One last example:
A A
| |
B C D
|
B
Now if you want to see that as D
being inserted and C
being deleted (again, does order matter?), you need to make a lot of assumptions. Ultimately, it comes down to the fact that there are many different transformations that result in the same tree structure. To a large degree, what you should infer from a given before and after image will depend on the way they will be synchronized.
Let me show you an old requirements trick:
I need to compare and synchronize two tree data structures.
I have Tree A (the “Reference Tree”) and Tree B (the “Target Tree”). I need to bring the Target Tree into compliance with the Reference Tree. I need to find differences — additions and deletions, including those that result in new descendants or branches — and bring the trees into synchronization.
I challenge these requirements. I believe they are under specified. As proof, consider a proposed solution: Discard the old B. Make a copy of Tree A and name it B.
I’m sure you have reasons this wont work. But since you didn’t tell us what those are this is what you get from me until you do.
4
I think the basic problem is not difficult to solve, and doesn’t support a “school of thought” of its own – you’d just be looking for something like “algorithms on tree structures”.
As to an algorithm, you start at the root, moving breadth first, compare the node, and alter the target (and any descendants) if they do not match. Typically this algorithm will use recursion.
And compare (and alter) the ordering at each level, if ordering is important.
I suspect things in this area only get complicated when there are specific nuances of an application, that aren’t implied by the general problem. If that’s the case, you may wish to be more specific about the context of your problem.
For example, if you distinguish structural changes from content changes – in other words, if the replacement of a node (and the relations it has with other nodes) is distinct from the replacement of its contents – then you might find that such cases cannot be distinguished from a mere comparison of the contents of each node. The solution in this situation is that you would have to store additional data that represents the unique identity of the node.