Four-way XML comparison in C#

  softwareengineering

I have 4 XML files: A, B, C, and D. I want to know if the difference between A and B is the same as the difference between C and D.

The XML files are serializations of the same .NET object; one of the primary differences will be in a particular list that describes the features available on a particular product. (A description of the feature is itself another object).

All four have very similar structures, but there may be values present in one that aren’t present in another, and some values may be changed. For example, if we consider document A:

<xmldoc>
   <a></a>
   <c></c>
   <d></d>
<xmldoc>

Document B:

<xmldoc>
   <a></a>
   <b></b> -- Added 
   <c></c> -- C and D are still ordered in the same way (except for the addition of <b>
   <d></d>
   <e></e> -- Also added, but it doesn't affect the sort of the other ones
<xmldoc>

Now suppose that I have the following documents. Document C is exactly identical to document A:

<xmldoc>
   <a></a>
   <c></c>
   <d></d>
<xmldoc>

Document D is identical to document B.

Since the difference between C and D is exactly the same as the difference between A and B, this should pass. However, suppose that instead we have document D as follows:

<xmldoc>
   <a></a>
   <b></b> 
   <f></f> <!-- Added -->
   <c></c>
   <d></d>
   <e></e>
   <f></f>
<xmldoc>

The difference between C and D is no longer the same as the difference between A and B.

I’m pretty sure that we won’t have a case where document A shows up as:

<xmldoc>
   <c></c>
   <a></a> -- This is the same as the original document A except that this was reordered - this shouldn't happen
   <d></d>
<xmldoc>

My first thought was to use Microsoft’s XML Diff Patch library, which compares two files and generates a DiffGram, which is an XML document that describes the difference between the two files being compared. My thought is that I could compare A to B to get DiffGram X and C to D to get DiffGram Y, and then do a third XML comparison between X and Y.

The idea sounds good on paper; unfortunately it’s not turning out to be so simple. The difference between A and B is very similar to the difference between C and D, but X and Y look nothing like each other.

The problem is it gives DiffGrams like the following:

<xd:node match="4">
           <xd:node match="2">
              <xd:node match="1">
                 <xd:remove match="1-3" />
              </xd:node>
           </xd:node>

           <xd:node match="1">
              <xd:node match="1">
                 <xd:remove match="1-3" />
              </xd:node>
           </xd:node>
        </xd:node>

This has two problems: first, it’s extremely cryptic – I’d prefer it if it was more human-readable, but it’s not the end of the world if that’s not the case (since my primary purpose is programmatic here). Secondly (and much more critically), it seems like that’s very tightly coupled to the specific XML files that are in that particular comparison.

I originally posted on the Software Recommendation Stack Exchange asking for recommendations for a .NET library (preferably a available as a NuGet package) that would be suitable for this purpose but didn’t have much luck getting a recommendation. (Full disclosure: I haven’t deleted that question yet but intend to do so shortly). If such a library exists, I haven’t been able to find it (a lot of them seem like they’re not designed for the purpose I want to use them for and/or aren’t written for the .NET framework), but if anyone’s aware of such a library that would definitely be an acceptable solution as well (in fafct, I would strongly prefer that to having to implement it myself).

Has anyone successfully done something like this (either by creating your own solution, using Microsoft’s XML Diff library, or using another third-party library)? If so, what did you do?

I’m hoping that this isn’t too broad of a question (if so let me know and I’ll edit), but what would be a good approach to this if I end up writing this myself?

5

My thought is that I could compare A to B to get DiffGram X and C to D to get DiffGram Y, and then do a third XML comparison between X and Y.

That seems to be a good start. I guess what is missing here is something like a program or xslt script to transform “DiffGram X” to a readable representation X’. Then you can apply the same transformation to Diffgram Y, leading to a readable Y’. Comparing X’ and Y’ gives you a final DiffGram Z, which might be transformed to a readable Z’.

How this script or program will loook like probably depends on what kind of assumptions you can make about the structure of the input files. Do they really consist of arbitrary nested XML trees? Do you need to compare attributes, name space differences elements and element texts as well? I would be astonished if one cannot use that knowledge to simplify the DiffGrams.

4

The DiffGram representation of changes does not work well for this situation. It is fine for patching files but not really for this type of application. Using DeltaXML gives a more useful representation of the differences between your A and B docs:

<xmldoc deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context" xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
 <a deltaxml:deltaV2="A=B" />
 <b deltaxml:deltaV2="B" />
 <c deltaxml:deltaV2="A=B" />
 <d deltaxml:deltaV2="A=B" />
 <e deltaxml:deltaV2="B" />
</xmldoc>

Then you would get something very similar for your second comparison, C to D where C is like A but D has an added element (note we have called these A and B here so we get a result as near to the first result as we can):

<xmldoc deltaxml:deltaV2="A!=B" deltaxml:version="2.0" deltaxml:content-type="full-context" xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1">
 <a deltaxml:deltaV2="A=B" />
 <b deltaxml:deltaV2="B" />
 <f deltaxml:deltaV2="B" />
 <c deltaxml:deltaV2="A=B" />
 <d deltaxml:deltaV2="A=B" />
 <e deltaxml:deltaV2="B" />
</xmldoc>

This is basic two-way comparison – which is available for .NET. As you see, you could compare these two results and get a useful diff (some namespace changes would need to be made so the delta files were treated as regular files).

It is also possible using XML merge (though this is Java only) to go one stage better and show all three files in one. As A is the same as C we can treat this as one, so we want to know the changes between A and B and between A and D.

<xmldoc deltaxml:deltaV2="A!=B!=D" deltaxml:version="2.0" deltaxml:content-type="full-context" xmlns:deltaxml="http://www.deltaxml.com/ns/well-formed-delta-v1" xmlns:dxu="http://www.deltaxml.com/ns/unified-delta-v1">
 <a deltaxml:deltaV2="A=B=D" />
 <b deltaxml:deltaV2="B=D" />
 <f deltaxml:deltaV2="D" />
 <c deltaxml:deltaV2="A=B=D" />
 <d deltaxml:deltaV2="A=B=D" />
 <e deltaxml:deltaV2="B=D" />

That is probably what you need here. You do not say what your end goal is, perhaps to make a concurrent edit style of update, i.e. merge the changes made in both edit paths. As you have found, this is quite difficult! I hope this helps.
Robin

I developed a xslt diff sheet for the purpose of comparing any two xml files using XSLT 1.0. https://github.com/sflynn1812/xslt-diff-turbo

You alter the variable at the top of the sheet to specify the file being compared against.

A practical example is below. For instance if file a.xml is compared against file b.xml:

a.xml

<?xml version="1.0" encoding="utf-8" ?>
<a>
  <b>test c</b>
  <c>
    <d>test</d>
  </c>
  <b>test</b>
  <c>
    <d>test</d>
  </c>
  <b>test</b>
  <c>
    <d>test</d>
  </c>
</a>

b.xml

<?xml version="1.0" encoding="utf-8" ?>
<a>
  <b>test 2</b>
  <c>
    <d>test</d>
  </c>
  <b>test</b>
  <c>
    <d>test</d>
  </c>
  <b>test</b>
  <c>
    <d>test</d>
  </c>
</a>

The output would be as shown below, with the mismatches in a.xml not list in b.xml within tree->mismatch. The mismatches between b.xml not
not in a.xml under compare->mismatch:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <root>
    <tree>
      <mismatch>
        <a>
          <b>test 2</b>
        </a>
      </mismatch>
      <match>
        <a>
          <c>
            <d>test</d>/
          </c>
          <b>test</b>
          <c>
            <d>test</d>
          </c>
          <b>test</b>
          <c>
            <d>test</d>
          </c>
        </a>
      </match>
    </tree>
    <compare>
      <mismatch>
        <a>
          <b>test c</b>
        </a>
      </mismatch>
      <match>
        <a>
          <c>
            <d>test</d>
          </c>
          <b>test</b>
          <c>
            <d>test</d>
          </c>
          <b>test</b>
          <c>
            <d>test</d>
          </c>
        </a>
      </match>
    </compare>
  </root>
</root>

In the case of what you are trying to do you would do the difference between document A and document B, and document C and document D, then select the mismatched output of both files using xpath queries, followed by running the XSLT sheet a third time between the differences.

Just a broad answer. There is a recommendation called the XML Information Set:

https://www.w3.org/TR/xml-infoset

I’d say the most accurate way to compute the difference (or “delta”) between two XML documents, and then compare such differences themselves, will be after using whichever API/component (out of the box, augmented, or custom) supports the constructs defined in that recommendation the most faithfully.

LEAVE A COMMENT