Diffing HTML
By Adrian Sutton
I think this is the final episode in my series of responses1 to Alastair’s Responding to Adrian. What’s the aim of diffing HTML, how hard is it and how do you go about it?
The aim is really important to identify. The most common and most useful aim that I see for diffing HTML is to be able to show users what changed between two versions of a document. Since the management of most content is centralized2, this equates to showing the combined changes in each version between the original version to compare and the final version to compare. If you’ve ever wanted to see what’s changed on a wiki page, you’ve wanted this type of diff. If you’re sending Word documents back and forth between people you probably want this type of diff too.
Another common and useful type of diff is where two parties have made changes to the same base version and you want to compare them so they can be merged into a single document with both changes. This is useful when you don’t have a centralized repository controlling things, or if that repository allows concurrent changes.
The third use of diff is to store only changes to a document in order to save space, be that on the file system or network bandwidth. This is the one form which has absolutely no need for human readability.
It’s also worth noting that diffing HTML is quite different to diffing generic XML. HTML is a document format and the type of content it generally contains is very natural language intensive. There are many XML formats that have similar attributes3, but also a lot of XML formats that don’t4. For content that isn’t natural language intensive, diffing in any of these use cases comes down to the classic computer science techniques – adding, removing and moving elements, adding, removing and changing attributes, changing element content etc. However, for natural language based content, the changes to the XML structure are far less important than the changes to the actual textual content.
It may well be that I’ve missed something, and if so please let me know, but despite a reasonable amount of searching I’ve never seen a diff tool for any format that can handle natural language well. For the aim of showing changes between document versions, the most important thing is that the diff output clearly shows the intent of changes, not the effect of them. Line based diff obviously doesn’t cut it here as a single character spelling correction shouldn’t cause the entire line to be marked as different. This kind of change is where Word’s track changes shines – it can mark that one character change as a single character change while still marking a change from “the” to “tear” as a word change instead of removed h and an inserted a and r. It is complexities like these that make natural language diffing so hard.
XML structural diffing on the other hand is a fairly well solved problem as long as you just need to know that the content changed, rather than clearly what was changed. Docucomp for instance has very good diffing tools for HTML, but it is largely ineffective at showing the intent of changes to natural language.
In general, I agree with Alastair’s comments on diffing, but he seems to be looking mostly at the second two use cases5 whereas I mostly focus on the first case since that’s about the only use case I have for diff. I also agree that Word has some significant limitations and bugs in its track changes implementation, but as a technology concept it does show the potential for tracking changes as they happen instead of diffing after the fact as a way of showing changes between two versions of a document to humans. A complete implementation of track changes6 would enable the other two use cases but with more effort than a post-edit diff.
In the end, it really comes down to what your use case is.
1 – Previous responses: On The Importance Of Rendering Fidelity, The Invisible Formatting Tag Problem and the original The Challenge Of Intuitive WYSIWYG HTML↩
2 – at least in the area I work in where CMSs and similar systems are everywhere↩
3 – Docbook and DITA come to mind↩
4 – eg: the data from a form that is stored as XML↩
5 – which explains why three way diff is a hard requirement↩
6 – Word can't track structural changes to tables↩