On Standardizing Office XML
By Adrian Sutton
Interesting argument between Tim Bray and Robert Scoble about what benefit standardizing Office 12’s XML format will provide. Tim Bray suggests that Microsoft should use the open document format as the basis for its XML format with custom extensions when required. Scoble argues that documents are more complex than Tim believes and that it would be impossible to create a compatible version of Office around the open document format. Maybe Tim Bray has done a lot of work with Word documents and office documents in general that I’m not familiar with – I know he’s done a lot with XML but that’s not enough to comment on the needs of a office-style document format1.
Frankly, I’m not entirely sure that either of them are particularly well qualified to comment – an accusation not often leveled at Tim Bray and far too commonly leveled at Scoble (usually unfairly). For the record, I’m not particularly well qualified to comment either considering I haven’t looked at either specification. Having said that I have spent the last five years or so creating a HTML editor that should feel familiar to Word users, including the ability to import content from Word.
With that in mind, I’d say they are both potentially right. The vast majority of documents people produce are really quite basic. A bunch paragraphs and headings, maybe some lists and potentially a table. The advanced users create an automatic table of contents and play around with borders and shading. The really advanced people get into complex tables, columns and indexes. So Tim is right, the vast majority of documents out there are really, really simple and could easily use pretty much any half-baked language for documents.
On the other hand, with the vast number of Word documents out there, even though only a very tiny percentage use more complex features, that’s an awful lot of Word documents, so supporting every last feature down to the tiniest little detail is important to Microsoft. Considering how difficult the task of backwards compatibility is, Microsoft has done a surprisingly good job with it (the lack of forward compatibility is what they use to force people to upgrade so they can read documents that are sent to them).
So, is it possible to just add extensions to the open document format to support the extra Word features? Probably. Would it be simple? Heck no. The trouble with converting to a different format that uses a different model for the text is that inevitably you come across some document model property that shows through to the user which is different between the models. Trying to convert Word’s model for styles to CSS is one such problem. CSS has a completely different concept of inheritance to Word so to create something that is familiar to end users, you spend a lot of time coming up with ways to make the new model appear just like the old model. What makes it really difficult is when you then have to support a document that was created based around the new document model and does things that simply aren’t possible in the old model. You’ve got to find a user interface and series of editing operations that melds the two differences.
The other problem you run into is how to test every possible document that could be made. We have a ton of code for handling special cases of Word documents and yet we still keep finding documents that present new corner cases. There’s a difference between identical looking documents that were created in Word originally, compared with documents that were copied and pasted into Word from a web page, compared with documents that were copied from Outlook and so. The number of possible variations is staggering and impossible for any QA department to identify them all.
All up, my suggestion for finding out if it’s possible or not is find someone who is familiar with the old Word document format (an Open Office developer perhaps), that is also familiar with the Open Document Format and is familiar with the new Office 12 XML format. You need to be very, very familiar with all three to be able to accurately gauge how possible and how expensive it would be for Microsoft to base its file format off of the open document format. Anyone who isn’t in that position most likely doesn’t have any idea what it would take to use Open Document Format in Word, including me.
[1] Docbook and most XML document formats do not in any way, shape or form count as an Office-style document. Office-style documents are mostly focussed on formatting and not so much on content. Most XML based formats are the opposite of that.