The Rumours Of XML’s Death Have Been Greatly Exaggerated

By Adrian Sutton

July 23, 2004

Mark Pilgrim posts an interesting article entitled XML on the Web Has Failed and he’s right to some degree. Character sets remain a huge mess on the internet, but I think he’s pinning the failure on the wrong technology. It’s not XML that’s failed, but RFC3023 which specifies a set of rules for detecting the XML charset when combined with HTTP. The reason RFC3023 fails is because noone likes the way it works and it’s just not implemented anywhere. The one part of the specification that causes problems is what to do when XML is transferred over HTTP with no Charset header and a Content-Type of text/xml. The one reason that rule is so screwed up (it says to ignore the charset in the XML file) is because a bunch of proxies translate the character encoding without knowing anything about the content being transferred (except that it’s text/something). So what’s the solution? Put some common sense back into the mix, if there’s no charset in the HTTP headers and a charset declared in the XML file, use the charset from the XML file, then fix the proxies that are destroying content – they’re probably destroying a lot of HTML files as well since they wouldn’t pay attention to the content type specified in a meta tag. Claiming that XML has failed is throwing the baby out with the bath water, the problem is just that there’s some stupid proxies doing things that, while currently allowed, are pretty obviously going to destroy content at least some of the time. So I propose a very simple solution to this problem. Add one new rule to the HTTP spec:

Proxies MUST NOT modify the body of the HTTP request or response. There’s already enough trouble getting the server and the client to work out the charset without someone in the middle silently screwing things up. This would of course make things like Privoxy non-compliant with the spec, but that’s probably reasonable considering that it can occasionally break the functionality of web pages.

Compliant behavior operating on compliant data should not result in non-compliant, non-functional or incorrect results. Obviously specs can’t actually specify the above because of the ambiguousness of non-functional and incorrect but it’s still a good design principle to follow. Proxies that modify the body of HTTP requests and responses definitely don’t follow that principle.