Ampersands (Does it ever end?)
By Adrian Sutton
Byron continues on the ampersand issue:
I’m not going to accept your argument that it’s not harmful to produce invalid HTML. What would your code produce for: http://example.com/entities.cgi?entity=& The requirements are that it should produce exactly that since that will work in all known browsers and would break in all known browsers if the ampersand wasn’t escaped. Since I didn’t personally write the code I can’t be certain that it does output that, but that’s what it should do. It should output whatever it is that makes things the most compatible so our users are the most happy.
I seem to recall someone or something in this discussion mentioned using a character other than ampersand entirely. If the links you’re producing are for programs within your control, it may be appropriate to use semi-colons, not ampersands. Of course, this approach will often be impractical or impossible. Impractical, impossible and highly incompatible. There are a vast number of pre-existing frameworks that assume GET parameters are of the form name=value&name2=value2. Using anything but an ampersand would break all those frameworks and existing libraries. Besides, as an HTML editor we don’t get to decide what the link should be – the user does.
There are some other questions I’d like to see you answer. Is it OK to use ampersands incorrectly in an XML document because one particular parser isn’t conformant to the standard? No, this would greatly decrease the compatibility with real world situations and make life more difficult for our users, thus it would be unacceptable. Were it the case that all known XML parsers handled an unescaped ampersand but one didn’t correctly handle an escaped ampersand then my answer would be different. The technology is irrelevant here, the answer is always to do what best meets the user’s requirements.
Is it OK to expect every parser to let you violate the XML standard to support bugs in others? Yes. Any HTML parser which can’t handle unescaped ampersands in a document is not going to particularly useful unless all the HTML documents have been previously cleaned with something like Tidy. The state of the real world is that unescaped ampersands is the least of the problems an HTML parser is expected to deal with – incorrectly nested tags are much more difficult to deal with but most parsers still get them right.
I am facing the exact problem under discussion right now: a piece of HTML I wish to parse misuses ampersands. The library I am using does not tolerate malformed attribute values. Do you consider this a bug in the library I am using? That depends. If the library claims to be able to parse only fully validating HTML then I would call it a critical limitation but not a bug. The parser would still be useful for parsing documents that have someone been confirmed to be completely valid (ie: a help function in a program that uses HTML as its help pages). If however the library claimed to be able to parse HTML I would definitely consider it a bug. The most common use of the term “HTML” refers to documents that vaguely comply with the HTML standard (but rarely a fully conforming) so I would expect any parser that claimed to parse HTML to handle these malformed documents. Where you draw the line at how messed up the document can be will entirely depend on your requirements.
Personally, I do not consider it a bug in the library, since it parses HTML just fine. A document with a raw ampersand in an attribute value is not, strictly (or pedantically) speaking, HTML. Bug or not, it’s failing to meet your requirements and causing your program to fail. That is a bad thing and if you have real world users they should (and most likely will) expect you to fix it. It is worth noting that if people wanted the HTML they took in to be highly compliant and simple to parse, they should use XHTML as it not only makes parsing easier (can be done with any XML parser given the right DTD) and XHTML documents tend to conform to the standard much more regularly than HTML documents do (and browsers generally have a separate mode for XHTML which expects the documents to be well-formed so the browser behavior is more predictable). When all is said and done the point of software is to get a job done, not to comply with standards. If the software accomplishes its job and the users are happy that’s all that really matters. Life happens in the real world, users pay money for real world results, expecting HTML to be well formed is a pipe dream.