URL Escaping is Evil
By Adrian Sutton
I have come to the conclusion that URL escaping is evil and must be banished from the face of the earth. I’ve got no idea how it manages to work at all – every implementation seems to be different and the support for different character sets is a major hit and miss affair. Take for instance the string: © Adrian Sutton
It looks like a pretty simple string and all. It should be encoded as: %C2%A9%20Adrian%20Sutton
assuming UTF-8 character encoding (and I literally mean assuming since there’s no possible way to know for sure). If however you were to use the javascript escape()
function you could get any one of: %u00A9+Adrian+Sutton
%C2%A9%20Adrian+Sutton
%u00A9%29Adrian%20Sutton
It’s impossible to tell if the + sign in the first two is an encoded space or an actual plus sign (there’s no requirement for + to be escaped in URIs so many implementations leave it as is). Then you have to deal with the rather odd %u00A9 syntax which seems to be half URI escaping, half HTML entity and finally you get to worry about which character set was in use. For the record, here’s what your browser makes of it: