Doctypes, Compatibility Modes, Charsets and Fonts
By Adrian Sutton
This information is all covered in much more detail elsewhere on the web but for my own future reference, here’s a primer on doctypes, compatibility modes, charsets and fonts which was required to explain why certain Chinese characters weren’t showing up in IE8. Of course the best answer is that you need to have the East Asian Font pack installed and then it just works (usually) but this tends to be useful background and saves “server side” folks from a number of gotchas.
Doctypes and Compatibility Modes
- IE 7 and above has an insane array of compatibility modes which are out to get you. The most common gotcha is that it will use compatibility mode (emulating IE7) if the website is in the “intranet zone”. There’s an option to disable this somewhere in the preferences dialogs. You wind up in the intranet zone if you’re accessing a site via any domain name that doesn’t look like a real one (e.g. http://dog/ is in the intranet zone).
- If you can avoid falling into compatibility mode, any pages including the DOCTYPE as
<!DOCTYPE html>
or<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
will render in standards compliant mode (where the world is as sane as it gets in web development). Go with the shorter version unless you have a reason not to. - http://hsivonen.iki.fi/doctype/ is the gold standard for information about browser modes.
Character Sets and Fonts
- If you’re tracking down problems with foreign languages there are two major categories of problems – encoding corruption (where characters come out garbled or as ?) and missing glyphs in fonts (where characters come out as little boxes).
- Corruption is fixed by specifying the same character encoding everywhere. It is a security issue if any webpage is missing a meta tag defining the character set (smallest variant is
<meta charset="UTF-8">
). It must be the first tag in the<head>
of the document. - Little square boxes mean that either the font currently in use doesn’t have a glyph for that particular character and the font fallback routine was unable to find any font on the system which supports that character.
- Browsers have a default stylesheet which is automatically applied to every page which commonly sets a specific font-family and font-size for text input elements, so adding the style
body { font-family: 'Arial Unicode MS' }
may get some Asian characters working in the main content but not in text boxes unless you also addinput { font-family: inherit; }
.
The security issue mentioned above is that any page which doesn’t define a character set but includes any form of user supplied content is vulnerable to a cross site scripting injection attack – even if the user supplied content is escaped properly, because the content may include a character that causes the browser to incorrectly switch to UCS-7 or other weird character sets and drastically change the meaning of the content on the page (hence the user content is no longer correctly escaped). There have been steps taken by modern browsers to remove this risk (including removing support for UCS-7 I believe) but its good practice to specify your charset explicitly anyway.