Character Encodings
By Adrian Sutton
Jim Winstead has posted a couple of entries on character encodings (1, 2). Some good info in there. My three big tips for dealing with character encodings is this: 1. Know your character encoding and make sure you’re using that encoding everywhere. No look again because you probably missed a place where you didn’t think about encoding. Particularly don’t forget that printing something to System.err or System.out in Java uses the platform default encoding and so characters that can’t be represented in that encoding become question marks. 2. When practical, use US-ASCII and escaped characters for anything outside of it’s range. Most formats which support different encodings also provide a way to represent a character which isn’t in the current encoding through something like HTML’s entities or java’s escape codes (\u8222 etc). Most encodings are compatible with US-ASCII (EBDIC being a notable exception) so even if people forget to use the right encoding they can generally get away with it. 3. Remember that character encodings, despite their name do not apply to characters – they apply to byte sequences which represent characters. If you have a char variable in Java it has no character encoding as far as you are concerned, it’s just that character. The JVM can choose any representation it likes for that character in physical memory and you shouldn’t care (it actually happens to choose UTF-16 I think but you still shouldn’t care). You *do* however have to worry about character encodings when you convert from characters (or Strings which are really just a fancy array of chars) to byte streams. This happens when you use String.getBytes(), or print the characters to any kind of output stream. You also have to worry about the reverse process, new String(byte[]) and reading from an input stream. The first two items should be pretty clear to you if you’ve done any work with character encodings, the third may seem unimportant, but it will help stop you from expecting code like the following to work:
String str = new String("my string"); byte[] utf8Bytes = str.getBytes("UTF-8"); String strISO88591 = new String(utf8Bytes, "ISO-8859-1");
Naturally this won’t work because of rules 1 and 3. Rule 1 is broken in that you used a different encoding when working with the same data and 3 is broken because you expected strISO88591 to be a String using ISO-8859-1 character encoding, but it doesn’t because String objects don’t have a character encoding (as far as you should be concerned). The big exception to rule 3 is when you’re using a language which doesn’t guarantee it will support whatever characters you throw at it, in which case you basically either have to work only with byte arrays and never let the language string functions near them. In general though I’d suggest you find a better language or a better library. If I were to add fourth suggestion, it would be: remember that just because your character is valid, doesn’t mean the font you’re using can display it. Most fonts can’t display anything more than the characters in ISO-8859-1 and a few select others so if you’re working with mathematical symbols or characters from other languages you’ll need to find a special font that supports them. BTW, yes I have spent far too much time working with character encodings and tracking down where people stuffed up with character encodings.