String Interning
By Adrian Sutton
Elan Meng investigates the behaviours of string constants, interning and == compared to .equals. Very informative. The question remaining is, why would anyone ever use == instead of .equals on a String considering how likely it is to cause confusion and the potential for disaster if one of the strings for some reason isn’t interned. The answer is performance. In the average case, the performance difference between == and .equals is pretty much non-existent. The first thing .equals does (like any good equals method) is check if the objects are == and returns true immediately if they are. The second thing .equals does is check that the strings have the same length (in Java the length of the string is a constant field and so requires no computation). If an answer still hasn’t been found, the characters of each String are iterated over and as soon as they differ, false is returned. Now, consider the possible cases:
-
The strings are .equals and == (the same object).
-
The strings are .equals but not ==.
-
The strings are neither .equals or == and have different lengths.
-
The strings are neither .equals or == and have the same length. The effective performance hit is then:
-
The == check returns true. The only extra cost of using .equals over == is the cost of the method call (which is likely to be inlined anyway).
-
The == check will fail, the length will match and all characters in the array will have to be iterated over.
-
The == check will fail, the length will be different and .equals can then return false. The only extra overhead is then the method call, and an integer comparison.
-
The == check will fail, the length check will pass and the .equals method will have to iterate over each character until the first difference. It may seem from that description that the most significant performance problem occurs when the strings are
.equals but not ==, however this problem can be easily mitigated by interning all of the strings when they first enter the section of code. Since the interning should occur outside of tight loops it should be a reasonably uncommon occurrence and so the performance hit of having to compare the strings will be irrelevant. The final situation is the one that causes all the problems though. There is no way to eliminate or reduce the cost of this case when using the .equals method. When the strings are long, only differ by the last character and the comparison happens in a critical section of code (particularly inside tight loops), the performance hit can be very significant. This is the case with the Xerces parser (and in fact most parsers that work with a lot of strings). So if you look at the source code for something like Xerces, you’ll find that == is used in preference to .equals most of the time because any strings that come into the library are immediately interned to avoid the performance penalties above in the critical sections of code. Of course, since you have to be absolutely positively certain that all the strings are interned, you may as well use == all through the “interned section” as it will help to reveal any bugs caused by strings not being interned (the more times an incorrect comparison occurs the higher the chance that it will be noticed). It should of course be noted that unless profiling has revealed string comparison to be a significant performance bottleneck for your application, you shouldn’t even consider this optimization as it will cause problems sooner or later.