I Love Regex, I Hate Regex
By Adrian Sutton
I’ve been playing around with writing a mini-wiki that uses the full compliment of HTML as it’s syntax (instead of forcing me to learn yet another markup language) and use EditLive! for Java as the editor – eating ones own dog food and all that. Frankly, that’s the way a wiki should work, no messing around with mark up at all, just simple, easy to use WYSIWYG markup. Anyway, I wrote the back end in PHP since we don’t have any PHP examples in our SDK and I couldn’t be bothered working out why perl refused to install the MySQL drivers. Loading and saving from the database is simple enough, and I settled in to make the CamelCase works hyperlinks. The obvious answer: regex. The obvious problem: working out which regex expression to use (I don’t use regex often since I usually live in the land of custom automatons instead). I’ve wound up with:
$pattern = "/(\b)([A-Z][\w]*[A-Z][\w]*)(\b)/";
$replacement = "\$1<a href="view.php?page=\$2">\$2</a>\$3";
echo preg_replace($pattern, $replacement, $pageData["content"]); which seems to work but is ugly as sin. It's seriously cool to be able to achieve that in what's essentially 1 line ($pattern and $replacement didn't need to be separate variables) but I'm going to hate myself when I come back and try to maintain that. I think it should be good enough even though it would break on something like:
<p class="SomeClassName"> I think it's a reasonable limitation to have that break when “someClassName” would work fine. If the regex were to get any more complex it'd be time to break it down and detect the tags then run the above code just over the actual content. The other limitation of the current setup would be that you can't have something like:
Camel<b>Case</b> because all HTML tags are considered word breaks instead of only block and empty tags. I don't think it's worth worrying about for my purposes though.