Pain vs Pay-Off

By Adrian Sutton

May 28, 2008

Doug’s discovered a way to improve the effectiveness of simian to avoid adding more duplication to a code base:

The solution is very simple. The simian-report is a an xml file, so I wrote a SAX2 DefaultHandler that was able to parse the number of duplications at the different threshold levels. Putting this into a trivial ant task then gave us a task to help make things no worse even at levels below what the simian-check was doing! Within the first week, the new legacy-check was breaking the build (where the simian-check would never have) and focusing the teams attention on how to make things better. The solution is simple and really cool. I have no doubt that it will have a big impact on the amount of duplication and the overall quality of our code. The trouble is, Simian’s normal reporting is really, really lousy and Doug’s extra check highlights just how useless Simian is at telling you where the duplication is instead of just detecting that there is duplication.

I discovered this new check when the build broke after a simple commit I made that fixed an annoying bug and removed some duplication. Unfortunately, it happened to change the distribution of duplication so that it triggered a simian-legacy failure. I can, after the use of a few choice words, handle failures like that because it shows I could have and probably should have done more to improve the code around the change I made. Someone’s got to go in a clean things up – it might as well be me.

That’s when I encountered Simian’s report. It’s one great big long HTML page listing pretty much every bit of duplication it found broken down into sections of numbers of lines duplicated and it has a pretty useless index at the top. Over VPN it took 15 minutes to download and render. It hung Safari and FireFox for about 5 minutes trying to render it locally. Once it loads you have to sort through the thousands of duplication matches it’s found (matching as little as 2 lines duplicated and with quite liberal settings on what constitutes a match – again, a good thing) to try and find the one new duplication that caused the build failure.

Eventually I discovered the easiest way was divide and conquer. Simian can run even if the code doesn’t compile, so I rolled back each file individually until Simian passed again. Then I patched my changes back in, searched the report for that file name and eliminated as much of the duplication involving that file as I could and reran Simian.

I’m still in two minds as to whether or not the new check is worth the pain it causes. I really like the idea of it, I just really wish there was an Eclipse plugin for Simian that highlighted duplication right there in the code, or at least a HTML report that showed the actual code marked up with duplication. Then again, if you just have to guess at what duplication to remove the code base will get better faster…