Use Asynchronous Integration
By Adrian Sutton
In James Shore’s chapter on Continuous Integration for his upcoming book, he strongly recommends using synchronous integration, where each time you check in you stop and wait for two builds to complete before moving on, in favor of asynchronous integration, where you check in and move on to your next task leaving an automated build server to check everything’s still okay. The whole point of continuous integration is that your team integrates regularly, so that you avoid conflicts. I can’t see how this is encouraged by adding up to twenty minutes of work1 every time you want to check in.
The advantage of synchronous integration is that you always know the build is broken – sort of. In reality you may find that the build broke, but you lock out all the other developers from accessing the source code repository while you check in and do the test build, about ten minutes every time someone checks in5. This lock out period provides the illusion that the build is always passing, because anytime someone checks out the code base or does an update, they get fully working code with all the tests passing. It’s really useful to know that anytime you update that you won’t get any unexpected surprises, but adding in this delay means less frequent integration cycles, more time waiting around and a greater chance of integration problems.
With asynchronous integration you accept the risk that every now and then the build might break and in exchange encourage people to integrate more often, allow people to integrate part way through a story without interrupting their flow2 and let the whole team know when a problem has come up. Letting the whole team know is an unexpected benefit, it means that when the build breaks, the whole team is informed that an unintended side-effect has been encountered and that help might be needed to resolve it. Most of the build failures we see are complex issues where a new feature is impacting on the expected results of an existing feature, and since we’re working with a legacy code base the failing test may not make it obvious why the failure is important. People new to the code base may well have thought that the failing test just needed to be updated because what it was asserting was outside of its intended scope. Fortunately, since the whole team sees the problem, the more experienced people may recall why that was important and be able to provide advice on how to make the test clearer in its intentions and how to resolve the conflict in functionality.
Being able to check in more regularly is a big advantage as well. If the check in process takes twenty minutes to complete, you might check in two maybe three times a day, with the check in process taking one to two minutes, we check in five to ten times a day. That’s a lot more regular integration and a lot less chance for conflicts. If the team is working on completely separate areas of the product that’s probably not so useful, but often working on commercial products you find the features you’re adding are related. Even with a well designed code base with limited coupling3, there’s a good chance of conflicts when you’ve got the whole team working on improving the functionality of list handling.
There are a couple of major problem that many teams encounter with asynchronous integration, and James Shore points them out nicely. Firstly:
If the build succeeds, asynchronous integration does save time. However, if the build fails, you have to interrupt your new task to roll back and fix the old one. To do so, you must leave your new task half-done, switching context (and sometimes pairs as well) to fix the problem, then switching back. It’s wasteful and annoying.
This is annoying when your check in occurs at the end of a story, if you check in regularly during the development of a story, when the build fails you’re probably still working in the same area of code, with the same pair and can quickly identify why the build failed and fix it straight away. As a bonus, you’ve just learnt something about the area you’re working in which is clearly important. The other way to mitigate this problem is to make the build fail less. That seems silly, but there are some simple steps you can take to make build failures far less common:
- Don’t program by coincidence. You may have tests as a net to catch you if you slip up, but that doesn’t mean you should just randomly change code until the tests pass. You need to understand the changes you are making to the code base and have confidence that you haven’t broken anything even without running the tests. Use the tests as a safety net, not as a replacement for intelligence. You will still make changes that cause tests to fail, but you will do it less if you think about what you’re doing.
- Identify the tests for the area you’re working in and run them very regularly while you’re developing. Keeping a custom test suite around is a good way of doing this, call it something like sanity-check and don’t check it in. It’s just a scratch pad that you add related tests to when you come across them. Also run these tests before you check in.
- Identify the really fast tests and run them before you check in. What tests you use here depends on the state of your project, how you like to work and many other things. The key thing is that the tests run fast enough that you’re not inclined to walk away from your computer while they run. Checking your e-mail or taking a quick mental break is good, but breaking flow is bad. They should run in under a minute. Some ways of picking these tests are:
- All the atomic tests. If you can run all your atomic tests in under a minute, run them.
- All the ‘fragile’ tests. It sounds bad to have fragile tests, but there are likely to be parts of your product that are more likely to break than others. Run these before you check in. Also consider the areas that these tests cover as potentially smelly and see if you can make them more robust.
- The tests for the section of code you’re working in. This is similar to the set of tests you build up in section 2, but will probably contain more tests in coarser grained sections. If you know that your menu bar is a completely separate module, you can make the menu bar tests separate too and run them before check in whenever you change the menu bar. Since it’s completely separate, you should never see a build fail because of changes to the menu bar code if you ran all the menu bar tests.
- Don’t forget to add files. You might think that everyone forgets to add a new file every so often, or misses committing an important change – it’s just a fact of life. That’s the wrong way to think. Don’t commit if there are any files that are unrecognized by your version control system – that means any files that are never checked in4 need to be added to the list of files to ignore. Then make sure you use tools that make it obvious that there are unrecognized files, or better yet just refuse to commit if there are. Finally, when you commit, always do so from the top level of the source tree and always commit every change. If you have changes that you don’t want to commit yet, store a patch and revert them, then run the pre-check-in tests again. Don’t have unrelated changes in your checked out code – either work on one thing at a time, or check out a new copy of the code for each simultaneous thing you’re trying to do.
In practice, rather than switch gears in the middle of a task, many teams simply let the build remain broken for a few hours while they finish their new task. If other people integrate during this time, the existing failures hide any new failures in their integration. Problems compound, leading to a vicious cycle of painful integrations, leading to longer broken builds, leading to more integration problems, leading to more painful integrations. I’ve seen teams that practice asynchronous integration leave the build broken for days at a time.
Don’t do this. Every build failure is critical and needs to be addressed immediately. If your team doesn’t treat builds this way, don’t use asynchronous integration. That doesn’t mean that your team needs to panic when the build breaks, but it does mean that as a team, you have to make sure someone is immediately working on fixing it. The fact is, synchronous integration only works because the team wants the build to keep passing – otherwise people would just ignore the check-in process. Asynchronous works on the same basis, but requires an even higher level of commitment to the process. You can’t use asynchronous integration to sneak continuous integration into your team, you can’t dictate to the team that continuous integration will be done, you have to foster the commitment in the team before you implement the integration process. Make it a point of professional pride that you don’t leave the build failing for long periods of time. Make sure there aren’t time pressures that encourage leaving the build broken – make sure your client understands that fixing the build is always more important than working on their stories. If you can’t do that, do synchronous integration and identify things that your developers can do to be productive while waiting on the build to complete.
Remember, three check-ins a day with a twenty minute delay on each check-in means an hour of downtime per day per developer. While you don’t want to hold your developers to the grindstone constantly, that’s a lot of down time, so you want to make sure they can do something useful while the build happens. They might check up on other pairs to see if they need help, handle their e-mail, read technical publications or catch up with RSS feeds, brainstorm new stories or products with the customer or discuss how they’ll approach their next task. Some of the time they should probably take a short walk outside while the build happens, or go make coffee etc. Most likely though, they will need to do these things while working on a story when it gets tough and they need a break though, so don’t assume all of the downtime will happen while the build is happening.
One important note is that whether you do synchronous or asynchronous integration, you need a fast build – if your build is too long with asynchronous integration you are more likely to get multiple changes bundled into the one build which makes it harder to identify the cause of problems. Keep working to make your build as fast as possible and monitoring what impact the build time is having on your team.
There’s no one right way to do continuous integration with your team, you need to evaluate the process carefully and find what’s best for you. If you start with asynchronous integration and find that the build stays broken for long periods then you should probably switch to synchronous integration. If you start with synchronous and find that you almost never have to roll-back a commit, you may be able to reduce the overhead of your integration process by switching to asynchronous integration. You may find that tests often fail but they fail consistently on the local machine and the build machine – you may need to run a full build locally but then leave the build on the integration machine to an automated system. Keep watching the health of your build and adjust your processes as needed to fit your teams needs.
1 – or more frustratingly, twenty minutes of waiting around↩
2 – obviously at a stable point where all the tests pass↩
3 – and let's face it, those are rare particularly with legacy code bases↩
4 – because they were created as part of the building or testing process↩
5 – ten minutes because the first build on your local machine is done before you take the integration token↩