We’ve been going through a number of IT changes lately, server migration, consolidating authentication mechanisms, virtualization, etc. Some of these changes have been painful because they’ve caused breakage. Authentication has broken so developers can’t log into the build system. Other changes have been painful because they’ve caused performance problems that did not appear right away and have proven difficult to isolate.
At the superficial level, pure breakage is caused by a lack of testing. The IT team sets up something new, but doesn’t test the new setup to the extent that the software team will exercise it. So we end up with an emergency when someone realizes they don’t have access to their important files from their PC. Or because our IT is outsourced, the IT team doesn’t realize their changes are not working a day later because they’re not using the system.
But both breakage and performance are classic software engineering problems. One well accepted solution for both types of issues is to build and run tests. Monitoring systems (cactus, monix, hobbit, etc) that provide automated alerts of problems are really an implementation of common unit tests.
Previously, I had not thought of IT as an environment that’s suitable for unit tests. After all, unit tests are for code. Why write a test for a problem that you’re going to Google for a solution? In my pragmatic view, IT is much more do it once and forget about it until it breaks next year.
But the more time I spend tracking down IT problems, reporting them and waiting for the fix, the more I want to just automate it all. So far, setting up Hobbit is our first step at a monitoring system that will detect first-level problems like, “the web server is down.”
Using the monitoring system to measure performance is a little trickier. I don’t have much experience with that yet, so I’ll have to save that for a later post.