We’ve been going through a number of IT changes lately, server migration, consolidating authentication mechanisms, virtualization, etc. Some of these changes have been painful because they’ve caused breakage. Authentication has broken so developers can’t log into the build system. Other changes have been painful because they’ve caused performance problems that did not appear right away and have proven difficult to isolate.
At the superficial level, pure breakage is caused by a lack of testing. The IT team sets up something new, but doesn’t test the new setup to the extent that the software team will exercise it. So we end up with an emergency when someone realizes they don’t have access to their important files from their PC. Or because our IT is outsourced, the IT team doesn’t realize their changes are not working a day later because they’re not using the system.
But both breakage and performance are classic software engineering problems. One well accepted solution for both types of issues is to build and run tests. Monitoring systems (cactus, monix, hobbit, etc) that provide automated alerts of problems are really an implementation of common unit tests.
Previously, I had not thought of IT as an environment that’s suitable for unit tests. After all, unit tests are for code. Why write a test for a problem that you’re going to Google for a solution? In my pragmatic view, IT is much more do it once and forget about it until it breaks next year.
But the more time I spend tracking down IT problems, reporting them and waiting for the fix, the more I want to just automate it all. So far, setting up Hobbit is our first step at a monitoring system that will detect first-level problems like, “the web server is down.”
Using the monitoring system to measure performance is a little trickier. I don’t have much experience with that yet, so I’ll have to save that for a later post.
The IT team at the company I work for is outsourced to another company that provides IT services. This post is brief analysis of the pros and cons that I’ve encountered.
Outsourcing IT allows my company to get cost-efficient service and skills without hiring dedicated employees, but it means that we really don’t have 9-5 service. When something goes down, there’s usually a delay until someone can look into it. Sometimes there will be an IT person on site, but generally if a machine goes down, the right person isn’t around.
There’s also a significant cost to “Emergency Service.” I don’t know what that cost is in dollars, but every time I’ve filed an emergency ticket, I’ve been asked by our accounting department whether or not it’s really an emergency. Now, according to the time and attention people (and I’m not saying that they’re bad people), putting this cost on a work order that’s going to interrupt somebody is actually a good thing. But the real-life effect is that I end up asking myself, “Is this something I can repair in 15 minutes?” So, there’s some set of critical repairs that displace my real work.
Another major consequence of outsourced IT is that the people doing the IT are not using the systems that they maintain everyday. This means that they don’t notice intermittent problems (here today, gone tomorrow) and they can’t react quickly to obvious outages.
These factors have driven a couple changes in the way I approach IT, which I’ll detail in future posts:
- I am effectively responsible for project management of IT projects; doing the planning, scoping and figuring out the details of how infrastructure projects should be implemented.
- Individual issues need to be prioritized relative to tasks for longer term projects, or else little progress is made on the long term projects because there are always minor issues to tackle.
- Automated monitoring is extremely important. Update: Monitoring is unit testing for IT.
On the plus side, we’ve got a decent team of people that are (for the most part) on call. On the minus side, I have to devote some of my time to IT work and I have to do much more of the planning and scoping of IT-related tasks. And there’s a lot of project definition that needs to be done to avoid IT disasters.
Ah, yeah. I love physics rap.