The Guru College
Jetpacks and Moving Sidewalks
Feature requests for software development fall into a couple of categories – one of them being the “if we had unlimited development resources, what would we want to see”. I’ve taken to calling those requests “jetpacks and moving sidewalks”. It’s not practical to think about implementing them now, but it helps keep you motivated when slogging through tedious debugging or implementing yet another feature you aren’t really excited about. Like a light in the darkness, you know where you’re trying to get to, and you know what to design towards.
Currently, my project is redeveloping the monitoring and notification system for the campus. It’s based on Nagios and NDOUtils, with a whole lot of custom code and web front ends thrown in to manage the configuration files. Feature creep is everywhere. I’m trying to keep it down to manageable level, but it can be hard. Recently, we’ve hit a major slowdown due to other groups on campus, and I’m getting the jetpack feeling again.
First off, I want to let people set their own notification options from a web site. Allow them to pick the address or addresses that get sent to when something goes wrong, as well as the time periods and event levels these addresses are valid for. These aren’t for the general pager system/on call stuff – this is to let people like myself get a lot more data about the state of the network. The paging of the oncall staff member would still happen via the operators. This is currently in progress. It’s fleshed out in concept, but needs a lot more user testing before I can release it to the wild.
Second, I want to get some physical phones to plug into the backs of the servers to be able to send and receive SMS messages, without using the campus networks. This would allow us to notify people even when email or the internet is down, and would allow the SMS system to respond to incoming pages as well. For example, you could respond to an alert with the phrase ack
to acknowledge the problem, or next
to alert the next on call person because you are otherwise detained. You could also schedule on-demand host and service checks from the road. In the true jetpack version of this, you’d be able to get the server to actually dial a number and play a pre-recorded message, complete with a touch-tone menu system. I’ll settle for bidirectional communication that’s not tied to our email system first. Of course, we have to get money approved for this, and with the budget realities we are working with, it may take some time.
Third is the concept of scale. The pause in progress recently has made me start mining through the nagios-discuss list, and I’ve come across the Distributed Nagios eXecutor. In short, it takes the tedious process of maintaining a lot of configuration files and multiple instances of Nagios off your hands, and handles the dispatch, execution and processing of check commands. It’s designed to scale to as many nodes as you give it, which allows you to add check capacity easily. I need to find out if you can run the nodes in multiple networks (with different network permissions) and have it target checks to the right boxes. If that’s possible, it may be a solid upgrade route for the 2.x or 3.x release of this monitoring project. Even if not, it may help solve the problem of scale as we bring more and more hosts online.