The Guru College

Merlin, The Nagios Wizard

I’ve been dabbling with Merlin lately at work. Merlin is a part of the monitoring system, but it’s open source, and seems to fit into the role that is currently filled by NDOUtil. Well, it fits that role, and then some. It’s maintained by Andreas Ericsson, who has commit access to the main Nagios repositories, and it’s used in production by hundreds of their customers, and is under very active development. Now that it is shaping up for a 1.0 release in the next few months, I’m looking more seriously about using it in production at work.

I know there’s nothing specifically magical about a v1.0 release, but it makes me more comfortable. Particularly in a relatively complicated Nagios environment with 2,000 hosts, 50+ users with read/write access to the configuration system, and 70+ people who can be woken up at 3:00 AM if something goes wrong. There are hosts that can only be monitored by nagios systems inside our 10.x private network, as well as hosts that can only be monitored from our public facing networks.

I’ve been doing some testing to make sure Merlin catches all the stupid corner cases I’d need to deal with in our current setup. First, and most importantly, failover works, and it works well. One head node (a NOC in Merlin terms) goes down, and all notifications and checks are migrated over to the remaining node(s). A pair of NOCs can be loaded will all the hosts and checks in our environment, and when one is shut down, the other takes over fully within about 8 seconds. Further, the status.dat and retention.dat files can be removed from the ‘down’ node (to make it come up with no knowledge of the network), and when it comes back up, it syncs in those 8 seconds. Impressive.

Further, the NOC model allows for our notification system to fail over as well. When a second NOC registers itself, the service and host checks are split between NOCs – and notifications are suppressed for services and hosts on the other NOC. This means for any given service issue, only one will be sending notifications out. Even better, as the Merlin NEB brokers information about all services and hosts between the active NOCs, when a NOC fails (or is rebooted, or whatever), the responsibility for notifications for those services and hosts moves to the remaining NOC. It’s a slick setup.

The only thing left for me to test is how stand-alone pollers work. A NOC is a system that is running as a peer to another. A poller is a node that is only responsible for executing checks for specific hostgroups and servicegroups. This is the mechanism by which we’ll be able to run switch checks from nodes that are located logically with the core switches, server checks from the nodes that are located behind the Cisco FWSM blades, and internet services checked by the nodes that are exposed directly to the Internet. We only have two nodes that should be sending out notifications, and I need to make sure that the pollers aren’t going to try to notify for hosts and services, and I need to test failover and look for corner cases there.

In all, Merlin seems to be exactly what most complicated Nagios environments need. I just wish it was part of Nagios Core.

My Office | Home | D7000 Update