The Guru College
Designing For Scale
My current project is to rebuild the monitoring and notification system for the University. The old system was built on Nagios 1.4, which has served it’s purpose very well since it was setup, but is starting to get long in the tooth, and the previous owner of the project left the employ of the University in 2007. I have spent the last few months learning his code and getting a new back end ready for the system. One of the things that had been postponing the move away from Nagios 1.4 was the removal of the database backend from the core of the Nagios system. The web interface provided by Nagios has always stunk, so we had been running our own set of front end tools for managing the configuration files and keeping an eye on the status of the servers monitored. Thus, the database is critical to our monitoring needs.
There is a new-ish database backend for Nagios, called NDOUtils (Nagios Data Output Utility) that runs a daemon that dumps all the Nagios event data to a MySQL database. The problem is that NDO was never designed for scale, as best I can tell. The first major sin it commits is not generating any SQL indexes by default when you install the database. This is a Very Bad Thing, and leads to horrible database performance as the data builds up in the tables. There is an optional schema file you can process and run to get indexes, but the installer doesn’t apply them for you. The other problem is that all event data is logged to the database. This includes state changes, notification messages, performance data – everything. Admittedly, this isn’t a bug, it’s a feature. There is even a handy pruning feature that will clean up entries older than a given age. This pruning age is set to a week for most of the event types.
The best post I could find about tuning NDOUtils talks about having 400 hosts and 3000 service checks. It’s a very good article, but I’m trying to solve a different problem. We have 1500 hosts, and over 4000 service checks, and NSCA isn’t overloaded (although xinetd is, but that’s a whole other story). Not all checks are run every minute, or even every 5 minutes, so I wrote a small script to troll through the nagios log looking for external events submitted by the servers performing the actual checks. My baseline is around 600 checks per minute coming into our front end Nagios server. Let’s do the math – 600 checks * 10,080 minutes in a week = 6 million possible events in the log. This leads us to the second major sin that is being committed by NDO – it frequently calls a DELETE statement to clear any data older than the specified time. In SQL, the DELETE command is one of the most expensive commands you can run, in terms of CPU usage. It’s made even worse when there is no index on the table you are deleting from. It’s made into a 3 Act Greek tragedy when every 5 minutes it tries to trim the last 3,000 entries from a table with 6 million rows. It starts to approach EPIC FAIL when you consider that we are going to be roughly doubling the number of servers and checks in the next few weeks as we bring the Oracle/HR environment under our monitoring, and maybe double again if we absorb network monitoring.
I of course discover all of this a few hours after I had sent the demo page for the new system out to the various teams of system administrators to look at, I knew I needed to fix the database performance problem, and quickly. I knew that it would be intensive to clear the bad data, but I didn’t have any real idea of how bad it would be. I turned the prune timer down from 10080 minutes to 60 (just keeping an hour’s worth of data), and restarted the database connector.
75 minutes later, it finished cleaning up the first of 6 multi-million entry tables. I waited till 5PM, stopped MySQL, dropped the database, and reloaded the schema (and the optional indexes), and brought everything back up. It took a few minutes for enough data to get back into the database to get accurate states to show up again, but it was far less than the 8+ hours it would have taken to prune away the old data. I shudder to think what would have happened had I loaded all the other devices, and was trying to process 6 tables with 20 million rows each.