Monitoring Website Availability

Nagios Service DetailThere are various tools out there which can be used to monitor website availability. My favourite being the defacto standard – Nagios. Not only is it open source, it is very flexible and powerful. The only downside with Nagios is the config and Linux skills (I’m not scared of Linux at all, but some people are). But when you’ve got it set up, it’s pretty rock solid and will just keep running away monitoring all of your services.

Installing Nagios
Nagios is very simple to install on most Linux systems. On the Nagios Quickstart website, there is detailed instructions on installing Nagios on Fedora/openSUSE/Ubuntu. However with Ubuntu, the installation can be simpler using apt-get to install:
sudo apt-get install -y nagios3
(Although it’s worth noting that aptitude usually has an older version. A trade-off for ease of installation!)

Configuring Nagios
Once installed, Nagios can be configured. I’m not going to cover it in this blog post here, as there are that many guides out there which already cover this. However a good starting point is this article which covers setting up monitoring for public services.

Twitter Integration
I’ve been looking at doing something nice with Nagios to allow it to tweet on Twitter when a website is down – “I’m sorry we’re down, we’ll be back up later”. However it seems a bit of a pain to install and setup Twurl. Therefore I thought it could be interesting to explore this using Taskcentre

Using Taskcentre as Alternative to Nagios
Looking at Twurl as it looks very painful to integrate into Nagios. So looked at it from the other angle. “What integrations have I done with Twitter which I can get to monitor a website”. I’ve previously created a task in Taskcentre which tweets, so with any task such as this, I broke it down into small manageable steps and tackled each one, one at a time:

  • Check site to see if up or down
  • Check to see if status has changed
  • Send notification

Check Site Availability
VBScript Site StatusA very basic check would be to “PING” the website to see if the server is available. This is a very basic check and I would never recommend this as a way of checking to see if your website is available. Some servers block pings, others will respond with a ping, but still not be delivering pages. The best way is to actually request the page and await the return of the HTTP Status Code. Generally any response other than one in the 200-299 range means that the website is not available. In this example, I’m focusing on the primary response “200” which means “OK”.

Using VBScript, this small script which will request a page from my website. If it returns a 200, it then sets the “CurrentStatus” variable to “True”. In every other case, it sets it to “False”.

Check to see if the status has changed
Firstly we need to know what the previous status was in order to compare this to the current status. Creating either a global variable, or task level variable to store the previous value, then comparing the current value to the previous value.

This is pretty easily done using the following decision step:
Variables("CurrentStatus") <> Variables("PreviousStatus")

Afterwards, a small bit of code is required to store the value. This is to ensure that the next time the task runs, it will highlight the change in status. (Rather than highlighting each time).
Variables("PreviousStatus") = Variables("CurrentStatus")

Send Notification
Twitter PostingFinally you need to decide what to do with your alert. In this case, I’ve tweeted on Twitter. (a good way to keep in touch with customers). Using a bit of logic, I’ve changed the text so that the message is different depending on the status change. A positive change says “We’re back now”. A negative change says “Sorry we’re down, but we’re looking into it”.

Obviously this task could send an email, send message via HipChat (as done in my previous post – here), or even text message instead of tweeting.

Here’s the completed task. Obviously needs a “schedule” task in there so that it runs every 5 minutes or so. (In my VM which I use for testing and prototyping, I tend not to schedule anything automatically to run).
Website Availability Task

Further Improvements
As with everything you do, there’s always ways to improve this further. My first thoughts are the following, but I bet there’s lots more..!

  • Check the content of the site, looking for specific text
  • Check a list of sites
  • Flap Detection
  • Different levels of alerts, such as:
    • First Alert – Email Staff
    • Second Alert – Update Twitter
    • Third Alert – Text Message Staff

Leave a Reply