Monitoring

Nagios and Nagiosgrapher

http://DEFLECT\_CONTROLLER/nagios3/index.php - System monitoring and trends

Nagios runs as a daemon in two places, and nagios agents run on every other host (ie the edges). The first nagios instance is on DEFLECT CONTROLLER, collecting data about all services, alerting us when things go down or cross configured thresholds. The second nagios server instance is tbd on backup, first to monitor DEFLECT CONTROLLER’s nagios, eventually to become a live standby.

The most important thing for us to monitor is availability of content, over HTTP, on a per-edge basis.

Nagios grapher

Nagios Grapher collections data using RRD, and presents it in graph format.

Nagios future

monitor an object on the origin too

both direct and via each edge

parent test

either HTTP object or availability of the nrpe agent should be configured as parent for each host. A parent test has this function: if ‘children’ fail which a parent is down, they do not issue alerts - though they will show as RED on the nagios web interface. This is to streamline diagnosis when a lot of ests fail at once.

DNS

DNS check against the nameservers, possibly integrated dns-then-http check

other basics

disk space, log size, log currency, cache size, traffic-to-origin (per origin?), traffic-to-edge, variation in traffic levels to different edges

ATS stats

already being recorded per minute, but not yet fed into monitoring.

Secondary nagios

The second instance runs (will run!) on backup, and its job is just to check that the first instance of nagios is functioning.configuration of the the two nagios instances will be such that if the primary goes down, the secondary will continue monitoring in the interim. For the time being, the secondary just monitors the primary and alerts if that goes down.

awstats

HTTP request volumes

logstash and kibana

Incident reporting

Under development. We use a wiki based incident system and email reporting of other events.