Host And Service Check Intervals

A common problem Nagios administrators come across is that they receive notifications for services when a host goes down and they want to know why. The host went down so shouldn't notifications be suppressed for services?

When a host goes down (non-OK state), Nagios suppresses notifications for it's child services. HOWEVER this is only once the host enters a HARD non-OK state. Here is a link that explains HARD and SOFT States.

The most common reason why service notifications are sent when a host is down is because the check_interval, retry_interval and max_check_attempts on the HOST and SERVICE objects are identical. When this type of configuration is defined then it comes down to a scheduling race.

This is much easier to explain with an example.

Here's a host and service definition:

define host{
    use windows-server    
    host_name host1

    alias host1

    address 10.25.14.51
    check_interval 5
   
retry_interval 1
    max_check_attempts 5

        notification_interval 30
    }

define service{
    use local-service
    host_name host1
    service_description Memory Usage
    check_command check_nrpe!CheckMem!ShowAll type=physical MinWarn=512M MinCrit=256M
    check_interval 5
   
retry_interval 1
    max_check_attempts 5

    notification_interval 30
   
}


Here is a scenario based on this configuration:

  • 13:10:10
    • Nagios HOST check for host1 executed
    • Result = OK
    • HARD state
    • Attempt 1/5
    • Next scheduled check 13:15:10
  • 13:10:30
    • host1 dies
    • Nagios does not know about this yet
  • 13:10:50
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 1/5
    • No notification sent
    • Next scheduled check 13:12:20
  • 13:12:20
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 2/5
    • No notification sent
    • Next scheduled check 13:13:50
  • 13:13:50
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 3/5
    • No notification sent
    • Next scheduled check 13:15:20
  • 13:15:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 1/5
    • No notification sent
    • Next scheduled check 13:16:10
  • 13:15:20
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 4/5
    • No notification sent
    • Next scheduled check 13:16:50
  • 13:16:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 4/5
    • No notification sent
    • Next scheduled check 13:17:10
  • 13:16:50
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 5/5
    • Notification SENT
    • Next scheduled check 13:22:20
    • Next notification 13:47:20
  • 13:17:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 3/5
    • No notification sent
    • Next scheduled check 13:18:10
  • 13:18:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 4/5
    • No notification sent
    • Next scheduled check 13:19:10
  • 13:19:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 5/5
    • Notification SENT 
    • SERVICE notifications for host1 suppressed as host1 is in a HARD down state
    • Next scheduled check 13:24:10
    • Next notification 13:49:10
  • 13:22:20
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 5/5
    • NO Notification sent as host1 is in a HARD down state
    • Next scheduled check 13:27:50

What you can see here is that Nagios did not know that they host was down until about 4 minutes after the host died. In that 4 minutes, the Memory Usage service went into a hard state and triggered the retry_interval to be used for that service, effectively beating the host check to reaching that HARD state and a notification was send.


Now let's look at a revised example with different check settings for host and service:

define host{
    use windows-server    
    host_name host1

    alias host1

    address 10.25.14.51
    check_interval 2
   
retry_interval 1
    max_check_attempts 2

        notification_interval 30
    }

define service{
    use local-service
    host_name host1
    service_description Memory Usage
    check_command check_nrpe!CheckMem!ShowAll type=physical MinWarn=512M MinCrit=256M
    check_interval 5
   
retry_interval 1
    max_check_attempts 5

    notification_interval 30
   
}


Here is a scenario based on this configuration:

  • 13:10:10
    • Nagios HOST check for host1 executed
    • Result = OK
    • HARD state
    • Attempt 1/2
    • Next scheduled check 13:12:10
  • 13:10:30
    • host1 dies
    • Nagios does not know about this yet
  • 13:10:50
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 1/5
    • No notification sent
    • Next scheduled check 13:12:20
  • 13:12:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 1/2
    • No notification sent
    • Next scheduled check 13:13:10
  • 13:12:20
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 2/5
    • No notification sent
    • Next scheduled check 13:13:50
  • 13:13:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 2/2
    • Notification SENT 
    • SERVICE notifications for host1 suppressed as host1 is in a HARD down state
    • Next scheduled check 13:15:10
    • Next notification 13:43:10
  • 13:13:50
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 3/5
    • No notification sent
    • Next scheduled check 13:15:20
  • 13:15:10
    • Nagios HOST check for host1 executed
    • Check timed out as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 2/2
    • No Notification send as the Next notification is at 13:43:10
    • Next scheduled check 13:17:10
  • 13:15:20
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • SOFT state
    • Attempt 4/5
    • No notification sent
    • Next scheduled check 13:16:50
  • 13:16:50
    • Nagios SERVICE check Memory Usage for host1 executed
    • Check timed out after 30 seconds as host is down
    • Result = CRITICAL
    • HARD state
    • Attempt 5/5
    • NO Notification sent as host1 is in a HARD down state
    • Next scheduled check 13:22:20
In this example you can see Nagios identified that host1 was down (HARD non-OK state) BEFORE the Memory Usage service reached a HARD non-OK state, because of this the service notifications were suppressed.