| Cheshire Cat Computing http://www.steveshipway.org/forum/ |
|
| EventLog problems recovering automatically, and too quickly http://www.steveshipway.org/forum/viewtopic.php?f=22&t=3995 |
Page 1 of 1 |
| Author: | Bennyvision [ Sat Nov 14, 2009 10:02 am ] |
| Post subject: | EventLog problems recovering automatically, and too quickly |
Hey folks, I am working on polishing a new Nagios installation, and I am trying the Windows EventLog Agent as a method of allowing my end users to configure their own thresholds for Windows events. I have configured NSCA and it is receiving my events (just testing with starting and stopping the agent so far) just fine: [1258145224] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting [1258145225] PASSIVE SERVICE CHECK: hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting Yay! It notices when I shut the agent down, and when I start it back up. That's awesome. So, my next test is to shut the agent down, and make sure it alerts the contact. I have my service set up like so: define service { service_description EventLog Agent check_command check_passive_service!0!EventLog Agent running host_name hntbw598 check_period 24x7 passive checks notification_period 24x7 passive checks contact_groups testing-admins active_checks_enabled 0 passive_checks_enabled 1 max_check_attempts 1 normal_check_interval 5 retry_check_interval 2 notification_interval 360 notification_options w,u,c,r active_checks_enabled 1 passive_checks_enabled 1 notifications_enabled 1 check_freshness 0 freshness_threshold 86400 } (the check_freshness being set to 0 is intentional, I want the alert to remain for now) The check_passive_service command is configured like so: define command { command_name check_passive_service command_line $USER1$/check_dummy $ARG1$ "$ARG2$" } (I needed the double quotes, otherwise I lost everything but the first word) So, here is a test run: [1258145224] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting [1258145225] PASSIVE SERVICE CHECK: hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting [1258145225] SERVICE ALERT: hntbw598;EventLog Agent;CRITICAL;HARD;1;HEARTBEAT [CRIT #2]: Service halting [1258145225] SERVICE NOTIFICATION: cbensend;hntbw598;EventLog Agent;CRITICAL;notify-service-by-email;HEARTBEAT [CRIT #2]: Service halting Yay! It alerted just fine. However: [1258145365] SERVICE ALERT: hntbw598;EventLog Agent;OK;HARD;1;OK: EventLog Agent running [1258145365] SERVICE NOTIFICATION: cbensend;hntbw598;EventLog Agent;OK;notify-service-by-email;OK: EventLog Agent running This recovered all by itself? I did *not* start the agent back up. Nagios just took it upon itself to recover the service, even though the service isn't recovered. It is not running on the remote host, NSCA did not receive another event. And freshness checking is off: grep freshness nagios.cfg |grep -v "#" check_service_freshness=0 service_freshness_check_interval=3600 check_host_freshness=0 host_freshness_check_interval=60 additional_freshness_latency=15 Any ideas why it automatically recovers itself? Thanks! |
|
| Author: | stevesh [ Sat Nov 14, 2009 3:18 pm ] |
| Post subject: | Re: EventLog problems recovering automatically, and too quickly |
It looks like your Nagios is running active service checks, for some reason. 1) Cheshire the Nagios log. This should say when freshness checks or external commands are received, so that you can tell if its a passive, freshness, or active check. 2) You should have max_check_attempts set to 1 for this sort of thing as on shut down you only get one crit notification. 3) For the heartbeat service (thats what this is) you should set your check_command to give a CRITICAL with message "Eventlog agent has died" and enable freshness checks to about 5mins timeout. This is because if the agent is running, it will send periodic "Im OK" heartbeat messages; if it dies, you get nothing so you need the freshness checks to make things go critical if the agent dies. If it is shut down manually it sends a crit before shutting down but you cant rely on this if something goes wrong. 4) Check that you dont have the service with checks enabled IN NAGIOS. The definition in the files may say active checks disabled, but someone may have re-enabled them in the web interface. 5) Check you dont have 2 nagios instances running at once. It can do weird things. 6) Which nagios version are you running? In later versions, you can set the normal_check_interval to 0 to disable active checks, this protects you from situation (4) above Steve |
|
| Author: | Bennyvision [ Tue Nov 17, 2009 2:42 am ] |
| Post subject: | Re: EventLog problems recovering automatically, and too quickly |
Hi Steve, I want to say that was a contiguous series of events from the Nagios log (ie, I didn't trim or skip anything), but here is another test, end-to-end with nothing at all trimmed out. Also, I am the only person that has access to this installation, so I am positive that no one is issuing commands that I'm not aware of: From /var/log/messages: Nov 16 07:30:20 hntbw597 nagios: EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting Nov 16 07:30:23 hntbw597 nagios: PASSIVE SERVICE CHECK: hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting Nov 16 07:30:23 hntbw597 nagios: SERVICE ALERT: hntbw598;EventLog Agent;CRITICAL;HARD;1;HEARTBEAT [CRIT #2]: Service halting Nov 16 07:30:23 hntbw597 nagios: SERVICE NOTIFICATION: cbensend;hntbw598;EventLog Agent;CRITICAL;notify-service-by-email;HEARTBEAT [CRIT #2]: Service halting Nov 16 07:34:23 hntbw597 nagios: SERVICE ALERT: hntbw598;EventLog Agent;OK;HARD;1;OK: EventLog Agent running Nov 16 07:34:23 hntbw597 nagios: SERVICE NOTIFICATION: cbensend;hntbw598;EventLog Agent;OK;notify-service-by-email;OK: EventLog Agent running And from the Nagios log: [1258378220] EXTERNAL COMMAND: PROCESS_SERVICE_CHECK_RESULT;hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting [1258378223] PASSIVE SERVICE CHECK: hntbw598;EventLog Agent;2;HEARTBEAT [CRIT #2]: Service halting [1258378223] SERVICE ALERT: hntbw598;EventLog Agent;CRITICAL;HARD;1;HEARTBEAT [CRIT #2]: Service halting [1258378223] SERVICE NOTIFICATION: cbensend;hntbw598;EventLog Agent;CRITICAL;notify-service-by-email;HEARTBEAT [CRIT #2]: Service halting [1258378463] SERVICE ALERT: hntbw598;EventLog Agent;OK;HARD;1;OK: EventLog Agent running [1258378463] SERVICE NOTIFICATION: cbensend;hntbw598;EventLog Agent;OK;notify-service-by-email;OK: EventLog Agent running Only on Nagios daemon running, BTW. This is Nagios v3.2.0. From objects.cache (to show the *running* Nagios configuration, not just the config files: define host { host_name hntbw598 alias Nagios development Windows host address W.X.Y.Z check_period 24x7 check_command check-host-alive contact_groups testing-admins notification_period 24x7 initial_state o check_interval 5.000000 retry_interval 1.000000 max_check_attempts 3 active_checks_enabled 1 passive_checks_enabled 1 obsess_over_host 1 event_handler_enabled 1 low_flap_threshold 0.000000 high_flap_threshold 0.000000 flap_detection_enabled 1 flap_detection_options o,d,u freshness_threshold 0 check_freshness 0 notification_options d,u,r notifications_enabled 1 notification_interval 360.000000 first_notification_delay 0.000000 stalking_options n process_perf_data 1 failure_prediction_enabled 1 icon_image win40.png icon_image_alt Microsoft Windows statusmap_image win40.gd2 notes_url http://hntbw597.hntb.org/nagios/notes/hntbw598.html retain_status_information 1 retain_nonstatus_information 1 } define service { host_name hntbw598 service_description EventLog Agent check_period 24x7 passive checks check_command check_passive_service!0!EventLog Agent running contact_groups testing-admins notification_period 24x7 passive checks initial_state o check_interval 5.000000 retry_interval 2.000000 max_check_attempts 1 is_volatile 0 parallelize_check 1 active_checks_enabled 1 passive_checks_enabled 1 obsess_over_service 1 event_handler_enabled 1 low_flap_threshold 0.000000 high_flap_threshold 0.000000 flap_detection_enabled 1 flap_detection_options o,w,u,c freshness_threshold 86400 check_freshness 0 notification_options u,w,c,r notifications_enabled 1 notification_interval 360.000000 first_notification_delay 0.000000 stalking_options n process_perf_data 1 failure_prediction_enabled 1 retain_status_information 1 retain_nonstatus_information 1 } I'll change the freshness check now and see what happens... Thanks so much! Benny |
|
| Author: | stevesh [ Tue Nov 17, 2009 8:33 am ] |
| Post subject: | Re: EventLog problems recovering automatically, and too quickly |
You should not have Active checks enabled for this service (my point 4 above)! Your active check is causing the second status alert. Change the message in the service definition so youlllk be able to see this. Steve |
|
| Author: | Bennyvision [ Tue Nov 17, 2009 8:42 am ] |
| Post subject: | Re: EventLog problems recovering automatically, and too quickly |
Yes, I caught this after my reply earlier today. It appears that NConf is doing some .. uh .. silly things like tying active_checks_enabled/passive_checks_enabled to the *timeperiod*, and if I add it to the service definition to override, it adds a second active_checks_enabled definition to the service. Sigh. So, not my fault, but certainly my fault for not noticing it until now. The NConf guys are releasing a new version this week; hopefully this goofy behavior is fixed. Thanks, Steve! |
|
| Page 1 of 1 | All times are UTC + 12 hours [ DST ] |
| Powered by phpBB® Forum Software © phpBB Group http://www.phpbb.com/ |
|