Saturday, May 25th 2013, 10:10am UTC+2

You are not logged in.

  • Login
  • Register

Dear visitor, welcome to Monitoring-Portal.
Although this is a german monitoring forum, please don't hesitate to post in English. Nearly everybody here understands you and will answer in English as well.
If this is your first visit here, please read the Help. It explains how this page works. You must be registered before you can use all the page's features. Please use the registration form to register here or read more information about the registration process. If you are already registered, please login here.

1

Thursday, August 6th 2009, 12:44pm

Nagios 3.0.6 Just stops working for no Apparent Reason... problem with NDO maybe?

Hi All,
Nagios 3.0.6
NDO the latest beta which is getting on a bit to be honest. 1.4b or something isn't it

I'm placing my bets a bit writing in several different forums trying to find answers.

I see when people have had similar issue to me that its either a box not powerful enough problem or NDO (similar thing actually). However i'm running a powerful box and less checks than other people have had with similar issue so i don't think it is down to that.

My problem is basically active nagios checks seem to just stop occurring like someones hit a pause button. This is the second time its happened over the past week and pre that i've had no problems at all running nagios on this all powerful box for at least 2 months and pre that it was only on a misly Virtual Machine and never had any problem. Strangely when this occurs you can stil browse arround the nagios web GUI but try to run anything using the right hand side pane where the links are to run some nagios commands, although it looks like its going to execute those and run them... it never actually gets round to doing so.

No errors are recoreded in any logs from what i can see for nagios... the last logs are just stating the outputs of the last checks that occured. The nagios process stays running the NDO process stays running. Mysql is running and there is still plenty of disk space. What i have found is some tables like nagios_hostchecks are huge.... i mean massive ie:-
show table status\G returns 664929 rows (which is no biggy really), 281362432 data__length, 23134208 index_length, 4800000 ish for the auto increment number.... some of those numbers seem a bit high but still.

Would maybe tweeking the trimming numbers within NDO help... and what exactly does this do.. i mean i'm not sure why it needs to keep so much data back, what makes use of all those extra days of data, Nagios? Nagvis? pnp? I can't think that any of them need it?

Any ideas anyone?
Regards,
Mark

dnsmichi

Super Moderator

Posts: 5,990

Birthday: May 30th 1983 (29)

Gender: male

Location: Nürnberg

Occupation: Consultant / Developer beim besten Arbeitgeber der Welt @netways

Number of monitoring servers: Icinga: 4x dev, 10++ prod, Icinga2: 2x dev

Nagios Version: s/nagios/icinga/

Icinga Version: 1.9.1 / GIT

Distributed monitoring: Ja

Redundant monitoring: Ja

Number of hosts: 1000+

Number of services: 15000+

OS: RHEL, Debian, SUSE

Plugin Version: 1.4.16

IDO-Version: 1.9.1 / GIT MySQL/Postgresql/Oracle

Other Addons: Icinga Web, PNP, check_multi, inGraph, EventDB, LConf

2

Thursday, August 6th 2009, 2:10pm

the blocking behaviour of the ndo is a very common problem. if the ndo2db-daemon encoutners a problem it causes the ndomod writing to the socket to go over to blocking mode which recently causes nagios to stay like a dead process but still running but not doing any checks neither by process nor by request on the web. we've encountered that on a weekend therefore a small perlscript which checks the actuality of status.dat (where all the checks are logged into) has been applied on another server to see if there's a problem with nagios - kind of strange but well well.

removing some kind of blocking by trimming the database regularly could be a solution. database trimming works like this:

* given several interval parameters for the most common used tables in ndo2db.cfg
* at ndo2db startup, those tables will be checked and by a simple "delete from bla where time < (now - interval)" query. this could lead into quite a long deletion period, seen that on oracle where ndo2db simply died deleting from 3,7 mio rows
* running ndo2db, every 60 seconds a check for old entries is being performed, likewise quite the same delete statement to wait for

concerning the amount of data - ndo2db saves nearly everything from nagios, even event logs (timed_events) which is kind of annoying if you do not need the data. for historical purposes and analysis of services and their up/downtime of the past year e.g. it's a good thing to have (even though mysql shouldn't be the preferred rdbm that way, also the db scheme is way too normalized).
if you desire not to have several data, you may change the data_processing_options in the config to an appropriate value (i.e. kick timed_events and so on). several addons still require only the configs and some live data for them to work with ndoutils.-
+++ Icinga / LConf Developer +++ Senior Consultant at []NETWAYS> +++
+++ Icinga 1.9 || Icinga 2 +++ Icinga Support || IRC +++

3

Thursday, August 6th 2009, 5:08pm

Thanks for this information, i'm surprised i've not come across this problem before with NDO as i've been using it now for quite a good few months. I guess i'm getting to a stage now where i'm adding more checks and this is obviously starting to have an impact. I thankyou for your reply it seemed like i couldn't find any details about this issue else where (perhaps there is a direction you can point me in if there is something written about this problem else where)?

Is the only way to fix this then to kill ndo and stop nagios then kill ndo then start ndo then start nagios again? Or is there an interactive way to unlock this 'blocking mode'. I think i need to try and find out more about it before taking any actions. However its quite a problem if nagios stops responding and we don't know about it. I've heard the new nagios people over at icinga, look to have plans to developing a API for this stuff to get rid of the old NDO... which is a long awaited thing i'd say considereing NDO isn't under dev anymore i really didn't know whether i should use it or not in the production environement however it seems most people do and it is needed for most of the 3rd party apps that are really cool.

Thanks for you post though.. its been a real help to know this... and i assume there is no known fix for whatever is going on?

dnsmichi

Super Moderator

Posts: 5,990

Birthday: May 30th 1983 (29)

Gender: male

Location: Nürnberg

Occupation: Consultant / Developer beim besten Arbeitgeber der Welt @netways

Number of monitoring servers: Icinga: 4x dev, 10++ prod, Icinga2: 2x dev

Nagios Version: s/nagios/icinga/

Icinga Version: 1.9.1 / GIT

Distributed monitoring: Ja

Redundant monitoring: Ja

Number of hosts: 1000+

Number of services: 15000+

OS: RHEL, Debian, SUSE

Plugin Version: 1.4.16

IDO-Version: 1.9.1 / GIT MySQL/Postgresql/Oracle

Other Addons: Icinga Web, PNP, check_multi, inGraph, EventDB, LConf

4

Thursday, August 6th 2009, 11:37pm

first of all, i've become one of the maintainers for ndoutils oracle which is a port of ndoutils and using mysql, just for oracle. as you said, since the development of ndoutils quietly stopped and the codebase wasn't that good i got a bit more deeper into the code. just as icinga was announced as a fork and the code has been patched a bit better with ndoutils -> idoutils. primary goal of ndoutils oracle was, to commit changes to the upstream which wasn't possible quite a long time. for that reason i have taken the chances to commit knowledge to icinga and idoutils. currently i am working on porting and improving queries and housekeeping for mysql, and getting postgres and also oracle to work with idoutils and icinga.

the blocking behaviour is a well known fact it has been discussed on the mailinglist in different threads but since development and support was not there, nothing has changed. we have discussed it here recently (but in german). there are several lines in the code where to start debugging and build a better "buffer" to catch on that bug. let's see if we can catch up on that in icinga and improve the idoutils. if you may have any further questions just come back here. imho the nagios-users mailinglist is quite good, but not that helpful and searching for a solutions is sometimes kind of weird.

as i pointed out for your problem, try the database_trimming_options and in first line, decide with data_processing_options what you really need in your database. if your not interested in historical data, just don't allow the ndo to get it and write it to the database. in ndomod.h you will find the defines with values - add them alltogether what you need and that's the value to set in the configs (also something not quite userfriendly and probably to be changed if the uersrs want that...). default value is btw -1, meaining everything.
+++ Icinga / LConf Developer +++ Senior Consultant at []NETWAYS> +++
+++ Icinga 1.9 || Icinga 2 +++ Icinga Support || IRC +++

5

Friday, August 7th 2009, 3:31pm

Thanks again for your reply, and best of luck helping out with the icinga project....

What could this mean for people using nagios... will icinga help out at all.. ie a good example here would be of course the NDO utils. Will idoutils work with Nagios. etc etc will they both be helping out each other. Or has the community now also got pick either or and that could be a real pain? As if icinga produces what it says it will then i'd think everyone would eventually be using it and nagios would be a thing of the past... but for some reason i just can't see that happening.

Because it will seem quite a pain to have to switch to something that is more or less exactly the same as nagios (as it stands at the moment at least from what i understand), but will mean we have to move to it anyway because well we couldn't obviously stay with NDO, and all the third party apps like pnp, nagvis etc i assume would actually already work exactly the same because icinga is basically the same....? WWOOOHA this is a little confusing :-)

Cheers,
mark

Added Edit:- Also i've just seen that 1.4b8 has been released. Are you part of the team making that happen or has work suddenly started on fixing it?

This post has been edited 1 times, last edit by "nagiosuser123" (Aug 7th 2009, 5:20pm)


dnsmichi

Super Moderator

Posts: 5,990

Birthday: May 30th 1983 (29)

Gender: male

Location: Nürnberg

Occupation: Consultant / Developer beim besten Arbeitgeber der Welt @netways

Number of monitoring servers: Icinga: 4x dev, 10++ prod, Icinga2: 2x dev

Nagios Version: s/nagios/icinga/

Icinga Version: 1.9.1 / GIT

Distributed monitoring: Ja

Redundant monitoring: Ja

Number of hosts: 1000+

Number of services: 15000+

OS: RHEL, Debian, SUSE

Plugin Version: 1.4.16

IDO-Version: 1.9.1 / GIT MySQL/Postgresql/Oracle

Other Addons: Icinga Web, PNP, check_multi, inGraph, EventDB, LConf

6

Saturday, August 8th 2009, 3:56am

ethan galstad (the so-called "godfather" of nagios) recently searched for ndoutils developers since it came to his mind that there was a lack of developers. hendrik and me tried to apply in case of merging fixes for icinga/idoutils to actual ndoutils. what happened next that hendrik got into the team while myself is a bit unknown to the nagios "scene" since i was not that active on the mailinglist and so on (i understand that aspect but concerning my programming skills well.. no).

so right now, 1.4b8 has been released which is a patch compilation since the last release (i've been using and testing release only from cvs). maybe development grows up a bit more again and several fixes wil be applied.

which brings us to icinga - the fork was applied as a consequence of ethan galstad being the only nagios developer and allowing/testing comitted patches. that is not just that the community is awaiting. so there was the split but the keep in mind fact is still to rename code without any nagios trademark to something new and then work and improve the code. but right this way with a broader bandwiddth of people working together and also introducing new features - this is what icinga is.

as a matter of fact, idoutils .e.g should work with nagios. but with icinga, it is currently part of the main install of icinga. meaning, a module next to the core which will be improved in many ways. on the mailinglist there was a discussion to completly delete it and code something new (which merlin from op5 tries to be). but, main goals are to bringt more rdbms to work with idoutils, improve buffering/housekeeping and so on. heavily focused on doing things right which ndoutils couldn't fix. as an example, if you look at the actual ndoutils code and before you will notice that there should have been an implementation for postgres. but in fact only delete and insert statements would have fit, everything else is not yet implemented or pure "bullshit".
+++ Icinga / LConf Developer +++ Senior Consultant at []NETWAYS> +++
+++ Icinga 1.9 || Icinga 2 +++ Icinga Support || IRC +++

7

Monday, August 10th 2009, 12:52pm


dnsmichi
- Thanks again for that reply... everything is so much clearer now and its all fitting into place. You've been great answering my questions and i very much appreciate it.... Thanks a Million and i hope this thread is useful to many others.

Best of health,
Thanks again!!
Mark

8

Monday, October 26th 2009, 8:44am

Re-openeing... Still a Problem?

Hi All,

I'm using 3.2.0 Core Nagios and NDOutils 1.4b8...

Still i'm having theses issues... however this time its slightly different to before.
I came in this morning only to find that nagios hadn't carried out any checks of hosts or services for 99% of the amount of services and hosts I have which all report to have had a last check of:-
10-25-2009 22:55:00 - 10-25-2009 22:59:59

So for all these checks of hosts and services... somehow nagios has left it that the next check for 99% of these services/hosts are to be carried out:-
10-26-2009 23:00:00

Why has it done that because about 90% of those 99% are set to be re-checked every 5mins and nothing is set to check only once every 24hours so...????

However there were a small few like 10 roughly in total that seemed to be un-effected and have been carrying on checking as normal, stranger still restarting the nagios process made no difference as nagios just picked-up from where it left off and didn't start checking all the services again... plus it seems if I force a re-check of a host of service it just starts working and re-checking as normal. So odd, corruption of the scheduler or something... is that even possible?

Don't think its related to an NDO issue this time, this is something different?
Thanks for any helps.
Regards,
M

EDIT:-
Any chance of it being time-zone related... seems strange we've just move into a different time of year and all then this happens? Not seen it as a problem before on this server and no settings with time have been changed at all that I can remember at least... still i'd think could that be a possible?
EDIT-EDIT:-
Added a Pic to explain a bit better..

This post has been edited 2 times, last edit by "nagiosuser123" (Oct 26th 2009, 9:10am)


hazet

Intermediate

Posts: 275

Birthday: Sep 8th

Gender: male

Location: Augsburg

Number of monitoring servers: 2

Nagios Version: 3.0.5

Distributed monitoring: Nein

Redundant monitoring: Ja

Number of hosts: 160

Number of services: >2000

OS: Ubuntu 8.04

Plugin Version: 1.4.12

NagVis Version: 1.3

NDO Version: 1.4.7b

Other Addons: PNP 0.6.1,NagTrap 0.1.2

9

Monday, October 26th 2009, 9:12am

Complex problems have simple, easy to understand, wrong answers.
(H. L. Mencken)

10

Monday, October 26th 2009, 11:15am

Thanks

hazet
!!!

Much obliged, this is exactly the problem..

Cheers and Thanks again,
M