Forum OpenACS Q&A: Ns_Pool: invalid block

Collapse
Posted by Jonathan Ellis on
I noticed that I get this error around a dozen times a day, usually
after one of two queries (that execute fine the rest of the time) but
occasionally with other queries.  It looks like

nsthread(6221) error: Ns_Pool: invalid block: 0x8178d98

then aolserver crashes.  I'm running AOLserver 3.3ad13 with Jerry's
vhr patches (version 6).  Is this an error with aolserver itself or
the postgres driver?  How should I try to fix it?

Collapse
Posted by Jonathan Ellis on
I'm still getting this ns_pool: invalid block error.  This is pg 7.1.3 and pgdriver 2.0.1.  More recently I've had the pg backend die on me without warning which wedges aolserver REALLY badly.  I'd guess the two are probably related...

I thought at first it must be bad ram, but I've swapped half my ram out at a time and it's happened in all configurations.

At this point I'm just crossing my fingers hoping 7.2 fixes it... anyone have a cron script handy to check for wedged aolserver before I go ahead and write one in the meantime? :/

Collapse
Posted by David Walker on
That error is generated by AOLServer.  Perhaps a query on the AOLServer list
would yield some background into what might cause the error.
Collapse
Posted by Jonathan Ellis on
yeah, but ns_pool is what talks to the db... and the db has been crashing so I'm thinking the latter may have caused the former.

I will try asking the aolserver list tho if their listserv is any more user-friendly than it was the last time I tried. :/

Collapse
Posted by David Walker on
The lesson I learned is not to use mailto:listserv.aol.com@mydomain.com as my
email address.  Once I changed that I haven't had any more trouble.  YMMV.
Collapse
Posted by Don Baccus on
The driver will try to restablish the backend connection if a backend dies.  It used to work.  I imagine it still does but the PG group has done such a good job of fixing bugs that crash the backend that I no longer have any simple test cases to test this driver feature!

Do you have enough stack space allocated per thread?

Are you in a position where you can set up things without Jerry's patches?  This would help isolate things.  I ask because my AOLserver 3.3ad13+PG 7.1.2 site stays up literally for months at a time.  I just added a 1/2 gig of RAM to it last week - uptime was 200 days.  I'd restarted AOLserver a few times in the interim as I did site upgrades but other than that it's up all the time and I've never seen this problem you describe.

Collapse
Posted by Jonathan Ellis on
I'll try running it without the vhr setup tomorrow...  at least part of the problem does seem to be vhr related; not sure if the slave backends reconnected to the db ok -- both times I killed the slaves first -- but for sure nothing got served after the db went down until I restarted the master.  Not sure why the one would hose the other but both times the master server has gone down and stayed down until I killed it.  So maybe I didn't need to restart the slaves at all but I won't know until it happens again. :/
Collapse
Posted by Jonathan Ellis on
I've set stack size to 128k now (how do I know if that's really enough?) but it was just at the default before.  (64k?)
Collapse
Posted by Don Baccus on
Ahhh...stacksize...this could be a useful change, let us know :)

128K comes from aD's experiences I believe ... determined experimentally using the WAG algorithm, probably.  But I've had problems when not specifying that much stackspace in the PG version, and have had them go away once I've remembered to allocate that much.

It will be interesting to see if this makes your problems disappear.

Collapse
Posted by mark dalrymple on
The original reason for the big stack sizes was the first couple of versions of the aD oracle driver put huge character arrays on the stack (for error messages and logging).  After the driver changed ownership a couple of times, the stack size was reduced as message bufers became dynamically allocated.
Collapse
Posted by Jonathan Ellis on
The db crashed again. I didn't have to restart the master nsd this time. Odd. I may not have had to restart the slave but ad_validate_security_info was spamming my error log ten times a second so I panicked and restarted it. :) But it looks like the driver was trying to reconnect to pg for each query while it started up. So the driver is exhonerated I think. :)

Here's what pg had to say in the log:

Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:05 www last message repeated 2 times
Jan 25 16:50:05 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:05 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:11 www logger: Server process (pid 1476) exited with status 139 at
 Fri Jan 25 16:50:11 2002
Jan 25 16:50:11 www logger: Terminating any active server processes...
Jan 25 16:50:11 www logger: NOTICE:  Message from PostgreSQL backend:
Jan 25 16:50:11 www logger: ^IThe Postmaster has informed me that some other ba
ckend^Idied abnormally and possibly corrupted shared memory.
Jan 25 16:50:11 www logger: ^II have rolled back the current transaction and am
^Igoing to terminate your database system connection and exit.
Collapse
Posted by Don Baccus on
Oh!  It looks like you have a query that's killing the PG backend ... you need to isolate it (try turrning on "verbose").  Generally PG-killing queries are repeatable, i.e. if you isolate it you'll be able to crash it in PSQL, too - while you figure out a workaround.

Regarding stacksize ... I think the 128*1024 actually comes from AOLserver folk, come to think of it, normally commented out in the default .tcl file?

Mark's post triggered my memory - aD used more like 512KB with the early ora8 incarnation for the reasons he mentions, IIRC.

Collapse
Posted by Patrick Giagnocavo on
<blockquote>> Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
</blockquote>
^^^^^^^

The above you will see anytime a client drops the connection.  Fire up
psql then kill it with a kill -11 or a kill -9 and you will get the
same error message.

<blockquote> Jan 25 16:50:11 www logger: Server process (pid 1476) exited with status 139
</blockquote>

You should look up what status 139 means.  I think it means "core dump".  If it saved a core dump file, usually named either "core" or "postmaster.core", and usually in the directory where you started either the AOLServer or the home directory of postgres, then you may be able to load it up under gdb and get a stack trace.  Which may or may not be useful.

You should run ipcclean to clean up the shared memory segments.  I have found sometimes that it is a good idea to reboot to ensure that shared memory is truly cleaned up.

You don't mention which OS, but I am guessing Linux.  Here are the things I would do:

1.  run mprime in torture test mode - get it from www.mersenne.org.  It heavily tests your RAM and CPU, and if it runs fine your RAM is ok.  You can run mprime while everything else is running.

2.  Have you upgraded any of the shared libraries, like glibc, that Postgres or AOLserver are dynamically linked to?  If so, you should recompile, or set LD_LIBRARY_PATH to point to a directory that has the original libraries you compiled against.

3.  Run "vacuum verbose analyze" a few times in succession.  Then use pg_dump to dump the entire database and re-import it into a different database name.  Does it import cleanly?  If so, make a new nsd.tcl config file, changing only the name of the db to connect to.  Stop AOLserver, then restart it with the new nsd.tcl so that it uses the new db.  What happens then?

My guess:  you either upgraded your libraries, or you need to upgrade your libraries.  Sometimes different versions of the thread library will work better than other versions.

Collapse
Posted by Don Baccus on
Well ... could be but I'm guessing he's just managed to write a query that's broken PG.  The message from the postmaster is given whenever a backend dies, which in my experience has always been due to a query that  PG simply barfs on.

Note that he's not getting the invalid block error - I think that was due to inadequate stack space.

Of course, it's a good idea to run RAM and other system tests, just on general principle.

But I think he's going to find there's a PG-breaking query at the root of this new symptom he's seeing.

Collapse
Posted by Jonathan Ellis on
Turn on verbose -- do you mean postmster -d debug_level?  Any idea what level I'd want?  All the docs say is "values up to 4 can make sense."

I've had the ns_pool error 5 times since doubling stack space to 128k.  I've taken out WAY more cached tcl procs than I've added, and I use util_memoize sparingly, so I wouldn't think I'd need to increase it to much more than 128k.  I take it there's not really any way to tell for sure, though. :/

Re patrick's suggestions -- I'll give one of the memory tester's a shot when I take down the server for a few minutes to up the pg debug level.  I haven't played with the shared libraries, and since I rebooted the server less than 24h before the most recent backend crash I don't think that's the problem either.  I also did a totally fresh load from pg_dump at that time, with no errors.

Collapse
Posted by Dave Bauer on
Check out the database pool section of the aolserver config file. Look for this:
    ns_param   Verbose            On
    ns_param   ExtendedTableInfo  On
    ns_param   LogSQLErrors       On 
turning these one will give you the most information in the AOLserver error log about database errors.
Collapse
Posted by Don Baccus on
Hmmm...have you tried removing Jerry's nsvhr patches yet?

The fact that you're still getting ns_pool errors makes me more open to the "busted hardware" theory, too ...

Collapse
Posted by Jonathan Ellis on
I tried the memory tester at http://www.qcc.sk.ca/~charlesc/software/memtester/ but it didn't find any errors.  I've got it running now without vhr and with the pools set to verbose; we'll see what happens...
Collapse
Posted by Jonathan Ellis on
We are getting somewhere now...

I took out my cron job to auto-kickstart the webserver when the db went down, which was stupid because it went down at 1:15 AM this morning and nsd never recovered. :(  I get a bunch of messages that it's trying to reconnect but it never did until I whacked it 7 hours later.  Maybe it's nsd and not the driver but it was definitely DOA.  (Still running w/o nsvhr.)  So that is the bad news.

The good news is that the last query executed before the backend went down crashes 7.1.3 100% of the time.  I tried it on my trusty SS10, too, so this one's definitely not a HW issue.  Selecting soundex(null) -- from when someone leaves a search form blank -- crashes it hard.  Doh! :)

Also I noticed that the 50 or so ns_pool errors in my logs have all happened with the same 3 memory addresses, which may indicate hardware issues, despite what 'memtest' said.  Or maybe not.  Haven't heard anything useful from AOLserver list yet, but I'm on digest format so we'll see tonight.

Collapse
Posted by Jonathan Ellis on
... and btw the user must have been typing urls in manually because normally the form in question won't let you submit it without a reasonably valid name.  No wonder the db only died every blue moon.
Collapse
Posted by David Walker on
Where did soundex come from?  Is it a C function?
Collapse
Posted by Jonathan Ellis on
yeah it's in the contrib directory of 7.1.3

7.2 replaces it with the 'fuzzystrmatch' suite, also in contrib; dunno if that has the same bug

Collapse
Posted by David Walker on
Are you using ns_dbquotevalue?  Is that why there is a NULL there?
Could you switch to DoubleApos or something like that?
Collapse
Posted by Jonathan Ellis on
I'm using the pseudo-bind-variables from the 4.x api.  Fixing it was as simple as adding a special empty_string_p test at the top of the function.
Collapse
Posted by Jonathan Ellis on
Someone on the AOLServer list suggested running without -z and since I did 24h ago I haven't seen the Ns_Pool error.  Looks like we have a winner!