Forum OpenACS Q&A: Ns_Pool: invalid block

Posted by Jonathan Ellis on 12/05/01 01:19 AM

I noticed that I get this error around a dozen times a day, usually
after one of two queries (that execute fine the rest of the time) but
occasionally with other queries. It looks like

nsthread(6221) error: Ns_Pool: invalid block: 0x8178d98

then aolserver crashes. I'm running AOLserver 3.3ad13 with Jerry's
vhr patches (version 6). Is this an error with aolserver itself or
the postgres driver? How should I try to fix it?

2: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/25/02 04:06 AM

I'm still getting this ns_pool: invalid block error. This is pg 7.1.3 and pgdriver 2.0.1. More recently I've had the pg backend die on me without warning which wedges aolserver REALLY badly. I'd guess the two are probably related...

I thought at first it must be bad ram, but I've swapped half my ram out at a time and it's happened in all configurations.

At this point I'm just crossing my fingers hoping 7.2 fixes it... anyone have a cron script handy to check for wedged aolserver before I go ahead and write one in the meantime? :/

3: Response to Ns_Pool: invalid block (response to 1)

Posted by David Walker on 01/25/02 04:37 PM

That error is generated by AOLServer. Perhaps a query on the AOLServer list
would yield some background into what might cause the error.

4: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/25/02 04:41 PM

yeah, but ns_pool is what talks to the db... and the db has been crashing so I'm thinking the latter may have caused the former.

I will try asking the aolserver list tho if their listserv is any more user-friendly than it was the last time I tried. :/

5: Response to Ns_Pool: invalid block (response to 1)

Posted by David Walker on 01/25/02 04:50 PM

The lesson I learned is not to use mailto:listserv.aol.com@mydomain.com as my
email address. Once I changed that I haven't had any more trouble. YMMV.

6: Response to Ns_Pool: invalid block (response to 1)

Posted by Don Baccus on 01/25/02 08:08 PM

The driver will try to restablish the backend connection if a backend dies. It used to work. I imagine it still does but the PG group has done such a good job of fixing bugs that crash the backend that I no longer have any simple test cases to test this driver feature!

Do you have enough stack space allocated per thread?

Are you in a position where you can set up things without Jerry's patches? This would help isolate things. I ask because my AOLserver 3.3ad13+PG 7.1.2 site stays up literally for months at a time. I just added a 1/2 gig of RAM to it last week - uptime was 200 days. I'd restarted AOLserver a few times in the interim as I did site upgrades but other than that it's up all the time and I've never seen this problem you describe.

7: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/25/02 08:20 PM

I'll try running it without the vhr setup tomorrow... at least part of the problem does seem to be vhr related; not sure if the slave backends reconnected to the db ok -- both times I killed the slaves first -- but for sure nothing got served after the db went down until I restarted the master. Not sure why the one would hose the other but both times the master server has gone down and stayed down until I killed it. So maybe I didn't need to restart the slaves at all but I won't know until it happens again. :/

8: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/25/02 09:05 PM

I've set stack size to 128k now (how do I know if that's really enough?) but it was just at the default before. (64k?)

9: Response to Ns_Pool: invalid block (response to 1)

Posted by Don Baccus on 01/25/02 10:20 PM

Ahhh...stacksize...this could be a useful change, let us know :)

128K comes from aD's experiences I believe ... determined experimentally using the WAG algorithm, probably. But I've had problems when not specifying that much stackspace in the PG version, and have had them go away once I've remembered to allocate that much.

It will be interesting to see if this makes your problems disappear.

10: Response to Ns_Pool: invalid block (response to 1)

Posted by mark dalrymple on 01/25/02 11:05 PM

The original reason for the big stack sizes was the first couple of versions of the aD oracle driver put huge character arrays on the stack (for error messages and logging). After the driver changed ownership a couple of times, the stack size was reduced as message bufers became dynamically allocated.

11: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/26/02 12:50 AM

The db crashed again. I didn't have to restart the master nsd this time. Odd. I may not have had to restart the slave but ad_validate_security_info was spamming my error log ten times a second so I panicked and restarted it. :) But it looks like the driver was trying to reconnect to pg for each query while it started up. So the driver is exhonerated I think. :)

Here's what pg had to say in the log:

Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:05 www last message repeated 2 times
Jan 25 16:50:05 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:05 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe
Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection
Jan 25 16:50:11 www logger: Server process (pid 1476) exited with status 139 at
 Fri Jan 25 16:50:11 2002
Jan 25 16:50:11 www logger: Terminating any active server processes...
Jan 25 16:50:11 www logger: NOTICE:  Message from PostgreSQL backend:
Jan 25 16:50:11 www logger: ^IThe Postmaster has informed me that some other ba
ckend^Idied abnormally and possibly corrupted shared memory.
Jan 25 16:50:11 www logger: ^II have rolled back the current transaction and am
^Igoing to terminate your database system connection and exit.

12: Response to Ns_Pool: invalid block (response to 1)

Posted by Don Baccus on 01/26/02 02:03 AM

Oh! It looks like you have a query that's killing the PG backend ... you need to isolate it (try turrning on "verbose"). Generally PG-killing queries are repeatable, i.e. if you isolate it you'll be able to crash it in PSQL, too - while you figure out a workaround.

Regarding stacksize ... I think the 128*1024 actually comes from AOLserver folk, come to think of it, normally commented out in the default .tcl file?

Mark's post triggered my memory - aD used more like 512KB with the early ora8 incarnation for the reasons he mentions, IIRC.

13: Response to Ns_Pool: invalid block (response to 1)

Posted by Patrick Giagnocavo on 01/26/02 02:09 AM

<blockquote>> Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection
</blockquote>
^^^^^^^

The above you will see anytime a client drops the connection. Fire up
psql then kill it with a kill -11 or a kill -9 and you will get the
same error message.

<blockquote> Jan 25 16:50:11 www logger: Server process (pid 1476) exited with status 139
</blockquote>

You should look up what status 139 means. I think it means "core dump". If it saved a core dump file, usually named either "core" or "postmaster.core", and usually in the directory where you started either the AOLServer or the home directory of postgres, then you may be able to load it up under gdb and get a stack trace. Which may or may not be useful.

You should run ipcclean to clean up the shared memory segments. I have found sometimes that it is a good idea to reboot to ensure that shared memory is truly cleaned up.

You don't mention which OS, but I am guessing Linux. Here are the things I would do:

1. run mprime in torture test mode - get it from www.mersenne.org. It heavily tests your RAM and CPU, and if it runs fine your RAM is ok. You can run mprime while everything else is running.

2. Have you upgraded any of the shared libraries, like glibc, that Postgres or AOLserver are dynamically linked to? If so, you should recompile, or set LD_LIBRARY_PATH to point to a directory that has the original libraries you compiled against.

3. Run "vacuum verbose analyze" a few times in succession. Then use pg_dump to dump the entire database and re-import it into a different database name. Does it import cleanly? If so, make a new nsd.tcl config file, changing only the name of the db to connect to. Stop AOLserver, then restart it with the new nsd.tcl so that it uses the new db. What happens then?

My guess: you either upgraded your libraries, or you need to upgrade your libraries. Sometimes different versions of the thread library will work better than other versions.

14: Response to Ns_Pool: invalid block (response to 1)

Posted by Don Baccus on 01/26/02 02:36 AM

Well ... could be but I'm guessing he's just managed to write a query that's broken PG. The message from the postmaster is given whenever a backend dies, which in my experience has always been due to a query that PG simply barfs on.

Note that he's not getting the invalid block error - I think that was due to inadequate stack space.

Of course, it's a good idea to run RAM and other system tests, just on general principle.

But I think he's going to find there's a PG-breaking query at the root of this new symptom he's seeing.

15: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/26/02 09:43 PM

Turn on verbose -- do you mean postmster -d debug_level? Any idea what level I'd want? All the docs say is "values up to 4 can make sense."

I've had the ns_pool error 5 times since doubling stack space to 128k. I've taken out WAY more cached tcl procs than I've added, and I use util_memoize sparingly, so I wouldn't think I'd need to increase it to much more than 128k. I take it there's not really any way to tell for sure, though. :/

Re patrick's suggestions -- I'll give one of the memory tester's a shot when I take down the server for a few minutes to up the pg debug level. I haven't played with the shared libraries, and since I rebooted the server less than 24h before the most recent backend crash I don't think that's the problem either. I also did a totally fresh load from pg_dump at that time, with no errors.

16: Response to Ns_Pool: invalid block (response to 1)

Posted by Dave Bauer on 01/26/02 10:13 PM

Check out the database pool section of the aolserver config file. Look for this:

    ns_param   Verbose            On
    ns_param   ExtendedTableInfo  On
    ns_param   LogSQLErrors       On

turning these one will give you the most information in the AOLserver error log about database errors.

17: Response to Ns_Pool: invalid block (response to 1)

Posted by Don Baccus on 01/26/02 10:29 PM

Hmmm...have you tried removing Jerry's nsvhr patches yet?

The fact that you're still getting ns_pool errors makes me more open to the "busted hardware" theory, too ...

18: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/26/02 11:58 PM

I tried the memory tester at http://www.qcc.sk.ca/~charlesc/software/memtester/ but it didn't find any errors. I've got it running now without vhr and with the pools set to verbose; we'll see what happens...

19: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/28/02 06:58 PM

We are getting somewhere now...

I took out my cron job to auto-kickstart the webserver when the db went down, which was stupid because it went down at 1:15 AM this morning and nsd never recovered. :( I get a bunch of messages that it's trying to reconnect but it never did until I whacked it 7 hours later. Maybe it's nsd and not the driver but it was definitely DOA. (Still running w/o nsvhr.) So that is the bad news.

The good news is that the last query executed before the backend went down crashes 7.1.3 100% of the time. I tried it on my trusty SS10, too, so this one's definitely not a HW issue. Selecting soundex(null) -- from when someone leaves a search form blank -- crashes it hard. Doh! :)

Also I noticed that the 50 or so ns_pool errors in my logs have all happened with the same 3 memory addresses, which may indicate hardware issues, despite what 'memtest' said. Or maybe not. Haven't heard anything useful from AOLserver list yet, but I'm on digest format so we'll see tonight.

20: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/28/02 07:06 PM

... and btw the user must have been typing urls in manually because normally the form in question won't let you submit it without a reasonably valid name. No wonder the db only died every blue moon.

21: Response to Ns_Pool: invalid block (response to 1)

Posted by David Walker on 01/28/02 07:08 PM

Where did soundex come from? Is it a C function?

22: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/28/02 07:11 PM

yeah it's in the contrib directory of 7.1.3

7.2 replaces it with the 'fuzzystrmatch' suite, also in contrib; dunno if that has the same bug

23: Response to Ns_Pool: invalid block (response to 1)

Posted by David Walker on 01/28/02 07:22 PM

Are you using ns_dbquotevalue? Is that why there is a NULL there?
Could you switch to DoubleApos or something like that?

24: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/28/02 07:26 PM

I'm using the pseudo-bind-variables from the 4.x api. Fixing it was as simple as adding a special empty_string_p test at the top of the function.

25: Response to Ns_Pool: invalid block (response to 1)

Posted by Jonathan Ellis on 01/30/02 04:13 PM

Someone on the AOLServer list suggested running without -z and since I did 24h ago I haven't seen the Ns_Pool error. Looks like we have a winner!