Forum OpenACS Q&A: Ns_Pool: invalid block
after one of two queries (that execute fine the rest of the time) but
occasionally with other queries. It looks like
nsthread(6221) error: Ns_Pool: invalid block: 0x8178d98
then aolserver crashes. I'm running AOLserver 3.3ad13 with Jerry's
vhr patches (version 6). Is this an error with aolserver itself or
the postgres driver? How should I try to fix it?
I thought at first it must be bad ram, but I've swapped half my ram out at a time and it's happened in all configurations.
At this point I'm just crossing my fingers hoping 7.2 fixes it... anyone have a cron script handy to check for wedged aolserver before I go ahead and write one in the meantime? :/
would yield some background into what might cause the error.
I will try asking the aolserver list tho if their listserv is any more user-friendly than it was the last time I tried. :/
email address. Once I changed that I haven't had any more trouble. YMMV.
Do you have enough stack space allocated per thread?
Are you in a position where you can set up things without Jerry's patches? This would help isolate things. I ask because my AOLserver 3.3ad13+PG 7.1.2 site stays up literally for months at a time. I just added a 1/2 gig of RAM to it last week - uptime was 200 days. I'd restarted AOLserver a few times in the interim as I did site upgrades but other than that it's up all the time and I've never seen this problem you describe.
128K comes from aD's experiences I believe ... determined experimentally using the WAG algorithm, probably. But I've had problems when not specifying that much stackspace in the PG version, and have had them go away once I've remembered to allocate that much.
It will be interesting to see if this makes your problems disappear.
Here's what pg had to say in the log:
Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection Jan 25 16:50:05 www last message repeated 2 times Jan 25 16:50:05 www logger: pq_flush: send() failed: Broken pipe Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection Jan 25 16:50:05 www logger: pq_flush: send() failed: Broken pipe Jan 25 16:50:05 www logger: pq_recvbuf: unexpected EOF on client connection Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection Jan 25 16:50:06 www logger: pq_flush: send() failed: Broken pipe Jan 25 16:50:06 www logger: pq_recvbuf: unexpected EOF on client connection Jan 25 16:50:11 www logger: Server process (pid 1476) exited with status 139 at Fri Jan 25 16:50:11 2002 Jan 25 16:50:11 www logger: Terminating any active server processes... Jan 25 16:50:11 www logger: NOTICE: Message from PostgreSQL backend: Jan 25 16:50:11 www logger: ^IThe Postmaster has informed me that some other ba ckend^Idied abnormally and possibly corrupted shared memory. Jan 25 16:50:11 www logger: ^II have rolled back the current transaction and am ^Igoing to terminate your database system connection and exit.
Regarding stacksize ... I think the 128*1024 actually comes from AOLserver folk, come to think of it, normally commented out in the default .tcl file?
Mark's post triggered my memory - aD used more like 512KB with the early ora8 incarnation for the reasons he mentions, IIRC.
The above you will see anytime a client drops the connection. Fire up
psql then kill it with a kill -11 or a kill -9 and you will get the
same error message.
<blockquote> Jan 25 16:50:11 www logger: Server process (pid 1476) exited with status 139
You should look up what status 139 means. I think it means "core dump". If it saved a core dump file, usually named either "core" or "postmaster.core", and usually in the directory where you started either the AOLServer or the home directory of postgres, then you may be able to load it up under gdb and get a stack trace. Which may or may not be useful.
You should run ipcclean to clean up the shared memory segments. I have found sometimes that it is a good idea to reboot to ensure that shared memory is truly cleaned up.
You don't mention which OS, but I am guessing Linux. Here are the things I would do:
1. run mprime in torture test mode - get it from www.mersenne.org. It heavily tests your RAM and CPU, and if it runs fine your RAM is ok. You can run mprime while everything else is running.
2. Have you upgraded any of the shared libraries, like glibc, that Postgres or AOLserver are dynamically linked to? If so, you should recompile, or set LD_LIBRARY_PATH to point to a directory that has the original libraries you compiled against.
3. Run "vacuum verbose analyze" a few times in succession. Then use pg_dump to dump the entire database and re-import it into a different database name. Does it import cleanly? If so, make a new nsd.tcl config file, changing only the name of the db to connect to. Stop AOLserver, then restart it with the new nsd.tcl so that it uses the new db. What happens then?
My guess: you either upgraded your libraries, or you need to upgrade your libraries. Sometimes different versions of the thread library will work better than other versions.
Note that he's not getting the invalid block error - I think that was due to inadequate stack space.
Of course, it's a good idea to run RAM and other system tests, just on general principle.
But I think he's going to find there's a PG-breaking query at the root of this new symptom he's seeing.
I've had the ns_pool error 5 times since doubling stack space to 128k. I've taken out WAY more cached tcl procs than I've added, and I use util_memoize sparingly, so I wouldn't think I'd need to increase it to much more than 128k. I take it there's not really any way to tell for sure, though. :/
Re patrick's suggestions -- I'll give one of the memory tester's a shot when I take down the server for a few minutes to up the pg debug level. I haven't played with the shared libraries, and since I rebooted the server less than 24h before the most recent backend crash I don't think that's the problem either. I also did a totally fresh load from pg_dump at that time, with no errors.
ns_param Verbose On ns_param ExtendedTableInfo On ns_param LogSQLErrors Onturning these one will give you the most information in the AOLserver error log about database errors.
The fact that you're still getting ns_pool errors makes me more open to the "busted hardware" theory, too ...
I took out my cron job to auto-kickstart the webserver when the db went down, which was stupid because it went down at 1:15 AM this morning and nsd never recovered. :( I get a bunch of messages that it's trying to reconnect but it never did until I whacked it 7 hours later. Maybe it's nsd and not the driver but it was definitely DOA. (Still running w/o nsvhr.) So that is the bad news.
The good news is that the last query executed before the backend went down crashes 7.1.3 100% of the time. I tried it on my trusty SS10, too, so this one's definitely not a HW issue. Selecting soundex(null) -- from when someone leaves a search form blank -- crashes it hard. Doh! :)
Also I noticed that the 50 or so ns_pool errors in my logs have all happened with the same 3 memory addresses, which may indicate hardware issues, despite what 'memtest' said. Or maybe not. Haven't heard anything useful from AOLserver list yet, but I'm on digest format so we'll see tonight.
7.2 replaces it with the 'fuzzystrmatch' suite, also in contrib; dunno if that has the same bug
Could you switch to DoubleApos or something like that?