Forum .LRN Q&A: How come my Postgres server processes are multiplying like rabbits on Easter?

My local postgres wizard doesn't like the looks of this either.  I've got my old dotlrn1/service0 running along by itself and a three new dotlrn2 instances, service80, service3400, and service 5400, being driven by nobody, a virtual name server on aolserver4.  If I fire up a fifth instance, my postgres gives up (I've only got 512mb ram), but four seems to be running ok.  But now I'm wondering about all these processes: is it normal for each server to be spawning four or more processes, and doing so before anybody asks to be served a page? Please don't ask me who killed my sendmail over the holiday, I don't know, and it's not a pretty sight.

I. M. Curious.

service0  2889  1.9  9.2 55100 47652 ?      S    22:45  0:25 [nsd]
postgres  2893  0.0  0.4 69656 2068 ?        S    22:45  0:00 postgres: stats buffer process
postgres  2894  0.0  0.4 68720 2120 ?        S    22:45  0:00 postgres: stats collector process
postgres  2897  0.0  2.7 70540 13896 ?      S    22:45  0:00 postgres: service0 service0 [local] idle
postgres  2898  0.1  0.9 69704 4904 ?        S    22:45  0:01 postgres: service0 service0 [local] idle
postgres  2899  0.0  0.6 69132 3256 ?        S    22:45  0:00 postgres: service0 service0 [local] idle
root      2906  0.0  0.4  6788 2076 ?        S    22:46  0:00 /usr/sbin/sshd
root      2908  0.0  0.2  4384 1488 pts/0    S    22:46  0:00 -bash
root      2979  0.0  0.0  1360  280 ?        S    22:50  0:00 supervise nobody
root      3072  0.0  0.1  2100  956 ?        S    22:52  0:00 /bin/sh ./run
nobody    3073  9.8 35.5 227192 182892 ?    S    22:52  1:30 [nsd]
postgres  3076  0.0  1.9 70728 10032 ?      S    22:52  0:00 postgres: service80 service80 [local] idle
postgres  3077  0.2  0.9 69704 5008 ?        S    22:52  0:02 postgres: service80 service80 [local] idle
postgres  3078  0.0  0.6 69132 3288 ?        S    22:52  0:00 postgres: service80 service80 [local] idle
postgres  3091  0.0  1.9 70700 10088 ?      S    22:53  0:00 postgres: service5400 service5400 [local] idle
postgres  3092  0.2  0.9 69704 5040 ?        S    22:53  0:02 postgres: service5400 service5400 [local] idle
postgres  3093  0.0  0.6 69132 3272 ?        S    22:53  0:00 postgres: service5400 service5400 [local] idle
postgres  3101  0.0  2.3 70700 12192 ?      S    22:54  0:00 postgres: service3400 service3400 [local] idle
postgres  3102  0.2  0.9 69704 5048 ?        S    22:54  0:02 postgres: service3400 service3400 [local] idle
postgres  3103  0.0  0.6 69132 3284 ?        S    22:54  0:00 postgres: service3400 service3400 [local] idle
postgres  3136  0.0  1.2 69632 6340 ?        S    22:55  0:00 postgres: service80 service80 [local] idle
postgres  3137  0.0  1.1 69784 5800 ?        S    22:55  0:00 postgres: service80 service80 [local] idle
postgres  3150  0.0  1.0 69608 5344 ?        S    22:55  0:00 postgres: service3400 service3400 [local] idle
postgres  3151  0.0  1.0 69632 5172 ?        S    22:56  0:00 postgres: service5400 service5400 [local] idle
root      3186  0.0  0.0    0    0 ?        Z    23:00  0:00 [sendmail <defunct>]
postgres  3226  0.0  1.3 70240 6988 ?        S    23:00  0:00 postgres: service0 service0 [local] idle

Bruce,  A lot of database queries get executed when you start a dotLRN or OpenACS instance.  I would say it is usual that at least 3 db connections get started at a minimum.

If you look at your nsd.tcl file, you will see 3 separate db pools.  Each one of them is used right away, so there needs to be at least one connection made for each pool.

If Postgres is "giving up" well, I don't know what you mean by that.  Perhaps you need to add more swap to your system, or maybe you should change the limit on the number of PG processes in your postgresql.conf file.

Thanks for the helpful comment, Patrick.  I've not looked at database processes closely since taking an Oracle course a year ago, but now I want to dig in.  The only changes I've made to postgresql.conf are those recommended by the openacs documentation, including increasing shared memory limit for postgres to: kernel.shmmax = 134217728.  Where the docs suggest: shared_buffers = 15200 I got error messages above 7600 so left that at 7600.  I set sort_mem = 32168.  When I added a fifth instance, the flow of error messages when starting openacs was noticeably slower and then halted, and so, as I can wait for that instance until I buy more memory, I limited myself to four instances.  I would guess that the appropriate question, and use of this forum, is to ask how I might best benchmark my setup: how people go about checking postgres and what values to look for.  I will read up on the postgres application design and study the docs, but what I haven't found yet is a "howto" relevant to dotlrn/openacs for newbies, those of us who aren't yet "real" programmers, that might direct us to the major performance indicators, let us know how deep we need to go so we can study that and then move on to the zillion other things we need to do.  Have you come across such a thing, or maybe you might volunteer to sketch what it might include?  MANY thanks!
Bruce
Each pool in your configuration .tcl file includes the number of connections allowed for that pool.  You should have three pools for an OpenACS 4 installation.  I think we set the default to five connections per pool so that would be fifteen connections per OpenACS instance, or sixty for four instances.

PG defaults to a maximum of a 100 open connections - you can change the configuration file in your PG install directory (i.e. /usr/local/pgsql/data I think by default if you install from source) to allow fewer or more.  You can also limit your OpenACS pools to one or two connections at a time, too ... this just causes things to stall while waiting for a connection.

You're not really seeing that many connections, though ... maybe you should cut your buffer pools down from 7600 to give enough space for PG to open up more backends if you want to run four instances.

Or use AOLserver 4's virtual server capability to run all five out of one instance, you can do more resource sharing (at the cost of having all virtual servers shutdown/restart together) this way than when running a separate AOLserver process for each site.  The latter's more convenient, the former more conservative of resource use.

Many thanks for the advice, Don! I am indeed running virtual names on aolserver4 for my 4 dotlrn 2.0 instances, and with the following paramaters uncommented in postgresql.conf I got that fourth instance to fire up nicely!  I guess it was the shared_buffers value, down from 7600 that did it.

max_connections = 128
shared_buffers = 256
sort_mem = 32168
checkpoint_segments = 3
fsync = true

Thanks again!

You should be able to bump up the shared buffer number somewhat, though if these are just development instances it shouldn't matter.

But for a production environment you want a larger shared buffer space.

HI!

Many thanks for the tip!  These are production servers that will have about 1,000 users this next semester, so I'll follow your advice.  Reading various postgresql docs has given me the impression that the system is pretty robust and doesn't need much care and feeding beyond these few parameters -- or?

Now here's a question: how do I test this?  Do we have a standard dotlrn/openacs performance testing setup?  I have read with much interest the advice of Janine and others on the Heidelberg server: do I try to follow her steps?  Has someone written this up?  Or might you briefly outline what I might set up for recording current benchmarks and then monitor on a regular basis, or should I be a big boy and start googling on this on my own?! :)

All the best,

Bruce

Surely you can scrounge another half gig of RAM for these 1,000 users, no?  My laptop has as much RAM as your production server ... that's sick!

Your system shouldn't require a lot of care and feeding, no, once it's set up properly.  Just make sure you're running scripts based on our documentation examples to analyze and vacuum and backup your database regularly.

Ignore the Heidelberg discussion - their box is a sick puppy and whatever the cause is will probably turn out to be Solaris and/or Oracle specific, not of terribly pertinent interest to Linux/PG .LRN users.