Forum OpenACS Q&A: Performance-Problems. Help needed!

Collapse
Posted by Nima Mazloumi on
Dear all,

we have problems with our aolserver installation and I was wondering if someone could give us some insight.
Please find the details below.

Any help or comment appreciated!

Thank you very much,
Nima Mazloumi

Fujitsu Siemens
4 GB Ram
2 Broadcom Gigabit Ethernet Driver bcm5700 with Broadcom NIC Extension (NICE) ver. 7.3.5
4 Intel(R) Xeon(TM) CPU 3.06GHz, L2 cache: 512K
200 GB Local disk and connection to 4 TB storage system

SuSE Linux 9.2
libhd version 9.31 (ia32)
Linux version 2.6.8-24.11-smp
AOLServer Version 4.0.10
PostgreSQL 7.4.8 (different box, same hardware specification)
OpenACS 5.1.4/dotLRN 2.1.1

We use pound as reverse proxy for SSL since we had problems with the SSL module of the AOLServer.

Here details from the config.tcl:
ns_param maxconnections 100
ns_param keepalivetimeout 15
ns_param maxdropped 0
ns_param maxthreads 20
ns_param minthreads 20
ns_param threadtimeout 3600

I have around 115 concurrent users and all CPUs are at 95%.

A single page to return takes around 11sec, usually its 400ms and less.

top says:
top - 15:54:08 up 25 days, 31 min, 2 users, load average: 10.68, 12.49, 12.18
Tasks: 116 total, 2 running, 114 sleeping, 0 stopped, 0 zombie
Cpu0 : 91.1% us, 2.3% sy, 0.0% ni, 1.3% id, 3.3% wa, 0.3% hi, 1.7% si
Cpu1 : 77.6% us, 1.7% sy, 0.0% ni, 8.6% id, 12.2% wa, 0.0% hi, 0.0% si
Cpu2 : 94.1% us, 2.0% sy, 0.0% ni, 2.0% id, 2.0% wa, 0.0% hi, 0.0% si
Cpu3 : 80.1% us, 1.3% sy, 0.0% ni, 2.3% id, 16.2% wa, 0.0% hi, 0.0% si
Mem: 4073116k total, 3964908k used, 108208k free, 118244k buffers
Swap: 3911816k total, 60712k used, 3851104k free, 2895056k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
root 16 0 1452 620 1276 S 0.3 0.0 6:00.33 syslogd
root 16 0 596 80 452 S 0.0 0.0 0:04.25 init
root RT 0 0 0 0 S 0.0 0.0 0:00.51 migration/0
root 34 19 0 0 0 S 0.0 0.0 0:00.31 ksoftirqd/0
root RT 0 0 0 0 S 0.0 0.0 0:00.87 migration/1
root 34 19 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/1

The nsd is somewhere at the bottom:
unima2 24 0 769m 706m 7204 S 0.0 17.8 0:36.82 0:36 nsd

vmstat:
20 0 63320 104472 305588 2555188 2 1 2 3 4 4 7 1 91 1
20 0 63320 99456 305588 2555188 0 0 0 85 1324 2839 96 2 2 0

When I shutdown the nsd the cpus have no load.

There are several waiting network connections at localhost:ndmp:
netstat|grep ndmp|grep WAIT |wc
340 2040 27540

PostgreSQL is on a different box. The cpus there have only a load of 10%
tcpip_socket = true
max_connections = 100
#superuser_reserved_connections = 2
port = 5432
#unix_socket_directory = ''
#unix_socket_group = ''
#unix_socket_permissions = 0777 # octal
#virtual_host = '' # what interface to listen on; defaults to any
#rendezvous_name = '' # defaults to the computer name
#authentication_timeout = 60 # 1-600, in seconds
#ssl = false
#password_encryption = true
#krb_server_keyfile = ''
#db_user_namespace = false
shared_buffers = 1000 # min 16, at least max_connections*2, 8KB each
#sort_mem = 1024 # min 64, size in KB
#vacuum_mem = 8192 # min 1024, size in KB
#max_fsm_pages = 20000 # min max_fsm_relations*16, 6 bytes each
#max_fsm_relations = 1000 # min 100, ~50 bytes each
#max_files_per_process = 1000 # min 25
#preload_libraries = ''

PS AUX on webserver box:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 596 180 ? S Oct19 0:01 init [3]
root 2 0.0 0.0 0 0 ? S Oct19 0:33 [migration/0]
root 3 0.0 0.0 0 0 ? SN Oct19 0:00 [ksoftirqd/0]
root 4 0.0 0.0 0 0 ? S Oct19 0:00 [migration/1]
root 5 0.0 0.0 0 0 ? SN Oct19 0:00 [ksoftirqd/1]
root 6 0.0 0.0 0 0 ? S Oct19 0:00 [migration/2]
root 7 0.0 0.0 0 0 ? SN Oct19 0:00 [ksoftirqd/2]
root 8 0.0 0.0 0 0 ? S Oct19 0:00 [migration/3]
root 9 0.0 0.0 0 0 ? SN Oct19 0:00 [ksoftirqd/3]
root 10 0.0 0.0 0 0 ? S< Oct19 0:00 [events/0]
root 11 0.0 0.0 0 0 ? S< Oct19 0:00 [events/1]
root 12 0.0 0.0 0 0 ? S< Oct19 0:00 [events/2]
root 13 0.0 0.0 0 0 ? S< Oct19 0:00 [events/3]
root 14 0.0 0.0 0 0 ? S< Oct19 0:00 [khelper]
root 15 0.0 0.0 0 0 ? S< Oct19 0:00 [netlink/0]
root 16 0.0 0.0 0 0 ? S< Oct19 0:00 [netlink/1]
root 17 0.0 0.0 0 0 ? S< Oct19 0:00 [netlink/2]
root 18 0.0 0.0 0 0 ? S< Oct19 0:00 [netlink/3]
root 19 0.0 0.0 0 0 ? S< Oct19 0:00 [kacpid]
root 44 0.0 0.0 0 0 ? S< Oct19 0:02 [kblockd/0]
root 45 0.0 0.0 0 0 ? S< Oct19 0:00 [kblockd/1]
root 46 0.0 0.0 0 0 ? S< Oct19 0:00 [kblockd/2]
root 47 0.0 0.0 0 0 ? S< Oct19 0:00 [kblockd/3]
root 58 0.0 0.0 0 0 ? S Oct19 0:08 [pdflush]
root 57 0.0 0.0 0 0 ? S Oct19 0:00 [kirqd]
root 59 0.0 0.0 0 0 ? S Oct19 0:14 [pdflush]
root 61 0.0 0.0 0 0 ? S< Oct19 0:00 [aio/0]
root 60 0.0 0.0 0 0 ? S Oct19 0:06 [kswapd0]
root 62 0.0 0.0 0 0 ? S< Oct19 0:00 [aio/1]
root 63 0.0 0.0 0 0 ? S< Oct19 0:00 [aio/2]
root 64 0.0 0.0 0 0 ? S< Oct19 0:00 [aio/3]
root 647 0.0 0.0 0 0 ? S Oct19 0:00 [kseriod]
root 1417 0.0 0.0 0 0 ? S Oct19 0:00 [scsi_eh_0]
root 1418 0.0 0.0 0 0 ? S Oct19 0:00 [ahd_dv_0]
root 1794 0.0 0.0 0 0 ? S Oct19 0:00 [scsi_eh_1]
root 2064 0.0 0.0 0 0 ? S Oct19 0:00 [scsi_eh_2]
root 2339 0.0 0.0 0 0 ? S Oct19 0:00 [md1_raid1]
root 2340 0.0 0.0 0 0 ? S Oct19 0:00 [md0_raid1]
root 2344 0.0 0.0 0 0 ? S< Oct19 0:03 [reiserfs/0]
root 2345 0.0 0.0 0 0 ? S< Oct19 0:00 [reiserfs/1]
root 2346 0.0 0.0 0 0 ? S< Oct19 0:00 [reiserfs/2]
root 2347 0.0 0.0 0 0 ? S< Oct19 0:01 [reiserfs/3]
root 2916 0.0 0.0 0 0 ? S Oct19 0:00 [khubd]
root 2943 0.0 0.0 1368 272 ? S s Oct19 0:00 udevd
root 3306 0.0 0.0 1436 332 ? S Oct19 0:00 [hwscand]
root 3307 0.0 0.0 1436 328 ? S Oct19 0:00 [hwscand]
root 4080 0.0 0.0 1452 656 ? Ss Oct19 0:11 /sbin/syslogd -a /var/lib/ntp/dev/log
root 4098 0.0 0.0 1516 576 ? Ss Oct19 0:00 /sbin/klogd -c 1 -2 -x
nobody 4116 0.0 0.0 1416 396 ? Ss Oct19 0:00 /sbin/portmap
root 4117 0.0 0.0 2032 676 ? Ss Oct19 0:00 /sbin/resmgrd
root 4199 0.0 0.0 2072 700 ? Ss Oct19 0:00 /usr/sbin/xinetd
nobody 4202 0.0 0.0 1824 668 ? Ss Oct19 0:18 /usr/sbin/scanlogd
root 4263 0.0 0.0 5148 1836 ? Ss Oct19 0:01 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid
postgres 4302 0.0 0.0 15016 2072 ? S Oct19 0:00 /usr/bin/postmaster -D /var/lib/pgsql/data
postgres 4308 0.0 0.0 9704 1928 ? S Oct19 0:00 postgres: stats buffer process
postgres 4309 0.0 0.0 8712 1932 ? S Oct19 0:00 postgres: stats collector process
ntp 4325 0.0 0.0 2740 2736 ? SLs Oct19 0:00 /usr/sbin/ntpd -p /var/lib/ntp/var/run/ntp/ntpd.pid -u ntp -i /var/lib/ntp
root 4367 0.0 0.0 1388 472 ? Ss Oct19 0:00 /usr/sbin/acpid -c /etc/acpi/events.ignore
root 4416 0.0 0.0 2508 1068 ? S Oct19 0:00 /usr/sbin/powersaved -d -x /usr/lib/powersave/scripts -a resmgr -v 3
at 4486 0.0 0.0 1632 608 ? Ss Oct19 0:00 /usr/sbin/atd
root 4495 0.0 0.0 43068 804 ? Ssl Oct19 0:00 /usr/sbin/nscd
root 4507 0.0 0.0 0 0 ? S Oct19 0:00 [afs_rxlistener]
root 4509 0.0 0.0 0 0 ? S Oct19 0:00 [afs_callback]
root 4511 0.0 0.0 0 0 ? S Oct19 0:00 [afs_rxevent]
root 4513 0.0 0.0 0 0 ? S Oct19 0:00 [afsd]
root 4515 0.0 0.0 0 0 ? S Oct19 0:00 [afs_checkserver]
root 4518 0.0 0.0 0 0 ? S Oct19 0:00 [afs_background]
root 4519 0.0 0.0 0 0 ? S Oct19 0:00 [afs_background]
root 4521 0.0 0.0 0 0 ? S Oct19 0:00 [afs_background]
root 4523 0.0 0.0 0 0 ? S Oct19 0:00 [afs_cachetrim]
root 4590 0.0 0.0 4572 1356 ? Ss Oct19 0:02 /usr/lib/postfix/master
postfix 4603 0.0 0.0 4832 1572 ? S Oct19 0:03 qmgr -l -t fifo -u
root 4842 0.0 0.0 1664 612 ? Ss Oct19 0:00 /usr/sbin/cron
root 4856 0.0 0.0 1600 464 tty2 Ss+ Oct19 0:00 /sbin/mingetty tty2
root 4857 0.0 0.0 1600 464 tty3 Ss+ Oct19 0:00 /sbin/mingetty tty3
root 4858 0.0 0.0 1600 464 tty4 Ss+ Oct19 0:00 /sbin/mingetty tty4
root 4859 0.0 0.0 1600 464 tty5 Ss+ Oct19 0:00 /sbin/mingetty tty5
root 4860 0.0 0.0 1600 464 tty6 Ss+ Oct19 0:00 /sbin/mingetty tty6
root 5379 0.0 0.0 1432 280 ? S< Oct19 0:00 [hwscand]
root 5442 0.0 0.0 10276 3128 ? Ss Oct19 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
root 5456 0.0 0.0 3624 1336 ? S Oct19 0:00 /usr/local/sbin/pound
root 5457 0.0 0.1 18100 4928 ? Sl Oct19 0:11 /usr/local/sbin/pound
unima2 5492 0.0 13.5 868504 550916 ? Ssl Oct19 0:55 /opt/aolserver4/bin/nsd -u unima2 -t /www/unima2/etc/config.tcl
root 5753 0.0 0.0 1600 464 tty1 Ss+ Oct19 0:00 /sbin/mingetty --noclear tty1
wwwrun 8073 0.0 0.0 10568 3928 ? S Oct19 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
root 11178 0.0 0.0 1672 612 ? S Oct19 0:00 /usr/sbin/cron
root 11179 0.0 0.0 2364 1060 ? Ss Oct19 0:00 /bin/sh -c /home/unima2/bin/unima2-analyze.sh
root 11182 0.0 1.1 48508 44848 ? S Oct19 0:42 /usr/share/analog-5.32/analog -G -g/www/unima2/etc/analog.cfg
root 16364 0.0 0.0 4584 1156 ? S 00:55 0:00 /usr/sbin/sendmail -FCronDaemon -odi -oem -oi -t
root 16365 0.0 0.0 4568 1292 ? S 00:55 0:00 /usr/sbin/postdrop -r
root 6458 0.0 0.0 8540 2448 ? Ss 10:02 0:00 sshd: kredel [priv]
kredel 6464 0.0 0.0 8580 2596 ? S 10:02 0:00 sshd: kredel@pts/1
kredel 6465 0.0 0.0 4136 1768 pts/1 Ss 10:02 0:00 -bash
wwwrun 18759 0.0 0.1 10568 4088 ? S 10:45 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
root 22285 0.0 0.0 2592 1128 pts/1 S 10:57 0:00 su -
root 22305 0.0 0.0 2848 1700 pts/1 S+ 10:57 0:00 -bash
wwwrun 24782 0.0 0.1 10576 4092 ? S 11:06 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
root 1982 0.0 0.0 9684 3028 pts/1 Sl 11:33 0:00 /usr/bin/dsmc schedule
wwwrun 3642 0.0 0.0 10552 4068 ? S 13:34 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
wwwrun 3719 0.0 0.0 10560 4000 ? S 13:35 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
wwwrun 12425 0.0 0.0 10572 4064 ? S 14:06 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
wwwrun 16396 0.0 0.0 10564 3968 ? S 14:22 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
wwwrun 29376 0.0 0.0 10572 3972 ? S 15:06 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
wwwrun 31241 0.0 0.0 10348 3416 ? S 15:12 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
root 6283 0.0 0.0 8684 2524 ? Ss 15:46 0:00 sshd: root@pts/0
root 6308 0.0 0.0 2856 1744 pts/0 Ss 15:46 0:00 -bash
postfix 8938 0.0 0.0 4688 1520 ? S 15:47 0:00 pickup -l -t fifo -u
wwwrun 12585 0.0 0.0 10496 3772 ? S 15:55 0:00 /usr/local/apache2/bin/httpd -f /usr/local/apache2/conf/httpd.conf -k start
postfix 18010 0.0 0.0 5068 1940 ? S 16:13 0:00 cleanup -z -t unix -u
postfix 18011 0.0 0.0 5032 1864 ? S 16:13 0:00 trivial-rewrite -n rewrite -t unix -u
postfix 18012 0.0 0.0 4792 1660 ? S 16:13 0:00 smtp -t unix -u
root 18264 0.0 0.0 2376 716 pts/0 R+ 16:14 0:00 ps aux

Collapse
Posted by Malte Sussdorff on
Did you add any packages? Did you make an upgrade of OpenACS? Are you surfing your site with the developer support bar turned on with database support (to see if the problem lies in the database of with the AOLserver)? Does the apache have to run on this machine? What partitions are growing (run "df" every minute and compare the results), which will help you narrow down the problem further.

If you have not done this, setup multiple partitions or even better an LVM to detect where most of the disc activity occurs. Keep the system Logfiles away from the AOLserver binaries away from the actual web directory that hosts OpenACS.

My approach would be to look for errors either within OpenACS (which you can detect if you installed any new packages or made an upgrade just before the performance problems hit or if the developer support bar shows some *really* slow queries in e.g. the "My Workspace" page).

If no errors can be found there, try to narrow it down in the system, maybe something is wrong there as well. Alternatively try to use the latest AOLserver from CVS to see if the problem goes away.

Collapse
Posted by Nima Mazloumi on
- no new packages
- no upgrade took place. the last was in january and in march I had around 250 concurrent users (twice as now) with equal performance
- turning ds-bar off doesnt improve performance
- database support of ds is turned off
- the apache does nothing but redirect from port 80 to 443
- except for some logging there is not much disk activity
Collapse
Posted by Malte Sussdorff on
Regarding DS I actually meant to keep it turned on to see if there are any expensive queries, but it seems unlikely this is the root of the problem if you did not change any code.

What exactly did you do between March and now on the server?

Collapse
Posted by Tak-Hung Yu on
This may not be related to your problem but shared_buffers = 1000 seems a bit low....


http://www.postgresql.org/docs/7.4/interactive/runtime-config.html