Forum .LRN Q&A: Appreciate help with dotLRN performance

I have been running a courseat the Indian Institute of Management, Bangalore, with ample discussion forum postings using dotLRN. The configuration is as follows:

  • AMD Opteron 1.4 MHz
  • 120 GB hard disk space
  • 1 GB high speed RAM
  • SuSE Linux Enterprise Edition
  • PostgreSQL database

After a month or so, I conducted a dotLRN-based survey. Suddenly, I have noticed a drastic slowdown in performance. Looking up the top processes on the server, it's postgres and nsd that seems to having dotLRN for lunch. Could this be due to memory leaks? Any clues?

Shankar

Collapse
Posted by Jun Yamog on
Have you tried to vacuum analyze postgres?
Collapse
Posted by Shankar Venkatagiri on
Thanks for the pointer. I went ahead and vacuumed the database (vacuumdb -a -f -v). I didn't see any progress even after this. More specifically, dotLRN drags when I try to load the Class Home page, which is a set of portlets. Any help?

Shankar

mmm... this is a problem related to survey, calling some psql function in the where clause causes it to not use the indexes, basicly unscalable, Dave Bauer fixed it in a project, but not sure if he finally commited ....
any comments Dave?
Collapse
Posted by Dave Bauer on
Yes, the pl/sql functions in the where clause of the queries has been removed on HEAD/5.0
Collapse
Posted by Roberto Mello on
You forgot to analyze the database. That's the -z flag for vacuumdb. The -z flag is more important than the -f (full) flag for performance. I usually vacuum analyze my databases several times a day, but only vacuum full once a day, depending on DML operations of the database.

-Roberto

Collapse
Posted by Shankar Venkatagiri on
Thanks for the tip. I did go ahead and analyze the db. Not sure I understand all of the output, but will hve someone here look at it.

What I am positive about is that when I load the Class Home (set of portlets) the nsd processes take up a huge chunk of memory running whatever script they run. Apologies for the ignorance here:

  PID USER    PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM  TIME COMMAND

1748 shikshan  16  0 62528  61M  2336 S    2.9  6.1  0:13 nsd
1747 shikshan  15  0 62528  61M  2336 S    0.5  6.1  0:14 nsd

Also, the cache goes up significantly when this happens. Any help will be welcomed.

Shankar

Collapse
Posted by Shankar Venkatagiri on
Hi Dave:

Can you please suggest me an easy way to update this query? It should do us a world of good.

Shankar

Collapse
Posted by Andrew Piskorski on
A resident set size of 61 MB for your AOLserver is a "huge chunk of memory"? I don't think so. In your top output above, note that nsd is only taking 6% of your memory. That's not large, that's trivially small.
Collapse
Posted by Shankar Venkatagiri on
Thanks for the clarification. What I notice is that these processes don't "quit". Here's a sampler:

shikshan  1789  0.6  4.7 54276 48848 ?      S    11:36  0:12 [nsd]
shikshan  1790  0.0  4.7 54276 48848 ?      S    11:36  0:00 [nsd]
shikshan  1791  0.0  4.7 54276 48848 ?      S    11:36  0:00 [nsd]
shikshan  1792  0.0  4.7 54276 48848 ?      S    11:36  0:00 [nsd]
shikshan  1797  0.0  4.7 54276 48848 ?      S    11:36  0:01 [nsd]
shikshan  1798  0.0  4.7 54276 48848 ?      S    11:36  0:00 [nsd]
shikshan  1799  0.0  4.7 54276 48848 ?      S    11:36  0:01 [nsd]
shikshan  1800  0.0  4.7 54276 48848 ?      S    11:36  0:01 [nsd]
shikshan  1801  0.0  4.7 54276 48848 ?      S    11:36  0:00 [nsd]
shikshan  1802  0.0  4.7 54276 48848 ?      S    11:36  0:00 [nsd]

Any clues? Also, does using Apache instead of AOLServer improve my situation?

Thanks in advance -
Shankar

Collapse
Posted by Jeff Davis on
AOLserver is multithreaded so what you are seeing is multiple threads in one process, not processes that fail to exit.

Using apache might improve your situation immensely but I doubt it will do so if you intend to run OpenACS.

Collapse
Posted by Shankar Venkatagiri on

Thanks for the pointer, Jeff. I reported the server's response to ps almost 10 minutes after my last interaction with dotLRN. The same processes linger on even now, five hours after I last posted the previous message. Could this indicate un-exiting processes?

I will go ahead and test dotLRN out with apache. I do not, however, seem to understand the distinction between OpenACS and dotLRN. My bad!

Shankar

Collapse
Posted by Jeff Davis on
Neither dotLRN nor OpenACS will work under apache (well, you might be able to fight with mod_aolserver for a few months and get it to run acceptibly but I certainly would not recommend it). I don't really think of dotLRN as being seperate from OpenACS (rather it is a particular install of OpenACS).

You also don't seem to understand the difference between a thread and a process. AOLServer is multithreaded, it creates threads within the server process to handle requests and typically those threads do not go away until the server process exits. ps on linux has the annoying -- not sure if you would call it a feature or a bug -- that it displays threads like they are processes, these don't really take any extra memory than is already taken by the server (note how they are all listed as being the same size and were all created at the same time -- thats because its all just the same server process).

Collapse
Posted by Shankar Venkatagiri on
Thanks for the clarification. Looks like the problem is elsewhere.

Shankar

Collapse
Posted by Roberto Mello on
Using the tree-view option of ps will help. It'll show the threads of the main process. Try "ps fax":
  922 ?        S      0:02 /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolserver/lbn.
  923 ?        S      0:01  \_ /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolserver/
  924 ?        S      0:00      \_ /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolser
  925 ?        S      0:11      \_ /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolser
  933 ?        S      0:06      \_ /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolser
 1311 ?        S      0:00      \_ /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolser
28309 ?        S      0:01      \_ /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolser
28311 ?        S      0:01      \_ /usr/local/lib/aolserver/bin/nsd -u nsadmin -g www-data -t /usr/local/stow/aolserver_local/etc/aolser...

-Roberto

Collapse
Posted by Shankar Venkatagiri on
I found the sucker - it's the survey indeed. I deleted it after copying its contents (from the survey) and I have sweet dotLRN back up and running wonderfully, not consuming 99% of the CPU cycles as earlier. Now this suggests some serious rethink of the survey module's interaction with the rest of the database. The fault is definitely with the population of the DB.

Shankar

Collapse
Posted by Andrew Piskorski on
So, just what query or page in the survey module do you say was causing the trouble? This might be quite easy to track down if you have the Developer Support package installed and turned on.

As it stands, you've basically said "something somewhere in the survey package sometimes takes a lot more CPU than I think it should", which is not especially useful as bug report.