Forum .LRN Q&A: Help Needed in Setting up .LRN to Scale

1: Help Needed in Setting up .LRN to Scale

Posted by Peter Marklund on 04/01/04 04:25 PM

We are currently helping the Heidelberg University with some urgent scalability issues that they are facing with their .LRN installation. They have 30,000+ users and run on a big Solaris server with Oracle.

With this post I wanted to solicit any scaling experience that might have been gathered from other big installations of .LRN. We are of course looking at tuning the OpenACS datamodel and Tcl code if necessary, but we're also hoping that there is room for improvement in the OS/Oracle/AOLserver configuration. The Heidelberg setup and configuration currently more or less reflects the OpenACS documentation.

Thanks in advance!

2: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Alfred Essa on 04/01/04 06:54 PM

Peter, Count on us for help.

3: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by mark dalrymple on 04/01/04 07:16 PM

I haven't used .LRN, so I can't comment on configuration specifics, but I'd first try some OS utilities to see where the bottleneck is, so you can narrow down the focus of work.

does 'top' on the webserver machines show aolserver chewing a bunch of CPU? If so, looking to cache things could be a win.

Are the webserver machines i/o bound? If so, they might be swapping because too much is being cached.

Similarly, on the db machine, see if it is being CPU or i/o bound. If both sets of machines seem to be relatively idle, then there may be locking issues. Poking around the oracle data dictionary can show some of that stuff.

4: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Dirk Gomez on 04/02/04 09:47 AM

Set up Oracle Statspack to *always* monitor what is going on in the database.

You'll find the docs for this in $ORACLE_HOME/rdbms/admin/spdoc.txt

Post the init.ora and the machine's parameters.

5: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Cato Kolås on 04/02/04 09:59 AM

Peter,

maybe you'll find something helpfull in this thread:
https://openacs.org/forums/message-view?message_id=156292

Cato

6: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Malte Sussdorff on 04/02/04 10:14 AM

Maybe making a fool out of myself if Heidelberg is already using a two server setup: It is not a good idea to have AOLserver and the database run on the same machine for performance reasons. I can't give you a technical *explanation* just an experience *observation*.

My economics of the situation: Getting a linux box running AOLserver should be around 1000 EUR plus 5 hours of work for setup and give you immediate gratification. Trying to tune OpenACS is most likely more costly *and* time consuming. The additional costs for system administration should be minimal but shouldn't be neglected, I agree.

7: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Joel Aufrecht on 04/02/04 01:53 PM

I'm pasting in some emails so that everything is online:

Here's one thing that makes a big difference. We're regularly analyzing our
tables using the acs-monitoring package. Also, just after doing a new import, I
always analyze the entire schema using

SQL> exec dbms_stats.gather_schema_stats('DBUSER',cascade => true);

Cheers,
Andrew

8: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Peter Alberer on 04/02/04 04:42 PM

We have made good experiences with the following setup:

One server running a reverse proxy (we are using pound)
One server with an aolserver instance (no openacs) serving "static images" (served from the file system not the content rep)
One server with the openacs installation
One db server (pg)
The proxy uses the url to divide the requests between the image server and the openacs server

All servers are running linux rh8. We are currently using only one openacs server but of course that could be several machines as well. The proxy could do the load balancing. Unfortunately the more servers you have the more work you will have with handling failover problems :)

As far as openacs itself is concerned i think we (Vienna Univ of Business Admin) will have the same performance problems in a few weeks. I think a good point to start enhancing dotlrn performance is the portal system. For all community-portals (dotlrn_class, dotlrn_class_instance, dotlrn_club, dotlrn_department) but the user portal i have ripped out the real portal-system and created a "static" version where all portlet are called directly with given parameters (=unchangeable portal layout). Next important thing will be to enhance the real portal system and to cache portlet content.

9: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Don Baccus on 04/02/04 08:46 PM

"For all community-portals (dotlrn_class, dotlrn_class_instance, dotlrn_club, dotlrn_department) but the user portal i have ripped out the real portal-system and created a "static" version where all portlet are called directly with given parameters (=unchangeable portal layout)."

Now that I've got the rewritten portal system working within OpenACS, my next two goals will be:

1. reintegration with .LRN (of course!)

2. Maximum caching of portal information. In particular parameters to portlets very, very rarely change and all the database operations to set the render call up can be cached more-or-less permanently. Since the parameters should only be changed through portal package API calls caching can be controlled with 100% accuracy unless someone goes out of their way to break the rules, in which case the screw
'em.

"Next important thing will be to enhance the real portal system and to cache portlet content."

Coming up with a useful scheme for this might be tricky, unless we just want to say "cache content for five minutes" or something like that. The problem with any portlet content caching approach implemented by the portal package itself is that portlet content won't match the content you get when you visit the package itself ... very confusing.

On the other hand if we implemented per-application package caching and if portlets share application code properly then they can both be made to render the same content, maybe even coherently if caching's implemented intelligently :)

But one thing is for sure ... portal-level stuff (determining layout, parameters to pass to portlets, etc) can be made to run with zero db hits (after the cache is filled by visitors, of course) without much trouble at all. I've been looking into it ...

10: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Alfred Essa on 04/02/04 09:41 PM

Peter, How is it going with Heidelberg?

11: Re: Help Needed in Setting up .LRN to Scale (response to 9)

Posted by Peter Alberer on 04/02/04 10:39 PM

<blockquote>Now that I've got the rewritten portal system working >within OpenACS, my next two goals will be:
</blockquote>

When looking in the cvs repository i found two different portal packages. Is the rewritten portal system the "portal" package in the /contrib directory? I currently use the new-portal package from the dotlrn repository. Is this the current solution?

<blockquote>2. Maximum caching of portal information. In particular >parameters to portlets very, very rarely change and all >the database operations to set the render call up can be >cached more-or-less permanently
</blockquote>

what i found difficult to deal with are portlets that use ns_query... to directly get some kind of user input (like the calendar list view). Do you have ideas how to get caching to work with those portlets?

12: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Don Baccus on 04/03/04 12:49 AM

The rewritten package is the contrib/portal package. Open Force started on this project a year ago summer, when they dropped out it lay untouched until last summer. I worked on it for a couple of weeks last summer but got busy with other stuff, but got back to it a week or two before going to Guatemala (University of Galileo) for most of February.

At the moment it does not work with .LRN. Also its caching is probably less effective than new-portal at the moment since I didn't address this problem while reorganizing and rewriting big chunks of it (mostly because, if folks approve my TIP, I'd prefer to use the caching db_* API rather than util_memoize to do the caching).

Mostly I find OF's use of ns_sets for passing stuff around annoying but beyond that haven't looked into what it would take to make specific portlets cache their content. As I said above they really need to coordinate with caching versions of the underlying application if we're going to provide the user consistent views of application content. That's not a short-term fix, obviously, and short-term we may need to kludge things optionally ...

Of course one can ease the pain by minimizing the number of portlets per portal page. Ideal would be one, then you'd have the equivalent of application pages rather than portalled pages! :) OK, I'm being silly, but perhaps this helps make clear that when it comes to content it is really the application's responsibility if we're to present consistent views of content? Portal pages are bad performance-wise because rendering one's the equivalent of rendering index pages for several non-caching applications all at once.

13: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Peter Marklund on 04/03/04 12:41 PM

Thank you all very much for all the great tips and info!

I'm on a one week easter vacation now and won't have time to follow up on this thread before the 13:th of April. I have reassigned this scalability task to Joel Aufrecht so you can expect status updates from him on how the work progresses.

14: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Nima Mazloumi on 04/03/04 04:33 PM

Joel, where can I find the acs-monitoring package? I looked at cvs and was not able to find it.

15: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Andrew Grumet on 04/03/04 06:38 PM

My mistake, the package is called "monitoring".

http://cvs.openacs.org/cvs/openacs-4/packages/monitoring/

16: Re: Help Needed in Setting up .LRN to Scale (response to 15)

Posted by Alfred Essa on 04/03/04 08:58 PM

Just for the record, can someone from Heidelberg post hardware characteristics (RAM etc). Thank You.

17: Re: Help Needed in Setting up .LRN to Scale (response to 16)

Posted by Michael Hebgen on 04/04/04 07:27 PM

hi all together,

the machine we experience performance problems is

hardware:
- a sun fire 280r
- 2048 megabytes storage
- 2 * 36 gb disks, 1 * 200 gb raid
- 1 fastethernet adapter with 1 additional virtual
interface

software:
- solaris 2.8
- webct (uses about 128 mg storage, minimal cpu)
- dotlrn 2.0.1
- oracle 1.8.i server & db

michael

18: Re: Help Needed in Setting up .LRN to Scale (response to 17)

Posted by Andrew Piskorski on 04/05/04 02:33 PM

Are you saying your Sun box has 2 GB of RAM? I think that's what you meant by "storage". What CPU does that thing have and how fast is it?

19: Re: Help Needed in Setting up .LRN to Scale (response to 17)

Posted by Andrew Piskorski on 04/05/04 02:53 PM

Googling, looks like a "Sune Fire 280r" is usually a 2 cpu box. I've no idea what a "1 * 200 gb raid" is, nor how fast your two 36 GB disks (must be SCSCI) are, but that machine does not sound like a "big Solaris server". In fact, depending on how old it is (and thus how fast the CPUs, etc. are), it sounds fairly puny.

The biggest question is your disk IO. Just what is that "1 * 200 gb raid" exactly? If it is really a RAID 10 array with 4 or 8 disks or something like that, you might be fine. But I don't think that's what you have, and if my assumptions about your hardware are correct, it would probably only cost a few thousand dollars to buy a brand new and much faster Linux box.

Which isn't to say that your scaling problems are hardware related, they might not be. But when good server hardware is so cheap, it doesn't make sense to even try to run a large, high traffic site on a slow machine.

20: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/05/04 05:24 PM

So far, the problem looks like this:

Because the box has only 2 GB of RAM, and .LRN isn't the only thing running on it, there is over 1 GB of swap in use. It appears that Oracle's SGA resides at least partly in swap (looking at iostat to see lots of swap activity while queries are run in sqlplus). This, of course, just kills performance.

To make matters worse, most everything is installed out on the disk array, so all of the log files, both nsd and Oracle, are being written to over a single SCSI channel.

My recommendation is to first get some more RAM, at *least* bring it up to 4 GB, and then if possible split things up across multiple disks, either by moving the log files to an internal disk or attaching a second disk array.

I'm pretty sure things will be fine after that, but if not, we'll continue looking into it at that point. I've tried doing a little query tuning but it's a lost cause right now; nothing I do makes any difference.

21: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Alfred Essa on 04/06/04 03:35 PM

A Sun 280R with only 2Gb RAM seems under-powered to run an Oracle installation for 30,000+ users. It's also not advisable to run another application (WebCT) on the same box.

22: Re: Help Needed in Setting up .LRN to Scale (response to 21)

Posted by Dirk Gomez on 04/06/04 03:44 PM

A few questions:

Does the installation not scale adequately or not perform adequately?

And how many people are accessing .LRN during regular times and during peak times?

How big is the Oracle SGA?

What exactly doesn't scale or perform? The whole system, some pages?

23: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/06/04 04:23 PM

Currently there are only about 10 users active, probably only one using the site at any given moment, and pretty much everything is slow. Yet there is virtually no load on the system; uptime shows it hovering around 0.2. So at this point it isn't that the system is overloaded; as far as we can tell, it's just that everything is running slowly because it's all running out of swap.

I have advised them to get the system up to at least 4 GB of RAM and see how we do then. More tuning may be needed at that point, but right now it's impossible to tell.

Al, you may be right that the system is underpowered, but furfly has a pretty busy Oracle-based ACS site on a dual Pentium with excellent performance, so you never know. Each installation seems to be different as far as how much load it puts on a system. I have mentioned to Lars that the system might not scale, but I think it's ok to take a wait and see approach for now.

24: Re: Help Needed in Setting up .LRN to Scale (response to 23)

Posted by Dirk Gomez on 04/06/04 05:01 PM

I would hope that this machine is big enough for this load.

What is eating up all the RAM?

25: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/06/04 05:12 PM

I assume it is WebCT that has grabbed most of the RAM, as I shut down both nsd and Oracle and only about 200 MB of RAM was released.

2 GB is a paltry amount of RAM for a dual processor Sparc anyway; I would want to see 4 GB in that box even if we were only running one site on it.

26: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Nima Mazloumi on 04/06/04 05:27 PM

Can you turn on developer support info and database statistics and login to the sytem to view "my space" of a user that is not admin and send us the request information for that request. maybe we can find out what the bottle neck is from that info.

Also shut down webct for a short test to see if makes a significant difference.

Can you also post your config.tcl without any sensitive data?

Also, can you post your authentication, kernel and main site parameter settings under /acs-admin/

How many authorities exists for your installation? Does it make a difference if you deactivate your URZ Heidelberg or Extern authority?

Have you ever tried postgresql instead. My installation with 22.000 users is quicker.

27: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Michael Hebgen on 04/08/04 02:34 PM

A colleague of mine, our Sun specialist Gerhard Rathmann, has made
some testing over the past days with the following results:

Intensive testing and monitoring of the computer named "athena"
gave the following results:

a) The I/O-Usage of the RAID-Subsystem was about 6MB/sec for
writing and 18MB/sec for reading - not really high.

b) Running programs like iostat, vmstat and top has shown that
the highest data rate was caused by the TSM backup process.
In all other cases the data rate was less than 10 percent of
the values mentioned above.

c) Storage usage is about 90 percent, nearly no swap activities
have been recognized.

About 1 GB of storage is used by Oracle, the other application
Webct uses about 200 MB (for comparision: our very active Oracle
Server has been recently upgraded from 1 GB to 2 GB and performes
pretty well!!!)

d) CPU usage is below 1 percent - the highest CPU usage was caused
by the testing and monitoring programs mentioned above.

So we conclude that we do not have a performance problem caused by
hardware bottlenecks or by the other application Webct.

It looks likes we need some tuning of dotLRN and/or Oracle.

28: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Dirk Gomez on 04/08/04 02:55 PM

I was told that Statspack was installed. Just post the results - the culprits are usually VERY easy to identify.

29: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/08/04 05:19 PM

I have been posting the details of our investigation to the internal ticket tracker for this project. In the interest of getting some more eyes on the problem, here is what I have discovered/thought as we went along. I am going to go run another statspack report after I post this and will be back with that after I have looked at it.

---------------------------------------
Based on the system specs Mat sent I think that if we cannot add RAM to this box then we may actually need to reduce the amount of space allocated to Oracle. I have set up statspack and taken a very quick snapshot of loading my own My Space page (and whatever else happened to go on during that time). This is not a very large sample but when I did this for Sloanspace it did help us pinpoint problems. One thing it hopefully will tell me is whether we have excess memory and can cut it back.

To be clear, I don't think this is the whole problem but it is certainly a contributing factor. In my opinion we need to get problems like this cleared up before we start tuning the application.

I'm going to go off now and study the report, which may take some time.
---------------------------------------
Because there is so little data in the report, I can't tell a whole lot about what our performance issues might be. But one thing is clear - we've got too much memory allocated to Oracle. The current size of the shared pool is 250,270,105 bytes, and at the moment I took the snapshot we were using 40% of it. That number is supposed to be between 75% and 85% for optimal performance. That, combined with our memory shortage, points to this being a number we should definitely change.

The number of bytes actually in use was 100,108,042, which is 75% of 133,477,389. Unless I hear any objections, I'll shut down the site and Oracle and change the shared pool size to that number. It may not be enough of a change to make much difference, considering we have almost 2 GB of swap being used, but it's the right thing to do in any case.

This is not necessarily the only change we'll want to make to the Oracle configuration, but the site needs to run a bit so I can take another snapshot with some better numbers in it. I think that the sort_area is probably too small, and the db_block_buffers might be too large, but I don't want to change them without some data to back it up. However, I think that even when all the tuning is done, we're still going to need more RAM for this system.

After I make the change to the shared pool size, the next step will be to start looking at the application. I am assuming that you want me to do this, and not just stick to Oracle tuning - let me know if that is not right.

I will wait about 15 minutes for objections and then make this change.
---------------------------------------
Ok, change has been made. Some stats:

With both Oracle and nsd shut down:

Memory: 2048M real, 1283M free, 675M swap in use, 4912M swap free

With Oracle running and nsd shut down:

Memory: 2048M real, 337M free, 1638M swap in use, 3948M swap free

With both running, after nsd had finished initializing:

Memory: 2048M real, 266M free, 1721M swap in use, 3865M swap free

So basically, there is a limit to what we can do here because the system is still using swap even with everything we are running on the box turned off! That might clear up with a reboot, but I expect it would happen again over time.

I will revisit this issue when I have more statspack data to work with but I think it's clear we aren't going to win this one without more RAM. Time to look at the application and see if there's anything we can do there.
---------------------------------------
I have examined several queries in detail, but no silver bullet has been found so far. The only thing that jumps out at me is that it has been a while since tables were last analyzed:

SQL> select last_analyzed from user_tables where table_name = 'ACS_OBJECT_TYPES';

LAST_ANALY
----------
2004-02-10

It would be a good idea to do this weekly, if not more often.

#1 - the dotlrn_users query in /dotlrn/admin/users

This query is *horribly* slow and does three full table scans. Unfortunately, none of my usual tricks worked to eliminate the scans.

#2 - the call to dotlrn_community_admin_p is the culprit here. Again, I was unable (so far) to make it run any faster.

However.... I have not given up, and I will continue working on this on Monday (possibly some on Saturday if I have time). It took a while to hit pay dirt on Sloanspace too; unfortunately (or fortunately, depending on your point of view) this installation doesn't have the Oracle misconfiguration that turned out to be responsible for a lot of our troubles on Sloanspace.
---------------------------------------
I have been thinking about this all weekend, and I kept coming back to the fact that the system is not heavily loaded, yet performance is poor. A situation that can be helped by tuning queries generally exhibits other signs of stress - high system load and Oracle processes using lots of CPU time. Not so here.

I asked Mike to take a look; he ran various OS tools looking at performance while I loaded the /dotlrn/admin/users page over and over. Mike believes he has found a potential problem. Here is what he wrote up for me, and I will comment further after:

"This looks like a disk I/O based performance problem.

The device to pay attention to is sd30 -- an external SCSI-attached disk array.

iostat shows that a large amount of disk I/O results when the page is loaded; kps is total traffic in kilobytes per second, tps is total transactions per second, and serv is service time (disk seek time) in milliseconds.

The disk service time is fine which tells us the disk array is not overloaded and the time to seek from the disk is reasonably speedy.

The ratio between the kps and tps tells us about file sizes -- in this case it looks like a lot of large files are being transfered when the page loads.

This looks to be a case where disk I/O bandwidth isn't sufficient for the query; multiple spindles are needed and the load should be divided between multiple disks (for example, sd30 has both /web and /ora8 which means the same disk is being hit to read from Oracle, write web access logs and transaction logs, as well as reading the html)."

Mike didn't see any signs of swapping going on during our tests.

Here's my version: a lot of data is going back and forth between the system and that disk array. Data gets read from Oracle tables, and intermediate results get written to the temporary tablespace. Redo, rollback and archive logs are written to. The nsd error and access logs are also written to. It appears that there is just so much data going through that one connection to the disk array that we're experiencing a traffic jam.

Now, it seems a bit odd to me that Oracle is doing this much disk access... I would have expected it and nsd to both keep this data in memory, especially as I reload the same page over and over again. I don't know off the top of my head how to tell how much of the database Oracle has got in memory; that will be tomorrow's research project, along with looking at another statspack report.

I'm not sure what to recommend as a course of action to fix this, assuming we end up agreeing that this is the problem, because I don't know what our options are. Do we have any other systems available which might be more suitable?
---------------------------------------
One thing that bothers me about this forming hypothesis is that we don't see any swap activity during page loads. It seems that we should, if we're going to blame the site's slowness on a disk i/o bottleneck. So I took the query from the /dotlrn/users/admin page and ran it in sqlplus, running iostat at the same time to monitor disk activity. This time I saw *lots* of disk activity on the swap device.

So what does this tell us? For one thing, I think it confirms the theory that the memory Oracle is using resides in the swap partition and not in RAM. That's a guaranteed performance killer, so we definitely have to fix that. It also tells us that some caching is happening somewhere, because when I load that page and the same query executes, there is very little swap activity. Unfortunately this doesn't explain why the page load is so slow anyway... the cache may also be out in the swap partition but that doesn't fully explain it.

At this point I believe that if we could bump the RAM in this system up to at least 4 GB it would help considerably. Mike also feels that there is too much disk activity going to one place - all those log files (nsd and Oracle) should be split up between at least two disks, preferably on separate channels.

In my opinion, it doesn't make sense to continue tuning queries or looking at the finer points of the Oracle installation until the hardware is adqequate to support the site; as I saw on Friday, the efforts are unlikely to result in any improvement.
---------------------------------------
Matthais, I'm not sure I understand the question, so let me just state clearly what I think we need to do.

First, if we are going to remain on this system we need more RAM. The system needs to have at least 4 GB (total) just to stop it from using any swap space, and it would be better if we had an extra GB or two (meaning 5 or 6 total) to have room for growth. If we have enough RAM, then everything that is supposed to be loaded into RAM, like Oracle's working area, will be and performance will be much improved.

At that point it's possible that things will be running well enough that the external disk array will no longer be a problem. If it is still a problem, then we will need either access to a second external array, so we can split up the log files, or (even better) an internal disk added to the system.

At this time there is no need for a high performance system, just a few more resources allocated to this one.
---------------------------------------
Here are the results of my experiment. I took snapshots via the top command at each step.

before:

Memory: 2048M real, 301M free, 1563M swap in use, 4023M swap free

nsd shut down:

Memory: 2048M real, 441M free, 1422M swap in use, 4165M swap free

Oracle shut down:

Memory: 2048M real, 1372M free, 459M swap in use, 5130M swap free

WebCT shut down:

Memory: 2048M real, 1413M free, 383M swap in use, 5207M swap free

At this point nothing is running but Solaris, so this is a baseline state. It's possible that a bit more memory would be available if we could reboot, but this looks pretty normal to me.

on the way back up:

Oracle started up:

Memory: 2048M real, 457M free, 1339M swap in use, 4248M swap free

nsd started up:

Memory: 2048M real, 430M free, 1363M swap in use, 4225M swap free

after site has come up all the way and a few pages loaded:

Memory: 2048M real, 275M free, 1480M swap in use, 4107M swap free

WebCT started up:

Memory: 2048M real, 266M free, 1512M swap in use, 4075M swap free

Conclusion:

Oracle grabbed 1915M of RAM, considerably more than was available, so even when it was the only thing running it caused the system to go into swap. It is the major resource hog here. WebCT used a small amount of memory so, at least as far as RAM goes, it's presence is not making a significant difference to system performance.

As you might expect, the site ran no faster with WebCT shut down, because the system was basically just as far into swap as it was when I started.

I am still convinced that adding RAM (at least 2 GB) is the most important thing we can do to improve the situation.
---------------------------------------
I forgot to mention one thing - I can make Oracle require less RAM, but I probably can't get it down small enough. And even if I could, it would only work for a short time; Oracle performs best when it is able to load the entire data set into RAM, and if it has a minimal amount of space to work with it will lose the ability to do that as your users add content. So performance would fall off quickly at some point in the not-too-distant future. It is really better to fix this properly now.
---------------------------------------

Ok, that's the trail so far. Comments?

30: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/08/04 05:28 PM

The statspack report will have to wait until later in the day or tomorrow, when more data has gathered. I don't have it set up to take automatic snapshots (don't want to make the problems any worse) and becuase Oracle was shut down since I last took a manual snapshot, the resulting report is invalid. Lots of weird negative numbers, definitely not something to rely on. So I'll be back with that later.

31: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Nima Mazloumi on 04/08/04 05:34 PM

Regardless of the RAM issue I don't have similar performance problems using postgresql instead. My installation with 22.000 users seems much quicker - as Heiko agreed. Interesting enough. I had a similar problem after the initial batch synch in the first place. But tuning the OpenACS/dotLRN parameters made the performance problem go away (see https://openacs.org/forums/message-view?message_id=157767).

Again, can you kindly post the following info:

What is the request info of ds for simply login into the system?

Does the above make a significant difference if webct is shut down?

How are the config.tcl settings?

What are the authentication, kernel and main site parameter settings under /acs-admin/?

How many authorities exists for your installation? Does it make a difference if you deactivate your URZ Heidelberg or Extern authority?

32: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/08/04 05:48 PM

Nima, I didn't post all that before becuase it's a lot of data for people to wade through and I don't think it will make any difference; since I can take a slow query, say the one from /dotlrn/admin/users that lists the users of a particular type, and get just as long a run time running it from sqlplus as I get through the browser, I believe the root of the problem lies at the system/database level, not in the application. However, since we want to get to the bottom of this I'll put together a list of what you've asked for.

In particular we do not currently have PermissionCacheP turned on, but in that thread you reported having some trouble with it. Since this is a production system I don't want to turn it on if users might encounter errors. Is it is working cleanly now?

33: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Nima Mazloumi on 04/08/04 05:54 PM

You are right. The PermissionCacheP parameter was discussed at the end of the thread. And it is true - turning it on will increase the performance significantly but leads to errors on different places. Hopefully this can be fixed in future. At present my installation is not using that parameter as well and seems still faster.

34: Re: Help Needed in Setting up .LRN to Scale (response to 31)

Posted by Janine Ohmer on 04/08/04 06:12 PM

-- What is the request info of ds for simply login into the system?

Request Information
Main Site : Developer Support : Request Information

Parameters

Request Start Time:
2004-04-08 17:59:39

Request Completion Time:
2004-04-08 17:59:41

Request Duration:
2215 ms

IP:
18.170.5.196

Method:
GET

URL:
/dotlrn/index

Query:
(empty)

Request Processor

+49.4 ms: Applied transformation from /web/product/www / dotlrn/index -> ? - 7.5 ms
+63.3 ms: Served file /web/product/packages/dotlrn/www/index.adp with adp_parse_ad_conn_file - 2146.0 ms
+2211.2 ms: Applied GET filter: (for /dotlrn/index ds_trace_filter) - 10.2 ms
returned filter_ok

show RP debugging information

Comments

rp_handler: trying rp_serve_abstract_file /web/product/www / dotlrn/index
rp_handler: not found
rp_handler: trying rp_serve_abstract_file /web/product/packages/dotlrn/www / index

Headers

Host:
athena2.uni-heidelberg.de

Accept:
*/*

Accept-Language:
en

Pragma:
no-cache

Connection:
Keep-Alive

Referer:
http://athena2.uni-heidelberg.de/register/?blocale=en%5fUS&return%5furl=%2fdotlrn%2findex

User-Agent:
Mozilla/4.0 (compatible; MSIE 5.23; Mac_PowerPC)

UA-OS:
MacOS

UA-CPU:
PPC

Cookie:
ad_session_id=45199%2c110325%2c1%20%7b31%2010812456245179%20E459FB42B0DD8685A1F56F45645651CB19A532BA6BBDC8%7d; ad_user_login=110325%2c145636226%2cC4D30F674%20%7b531%201081468779%2071A84256603EB939973A0131D86AFB873DBA547E82B%7d

Extension:
Security/Remote-Passphrase

Output Headers

Expires:
Thu, 08 Apr 2004 15:59:41 GMT

Pragma:
no-cache

Cache-Control:
no-cache

Content-Type:
text/html; charset=utf-8

MIME-Version:
1.0

Date:
Thu, 08 Apr 2004 15:59:41 GMT

Server:
AOLserver/3.3.1+ad13

Content-Length:
18207

Connection:
close

Database Requests

  Duration
  Pool
Command

  1 ms
  pool2
gethandle (returned nsdb0)

  4 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  3 ms
  pool2
dbqd.dotlrn.tcl.dotlrn-security-procs.dotlrn::user_p.select_count: 0or1row nsdb0

select count(*)
from dual
where exists (select 1
from dotlrn_users
where user_id = :user_id)

  4 ms
  pool2
dbqd.dotlrn.tcl.community-procs.dotlrn_community::get_all_communities_by_user.select_communities_by_user: select nsdb0

select dotlrn_communities_full.*
from dotlrn_communities_full,
dotlrn_member_rels_approved
where dotlrn_communities_full.community_id = dotlrn_member_rels_approved.community_id
and dotlrn_member_rels_approved.user_id = :user_id

  1 ms
  pool2
getrow nsdb0

  6 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  3 ms
  pool2
dbqd.dotlrn.tcl.dotlrn-procs.dotlrn::get_portal_id_not_cached.select_user_portal_id: 0or1row nsdb0

select portal_id
from dotlrn_users
where user_id = :user_id

  4 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  4 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  4 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::render.portal_select: 0or1row nsdb0

select portals.name,
portals.portal_id,
portals.theme_id,
portal_layouts.layout_id,
portal_layouts.filename as layout_filename,
portal_pages.page_id
from portals,
portal_pages,
portal_layouts
where portal_pages.sort_key = :sort_key
and portal_pages.portal_id = :portal_id
and portal_pages.portal_id = portals.portal_id
and portal_pages.layout_id = portal_layouts.layout_id

  3 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::render.element_select: select nsdb0

select portal_element_map.element_id,
portal_element_map.region,
portal_element_map.sort_key
from portal_element_map,
portal_pages
where portal_pages.portal_id = :portal_id
and portal_element_map.page_id = :page_id
and portal_element_map.page_id = portal_pages.page_id
and portal_element_map.state != 'hidden'
order by portal_element_map.region,
portal_element_map.sort_key

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  9 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::evaluate_element.element_select: 0or1row nsdb0

select pem.element_id,
pem.datasource_id,
pem.state,
pet.filename as filename,
pet.resource_dir as resource_dir,
pem.pretty_name as pretty_name,
pd.name as ds_name
from portal_element_map pem,
portal_element_themes pet,
portal_datasources pd
where pet.theme_id = :theme_id
and pem.element_id = :element_id
and pem.datasource_id = pd.datasource_id

  3 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::element_params_not_cached.params_select: select nsdb0

select key,
value
from portal_element_parameters
where element_id = :element_id

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  6 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  5 ms
  pool2
dbqd.dotlrn.www.dotlrn-main-portlet.select_communities: select nsdb0

select dotlrn_communities_all.*,
dotlrn_community.url(dotlrn_communities_all.community_id) as url,
decode(dotlrn_communities_all.community_type, 'dotlrn_community', 'dotlrn_community',
'dotlrn_club', 'dotlrn_club',
'dotlrn_class_instance') as simple_community_type,
decode(dotlrn_community_admin_p(dotlrn_communities_all.community_id, dotlrn_member_rels_approved.user_id),'f',0,1) as admin_p,
tree.tree_level(dotlrn_communities_all.tree_sortkey) as tree_level,
nvl((select tree.tree_level(dotlrn_community_types.tree_sortkey)
from dotlrn_community_types
where dotlrn_community_types.community_type = dotlrn_communities_all.community_type), 0) as community_type_level
from dotlrn_communities_all,
dotlrn_member_rels_approved
where dotlrn_communities_all.community_id = dotlrn_member_rels_approved.community_id
and dotlrn_member_rels_approved.user_id = :user_id
order by dotlrn_communities_all.tree_sortkey

  1 ms
  pool2
getrow nsdb0

  4 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::evaluate_element.element_select: 0or1row nsdb0

select pem.element_id,
pem.datasource_id,
pem.state,
pet.filename as filename,
pet.resource_dir as resource_dir,
pem.pretty_name as pretty_name,
pd.name as ds_name
from portal_element_map pem,
portal_element_themes pet,
portal_datasources pd
where pet.theme_id = :theme_id
and pem.element_id = :element_id
and pem.datasource_id = pd.datasource_id

  3 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::element_params_not_cached.params_select: select nsdb0

select key,
value
from portal_element_parameters
where element_id = :element_id

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  6 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  4 ms
  pool2
dbqd.forums-portlet.www.forums-portlet.select_forums: select nsdb0

select forums_forums.package_id,
acs_object.name(apm_package.parent_id(forums_forums.package_id)) as parent_name,
(select site_node.url(site_nodes.node_id)
from site_nodes
where site_nodes.object_id = forums_forums.package_id) as url,
forums_forums.forum_id,
forums_forums.name,
case when last_modified > (sysdate - 1) then 't' else 'f' end as new_p
from forums_forums_enabled forums_forums,
acs_objects
where acs_objects.object_id = forums_forums.forum_id and
forums_forums.package_id in (0)
order by parent_name,
forums_forums.name

  1 ms
  pool2
getrow nsdb0

  4 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::evaluate_element.element_select: 0or1row nsdb0

select pem.element_id,
pem.datasource_id,
pem.state,
pet.filename as filename,
pet.resource_dir as resource_dir,
pem.pretty_name as pretty_name,
pd.name as ds_name
from portal_element_map pem,
portal_element_themes pet,
portal_datasources pd
where pet.theme_id = :theme_id
and pem.element_id = :element_id
and pem.datasource_id = pd.datasource_id

  3 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::element_params_not_cached.params_select: select nsdb0

select key,
value
from portal_element_parameters
where element_id = :element_id

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  4 ms
  pool2
dbqd.faq-portlet.www.faq-portlet.select_faqs: select nsdb0

select acs_objects.context_id as package_id,
acs_object.name(apm_package.parent_id(acs_objects.context_id)) as parent_name,
(select site_node.url(site_nodes.node_id)
from site_nodes
where site_nodes.object_id = acs_objects.context_id) as url,
faqs.faq_id,
faqs.faq_name
from faqs,
acs_objects
where faqs.faq_id = acs_objects.object_id
and faqs.disabled_p <> 't'
and acs_objects.context_id in (0)
order by lower(faq_name)

  1 ms
  pool2
getrow nsdb0

  4 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::evaluate_element.element_select: 0or1row nsdb0

select pem.element_id,
pem.datasource_id,
pem.state,
pet.filename as filename,
pet.resource_dir as resource_dir,
pem.pretty_name as pretty_name,
pd.name as ds_name
from portal_element_map pem,
portal_element_themes pet,
portal_datasources pd
where pet.theme_id = :theme_id
and pem.element_id = :element_id
and pem.datasource_id = pd.datasource_id

  3 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::element_params_not_cached.params_select: select nsdb0

select key,
value
from portal_element_parameters
where element_id = :element_id

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  10 ms
  pool2
dbqd.news-portlet.www.news-portlet.select_news_items: select nsdb0

select news_items_approved.package_id,
acs_object.name(apm_package.parent_id(news_items_approved.package_id)) as parent_name,
(select site_node.url(site_nodes.node_id)
from site_nodes
where site_nodes.object_id = news_items_approved.package_id) as url,
news_items_approved.item_id,
news_items_approved.publish_title,
to_char(news_items_approved.publish_date, 'YYYY-MM-DD HH24:MI:SS') as publish_date_ansi
from news_items_approved
where news_items_approved.publish_date < sysdate
and (news_items_approved.archive_date >= sysdate or news_items_approved.archive_date is null)
and news_items_approved.package_id in (0)
order by parent_name,
news_items_approved.publish_date desc,
news_items_approved.publish_title

  2 ms
  pool2
getrow nsdb0

  5 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::evaluate_element.element_select: 0or1row nsdb0

select pem.element_id,
pem.datasource_id,
pem.state,
pet.filename as filename,
pet.resource_dir as resource_dir,
pem.pretty_name as pretty_name,
pd.name as ds_name
from portal_element_map pem,
portal_element_themes pet,
portal_datasources pd
where pet.theme_id = :theme_id
and pem.element_id = :element_id
and pem.datasource_id = pd.datasource_id

  3 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::element_params_not_cached.params_select: select nsdb0

select key,
value
from portal_element_parameters
where element_id = :element_id

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  7 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  3 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  6 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  73 ms
  pool2
dbqd.calendar.www.view-one-day-display.select_day_items: select nsdb0

select nvl(e.name, a.name) as name,
nvl(e.status_summary, a.status_summary) as status_summary,
e.event_id as item_id,
(select type from cal_item_types where item_type_id= ci.item_type_id) as item_type,
cals.calendar_id,
cals.calendar_name
from acs_activities a,
acs_events e,
timespans s,
time_intervals t,
cal_items ci,
calendars cals
where e.timespan_id = s.timespan_id
and s.interval_id = t.interval_id
and e.activity_id = a.activity_id
and start_date between
to_date(:current_date_system,:ansi_date_format) and
(to_date(:current_date_system,:ansi_date_format) + (24 - 1/3600)/24)
and ci.cal_item_id = e.event_id
and to_char(start_date, 'HH24:MI') = '00:00'
and to_char(end_date, 'HH24:MI') = '00:00'
and cals.calendar_id = ci.on_which_calendar
and e.event_id = ci.cal_item_id
and on_which_calendar in (110394) and (cals.private_p='f' or (cals.private_p='t' and cals.owner_id= :user_id))

  1 ms
  pool2
getrow nsdb0

  205 ms
  pool2
dbqd.calendar.www.view-one-day-display.select_day_items_with_time: select nsdb0

select to_char(start_date, :ansi_date_format) as ansi_start_date,
to_char(end_date, :ansi_date_format) as ansi_end_date,
nvl(e.name, a.name) as name,
nvl(e.status_summary, a.status_summary) as status_summary,
e.event_id as item_id,
(select type from cal_item_types where item_type_id= ci.item_type_id) as item_type,
cals.calendar_id,
cals.calendar_name
from acs_activities a,
acs_events e,
timespans s,
time_intervals t,
cal_items ci,
calendars cals
where e.timespan_id = s.timespan_id
and s.interval_id = t.interval_id
and e.activity_id = a.activity_id
and start_date between
to_date(:current_date_system,:ansi_date_format) and
(to_date(:current_date_system,:ansi_date_format) + (:end_display_hour - 1/3600)/:end_display_hour)
and ci.cal_item_id = e.event_id
and (to_char(start_date, 'HH24:MI') <> '00:00' or
to_char(end_date, 'HH24:MI') <> '00:00')
and cals.calendar_id = ci.on_which_calendar
and e.event_id = ci.cal_item_id
and on_which_calendar in (110394) and (cals.private_p='f' or (cals.private_p='t' and cals.owner_id= :user_id))
order by to_char(start_date,'HH24')

  1 ms
  pool2
getrow nsdb0

  3 ms
  pool2
dbqd.acs-lang.tcl.locale-procs.lang::user::timezone_no_cache.select_user_timezone: 0or1row nsdb0

select timezone
from user_preferences
where user_id = :user_id

  4 ms
  pool2
dbqd.calendar.www.view-one-day-display.select_day_info: 0or1row nsdb0

select to_char(to_date(:current_date, 'yyyy-mm-dd'), 'Day, DD Month YYYY')
as day_of_the_week,
to_char((to_date(:current_date, 'yyyy-mm-dd') - 1), 'yyyy-mm-dd')
as yesterday,
to_char((to_date(:current_date, 'yyyy-mm-dd') + 1), 'yyyy-mm-dd')
as tomorrow
from dual

  4 ms
  pool2
dbqd.dotlrn.tcl.dotlrn-security-procs.dotlrn::user_p.select_count: 0or1row nsdb0

select count(*)
from dual
where exists (select 1
from dotlrn_users
where user_id = :user_id)

  3 ms
  pool2
dbqd.dotlrn.tcl.dotlrn-security-procs.dotlrn::user_p.select_count: 0or1row nsdb0

select count(*)
from dual
where exists (select 1
from dotlrn_users
where user_id = :user_id)

  3 ms
  pool2
dbqd.new-portal.tcl.portal-procs.portal::navbar.list_page_nums_select: select nsdb0

select pretty_name,
sort_key as page_num
from portal_pages
where portal_id = :portal_id
order by sort_key

  1 ms
  pool2
getrow nsdb0

  130 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
getrow nsdb0

  53 ms
  pool2
dbqd.acs-tcl.tcl.acs-permissions-procs.permission::permission_p_not_cached.select_permission_p: 0or1row nsdb0

select 1
from dual
where 't' = acs_permission.permission_p(:object_id, :party_id, :privilege)

  5 ms
  pool2
dbqd.curriculum.tcl.misc-procs.curriculum::enabled_elements.element_ns_set_list: select nsdb0

select cee.element_id,
cc.curriculum_id,
cc.name as curriculum_name,
cee.url,
cee.external_p,
cee.name
from (select curriculum_id
from cu_curriculums
where package_id = :package_id
MINUS
select curriculum_id
from cu_user_curriculum_map
where user_id = :user_id
and package_id = :package_id) desired,
workflow_cases cas,
workflow_case_fsm cfsm,
cu_curriculums cc,
cu_elements_enabled cee
where cc.package_id = :package_id
and desired.curriculum_id = cc.curriculum_id
and cc.curriculum_id = cee.curriculum_id
and cas.object_id = cc.curriculum_id
and cfsm.case_id = cas.case_id
and cfsm.current_state = :state_id
order by cc.sort_key,
cee.sort_key

  1 ms
  pool2
getrow nsdb0

  1 ms
  pool2
releasehandle nsdb0

  704 ms
(total)

Developer Information

3 database commands totalling 19 ms

page served in 203 ms
mailto:dotlrn@uni-hd.de

-- Does the above make a significant difference if webct is shut down?

No, Carl tried this a few days ago and saw no difference at all.

-- How are the config.tcl settings?

ns_log notice "nsd.tcl: starting to read config file..."

######################################################################
#
# Instance-specific settings
# These default settings will only work in limited circumstances
# Two servers with default settings cannot run on the same host
#
######################################################################

#---------------------------------------------------------------------
# change to 80 and 443 for production use
set httpport 80
set httpsport 443

# The hostname and address should be set to actual values.
#set hostname [ns_info hostname]
set hostname athena2.uni-heidelberg.de

#set address [ns_info address]
set address 129.206.100.143

set server "product"
set servername "Athena - dotLRN UNI HD"

set serverroot "/web/${server}"

#---------------------------------------------------------------------
# which database do you want? postgres or oracle
set database oracle

set db_name $server

if { $database == "oracle" } {
set db_password "itsgonenow"
} else {
set db_host localhost
set db_port ""
set db_user $server
}

#---------------------------------------------------------------------
# if debug is false, all debugging will be turned off
set debug false

set homedir /usr/local/aolserver
set bindir [file dirname [ns_info nsd]]

#---------------------------------------------------------------------
# which modules should be loaded? Missing modules break the server, so
# don't uncomment modules unless they have been installed.

ns_section ns/server/${server}/modules
ns_param nssock ${bindir}/nssock.so
ns_param nslog ${bindir}/nslog.so
ns_param nssha1 ${bindir}/nssha1.so
ns_param nscache ${bindir}/nscache.so
ns_param nsrewrite ${bindir}/nsrewrite.so

#---------------------------------------------------------------------
# nsopenssl will fail unless the cert files are present as specified
# later in this file, so it's disabled by default
ns_param nsopenssl ${bindir}/nsopenssl.so

# Full Text Search
#ns_param nsfts ${bindir}/nsfts.so

# PAM authentication
ns_param nspam ${bindir}/nspam.so

# LDAP authentication
#ns_param nsldap ${bindir}/nsldap.so

# These modules aren't used in standard OpenACS installs
#ns_param nsperm ${bindir}/nsperm.so
#ns_param nscgi ${bindir}/nscgi.so
#ns_param nsjava ${bindir}/libnsjava.so

if { [ns_info version] >= 4 } {
# Required for AOLserver 4.x
ns_param nsdb ${bindir}/nsdb.so
} else {
# Required for AOLserver 3.x
ns_param libtdom ${bindir}/libtdom.so
}

#---------------------------------------------------------------------
#
# Rollout email support
#
# These procs help manage differing email behavior on
# dev/staging/production.
#
#---------------------------------------------------------------------

ns_section ns/server/${server}/acs/acs-rollout-support

# EmailDeliveryMode can be:
# default: Email messages are sent in the usual manner.
# log: Email messages are written to the server's error log.
# redirect: Email messages are redirected to the addresses specified
# by the EmailRedirectTo parameter. If this list is absent
# or empty, email messages are written to the server's error log.
# filter: Email messages are sent to in the usual manner if the
# recipient appears in the EmailAllow parameter, otherwise they
# are logged.

#ns_param EmailDeliveryMode redirect
#ns_param EmailRedirectTo mailto:somenerd@yourdomain.test, mailto:othernerd@yourdomain.tes
t
#ns_param EmailAllow mailto:somenerd@yourdomain.test,mailto:othernerd@yourdomain.test

######################################################################
#
# End of instance-specific settings
#
# Nothing below this point need be changed in a default install.
#
######################################################################

#---------------------------------------------------------------------
#
# AOLserver's directories. Autoconfigurable.
#
#---------------------------------------------------------------------

#---------------------------------------------------------------------
# Where are your pages going to live ?
#
set pageroot ${serverroot}/www
set directoryfile index.tcl,index.adp,index.html,index.htm

#---------------------------------------------------------------------
# Global server parameters
#---------------------------------------------------------------------

ns_section ns/parameters
ns_param serverlog ${serverroot}/log/error.log
ns_param home $homedir
ns_param maxkeepalive 0
ns_param logroll on
ns_param maxbackup 5
ns_param debug $debug

ns_param HackContentType 1
ns_param URLCharset utf-8
ns_param OutputCharset utf-8
ns_param HttpOpenCharset utf-8
ns_param DefaultCharset utf-8

#---------------------------------------------------------------------
# Thread library (nsthread) parameters
#---------------------------------------------------------------------

ns_section ns/threads
ns_param mutexmeter true ;# measure lock contention
# The per-thread stack size must be a multiple of 8k for AOLServer to run under M
acOS X
ns_param stacksize [expr 128 * 8192]

#
# MIME types.
#
# Note: AOLserver already has an exhaustive list of MIME types, but in
# case something is missing you can add it here.
#

ns_section ns/mimetypes
ns_param Default text/plain
ns_param NoExtension text/plain
ns_param .pcd image/x-photo-cd
ns_param .prc application/x-pilot
ns_param .xls application/vnd.ms-excel
ns_param .doc application/vnd.ms-word

#
# Tcl Configuration
#
ns_section ns/server/${server}/tcl
ns_param library ${serverroot}/tcl
ns_param autoclose on
ns_param debug $debug

#---------------------------------------------------------------------
#
# Server-level configuration
#
# There is only one server in AOLserver, but this is helpful when multiple
# servers share the same configuration file. This file assumes that only
# one server is in use so it is set at the top in the "server" Tcl variable
# Other host-specific values are set up above as Tcl variables, too.
#
#---------------------------------------------------------------------

ns_section ns/servers
ns_param $server $servername

#
# Server parameters
#
ns_section ns/server/${server}
ns_param directoryfile $directoryfile
ns_param pageroot $pageroot
ns_param maxconnections 5
ns_param maxdropped 0
ns_param maxthreads 5
ns_param minthreads 5
ns_param threadtimeout 120
ns_param globalstats false ;# Enable built-in statistics
ns_param urlstats false ;# Enable URL statistics
ns_param maxurlstats 1000 ;# Max number of URL's to do stats on
#ns_param directoryadp $pageroot/dirlist.adp ;# Choose one or the other
#ns_param directoryproc _ns_dirlist ;# ...but not both!
#ns_param directorylisting fancy ;# Can be simple or fancy

#
# Special HTTP pages
#

ns_param NotFoundResponse "/global/file-not-found.html"
ns_param ServerBusyResponse "/global/busy.html"
ns_param ServerInternalErrorResponse "/global/error.html"

#---------------------------------------------------------------------
#
# ADP (AOLserver Dynamic Page) configuration
#
#---------------------------------------------------------------------

ns_section ns/server/${server}/adp
ns_param map /*.adp ;# Extensions to parse as ADP's
#ns_param map "/*.html" ;# Any extension can be mapped
ns_param enableexpire false ;# Set "Expires: now" on all ADP's
ns_param enabledebug $debug ;# Allow Tclpro debugging with "?debug"
ns_param defaultparser fancy

ns_section ns/server/${server}/adp/parsers
ns_param fancy ".adp"

#---------------------------------------------------------------------
#
# Socket driver module (HTTP) -- nssock
#
#---------------------------------------------------------------------

ns_section ns/server/${server}/module/nssock
ns_param timeout 120
ns_param address $address
ns_param hostname $hostname
ns_param port $httpport

#---------------------------------------------------------------------
#
# OpenSSL
#
#---------------------------------------------------------------------

ns_section "ns/server/${server}/module/nsopenssl"

ns_param ModuleDir ${serverroot}/etc/certs

# NSD-driven connections:
ns_param ServerPort $httpsport
ns_param ServerHostname $hostname
ns_param ServerAddress $address
ns_param ServerCertFile certfile.pem
#ns_param ServerCertFile athena2.pem
ns_param ServerKeyFile keyfile.pem
ns_param ServerProtocols "SSLv2, SSLv3, TLSv1"
ns_param ServerCipherSuite "ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+E
XP"
ns_param ServerSessionCache false
ns_param ServerSessionCacheID 1
ns_param ServerSessionCacheSize 512
ns_param ServerSessionCacheTimeout 300
#ns_param ServerPeerVerify true
ns_param ServerPeerVerify false
ns_param ServerPeerVerifyDepth 3
ns_param ServerCADir ca
ns_param ServerCAFile ca.pem
ns_param ServerTrace false

# For listening and accepting SSL connections via Tcl/C API:
ns_param SockServerCertFile certfile.pem
#ns_param SockServerCertFile athena2.pem
ns_param SockServerKeyFile keyfile.pem
ns_param SockServerProtocols "SSLv2, SSLv3, TLSv1"
ns_param SockServerCipherSuite "ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SS
Lv2:+EXP"
ns_param SockServerSessionCache false
ns_param SockServerSessionCacheID 2
ns_param SockServerSessionCacheSize 512
ns_param SockServerSessionCacheTimeout 300
#ns_param SockServerPeerVerify true
ns_param SockServerPeerVerify false
ns_param SockServerPeerVerifyDepth 3
ns_param SockServerCADir internal_ca
ns_param SockServerCAFile internal_ca.pem
ns_param SockServerTrace false

# Outgoing SSL connections
ns_param SockClientCertFile certfile.pem
#ns_param SockClientCertFile athena2.pem
ns_param SockClientKeyFile keyfile.pem
ns_param SockClientProtocols "SSLv2, SSLv3, TLSv1"
ns_param SockClientCipherSuite "ALL:!ADH:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SS
Lv2:+EXP"
ns_param SockClientSessionCache false
ns_param SockClientSessionCacheID 3
ns_param SockClientSessionCacheSize 512
ns_param SockClientSessionCacheTimeout 300
ns_param SockClientPeerVerify true
ns_param SockClientPeerVerify false
ns_param SockServerPeerVerifyDepth 3
ns_param SockClientCADir ca
ns_param SockClientCAFile ca.pem
ns_param SockClientTrace false

# OpenSSL library support:
#ns_param RandomFile /some/file
ns_param SeedBytes 1024

#---------------------------------------------------------------------
#
# Database drivers
# The database driver is specified here.
# Make sure you have the driver compiled and put it in {aolserverdir}/bin
#
#---------------------------------------------------------------------

ns_section "ns/db/drivers"
if { $database == "oracle" } {
ns_param ora8 ${bindir}/ora8.so
} else {
ns_param postgres ${bindir}/nspostgres.so ;# Load PostgreSQL driver
}

#
# Database Pools: This is how AOLserver ``talks'' to the RDBMS. You need
# three for OpenACS: main, log, subquery. Make sure to replace ``yourdb''
# and ``yourpassword'' with the actual values for your db name and the
# password for it, if needed.

# AOLserver can have different pools connecting to different databases
# and even different different database servers.
#
ns_section ns/db/pools
ns_param pool1 "Pool 1"
ns_param pool2 "Pool 2"
ns_param pool3 "Pool 3"

ns_section ns/db/pool/pool1
ns_param maxidle 1000000000
ns_param maxopen 1000000000
ns_param connections 5
ns_param verbose $debug
ns_param extendedtableinfo true
ns_param logsqlerrors $debug
if { $database == "oracle" } {
ns_param driver ora8
ns_param datasource {}
ns_param user $db_name
ns_param password $db_password
} else {
ns_param driver postgres
ns_param datasource ${db_host}:${db_port}:${db_name}
ns_param user $db_user
ns_param password ""
}

ns_section ns/db/pool/pool2
ns_param maxidle 1000000000
ns_param maxopen 1000000000
ns_param connections 5
ns_param verbose $debug
ns_param extendedtableinfo true
ns_param logsqlerrors $debug
if { $database == "oracle" } {
ns_param driver ora8
ns_param datasource {}
ns_param user $db_name
ns_param password $db_password
} else {
ns_param driver postgres
ns_param datasource ${db_host}:${db_port}:${db_name}
ns_param user $db_user
ns_param password ""
}

ns_section ns/db/pool/pool3
ns_param maxidle 1000000000
ns_param maxopen 1000000000
ns_param connections 5
ns_param verbose $debug
ns_param extendedtableinfo true
ns_param logsqlerrors $debug
if { $database == "oracle" } {
ns_param driver ora8
ns_param datasource {}
ns_param user $db_name
ns_param password $db_password
} else {
ns_param driver postgres
ns_param datasource ${db_host}:${db_port}:${db_name}
ns_param user $db_user
ns_param password ""
}

ns_section ns/server/${server}/db
ns_param pools "*"
ns_param defaultpool pool1

ns_section ns/server/${server}/redirects
ns_param 404 "global/file-not-found.html"
ns_param 403 "global/forbidden.html"

#---------------------------------------------------------------------
#
# Access log -- nslog
#
#---------------------------------------------------------------------

ns_section ns/server/${server}/module/nslog
ns_param debug false
ns_param dev false
ns_param enablehostnamelookup false
ns_param file ${serverroot}/log/${server}.log
ns_param logcombined true
ns_param extendedheaders COOKIE
#ns_param logrefer false
#ns_param loguseragent false
ns_param maxbackup 1000
ns_param rollday *
ns_param rollfmt %Y-%m-%d-%H:%M
ns_param rollhour 0
ns_param rollonsignal true
ns_param rolllog true

#---------------------------------------------------------------------
#
# nsjava - aolserver module that embeds a java virtual machine. Needed to
# support webmail. See http://nsjava.sourceforge.net for further
# details. This may need to be updated for OpenACS4 webmail
#
#---------------------------------------------------------------------

ns_section ns/server/${server}/module/nsjava
ns_param enablejava off ;# Set to on to enable nsjava.
ns_param verbosejvm off ;# Same as command line -debug.
ns_param loglevel Notice
ns_param destroyjvm off ;# Destroy jvm on shutdown.
ns_param disablejitcompiler off
ns_param classpath /usr/local/jdk/jdk118_v1/lib/classes.zip:${bindir}/
nsjava.jar:${pageroot}/webmail/java/activation.jar:${pageroot}/webmail/java/mail.
jar:${pageroot}/webmail/java

#---------------------------------------------------------------------
#
# CGI interface -- nscgi, if you have legacy stuff. Tcl or ADP files inside
# AOLserver are vastly superior to CGIs. I haven't tested these params but they
# should be right.
#
#---------------------------------------------------------------------

#ns_section "ns/server/${server}/module/nscgi"
# ns_param map "GET /cgi-bin/ /web/$server/cgi-bin"
# ns_param map "POST /cgi-bin/ /web/$server/cgi-bin"
# ns_param Interps CGIinterps

#ns_section "ns/interps/CGIinterps"
# ns_param .pl "/usr/bin/perl"

#---------------------------------------------------------------------
#
# PAM authentication
#
#---------------------------------------------------------------------

ns_section ns/server/${server}/module/nspam
ns_param PamDomain "aolserver"

ns_log notice "nsd.tcl: finished reading config file."

-- What are the authentication, kernel and main site parameter settings under /acs-admin/?

Are there particular values you're interested in? I tried to copy/paste the pages but the values in the edit boxes don't copy. I won't type them all up unless you really want to see them all...

How many authorities exists for your installation? Does it make a difference if you deactivate your URZ Heidelberg or Extern authority?

I haven't tried this; I don't know much about external authentication and it sounds like something I probably can't do during the day. But since virtually every page in the site is slow, this is an unlikely culprit, isn't it?

35: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/08/04 06:31 PM

Nima already pointed out that I forgot to strip out the database password, and it has already been changed. No need to panic. :)

36: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Nima Mazloumi on 04/08/04 06:44 PM

The differences I have seen to my installation so far:

1. sample request info:
It says 2215 ms for request duration, 2146.0 ms for the /dotlrn/www/index.adp page with a total of 704 ms for database stuff.

Mannheim:
49 database commands totalling 677 ms
page served in 1091 ms

Question: The database stuff seems equivalent. So where is the 1sec loss in Heidelberg?

2. config.tcl

Should be the problem since as far as I know at present you don't have many active connections anyway:

Heidelberg:
ns_param maxconnections 5
ns_param maxdropped 0
ns_param maxthreads 5
ns_param minthreads 5

Mannheim:
ns_param maxconnections 100
ns_param maxdropped 0
ns_param maxthreads 50
ns_param minthreads 50
ns_param threadtimeout 3600

Database settings:

Heidelberg (3 x)
ns_param connections 5

Mannheim (3 x)
ns_param connections 10

3. authentication, kernel and main site parameter settings
I don't know yet. Just wanted to compare all of them. Can you save the html pages and send them to me? I will convert them and post it for you (leaving away the heidelberg info stuff) - if you like.

4. authorities
I had performance changes with multiple authorities enabled. Just give it a try.

37: Re: Help Needed in Setting up .LRN to Scale (response to 36)

Posted by Janine Ohmer on 04/08/04 08:03 PM

Where did the extra time go? Good question! I just ran the same login again, and this time it looks like this:

+57.1 ms: Applied transformation from /web/product/www / dotlrn/index -> ? - 7.7 ms
+71.2 ms: Served file /web/product/packages/dotlrn/www/index.adp with adp_parse_ad_conn_file - 2937.4 ms
+3010.5 ms: Applied GET filter: (for /dotlrn/index ds_trace_filter) - 10.5 ms
returned filter_ok

So it's even slower this time, but the time spent in the database is still only 651 ms, 53 ms *less* than last time even though the overall time is roughly 800 ms longer.

The difference is all in serving the file, but there's nothing here to tell us why or how it took so much longer. I imagine that's where the difference is between our numbers and yours, too.

Hmm. Will have to think about this. Of course, if my hypothesis is correct and the system is out of RAM, then *everything* will be running somewhat slowly, so it's still possible that this is the problem, that nsd is just puttering along.

As far as the connections go, ideally those numbers would be increased but I don't think this system can handle any more connections, so I think it best to leave that as-is until we figure out the problem.

i will get back to you on the other questions.

38: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/08/04 09:23 PM

I have put up a preliminary statspack report here: http://athena2.uni-heidelberg.de/statspack.html

It contains not quite four hours of data so it's not exactly a definitive report, but it's a start.

I only looked it over briefly and didn't make it all the way through (I have a meeting to go to) but the only thing that jumped out at me is that the memory usage in the shared pool is very low. It was 39% when I checked it last week so I cut down the shared pool almost in half, and it has only made it up to 40%. Obviously some more trimming could be done here (80% utilization is ideal) but we're only talking about roughtly 130 MB at this point so it's not enough to make a huge difference.

I'm not paying a huge amount of attention to the lists of slow SQL because the application is very nearly uniformly slow. We know that performance of .LRN sites is not always this bad, so I think we should be looking for more global problems and not getting bogged down in tuning individual queries (yet).

If anyone spots anything I missed or has a different interpretation, please let me know!

39: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Alfred Essa on 04/09/04 03:21 AM

We plan to set up a variety of configurations (e.g. (solaris, oracle); (solaris, postgres), (linux, oracle), (linux, postgres) at MIT so we can get some performance benchmarks for .LRN. In the meantime, we strongly suggest that you plunk some money in increasing your RAM from the measly 2Gb. We should be able to identify the problem here, but you will not be able to support 30,000+ users on a Sun 280R with 2Gb RAM. It's not a "big Solaris server".

40: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/09/04 05:31 AM

I have been experimenting with Oracle and my results seem to be taking me in a new direction.

I cut down the size of the Oracle SGA radically, from roughly 930 MB to about 46 MB. I did this with no regard at all for formulas; I just grabbed the numbers off of one of my Linux boxes, which runs a fairly busy Oracle site.

This is not an entirly fair test, because Solaris systems don't fully recover from having gone into swap without a reboot. But I did improve the memory situation; after nsd has been running a while things look like this:

Memory: 2048M real, 1041M free, 730M swap in use, 4855M swap free

There's still too much swap in use for my taste, but as I said, that's not going away without rebooting the system.

The good news, which is also the bad news, is that this did not change the site performance one iota. It's no worse than it was, but it's no faster either. So we just reclaimed a bunch of space (though perhaps a bit too drastically) but it didn't help either. Keeping in mind that a reboot still might halp us out, it looked like time to move on to other ideas.

I still think that this is a system or database problem, not a site problem, simply because the performance is so uniformly bad. So instead of profilng the application, I took a known slow query (from /dotlrn/admin/users) and ran it in sqlplus, while running a variation on iostat at the same time. I got these results (edited to remove data we don't care about):

athena:/> iostat -xMne 1 60
                            extended device statistics       ---- errors --- 
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn tot device
    1.4    2.0    0.1    0.0  0.2  0.1   56.5   19.4   0   2  40   0   0  40 c3t0d0
    0.0    3.0    0.0    0.0  0.0  0.0    0.0    7.8   0   1  40   0   0  40 c3t0d0
    0.0   26.0    0.0    0.2  0.0  0.3    0.0   10.2   0  26  40   0   0  40 c3t0d0
    0.0  107.0    0.0    0.8  0.0  1.0    0.0    9.2   0  98  40   0   0  40 c3t0d0
    0.0  125.0    0.0    1.0  0.0  1.0    0.0    7.7   0  97  40   0   0  40 c3t0d0
    0.0  144.0    0.0    1.1  0.0  1.0    0.0    6.9   0  95  40   0   0  40 c3t0d0
    0.0  139.0    0.0    1.1  0.0  0.9    0.0    6.4   0  90  40   0   0  40 c3t0d0
    0.0  141.0    0.0    1.1  0.0  1.0    0.0    6.7   0  95  40   0   0  40 c3t0d0
    0.0  134.0    0.0    1.0  0.0  1.0    0.0    7.2   0  92  40   0   0  40 c3t0d0
    0.0  149.0    0.0    1.2  0.0  1.0    0.0    6.5   0  97  40   0   0  40 c3t0d0
    0.0  144.0    0.0    1.1  0.0  1.0    0.0    6.8   0  97  40   0   0  40 c3t0d0
    0.0  140.0    0.0    1.1  0.0  1.0    0.0    7.4   0  96  40   0   0  40 c3t0d0
    0.0  147.0    0.0    1.1  0.0  1.0    0.0    6.6   0  97  40   0   0  40 c3t0d0
    0.0  156.0    0.0    1.2  0.0  1.0    0.0    6.2   0  97  40   0   0  40 c3t0d0
    0.0  136.0    0.0    1.1  0.0  1.0    0.0    7.3   0  96  40   0   0  40 c3t0d0
    0.0  108.0    0.0    0.8  0.0  1.0    0.0    9.1   0  98  40   0   0  40 c3t0d0
    0.0   92.0    0.0    0.7  0.0  0.9    0.0    9.4   0  87  40   0   0  40 c3t0d0
    0.0   45.0    0.0    0.4  0.0  0.6    0.0   14.4   0  37  40   0   0  40 c3t0d0
    0.0  108.0    0.0    0.8  0.0  1.0    0.0    9.1   0  98  40   0   0  40 c3t0d0
    0.0  112.0    0.0    0.9  0.0  1.0    0.0    8.7   0  98  40   0   0  40 c3t0d0
    0.0  106.0    0.0    0.8  0.0  1.0    0.0    9.7   0  98  40   0   0  40 c3t0d0
    0.0  108.0    0.0    0.8  0.0  1.0    0.0    8.9   0  97  40   0   0  40 c3t0d0
    0.0  109.0    0.0    0.9  0.0  1.0    0.0    9.0   0  98  40   0   0  40 c3t0d0
    0.0  111.0    0.0    0.9  0.0  1.0    0.0    9.3   0  98  40   0   0  40 c3t0d0
    0.0   44.0    0.0    0.3  0.0  0.4    0.0    8.8   0  39  40   0   0  40 c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0  40   0   0  40 c3t0d0
    0.0    3.0    0.0    0.0  0.0  0.0    0.0   12.3   0   2  40   0   0  40 c3t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0  40   0   0  40 c3t0d0

This device is the external disk array, which has both Oracle and /web on it.

There are a few iinteresting things to note here.

One is that there is basically no data being read from the array (r/s). This is good, becuase it means that all the data used for this query came from memory (we hope it didn't come from swap :).

Another is that a fair number of disk writes are happening (w/s). This is because there are a lot of log files being written to - redo, rollback, archive, trace files, web server logs... and they are all on this one RAID array.

The last interesting thing to note is that there are 40 software errors (s/w) being reported by the disk array. That's 40 total, probably since the last reboot, which is not a whole lot but it's 40 more than there should be. This is probably not important, but it might hint at a problem with one of the disks in the array.

The next thing to try here would be to start moving log and data files that are written to frequently to the internal disk, except it doesn't have a whole lot of room and I'm not sure I want to start doing that to a production system if I don't have to.... I'm going to see if we can get this recreated on another system, without the disk array, and see what happens.

I'm also going to give Oracle some of it's SGA back; a very short statspack snapshot shows we're now using 95% of the shared pool, which is now too high.

The saga continues....

41: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/09/04 05:41 AM

*sigh* I almost forgot to point out the most important piece of that iostat output. The %b column shows the how busy the RAID array was at the time the snapshot was taken. And for a period of approximately 21 seconds, the array was over 90% busy. That looks like it could be part of the problem, which is why it seems worthwhile to try eliminating the array and see what happens.

42: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Mike Sisk on 04/09/04 06:57 AM

After looking into disk I/O on this machine and doing more performance benchmarking my current thinking is that the problem isn't in the disk array afterall.

Initially -- not knowing how the array is configured or it's state -- I thought that slow disk writes and contention might be contributing to the problem. RAID 5 arrays are slow on writes and degraded arrays (those with failed or missing drives) are too, but neither look to be a factor here.

It's a perplexing problem -- we're running several busy Oracle sites, one with a database in excess of 10GB and over 15 million hits per month on less hardware with much better performance.

43: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Andrew Piskorski on 04/10/04 06:23 PM

It would be interesting to find out what mysterious bug or configuration problem is causing this "Heidelberg's dotLRN runs pathetically slowly even when only one user is requesting one page at a time" problem.

But you may well be wasting your time. One, that problem is reasonably likely to be peculiar to your particular installation on that particular Sun box. Two, even if you fix the problem, and especially given that you're running a bunch of other software on the same low-end Sun box, that box is likely to still be far too wimpy to handle the loads you expect.

So why don't you just buy a dual Opteron or dual Xeon Linux box with 4 to 8 GB of RAM and a bunch of fast RAID disks, set that up, and do whatever further debugging and tuning you need to there? Maybe the mysterious performance problem will not reappear, which would be a nice bonus. If it does reappear on the new machine, that also would tell you something and might help your debugging. But the main point is that the few anecdotal reports we have from current high-volume dotLRN users seem to say that your current shared Sun box is unlikely to meet your needs, and that you're going to need a new machine anyway.

Of course, I suspect Mike and Janine know that, so what's the deal? No money in the hardware budget currently to buy a Linux box? Do you really think that shared Sun 280R will meet your client's needs? Or what?

Time constraints? Getting a new machine up and running going to take quite some time, of course. (Even longer if the customer has bureaucratic purchasing rules.) But Furfly already has various other Oracle installations up and working, right? So how about setting up a Heidelberg Dev site on an entirely different machine, using known-good hardware and a known-good Oracle instance? If the mysterious problem re-appears there as well, then you know with about 99% certainty that it's not the hardware or Oracle, that the mysterious problem has got to be in your site OpenACS, dotLRN, or AOLserver code or configuration.

44: Re: Help Needed in Setting up .LRN to Scale (response to 43)

Posted by Carl Robert Blesius on 04/10/04 08:37 PM

Thanks for the feedback Andrew.

It does indeed look like this has to do this particular setup and has nothing to do with .LRN/OpenACS. We just did a comparison on a MUCH SMALLER Sun box (with cob webs and all) and .LRN was faster than what we are experiencing now.

We are moving forward and will report when we find out the exact problem for posterity.

Carl

P.S. Mike wrote, "we're running several busy Oracle sites, one with a database in excess of 10GB and over 15 million hits per month on less hardware" and I am sure we can do the same thing with a .LRN site, it is just a question of some of the .LRN users cooperating on making it happen with gradual improvements over time.

45: Re: Help Needed in Setting up .LRN to Scale (response to 34)

Posted by Andrew Piskorski on 04/10/04 11:44 PM

Janine, your AOLserver threadtimeout of 120 s is much too low. You do have maxthreads set the same as minthreads (which in this case is probably good), but I don't remember whether that means the threadtimeout setting is ignored our not. Best to be safe and set threadtimeout to something much higher...

maxconnections, maxthreads, and minthreads all set to 5 also seems low, but if this is just for the Dev server and you plan to bump those up for Production then that's probably ok for now.

The 2.2 or 2.9 s seconds shown above for the login page is mostly meaningless, as the time is all in adp_parse_ad_conn_file, which is normal on the very first hit of that page for the thread. The real question is how often does hitting that page give you the slow 2 s adp_parse_ad_conn_file time? Overall, it should be a very low percentage of times.

Normally, after you restart the server, it should run adp_parse_ad_conn_file once per page per thread, only, and then never again. But if your AOLserver is constantly creating and destroying new threads (because it's misconfigured), then adp_parse_ad_conn_file could be sucking up lots and lots of time - much more than just the 2 s per hit you saw on the login page, some pages can take 10 or 20 s or more, especially with slow Sparc CPUs.

That's all AOLserver Tuning 101 of course, but it is an easy mistake to make. From painful experience, I am very suspicious of your 120 s threadtimeout. I suspect that all 5 of your AOLserver threads are being killed and restarted every two minutes, which is an absolute performance killer - you really want to be sure you've ruled that out.

46: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Nima Mazloumi on 04/11/04 12:17 AM

Janine, I think Andrew is right. Still IMHO I don't think that the sun box used in Heidelberg is the main problem or the Oracle instance running there. This I believe simply by the fact that the IT department in Heidelberg surely has capable experts on both fields.

my feeling - after installing OpenACS and dotLRN more than a dozen time is that the problem is from OpenACS itself.

So one suggestion! Just to make sure that I am wrong! Please install on the same box a clean instance of OpenACS on another port using the same files on the machine but a new database service1 without any users batch synch'ed.

If I am right this instance will run very fast as lightning. And if it does there was a misconfiguration with the OpenACS params. Soplease post your kernel, main site and authorities params for a quick check (maybe you can simply make screenshots to save you time).

Greetings,
Nima

47: Re: Help Needed in Setting up .LRN to Scale (response to 46)

Posted by Alfred Essa on 04/11/04 07:30 PM

Janine is just a resource that MIT is providing. Peter Marklund and Collaboraid are taking the lead in solving this problem so comments should be directed to them.

48: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Don Baccus on 04/11/04 10:29 PM

Yes, for Andrew's information let's make clear that this isn't a Furfly client so they don't have control over the hardware being used, etc, as they normally would when they host one of their own clients.

As Al mentions, Janine's wearing her MIT, not Furfly, hat on this one and presumably Mike took a look and chimed in as a personal favor.

This is very curious. If it takes more than two weeks to solve a bunch of us will probably spend our nights and mornings huddled around the box trying to figure out what's going on (since we'll be in Heidelberg).

49: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Janine Ohmer on 04/11/04 10:59 PM

Time for an update.

I've tried cutting down Oracle's SGA so that the system is no longer using swap. Didn't help. I've tried moving the temp datafile, which was causing a lot of disk activity on the external RAID array, to the internal drive. I've tweaked the net8 files, and looked at things every which way. Found a few things not quite right, but nothing to fix the problem.

One major issue I still need to resolve is that the Oracle is version 8.1.7.0. I'm going to installt he 8.1.7.4 patchset as soon as someone in Germany can make the installation CD available, and I'm crossing my fingers that it will help.

I took the query from /dotlrn/admin/users and tweaked it every which way. What I found was that it's slow if it has to transfer rows from the database, even if it's only to do an order by. If there's no ordering and it's only returning a count then it's lighting fast, even with the two permission calls. Does this ring a bell for anyone?

I have no doubt that we will need to tweak the application to get this working, but until I can get decent performance out of sqlplus it seems rather pointless to try.

50: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Andrew Grumet on 04/12/04 02:05 AM

There's been so much discussion that I hope the table analysis hasn't fallen through the cracks. Just making sure, Janine, you did analyze the schema anew during your testing, right?

51: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Andrew Piskorski on 04/12/04 04:51 AM

Just what does "the performance is so uniformly bad" mean exactly? Re-reading some of their comments above, it sounds as if the folks on the spot are seeing really awful performance, on all pages, and from sqlplus as well as AOLserver.

Yet the only page load times I've seen mentioned above are 2.2 and 3.0 s, with 2.1 and 2.9 s (respectively) of that time taken up in adp_parse_ad_conn_file. But adp_parse_ad_conn_file should be a one time per page per thread overhead only, so taking that out, we should left with about a 0.1 s or so page load time - quite respectable!

So I guess those two examples must not be representative of the overall problem? In which case, just what does the overall performance problem look like from the users' point of view - just how slow are these pages really?

52: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Dirk Gomez on 04/12/04 10:03 AM

(I'm on dialup currently, so I can't reply all that often)

Janine, please turn on timed_statistics (need to bounce the instance, about 1% overhead) and don't worry about frequent snapshots. Each snapshots takes about a blocks to be written - this just pales in light of the expected load. If you put perfstat into its own tablespace you could measure it and you would NEVER worry about this again.

With timed_statistics=on all the empty columns will be filled with values and we will see on which wait event Oracle is losing time.

If you think it is the RAID or the Oracle instance which are slacking do this: create a a few files (large, mid-sized, small ones) and copy them around from a script. Measure the performance. Do the same with a few tables and a few access paths (table scan, index access, small commits) and measure the performance. What is the result?

A question to the Heidelberg people: is this the same machine that serves the WebCT production system? Were/Are you happy with its performance?

Don't ignore the SQL results!

Look how much acs-service-contracts queries are there. It almost looks like a denial-of-service attack: one query is executed 84.000 in about 4 hours. Another one just looks whether there is one service contract - probably just to be able to gracefully tell the user "service-contract foo$$%$%"bar doesn't exist.

What about aggressive caching for service contracts (and replacing it for the next release - I don't like the package anyway because I think it is a complexifying replacement for tcl namespaces that is ALSO expensive)

What is this query about: select dotlrn_communities_all.*,
dotlrn_community.url(... ? It is *extremely* expensive.

The next query in the ordered by gets is also extremely expensive: what does it do?

Number is probably using cc_users.

Can you at least try to cache the service-contracts and then take snapshots during the day when there is at least some activity on the system?

53: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Don Baccus on 04/12/04 09:34 PM

How is a service contract just an expensive replacement for namespaces? Last time I looked service contracts allow methods to be implemented in SQL or Tcl and no Tcl namespace hack allows for that ...

Just to mention one difference.

How are they expensive? At startup time procs are built for each live method for each contract implementation and these are called directly when you invoke a method.

So there's a little startup time but not much else.

The main problem with service contract is that they're hard to change once they're defined...

54: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Caroline Meeks on 04/13/04 12:23 AM

This is probably not the problem if you are seeing bad performance in sqlplus but if you have curriculum_bar_p in site-master.tcl try commenting it out and see if it makes a difference.

55: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Carl Robert Blesius on 04/20/04 07:29 AM

Just so people know: our Sun server is serving pages at an acceptable rate again (and we did not have to make any hardware changes). After a lot of painful detective work what actually solved our performance problem was some fine tuning of .LRN sql queries in Oracle (in addition to some fine tuning of Oracle itself).

We will make sure to get any changes that where made on our server back into the source and into the documentation as soon as they have been peer-reviewed.

Thanks for all the help everyone (special thanks to Janine/Sloan/Collaboraid/Dirk).

Will post the details soon here.

56: Re: Help Needed in Setting up .LRN to Scale (response to 55)

Posted by Lars Pind on 04/20/04 06:47 PM

We should mention also that switching to AOLserver 4 improved performance by about 30% as well.

57: Re: Help Needed in Setting up .LRN to Scale (response to 56)

Posted by C. R. Oldham on 04/20/04 07:11 PM

Lars,

Did you come from 3.3, or one of the 3.5 series?

--cro

58: Re: Help Needed in Setting up .LRN to Scale (response to 57)

Posted by Lars Pind on 04/21/04 11:58 AM

From 3.3.

59: Current performance status in Heidelberg (was Re: Help Needed in Setting up .LRN to Scale) (response to 55)

Posted by Martin Magerl on 11/23/04 01:44 PM

Hello,

just a short report regarding current performance status at University of Heidelberg.
On 21th October we moved AOLServer from our Sun, which hosted Webserver and Oracle at that moment, to a linux box, so our current infrastructre is:

- Front-End:
SuSe Linux 9.0
Processor: Athlon 1.8 GHz
Memory: 1.5 Gb
Network: Ethernet, 100Mb/s
DB-Client: Oracle
AOLServer: 4.0.8 (nsopenssl v3_0beta23 / tcl. 8.4.4)
OpenSSL: OpenSSL 0.9.7d 17 Mar 2004
MTA: Postfix 2.0.14

- DB-Server (as mentioned above):
Solaris 2.8 on Sun Fire 280r
Memory: 2 Gb
Disks: 2 * 36 Gb, 1 * 200 Gb raid
Network: Ethernet, 100Mb/s
DB-Server: Oracle 8.1i (patched)
Additional software running until end of year: webct

- Some facts about our dotLRN-installation:
40282 ACS-Users
2580 dotLRN-Users
22 class instances (current semester, about 40 total), 10 communities / 72 subcommunities
162641 ACS-objects, 114108 ACS-permissions and 11129 fs-objects

- Performance relevant AOLServer configuration parameters:
maxthreads 25
minthreads 20
(Changed minthreads != maxthreads, because there seem to be still some memory leakage issues)
threadtimeout 3600
stacksize 512 Kb
db-pool connections: first 20, second 10, third 5
keepalivetimeout 5
maxkeepalive 100
maxconnections 100

Some objective measurements of page response times before and after migration (calculated by measure-resonstimes.sh):

- /dotlrn
before: 2894 ms
after: 841 ms (internal measure by developer support: 80-100 ms)
- /dotlrn/calendar/cal-item-new
before: 2380 ms
after: 633 ms
- /dotlrn/manage-memberships:
before: 3597 ms
after: 1834 ms
- /dotlrn/classes/3520praktischeinformatik/3520urztest/3520urztest/
before: 7575 ms for class start page and 4589 ms for class file-storage
after: 3093 ms for class start page and 1335 ms for class file-storage
(1694 ms for class start page after removing subgroup- and homework-portlet!)

Alltogether, each page became about twice faster, although there might be still enough things to tune...
(E.g.: For very large query result sets AOLServer seems to need much more (exponential) time to parse the template via templating system compared to very fast output by manual ns_write commands.)

60: Re: Current performance status in Heidelberg (was Re: Help Needed in Setting up .LRN to Scale) (response to 59)

Posted by Andrew Piskorski on 11/23/04 03:48 PM

Oh, there's something unnecessarily O(n^2) or worse in the Templating System Tcl code? That's bad. Martin, could you please give more details on exactly what pages and queries demonstrate that, so that someone will be able to track it down?

61: Re: Help Needed in Setting up .LRN to Scale (response to 1)

Posted by Don Baccus on 11/23/04 05:54 PM

Yes, please track this down ... also templates are only parsed once and then cached so this doesn't make too much sense, to be honest. Unless there are issues with the underlying put operations that add to the output string, within Tcl itself.

But whatever information you can give us would be useful.

How much space are you setting up for ns_cache? Have you monitored performance to make sure it's large enough to be caching everything?

62: Re: Help Needed in Setting up .LRN to Scale (response to 61)

Posted by Martin Magerl on 11/24/04 04:42 PM

Hi Don, hi Andrew!

Yes, to be honest, with "exponential" I went really over the top.
This kind of behavior was observed, when we ran AOLServer3.3 on Sun Solaris. There we often had, regardless of special pages/nodes, statistics like less than 800ms for db queries, but more than 10000 ms total time for request processor.
In avoidance of any severity expression :), I noticed following performace facts:
- Requesting pages not belonging to or at least not portal-rendered by dotLRN are running faster, i.e. db query time and request processor delivery time are getting very close to each other.
This make sense, because rendering portals require "some" extra steps to be done.
- Especially the memberportlets of dotLRN-(sub-)groups have some weird statistic values (ok, about 200 members in this example):
25 database commands totalling 607 ms
page served in 12360 ms

Although statistics for subcomm's member administration page, which additionally contains super comm's users not yet included in subcommunity, shows statistics like:
20 database commands totalling 5333 ms
page served in 9805 ms

I wonder, if this behavior may be caused by nested loops, so it would be no problem of templating system itself (Maybe, should do a diff to dotLRN 2.1-queries... upgrading to 2.1 soon :).

Regarding templating system, I made a simple performance comparison performance by just displaying some information for a set of dotLRN users. For the first check I used ns_write output and for the second one templating system with multirow. Results (manually measured by clock):
- 2500 Users:
a) ns_write: 7 seconds total (including db query)
b) template: 10 seconds total
- 5000 Users:
a) ns_write: 20 seconds
b) template: 30 seconds
- 7500 Users:
a) ns_write: 42 seconds
b) template: 62 seconds

Maybe, I have to consider that some extra seconds are caused, because templating system first completely builds html result before sending it back to the browser, so ns_write has a little head start.

Don, you mentioned space set up for ns_cache. Do you mean Kernelparameter Memoize-MaxSize? It was 200000 and I set it to 300000 not knowing if this is an reasonable value.
ns_cache stats says:
Cache Max Current Entries Flushes Hits Misses Hit Rate
util_memoize 300000 299932 2229 5685 2222425 77774 96%
secret_tokens 32768 4080 102 0 2326 102 95%
nsfp:product 5120000 1364032 61 0 10550 61 99%
ns:dnshost 100 0 0 0 0 0 0%
ns:dnsaddr 100 1 1 0 7 1 87%

Is this MaxSize-Parameter limited by ns_configured StackSize (right now 512 Kb) or is this parameter independent?

What about nsv_buckets? Are those performance relevant?
nsv:7:product 17 3046879 44932 1.47468934605
nsv:6:product 18 1743471 221 0.0126758632636
nsv:5:product 19 33300250 120067 0.360558854663
nsv:4:product 20 455283 15 0.00329465409427
nsv:3:product 21 5668477 29275 0.516452655625
nsv:2:product 22 1415574 1310 0.0925419653088
nsv:1:product 23 763659 63 0.00824975545368
nsv:0:product 24 306966 6 0.00195461386603
ns:cache:util_memoize 81 2337795 4334 0.185388368099

Don't know, if mutex locks still use them...

Thanks for your answers & help and sorry for this exaggerated, not true performance severity statemant (dreaming for O(log(n)) 😊 ).

Martin

P.S.: Just one O(n^2) left: Our logger installation:
Only 184 entries, but about 100 seconds to display index page... but that's really a problem of logger itself.

63: Re: Help Needed in Setting up .LRN to Scale (response to 62)

Posted by Andrew Piskorski on 11/25/04 05:36 PM

Martin, you just dumped a whole lot of info on us but I don't see the some of the most simplest and most important stuff. Do you have the OpenACS Developer Support package installed? If not, install it, right away.

Your first order of business is to determine where and how AOLserver is spending its time, and so far I don't think you've done that. You posted your AOLserver thread settings above, good.

Now, find a particularly slow page. Hit it, and look at the Developer Support data. Very Important: Note whether or not this was the first time this thread served this page. Developer Support currently doesn't tell you this directly, so this isn't quite as simple as it could be, but by looking at the Developer Support info and/or the AOLserver log, you should be able to figure it out.

Now hit the same page again, and get it to run in a Thread which has served this same page before. Compare the Developer Support numbers between the 1st-time-in-this-thread and Nth-time hits on that page. This is key.

For all hits other than the 1st hit per thread per page, the page should be fast. If it is not, that is interesting and we want to know why. If only the 1st hit per thread per page is slow, and nearly all the time is being taken up in adp_parse_ad_conn_file, then that's normal.

That's why Don was asking you about ns_cache, etc. above. If your cache isn't big enough, presumably the cached compiled Templating System pages might get thrown out, and then you'd end up running the (expensive) adp_parse_ad_conn_file stuff over and over again many times per page per thread - not good.

Yes, nsv_buckets can certainly be performance relevent, but it's very unlikely that your slow pages are being caused by mutex contention for the nsv buckets. If you want to check, make sure you have "ns_param mutexmeter 1" in "ns_section ns/threads" in your AOLserver config file, then use the AOLserver nstelemetry page to check for lock contention.