Forum .LRN Q&A: Re: Help Needed in Setting up .LRN to Scale

I have been posting the details of our investigation to the internal ticket tracker for this project. In the interest of getting some more eyes on the problem, here is what I have discovered/thought as we went along. I am going to go run another statspack report after I post this and will be back with that after I have looked at it.

---------------------------------------
Based on the system specs Mat sent I think that if we cannot add RAM to this box then we may actually need to reduce the amount of space allocated to Oracle. I have set up statspack and taken a very quick snapshot of loading my own My Space page (and whatever else happened to go on during that time). This is not a very large sample but when I did this for Sloanspace it did help us pinpoint problems. One thing it hopefully will tell me is whether we have excess memory and can cut it back.

To be clear, I don't think this is the whole problem but it is certainly a contributing factor. In my opinion we need to get problems like this cleared up before we start tuning the application.

I'm going to go off now and study the report, which may take some time.
---------------------------------------
Because there is so little data in the report, I can't tell a whole lot about what our performance issues might be. But one thing is clear - we've got too much memory allocated to Oracle. The current size of the shared pool is 250,270,105 bytes, and at the moment I took the snapshot we were using 40% of it. That number is supposed to be between 75% and 85% for optimal performance. That, combined with our memory shortage, points to this being a number we should definitely change.

The number of bytes actually in use was 100,108,042, which is 75% of 133,477,389. Unless I hear any objections, I'll shut down the site and Oracle and change the shared pool size to that number. It may not be enough of a change to make much difference, considering we have almost 2 GB of swap being used, but it's the right thing to do in any case.

This is not necessarily the only change we'll want to make to the Oracle configuration, but the site needs to run a bit so I can take another snapshot with some better numbers in it. I think that the sort_area is probably too small, and the db_block_buffers might be too large, but I don't want to change them without some data to back it up. However, I think that even when all the tuning is done, we're still going to need more RAM for this system.

After I make the change to the shared pool size, the next step will be to start looking at the application. I am assuming that you want me to do this, and not just stick to Oracle tuning - let me know if that is not right.

I will wait about 15 minutes for objections and then make this change.
---------------------------------------
Ok, change has been made. Some stats:

With both Oracle and nsd shut down:

Memory: 2048M real, 1283M free, 675M swap in use, 4912M swap free

With Oracle running and nsd shut down:

Memory: 2048M real, 337M free, 1638M swap in use, 3948M swap free

With both running, after nsd had finished initializing:

Memory: 2048M real, 266M free, 1721M swap in use, 3865M swap free

So basically, there is a limit to what we can do here because the system is still using swap even with everything we are running on the box turned off! That might clear up with a reboot, but I expect it would happen again over time.

I will revisit this issue when I have more statspack data to work with but I think it's clear we aren't going to win this one without more RAM. Time to look at the application and see if there's anything we can do there.
---------------------------------------
I have examined several queries in detail, but no silver bullet has been found so far. The only thing that jumps out at me is that it has been a while since tables were last analyzed:

SQL> select last_analyzed from user_tables where table_name = 'ACS_OBJECT_TYPES';

LAST_ANALY
----------
2004-02-10

It would be a good idea to do this weekly, if not more often.

#1 - the dotlrn_users query in /dotlrn/admin/users

This query is *horribly* slow and does three full table scans. Unfortunately, none of my usual tricks worked to eliminate the scans.

#2 - the call to dotlrn_community_admin_p is the culprit here. Again, I was unable (so far) to make it run any faster.

However.... I have not given up, and I will continue working on this on Monday (possibly some on Saturday if I have time). It took a while to hit pay dirt on Sloanspace too; unfortunately (or fortunately, depending on your point of view) this installation doesn't have the Oracle misconfiguration that turned out to be responsible for a lot of our troubles on Sloanspace.
---------------------------------------
I have been thinking about this all weekend, and I kept coming back to the fact that the system is not heavily loaded, yet performance is poor. A situation that can be helped by tuning queries generally exhibits other signs of stress - high system load and Oracle processes using lots of CPU time. Not so here.

I asked Mike to take a look; he ran various OS tools looking at performance while I loaded the /dotlrn/admin/users page over and over. Mike believes he has found a potential problem. Here is what he wrote up for me, and I will comment further after:

"This looks like a disk I/O based performance problem.

The device to pay attention to is sd30 -- an external SCSI-attached disk array.

iostat shows that a large amount of disk I/O results when the page is loaded; kps is total traffic in kilobytes per second, tps is total transactions per second, and serv is service time (disk seek time) in milliseconds.

The disk service time is fine which tells us the disk array is not overloaded and the time to seek from the disk is reasonably speedy.

The ratio between the kps and tps tells us about file sizes -- in this case it looks like a lot of large files are being transfered when the page loads.

This looks to be a case where disk I/O bandwidth isn't sufficient for the query; multiple spindles are needed and the load should be divided between multiple disks (for example, sd30 has both /web and /ora8 which means the same disk is being hit to read from Oracle, write web access logs and transaction logs, as well as reading the html)."

Mike didn't see any signs of swapping going on during our tests.

Here's my version: a lot of data is going back and forth between the system and that disk array. Data gets read from Oracle tables, and intermediate results get written to the temporary tablespace. Redo, rollback and archive logs are written to. The nsd error and access logs are also written to. It appears that there is just so much data going through that one connection to the disk array that we're experiencing a traffic jam.

Now, it seems a bit odd to me that Oracle is doing this much disk access... I would have expected it and nsd to both keep this data in memory, especially as I reload the same page over and over again. I don't know off the top of my head how to tell how much of the database Oracle has got in memory; that will be tomorrow's research project, along with looking at another statspack report.

I'm not sure what to recommend as a course of action to fix this, assuming we end up agreeing that this is the problem, because I don't know what our options are. Do we have any other systems available which might be more suitable?
---------------------------------------
One thing that bothers me about this forming hypothesis is that we don't see any swap activity during page loads. It seems that we should, if we're going to blame the site's slowness on a disk i/o bottleneck. So I took the query from the /dotlrn/users/admin page and ran it in sqlplus, running iostat at the same time to monitor disk activity. This time I saw *lots* of disk activity on the swap device.

So what does this tell us? For one thing, I think it confirms the theory that the memory Oracle is using resides in the swap partition and not in RAM. That's a guaranteed performance killer, so we definitely have to fix that. It also tells us that some caching is happening somewhere, because when I load that page and the same query executes, there is very little swap activity. Unfortunately this doesn't explain why the page load is so slow anyway... the cache may also be out in the swap partition but that doesn't fully explain it.

At this point I believe that if we could bump the RAM in this system up to at least 4 GB it would help considerably. Mike also feels that there is too much disk activity going to one place - all those log files (nsd and Oracle) should be split up between at least two disks, preferably on separate channels.

In my opinion, it doesn't make sense to continue tuning queries or looking at the finer points of the Oracle installation until the hardware is adqequate to support the site; as I saw on Friday, the efforts are unlikely to result in any improvement.
---------------------------------------
Matthais, I'm not sure I understand the question, so let me just state clearly what I think we need to do.

First, if we are going to remain on this system we need more RAM. The system needs to have at least 4 GB (total) just to stop it from using any swap space, and it would be better if we had an extra GB or two (meaning 5 or 6 total) to have room for growth. If we have enough RAM, then everything that is supposed to be loaded into RAM, like Oracle's working area, will be and performance will be much improved.

At that point it's possible that things will be running well enough that the external disk array will no longer be a problem. If it is still a problem, then we will need either access to a second external array, so we can split up the log files, or (even better) an internal disk added to the system.

At this time there is no need for a high performance system, just a few more resources allocated to this one.
---------------------------------------
Here are the results of my experiment. I took snapshots via the top command at each step.

before:

Memory: 2048M real, 301M free, 1563M swap in use, 4023M swap free

nsd shut down:

Memory: 2048M real, 441M free, 1422M swap in use, 4165M swap free

Oracle shut down:

Memory: 2048M real, 1372M free, 459M swap in use, 5130M swap free

WebCT shut down:

Memory: 2048M real, 1413M free, 383M swap in use, 5207M swap free

At this point nothing is running but Solaris, so this is a baseline state. It's possible that a bit more memory would be available if we could reboot, but this looks pretty normal to me.

on the way back up:

Oracle started up:

Memory: 2048M real, 457M free, 1339M swap in use, 4248M swap free

nsd started up:

Memory: 2048M real, 430M free, 1363M swap in use, 4225M swap free

after site has come up all the way and a few pages loaded:

Memory: 2048M real, 275M free, 1480M swap in use, 4107M swap free

WebCT started up:

Memory: 2048M real, 266M free, 1512M swap in use, 4075M swap free

Conclusion:

Oracle grabbed 1915M of RAM, considerably more than was available, so even when it was the only thing running it caused the system to go into swap. It is the major resource hog here. WebCT used a small amount of memory so, at least as far as RAM goes, it's presence is not making a significant difference to system performance.

As you might expect, the site ran no faster with WebCT shut down, because the system was basically just as far into swap as it was when I started.

I am still convinced that adding RAM (at least 2 GB) is the most important thing we can do to improve the situation.
---------------------------------------
I forgot to mention one thing - I can make Oracle require less RAM, but I probably can't get it down small enough. And even if I could, it would only work for a short time; Oracle performs best when it is able to load the entire data set into RAM, and if it has a minimal amount of space to work with it will lose the ability to do that as your users add content. So performance would fall off quickly at some point in the not-too-distant future. It is really better to fix this properly now.
---------------------------------------

Ok, that's the trail so far. Comments?