Forum .LRN Q&A: Details on our efforts at Sloan to improve .LRN performance

Sloan recently upgraded to .LRN 2.1, and since then a group of us (Andrew Grumet, Tracy Adams, Mark Reidesel and me) have been working to improve performance. This post will document the efforts we've made to date, and we'll come back to add more as we go along.

System specs
Hardware: Sun V440 with 4 1-GHz Sparc IIIi CPUs, 16 GB of RAM
OS: Solaris 9
Internal disks are mirrored using Solstice disk suite (i.e. Software RAID).

Database: Oracle 8.1.7.4
Most database file are stored on a Sun 3310 RAID array configured RAID 1+0 (i.e. one big mirrored partition)

AOLserver 3.3+ad13
.LRN 2.1

Use SEToolKit to monitor Solaris Operating system
One of the most useful tools for measuring performance of the Solaris operating system is the setoolkit http://www.setoolkit.com, developed by Rich Pettit. In particular, the tool zoom.se gives a good overview of the usage of CPU, disk, RAM, virtual memory and other parameters using measurements developed by Adrian Cockcroft, one of Sun’s main performance people.

In our case, the zoom monitor showed no obvious problems in the system configuration. There was no disk contention and no paging going on. When the dotlrn system was noticably slow for users, one or more of the CPUs was almost alwasy at essentially 100% utilization. Curiously, though, there were other times when all four CPUs were maxed out, according to zoom, yet from the user perspective the performance was fine. Each time this was seen, it was apparently due to processes within AOLServer which seemed to have hung.

Overall, this new system seems to put a heavier average load on the CPUs that our previous one, for reasons which are not yet known. However, the main result of the operating system monitoring was to indicate that the performance problems were elsewhere — either in the database or in the application.

Increase AOLserver threads
I believe the original value of maxthreads was 10. We first bumped that up to 20, and then brought the rest of the thread parameters in line with recommendations given on the AOLserver mailing list. The parameters now look like this:

ns_param maxconnections 100
ns_param maxkeepalive 100
ns_param keepalivetimeout 15
ns_param maxdropped 0
ns_param maxthreads 20
ns_param minthreads 20
ns_param threadtimeout 3600

Increase number of database connections
We originally had the default of 5/5/5 for the three pools. We increased that to 24/12/4 and that helped considerably. We most likely had threads waiting for a database handle.

Some combination of the increase in threads and connections seems to have cured the problem of AOLserver processes appearing to hang, as mentioned above.

Disable paralle query execution
The Oracle consultant who first set up the original Sloanspace installation some years ago had enabled parallel query execution. That means that Oracle will break up the query into pieces and run them simultaneously on multiple CPUs, stitching the results back together afterwards. This sounds like a good thing, but in reality most of the queries in .LRN execute too quickly to benefit from this treatment. The overhead of the splitting and stitching phases is large enough that this actually causes things to run slower instead of faster in our case. This feature is really meant for data warehouse applications, where transactions tend to be large and run for a long time. Plus we are still on Oracle 8.1.7.4, and the parallel query feature is generally considered to have not been ready for prime time until Oracle 9, so we were getting a double-whammy from this instead of a performance boost.

We first ran into this when we were trying to improve performance of the previous version of Sloanspace. At that time we discovered that the Oracle parameter parallel_automatic_tuning had been turned on. Turning that off improved performance considerably and we thought that we had successfully disabled all parallel query execution, but it turned out that it was still enabled for certain tables. It was additionally necessary to disable it from those tables before it would truly go away.

We didn't see a huge performance increase this time, but it was clearly not providing any performance benefit so there was no reason to have the added complexity. Besides, it made for some really convoluted query execution plans!

Use statspack to look for Oracle configuration issues
We looked at two reports, one taken mid-morning and the other taken mid-afternoon. They were very similar. The conclusions were not earth-shattering:

"The soft parse and execute to parse ratios are a bit low; this usually indicates code that's not using bind variables when it should. The other efficiency numbers look pretty good.

Memory usage is around 86%, which is a little high; I'd like to see it in the 75 - 80% range.

The only wait events worth researching further are "direct path write" (we spent 87 seconds out of 18 minutes on this) and "SQL*Net more data to client" (158 seconds out of 20 minutes). Neither one of these is huge, plus they are both a little fuzzy and hard to nail down, so we decided not to look into them at the moment.

The TEMP tablespace has a lot more writes than ssv2, which is expected. We might be able to get some gains by making sure that TEMP is on it's own disk and controller, and has as fast a disk as possible."

A good resource for learning how to use statspack is Tom Kyte's Expert One-on-One Oracle. His website, http://asktom.oracle.com, is a treasure trove also.

Use tkprof to examine slow queries
Tracy had found a survey query that was horribly slow - well over an hour on the new site, but around a minute on the old one. It turned out to be a view which needed to be rewritten; the results are documented here: https://openacs.org/forums/message-view?message_id=201054. The same view was used by a query in survey-portlet that was called each time someone went to their MySpace page, so this was a real drag on the whole site.

The way we found the problem was by running 10046 event traces on the query and then formatting the output with tkprof. Tom Kyte's book is an excellent resource for how to do this. You get a lot more information about how the query ran and where the trouble spots might be than any other method I'm aware of using free tools.

Use hotsos Oracle profiler to measure individual queries
As part of our tuning efforts on our previous system, we had purchased the hotsos Oracle database performance profiler. Information on this can be found as http://www.hotsos.com and in the O’Reilly book Optimizing Oracle Performance by Cary Millsap and Jeff Holt. Like the tkprof method this used the 10046 tracing events built into the Oracle kernel. It has the advantage that it can give a profile of every query executed by a particular section of code in a way that is easy to read.

We used this to profile one of the slower pages with the unsurprising result that there were two queries used to list the members of a class that might benefit from some tuning. This was indicated by the fact that on query which returned seventy some rows did around 150,000 logical reads. In the short term, the solution is to not put the portal with lists a communities members on the main page for that community and so not run this query.

In our previous system before we had found the parallel execution problem, the profiler indicated that some of the slow queries were taking large amounts of time doing things that did not seem to make sense. Specifically, the SQLNet data transfers were taking 1000 time longer than they should have. This phenomenon disappeared when we turned off parallel tuning. On the current system, we have run into no such anomalies thus far. This indicates that there are no mysterious time consuming database processes on the new system, but it also means that we will not be able to miraculously speed up the entire site by a factor of three by changing on init.ora parameter in the way that we did in the previous exercise.

Conclusion
We may end up doing more profiling of individual queries (using hotsos and/or tkprof) if we find that certain pages are still slow, but right now it's looking pretty good. One task not yet completed is to increase Oracle's shared pool memory allocation, so the memory usage in the Statspack report will drop down a bit. This is not expected to improve performance at all; it just gives Oracle a bit more breathing room. It does not perform well at all if it runs out of memory. :)

Collapse
Posted by Rocael Hernández Rizzardini on
Great post Janine! thx.
You should change to aol4.0.8, its quite stable now and SSL is working fine as well ... and its much faster that 3.x series.

Galileo is using postgres 7.4.3, and we have been fixing several queries with bad performance, stuff like:
* left join
* wheres x is not in (...)
* and some inline query function calls

are giving bad performance, we have commited most of our changes, now working on put as list builders all the tables on /dotlrn/admin  is anybody else working on that as well? (in order to avoid duplicate efforts); this is really important as admin UI for dotlrn.

Though the experience so far has been really good (linux, pg, aol, oacs, .LRN), about 0.5 secs in average for serving any page. Also we moved all of the files (lobs) in the DB to the file system giving us better performance on serving those as well.

Actually, we tried to switch to AOLserver 4.0.8 yesterday but the site crashed after about 10 minutes.  C.R. Oldham was having trouble with his site yesterday too, and I noticed some restarting going on with one of my sites today.

In C.R.'s case, and apparently mine also, the problem was that old browsers with 40 bit encryption were causing nsopenssl to crash, taking nsd with it.  Dossy has a fix for that which we are testing and it looks good so far.  I don't know yet whether the crash we had yesterday at Sloan was caused by this or not, but hopefully we'll have it sorted out soon.

Tried again today and it crashed after about 35 minutes.  Dossy's looking at our coredump.
Rocael, how can other po installations benefit from your query improvements? Is there an upgrade file or a list of changed queries?
Rocael,

Can you give us more details or specific examples of why these are slow?

"Galileo is using postgres 7.4.3, and we have been fixing several queries with bad performance, stuff like:
* left join"

What kind of left joins are slow?

"* wheres x is not in (...)"
IN should perform well in PostgreSQL 7.4.x, maybe NOT IN is not optimized as well?

"* and some inline query function calls"
are these in the column spec of the query or ther WHERE clause?

Dave

well, all of them related to the CR,
also we avoided the use of the CR views x & i since they are quite slow for the extra joins which most of the times are unnecesary.

You can see recent changes on the evaluation packages, now is well tunned, left joins is just a pain with big tables (our cr_items has about 450K rows), basically we changed to use a tcl "approach" than sql approach and its a lot more faster.

Yes, NOT IN is not optimized, see a recent commit on CR.

Most of our problems where due the interaction with the CR, since is heavily used.

All the changes we have been doing are on CVS now, but most of them where on the evaluation packages, though we fixed stuff on other packages.

For the functions I think they were on the WHERE clause.