Forum .LRN Q&A: Re: Help Needed in Setting up .LRN to Scale

Posted by Don Baccus on
Yes, please track this down ... also templates are only parsed once and then cached so this doesn't make too much sense, to be honest.  Unless there are issues with the underlying put operations that add to the output string, within Tcl itself.

But whatever information you can give us would be useful.

How much space are you setting up for ns_cache?  Have you monitored performance to make sure it's large enough to be caching everything?

Posted by Martin Magerl on
Hi Don, hi Andrew!

Yes, to be honest, with "exponential" I went really over the top.
This kind of behavior was observed, when we ran AOLServer3.3 on Sun Solaris. There we often had, regardless of special pages/nodes, statistics like less than 800ms for db queries, but more than 10000 ms total time for request processor.
In avoidance of any severity expression :), I noticed following performace facts:
- Requesting pages not belonging to or at least not portal-rendered by dotLRN are running faster, i.e. db query time and request processor delivery time are getting very close to each other.
This make sense, because rendering portals require "some" extra steps to be done.
- Especially the memberportlets of dotLRN-(sub-)groups have some weird statistic values (ok, about 200 members in this example):
  25 database commands totalling 607 ms
  page served in 12360 ms

Although statistics for subcomm's member administration page, which additionally contains super comm's users not yet included in subcommunity, shows statistics like:
  20 database commands totalling 5333 ms
  page served in 9805 ms

I wonder, if this behavior may be caused by nested loops, so it would be no problem of templating system itself (Maybe, should do a diff to dotLRN 2.1-queries... upgrading to 2.1 soon :).

Regarding templating system, I made a simple performance comparison performance by just displaying some information for a set of dotLRN users. For the first check I used ns_write output and for the second one templating system with multirow. Results (manually measured by clock):
- 2500 Users:
a) ns_write: 7 seconds total (including db query)
b) template: 10 seconds total
- 5000 Users:
a) ns_write: 20 seconds
b) template: 30 seconds
- 7500 Users:
a) ns_write: 42 seconds
b) template: 62 seconds

Maybe, I have to consider that some extra seconds are caused, because templating system first completely builds html result before sending it back to the browser, so ns_write has a little head start.

Don, you mentioned space set up for ns_cache. Do you mean Kernelparameter Memoize-MaxSize? It was 200000 and I set it to 300000 not knowing if this is an reasonable value.
ns_cache stats says:
  Cache Max Current Entries Flushes Hits Misses Hit Rate
  util_memoize 300000 299932 2229 5685 2222425 77774 96%
  secret_tokens 32768 4080 102 0 2326 102 95%
  nsfp:product 5120000 1364032 61 0 10550 61 99%
  ns:dnshost 100 0 0 0 0 0 0%
  ns:dnsaddr 100 1 1 0 7 1 87%

Is this MaxSize-Parameter limited by ns_configured StackSize (right now 512 Kb) or is this parameter independent?

What about nsv_buckets? Are those performance relevant?
nsv:7:product 17 3046879 44932 1.47468934605
nsv:6:product 18 1743471 221 0.0126758632636
nsv:5:product 19 33300250 120067 0.360558854663
nsv:4:product 20 455283 15 0.00329465409427
nsv:3:product 21 5668477 29275 0.516452655625
nsv:2:product 22 1415574 1310 0.0925419653088
nsv:1:product 23 763659 63 0.00824975545368
nsv:0:product 24 306966 6 0.00195461386603
ns:cache:util_memoize 81 2337795 4334 0.185388368099

Don't know, if mutex locks still use them...

Thanks for your answers & help and sorry for this exaggerated, not true performance severity statemant (dreaming for O(log(n)) 😊 ).


P.S.: Just one O(n^2) left: Our logger installation:
Only 184 entries, but about 100 seconds to display index page... but that's really a problem of logger itself.

Posted by Andrew Piskorski on
Martin, you just dumped a whole lot of info on us but I don't see the some of the most simplest and most important stuff. Do you have the OpenACS Developer Support package installed? If not, install it, right away.

Your first order of business is to determine where and how AOLserver is spending its time, and so far I don't think you've done that. You posted your AOLserver thread settings above, good.

Now, find a particularly slow page. Hit it, and look at the Developer Support data. Very Important: Note whether or not this was the first time this thread served this page. Developer Support currently doesn't tell you this directly, so this isn't quite as simple as it could be, but by looking at the Developer Support info and/or the AOLserver log, you should be able to figure it out.

Now hit the same page again, and get it to run in a Thread which has served this same page before. Compare the Developer Support numbers between the 1st-time-in-this-thread and Nth-time hits on that page. This is key.

For all hits other than the 1st hit per thread per page, the page should be fast. If it is not, that is interesting and we want to know why. If only the 1st hit per thread per page is slow, and nearly all the time is being taken up in adp_parse_ad_conn_file, then that's normal.

That's why Don was asking you about ns_cache, etc. above. If your cache isn't big enough, presumably the cached compiled Templating System pages might get thrown out, and then you'd end up running the (expensive) adp_parse_ad_conn_file stuff over and over again many times per page per thread - not good.

Yes, nsv_buckets can certainly be performance relevent, but it's very unlikely that your slow pages are being caused by mutex contention for the nsv buckets. If you want to check, make sure you have "ns_param mutexmeter 1" in "ns_section ns/threads" in your AOLserver config file, then use the AOLserver nstelemetry page to check for lock contention.