Forum .LRN Q&A: Server recommend for dotLRN - with 80.000 users-

Hi!
I need help, for choose one server for the University of Valencia (Spain) (http://www.uv.es).

I would like some help in choosing a server for a dotLRN site that would have around 80,000 registered users. Postgres is the preferred database. What would be a good server to recommend for use with Debian Linux?

Thanks

/Dario

Collapse
Posted by Malte Sussdorff on
Hi Dario,

if you have the money and the ressources to maintain, use two servers, one for the database with lot's of RAM and another one for the AOLserver.

For the database server (also true if you only go with one server):

- Dual Processor (I'm highly satisfied with Opterons, but that may be too expensive)
- 4GB of RAM, or more (that's what we use on aiesec.net, so far it is enough, but less is not recommended).
- Multi Channel SCSI RAID with fast harddisks. Take a lot of harddisks, that are fast and can be accessed simultaneously. We use a two channel setup with two mirrored harddisks each and strip across both channels. This reduces your harddisks from becoming the bottleneck. Maybe SerialATA is a better alternative though.

As for the server for AOLserver, you can use any computer, though you'll need quite some storage space for your users files. OTOH, "quite some" still does not fill up our 250GB RAID, therefore it is all a matter of perspective.

Last but not least, think about backups. You will need to backup your database along for all the files in the content-repository (I'd not recommend storing content in the database, but use the filesystem based version). Unless you have a central backup storage at your university, get a (cheap) different computer with lots of harddisk.

Collapse
Posted by Dario Roig on
Hi Sussdorff!

Thank you to help me, but I have more questions, please.

-That linux version has for 64 bits (Dual Processor Opterons)?
-Suse 64 bits, maybe?

-In your institution how many users you have?

-Where I can find instructions for change a 'The filesystem based version'?

Thanks.

Collapse
Posted by Malte Sussdorff on
Hi Dario,

we are using SuSE Linux Enterprise for Opteron 64bit and it runs smoothly.

AIESEC is running the system with currently 100.000 users, realistically I'd say it is more like 40-50.000.

In the default installation you will have the parameter set to store files in the database instead of the filesystem (in the content-repository-content-files folder of your installation).

You can change this by going to your "site-map" and clicking at the "parameters" link of your "file-storage" packages. There you can set the parameter StoreFilesInDatabaseP to "0".

Collapse
Posted by Joel Aufrecht on
This is a good time to think about Backup and Recovery.
Dario, one thing you may wish to try is to load the machine with RAM, at least temporarily, when you are adding this many users.  Then, create a RAMdisk using the RAMFS filesystem (not sure if included in the default Debian kernel) and put the Postgres data files there.  Create your users, then stop the Postgres database and copy the files onto your hard drive.

This should greatly speed up insertions, since the limiting factor is the speed of the hard drive.  Since you are just loading users at this point, the database should not be too large, perhaps under 1GB.

Collapse
Posted by Denis Roy on
When you are planning your hardware setup, it very much depends on the number of concurrent users. Having a lot of data in your database doesn't really matter if there are only say 20 users online at the same time.

Since Malte is referring to AIESEC, let me give you some more concrete info:

We have 120K registered users. Only about 20K are active since it is also an alumni database. Our webserver is a Dual-P3 with around 933 Mhz each and 2 GB RAM. Our database server is a Dual-Opteron system as Malte already mentioned.

With this setup we support about 70 concurrent users on OACS 4.6.3 and .LRN 1.0 without any noticable negative impact on performance. When more users try to access the website, performance will decrease considerably. I found out that the webserver is the bottleneck because even when the website is almost not responding anymore at all (around 90 concurrent users) the database is still doing fine. Testing a second webserver at the same time was showing great performance on the second webserver.

We are currently installing a load-balanced solution which will make us very flexible on the frontend (webserver) side. It's just a matter of time until the database will become the bottleneck.

If anyone is interested, I will let you know about the results of our load-balancing and how many users we can handle with it.

Collapse
Posted by Andrew Piskorski on
Dennis, yes, please do let us know how the load-balancing work turns out! These sorts of hands-on reports always make useful background knowledge.

How well are the OpenACS server cluster procs working out for you so far?

Of course, if your site is growing you probably want to add the front-end load-balancing anyway, but in the near term it might be simpler just to swap out that aging dual P3 for a new machine.

Collapse
Posted by Cato Kolås on
We have around 18000 registered users, but max unique logins during a fortnight is close to 10000.

Specs:
webserver (~.LRN 1.0/OACS 4.6.3):
dual Xeon CPU 2.8 GHz, 3 GB RAM (dell 2650)

oracleserver (8.1.7.3):
dual Xeon CPU 2.8 GHz, 4 GB RAM (dell 2650)

(Both also has hyperthreading turned on and a dedicated 1 Gbit network between the two)

On heavy loads we find that it's the oracleserver that has the highest load. I'm wondering if the is anything more to gain by tweaking the aolserver-config:

ns_param  maxconnections    50
ns_param  maxdropped        0
ns_param  maxthreads        50
ns_param  minthreads        10
ns_param  threadtimeout      120
...
ns_section ns/db/pool/pool1
ns_param  connections        6
...
ns_section ns/db/pool/pool2
ns_param  connections        16
...
ns_section ns/db/pool/pool3
ns_param  connections        5

Some pics:
http://elg.uib.no/~edpck/logins-week.gif
http://elg.uib.no/~edpck/load-week.gif (green: webserver, blue: oracleserver)

All feedback appreciated

Cheers

Collapse
Posted by Andrew Piskorski on
Cato, my guess is you have your AOLserver threadtimeout set much too low, which is worth experimenting with and changing.

Why do you have so many more connections in your 2nd database pool than in your first? That seems weird.

On both machines, leaving the P4 hyperthreading turned on is also probably a bad idea unless you have a kernel (Linux 2.6.x, I think) which distinguishes between real and virtual/hyperthread processors.

Collapse
Posted by Cato Kolås on
<blockquote> Cato, my guess is you have your AOLserver threadtimeout set
much too low, which is worth experimenting with and
changing.
</blockquote>

ok, will do. thx.

<blockquote> Why do you have so many more connections in your 2nd
database pool than in your first? That seems weird.
</blockquote>

have turned on logging when pools are empty:
ns_param  WarnEmpty          true
- and based on the numbers of warnings tried to adjust the settings.

Collapse
Posted by Nima Mazloumi on
Hi Cato,
can you tell me what tool you use to log the load on the webserver and the database?

Greetings,
Nima

Collapse
Posted by Cato Kolås on
|Hi Cato,
|can you tell me what tool you use to log the load on the
|webserver and the database?

It's mrtg with a custom script to gather information about the load from the two machines (runs uptime via ssh connections and greps for the load).

Collapse
Posted by Nima Mazloumi on
Hi Denis,
I am really interested to learn more about your load-balancing solution.

Greetings,
Nima

Collapse
Posted by Dario Roig on
Hi friends!

Thanks to all for your recommendations.

we already have our system in:

The Database (Postgres 7.4.1):
  -Dual Opterons
  -4 GB RAM
  -Harddisk of 128 GB SerialATA
  -S.O. Mandrake 64 bits

The AOLserver 3.3 y dotLRN 2.0 Beta4:
  - Pentium IV
  - 1GB RAM
  - Harddisk 120GB
  - S.O. Debian

We with 80.000 users and 10.000 groups have perfomance problems with alone a concurrent user. When the user carries out the petition of it paginates web inside a dotlrn course, the slow answer but of 10 seconds.

some suggestion, please?

Collapse
Posted by Dirk Gomez on
Install acs-developer-support. Turn it on. Turn on database queries output. Report the output.

What does top on the server say?

Collapse
Posted by Joel Aufrecht on
More documentation on how to do developer support and performance debugging:
https://openacs.org/doc/openacs-HEAD/maint-performance.html
Collapse
Posted by Malte Sussdorff on
Hi Dario,

do you use multiple harddisks for your 128GB Serial ATA?.

If yes, what is your partitioning and where did you install the system and where the postgres data files (the postgres data file should be on a seperate (set of) harddisk(s)).

Anyhow, this does not explain the performance loss. So doing as Dirk suggests will help us help you.

Dario, you have some fairly serious server hardware there, but "harddisk of X GB, Serial ATA" doesn't mean much of anything. What type of disks exactly, and most importantly, what RAID controller, and what RAID level (1, 5, 10), and partitioned in what way?

More importantly, such a horribly awful slowdown (10 s page serve time) with just two concurrent users is pretty strange. That's not some disk IO limit due to e.g. RAID 5 vs. RAID 10 either (not with only 2 concurrent users!), it's got to be something that effects low-end performance too, aka, a mis-configuration problem.

I bet your AOLserver thread settings are wrong, that's probably the easiest single way to make your performance really suck. See my previous post and link above about that.

Collapse
Posted by Dario Roig on
Hi!

Our problem has been solved. Thank you for your help.

The problem is:

When being executed the file dotlrn-master.tcl slow 5 seconds in executing the following ones you line of code:

# Curriculum bar
set curriculum_bar_p [llength [site_node::get_children -  all -filters { package_key "curriculum" } -node_id $community_id]]

How we could remove the applet curriculum?

thanks.

Collapse
Posted by Nima Mazloumi on
What did you change to make it work?
Collapse
Posted by Ola Hansson on
Dario,

I'd be curious to know what happens if you comment out that line and tried with this one instead:

set curriculum_bar_p [util_memoize "llength [site_node::get_children -all -filters { package_key "curriculum" } -node_id $subsite_node_id]"]
After you have reloaded the page, with the new code, is it any faster? Fast enough? If so, this change may be considered to be enough, and in which case I'll commit the changes ASAP.

Parenthetically -- The util_memoize call means that you'll have to restart the server for each new curriculum instance that you mount - however, this shouldn't cause any additional problem since curriculum is already (mis)constructed in such a way that you need to restart the server after an instance has been mounted in order to make the user tracking work in the curriculum bar. This "feature" has been documented, at least ...

/Ola

Hi, Ola!

I have tried your code but dont works:

(I have replaced $subsite_node_id by community_id)

wrong # args: should be "llength list"
    while executing
"llength "
    ("eval" body line 1)
    invoked from within
"eval $script"
    invoked from within
"ns_cache eval util_memoize $script {
        list $current_time [eval $script]
    }"
    (procedure "util_memoize" line 20)
    invoked from within
"util_memoize "llength [site_node::get_children -all -filters { package_key "curriculum" } -node_id $community_id]""

???????????
Best regards,
Agustin

Collapse
Posted by Ola Hansson on
Can you post the exact line you use, please? That way it is easier to figure out what is wrong.
Collapse
Posted by Dirk Gomez on
If you don't use the curriculum module comment it out.

(What is the statement of the curriculum-portlet/module anyway?)

Hi again!

I have replaced the line with:

set curriculum_bar_p [util_memoize "llength
    [site_node::get_children -all -filters { package_key
    "curriculum" } -node_id $community_id]"]

Regards,
Agustin

Collapse
Posted by Ola Hansson on
Hi,

Try putting it in a single line. I get the same error as you do when I use line breaks.

Dirk: I would like to label the package as having a couple of known flaws (and as of today i count the issue at hand as one such flaw) but being basically fit for fight, although I'm not aware of anyone who's banged on it heavily under real-life circumstances yet.

The curriculum-portlet should work but has probably undergone even less testing.

I think one may say the pair's slowly approaching 1.0 status ...

Collapse
Posted by Jon Griffin on
I believe working as a single line is telling you that you have a space after one of your \.
This will break things in a strange way.
Hi!

Sorry but I wrote the command in the code in one single line.
But I cut it to send the message to the forum.

set curriculum_bar_p [util_memoize "llength
    [site_node::get_children -all -filters { package_key
    "curriculum" } -node_id $community_id]"]

The problem go on.
Any help?

Agustin

Collapse
Posted by Ola Hansson on
OK. You need to replace the quotation marks (") that surround the util_memoize arg with [list ...], like so:
set curriculum_bar_p [util_memoize [list llength [site_node::get_children -all -filters { package_key "curriculum" } -node_id $community_id]]]
Sorry about the confusion, but this ought to work.