Forum OpenACS Development: Experiences clustering OpenACS aolservers?

Hello!

Anybody is using OpenACS with aolserver cluster?

I would please to listen to his experiences / problems.

Regards,
Agustin

We'll post soon the work done at Galileo on this regard.
Thanks, Rocael!

I will wait for your news!

Agustin

Collapse
Posted by Joe Oldak on
We clustered the dotCommunity setup which is live on http://e-voice.org.uk/ - there are three web servers connecting to one database backend.

The cluster cache flushing code in the stock OpenACS isn't right though - especially in respect to the site_node cache. We did some work to improve it in dotCommunity and it (now) seems to be working well for us.

You can look at what we've done with the code by downloading it from www.dotcommunity.org. Or, if you aren't in too much of a hurry, then we'll be hoping to put our changes back into the OpenACS core code in due course...

Joe

Hi, Joe!

I have got your code and I selected the changed files
for cluster part.

I have put the changes in our test server.
They appears work ok!

The files I have used are:

packages/acs-tcl/tcl/site-nodes-procs-*.xql
packages/acs-tcl/tcl/site-nodes-procs.tcl
www/SYSTEM/flush-site-node-cache.tcl
packages/acs-tcl/tcl/site-nodes-init.tcl

Am I right?
Am I forgotten any file or change?

Regards,
Agustin

Collapse
Posted by Joe Oldak on
That seems right, yeah.

As long as you try out the adding/deleting/renaming of nodes from one machine, and these changes are mirrored on the others, then you should be fine!

We also made some changes in the functions which send the flush messages to the cluster peers - though this was mainly to do with our specific setup, and so you shouldn't really need them. (since each node in our cluster could have more than one ip address, we added an extra config parameter to store the local addresses of each machine). This is really just to prevent nodes sending flush requests to themselves (which is harmless but pointless!).

(sorry for the previous post)

Thanks, Joe!

Only another problem.

Anybody have patched whos-online code for use all
the servers in the cluster?
Currently only the local users are displayed with that
procedure (nsv* procedure).

By the way, we are solutioned the problems
with different real IPs in the same server adding one host
route to indicate the ouput network card.

Regards,
Agustin

Collapse
Posted by Joe Oldak on
One gotcha to check for - ensure one (and one only) machine is running all the background threads for search indexing etc.

Due to a misconfiguration, we ended up with none of these running for a while. (none of the machines thought they were the canonical server).

Ideally, we could come up with some scheme whereby one of the machines can be elected as the canonical server at runtime, just in case one of the servers dies.

Oh! If the canonical server die, no server will be run
the scheduling threads.

We could choose dinamically another cluster server
as canonical, but how could we rerun all the
scheduled procs in that server? Stop and start it
do not appear a good solution ...

Agustín

Collapse
Posted by Joe Oldak on
We don't have an answer for this at the moment. We're working on the theory that if the canonical server dies then we'll notice and do something about it!

Not ideal, but not much we can do about it.

It may of course be the case that it's safe to run the background threads on all the servers - providing care is taken with the db calls to prevent all the machines repeating the actions.

Anyone know if this is the case?

Hello again!

We are experiencing our first problems with the cluster 😊

The proc "update_cache_local" has a runtime of 8 seconds.
During this time the cluster is blocked. We are study the distribution of this time within the procedure and the greater cost is the copy in memory (array set, nsv_array reset) at the beginning and at the end of the proc that maintains site-nodes (in our case, greater than 110000).

Any idea to optimize it?

Regards,
Agustin

array set nodes [nsv_array get site_nodes]
array set url_by_node_id [nsv_array get
site_node_url_by_node_id]
array set url_by_object_id [nsv_array get
site_node_url_by_object_id]
array set url_by_package_key [nsv_array get
site_node_url_by_package_key]
...
nsv_array reset site_nodes [array get nodes]
nsv_array reset site_node_url_by_node_id [array get
url_by_node_id]
nsv_array reset site_node_url_by_object_id [array get
url_by_object_id]
nsv_array reset site_node_url_by_package_key [array get
url_by_package_key]

Collapse
Posted by Joe Oldak on
Apologies for the long rambling reply, hopefully you can extract some useful thoughts...

In the "old" version of the code, there were some bits of code which would modify the in-memory copy on certain operations. However now what happens is that it always calls update_cache_local with the node you have changed - this code is necessary for the cluster peers, and so it also runs it on the local peer too. This is less efficient than it used to be - but has the advantage of actually working 😉

Unfortunately the update_cache_local is quite slow, especially with so many site nodes. (what's on your site, so that you have so many!?)

And, also unfortunately, the job of synchronising the local site_nodes store whenever one of the cluster nodes changes the tree is also unavoidable.

There are a few things that could be done about this. I'll just mention a few passing thoughts:

There are a few things that could be done to reduce the amount of times the caches are flushed:

Firstly, simply make sure that the cluster nodes aren't sending unnecessary flush requests. And when they do send them, make sure they aren't unnecessarily using the "sync_children" option, as this results in a lot more db grinding.

Sometimes, I find a single request can send several flush requests to the peers, which is unnecessary. Perhaps what could happen is that the flush requests are saved up during the page fetch and sent at the end, with duplicates removed.

One thing that just occurred to me is that update_cache_local actually works on local arrays, so only needs a full writelock on the nsv_arrays at the end of the function, as it copies the data into them. This could certainly alleviate the problem a lot!

(just put a readlock around the start where it loads the arrays into local store, and a writelock where it puts it back)

Of course - if there are a LOT of nodes then the memcopying of this data from the nsv_array to the local arrays could be the time consuming part, rather than the update itself. If this is the case then something cleverer involving in-place editing of the nsv arrays could be done??

Perhaps you could add a bit of code around the function to see if it's the coping in/out that is taking the time, or whether its the db reading. I imagine in the cases wherey ou are just updating a single node and no children then the coping will be the majority of the time. However where you are syncing a lot of nodes the db access will be the majority.

In the long term, it could be good to be able to share among peer nodes the details of what has changed, rather than just the fact that "something" has changed - then we could be more efficient about things.

I'll stop here to avoid further confusion, but will post more responses if you wish to delve further into any of the thoughts!

Hi again!

Another problem that we are finding is related
with ajax chat users. The users using the ajax chat
only see other users while they are in the same server
of the cluster.

Anybody has any idea to resolve it?

Regards,
Agustin