Forum OpenACS Development: Re: Scalability of acs_objects and acs_object_context_index
Having said that, there are clearly too many object types in the basic design. Over time I'd like to pare them down, but only with careful thought. It's simple to say that an apm_parameter doesn't need permissions, for instance, because anyone who can admin the package should be able to admin the parameters. But the web admin UI allows for the creation of new parameters, and perhaps having audit info as to who created or last modified a param is useful? If so, then in the 4.x design that means it should be an object. Doing auditing separately might be a good idea but that's not how 4.x works and the odds of our making fundamental changes in the overall design at this level AND PROVIDING UPGRADE SCRIPTS for existing sites is very low. On the other hand a future redesign that abandons existing sites would have a lot more freedom, but so far the community has expressed little interest in this approach. Most of us seem to want to write sites today incrementally improving 4.x instead.
Now, on to specifics ...
Have you actually measured the amount of time permission_p takes on your data? This is another case of handwaving, saying "permissions are too slow" without posting data. The number of rows in acs_objects is meaningless in the analysis, the pertinent question is "how long does a permissions check take?"
Can you post the query you're using to test performance? Can you post how long it takes to execute and the query plan generated by Oracle and the hardware you're testing on? Can you post pertinent Oracle stats so we can be sure the large amount of data in the system isn't causing cache misses etc?
The load testing you're doing is extremely useful, but we need a lot more detail before we can analyze the results. I've personally just done load testing with a couple hundred thousand objects rather than a couple million objects because at the time I only had a P500 available. However in that context permission checks, properly written ones at least, performed adequately.
In general the permissions design anticipates that you'll use acs_object_party_privilege_map in where clauses to filter rows returned by queries rather than calls to acs_permission.permission_p().
In the past, though, that view gave horrifically slow performance and ars Digitan advice was to avoid it, which is reflected in the coding of some of the packages. And when we ported to Postgres, it turned out that view was *always* too slow to use, and we were forced to abandon its use. In some contexts an alternate view, all_object_party_privilege_map, was usable but not in all. So many people working on packages began calling the permission_p function to check perms in the where clause as a last resort.
Now, though, with my 4.6.1 permissions upgrade, the acs_object_party_privilete_map view's fast and should be used if your query does something like "return all the rows in this table that the user has the 'read' permission on". In my work on file-storage while improving the performance of permissions I got a 3x-4x performance improvement using this view to filter files the user has read permission on vs. calling the permission_p() function. Before my rewrite of permissions using the view would've slowed down the query by a couple of orders of magnitude, so you can see how the rewrite as impacted my thinking, at least, as to how to write queries that do permission checking (and I've upgraded the permissions design doc for 4.6.3 to reflect this fact, though I've not committed the new doc yet.)
Michael Hinds and I are slowly working on getting the rewrite of new-portals that Open Force began but did not complete working again. It will take some time. One reason I've been pushing to complete this rewrite is that it reduces the number of object types in the datamodel, with mapping tables which are strictly local to the package being plain tables rather than derived from acs_objects. This should reduce the number of objects in a dotLRN installation considerably.
I'm also curious as to the hardware you're testing on, and as to what kind of expectations people have for performance on a particular hardware platform. A dotLRN installation with nearly one hundred thousand users is a HUGE system, and I personally wouldn't expect to be able to support such a community with a cheap PC, for instance. Your loaded instance has about four times as many objects as the live system at Sloan/MIT that's been running for about a year now, and has three times the number of users than students that attend the University of Heidelberg.
As far as specific object types go, let's think about dotlrn_clubs as an example. These are objects because they need to do user-role permission checking. Who are the admins? Who are the members? Admins can modify policy for clubs on an individual basis, and classes, which are implemented similarly to clubs, have a more complex role structure (students, profs, TAs, etc.)
If these weren't objects, dotLRN would have to implement a separate permissions management scheme of its own, adding to its complexity. The first two or three attempts at implementing a general permissions scheme for 4.x failed miserably in performance, and it took me a lot of work analyzing it before I figured out how to improve it to wher e it is usable with reasonably large numbers of objects. So what are the odds that a first attempt to create a parallel scheme for dotLRN would scale well? Even if it did, new hackers coming to the system would have more code with additional complexity to master before they could extend the base system. Isn't avoiding this kind of redudancy and parallel effort important, too?
There's an underlying philisophical issue that the community may want to discuss - our target price/performance for real websites, in some sense. How much consistency in the design and ease of development are we willing to trade off in order to make it possible to run larger sites on less expensive server hardware?
There's not an obviously correct answer here. You and Timo argue that execution efficiency should be number one, but efficiency of development as we continue efforts to support more rapid development is another way to save clients money. There are likely to be trade-offs here. How many in the community truly expect to be developing sites for clients with a hundred thousand users in an intranet (real users as opposed to unregistered visitors more typical of public sites), and for those who are, doesn't the budget for such projects tend to be much larger than budgets for more modest site development? How far should we go to simplify the deployment of extremely large and busy sites that have significant development budgets if doing so increases the effort/cost to customize/write new packages for more modest sites?
I think these and similar questions are important, too ... and that there's probably not a single answer that fits the vision of everyone in the community. So they should be considered when we talk about revisiting the design of various aspects.
There's actually a fairly long list of detailed modifications to the kernel that we can pursue that would help scalabilty without sacrificing generality, and we should start talking about them before too long. Again, upgrade scripts for existing installations are a real hassle. I spent about twice as much time writing and thoroughly testing my upgrade scripts for the permissions code as I did on the actual rewrite, for instance, and this is probably typical for changes to fundamental pieces of acs-kernel.