Forum OpenACS Q&A: Augmenting acs_objects - Add package_id, name, overview, etc. to it

Here's a request for comments. I want to see the acs_objects table augmented by some fields. We - I and a bunch of ArsDigita folks - added these fields while on a project - and it was immensely useful:

Code:

        shortname               varchar2(200),
        overview                varchar2(4000),
        overview_html_p         char(1) default 'f'
                                constraint acs_objects_h_ck
                                check (overview_html_p in ('t','f')),

Helps to get rid off the *ugly* and *performance-killing* acs_objects.name pl/sql function - an absolute no-no if you want to present large lists of "objects".

Code:

	package_id		integer,
Make a "show-me-this-object" page possible which will create the URL path from the package_id and the object_type. Would replace the get_url (or similarly named) pl/sql function - which is very very very expensive. It'll also gives each OpenACS object a nifty and short *permanent* URL.

Something like this meta code becomes possible:


set query "
select object_id, shortname, overview, overview_html_p, ...
  from acs_objects, ...
  where ..."

perform_query

loop_thru_results
  set link_string "<a href='show-me-this-object?object_id'>$shortname</a><br>$overview"

We added a few more fields where I'm not always sure anymore why - like:

	state			char(1) default 'l' not null
				constraint acs_objects_state_ck
				check (state in ('l', 'd', 'a')),
Live, Deleted, Archived
        datastore               char(1) default 'a'
                                constraint acs_obj_datstr_nn not null,
Intermedia stuff
        publish_date            date default sysdate,
        toplevel_package_id     integer
Don't remember the background anymore - others may jump in here.

Later on I suggested that a site-wide package would *only* access acs_objects, no package-specific table - a query would just crawl through that very table, no need for expensive joins between large tables. And making a table "wider" is usuallyno big performance issue in Oracle.

How to keep acs_objects in sync with the package-specific tables - via triggers. Ah, and package queries would never access acs_objects, cos acs_objects' fields are - kept in sync with triggers - also part of the package-specific tables. Package-specific queries don't have to perform expensive joins with the large acs_objects table then.

(Unfortunately I never came to implementing this - I was moved to another project.)

Dirk

Dirk,

Sounds interesting. I definitely think it's really useful to be able to generate the canonical URL for an object in an inexpensive way.

May I add one thing to the mix: An object identifier, for example called 'object_no', which starts with 1 for each new package instance, with a unique (package_id, object_id).

This is necessary, because if you're ever going to migrate your data over to a new site, you cannot guarantee the same object_id's.

Hence, object_id should *never* be part of bookmarkable URLs.

Thoughts on this?

/Lars

I think making it easier to get the name and URL of an object is important. What size lists of objects are we talking about?

Notifications and search use service contracts to get the URL of an object and it seems quick enough. Notificatiosn does use acs_object__name to get the name of objects on the manage page, you can see it at https://openacs.org/notifications/manage

Of course, the length of this page depends on how many items you are subscribed to.

Dave, I'm talking about lists of hundreds or thousands of objects - about bulk pages. A single PL/SQL function is not expensive, a PL/SQL function with dynamic SQL in a from or where clause of a SQL statement is quite likely to be prohibitively expensive though.

Even if the resulting page shows only a couple of of objects, Oracle may trash through the name function for every single row during the result set creation - depending on what seems the smarter access path to the CBO.

Adding the field to acs_objects is simply the most inexpensive way of accessing frequently-used information.

I don't understand why you need package_id and object_type to generate the show-me-this-object url? I remember having seen something like this /object-by-id/1234 where 1234 is simply the object_id. Clicking on this URL will generate a redirect to the matching object, so the computation of the url will only be done when the url is clicked on and only for one object (instead of for many objects when displaying a list). That's what you intend to provide (among other things), right?

The get_url service contract that Dave is talking about is part of the notifications service contract. I would like to move that out and make it a stand alone service contract, so that every package can publish the url's of it's objects without having to implement notifications. The only client of this service contract would then be a yet to write /object-by-id/ page, while listings, notifications etc. can just link to that page. It would make most sense to do this when tcl service contracts are available I think.

Dirk, can you expand a bit more on which columns of acs_objects should be duplicated in your opinion in the package data tables and how the automatic synchronization in both directions would work? E.g. do you propose to have a package_id column both in acs_objects _and_ the package's table? Will there be a trigger on both - if yes wouldn't they fire each other infinitely?

Right, show-me-this-object only needs the object_id in the URL, but: you need the package_id to create the URL for the object e.g. you may want to apply a specific template set for a particular subsite.

The automatic synchronization never went beyond being a proposal, but of course the triggers shouldn't fire infinitely. The basic idea was/is that site-wide packages work on site-wide tables and don't know about package-specific tables whereas packages don't work with acs_objects directly.

A site-wide package here is e.g. site-wide search or a linking service something that is mostly (always?) read-only. It's quite likely that those services don't insert or update at all - and if then in a package context.

I think Til is right about the object URL case. All we need is a object-goto page that accepts an object_id. Then that page can call the URL service contract. This way it is not run for every row.

This doesn't solve the name display issue.

Dave, the package_id should only become a field in acs_objects because the generic redirect page needs it to compute the URL *in* the code. An example URL would be show-this-object?object_id=123456 - no need for package_id there.

As for Lars' suggestion: I'm a bit at a loss - can you explain more verbosely what you want?

Dirk,

THe object-goto page only needs to compute the URL for one page, so doing a query via an acs-service-contract shouldn't be a problem. Now, if the package_id is needed for other types of operations, they can be evaluated on their own. I just wanted to show that for redirection to an object specific page, using acs-serivce-contract works pretty well. It works by calling the acs-service-contract assigned to the object_type. So far for search and notifications it works fine. Maybe it could be found to not work in all cases.

OK, we didn't use the service contract then - probably cos it wasn't available back then?!

And right - one page can perform a single rather expensive query ;)

Some of this stuff is already in the content repository datamodel, and I think that content-oriented enhancements belong there, not in the underlying object datamodel.

If we stuff too much into objects, then those who keep complaining that objects are too expensive to use in general will have been proven right.  And I'd like to avoid that :)

What we are trying to do is to convince people to store content in the content repository rather than roll their own ad hoc solutions.  Though I realize people continue to do so, that doesn't mean I want to encourage them.  Nor do I want to encourage people to look at acs_objects themselves as being in some sense biased to the building of content types.

And the CR already knows how to expose URLs, how to customize rendering of content types and objects, etc.

Having package_id in cr_items would be very useful.  This is the one item on the list that may make sense to include in acs_objects, too.  Though finding the package owner via context id - if we follow through and decide that context id does indeed want to be a true parent linke - is fairly simple using the context index map and apm datamodel.  The question becomes "how often do you need to do this" ... a time vs. space tradeoff.

Is the PL/SQL call for the name information really that burdensome in practice?  Is it certain that this isn't just a result of poor query or page design?  Shouldn't we be paginating output rather than listing thousands of objects on a page?

<blockquote> Is the PL/SQL call for the name information really that burdensome in > practice?
</blockquote>

Yes it is.

<blockquote>  Is it certain that this isn't just a result of poor query or page design?  > Shouldn't we be paginating output rather than listing thousands
of objects on a page?
</blockquote>

Well thousands of objects are deffinitelly too much to show on one
page.  However, if you have, say, fifty objects you are already in the
area where scrolling offers better usability than paginating.  Fifty
queries on one page is a very bad idea.

For me, the notifications/manage page have 103 notifications listed. It uses the acs_object__name function to retrieve the names of the objects. It seems fast enough. I don't have any specific data, but we should test to see exactly where the problem is before changing acs_objects.

I just ran these two queries:

select object_id,object_type,context_id,creation_user from acs_objects where object_id <100;

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          9  consistent gets
          0  physical reads
          0  redo size
       2853  bytes sent via SQL*Net to client
        647  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         44  rows processed
object_type here to "simulate" the name field. Now on to the PL/SQL function
select object_id,acs_object.name(object_id),context_id,creation_user from acs_objects where object_id <100;
Statistics
----------------------------------------------------------
        213  recursive calls
          0  db block gets
        511  consistent gets
          0  physical reads
          0  redo size
       3120  bytes sent via SQL*Net to client
        647  bytes received via SQL*Net from client
          4  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
         44  rows processed

A recursive call occurs when one SQL statement requires the execution of a further separate SQL statement.

213 versus 0 recursive calls. 511 vs 9 consistent gets. It already *is* a huge difference for a small query - and it grows bigger the bigger your system gets.

(The get_url function is *way* worse btw).

If PL/SQL is unusable - which is in essence the claim - can either you or Branimir educate me as to why Ars Digita chose to architect ACS 4.x around the principle that object system details should be hidden as much as possible within PL/SQL functions and procedures?

Branimir - surely no one wants to do 50 queries per page, one wants to execute a single query that returns 50 rows ...

I didn't mean to say that PL/SQL is unusable, I just showed that PL/SQl functions have to be used with great care - both in the select and in the where clause.

If a PL/SQL function hides complicated "business logic" - then it's ok. If it's as simple as spitting out a string - PL/SQL is not a good choice.

As to to the "why" - no, I can't tell you, I was just a Professional Services dude at ArsDigita, not a core engineer.

I understand Branimir's comment - acs_object.name() itself executes a potentially expensive query, thus the "50 queries" comment.

So we're balancing code speed vs. space.  And there are folks who already argue that objects are too heavyweight, if anything.  Those arguments will simply increase if we add a name field, much less a possibly HTML-formatted description.

BTW "real" content - stuff that should be in the content repository anyway - already has name and description fields.

Do all objects in the system really require this much information?

Another way to present an abstract view of objects so that generic pages can find a name, etc would be to use standard views to collect the information.  This would still require a join with acs_objects and the type-specific table but would be  less expensive than the name_method approach.

103 recursive SQL queries on a single page can perform "fast enough"
if appropriate indexes are in place etc.  That's not the problem.
The problem is that such programming style makes sites that
don't scale as number of users of a particular page grows.

Your notifications page is a good example.  It is probably not a page
people visit very often: a *regular* user perhaps visits it a few times
a year.

Something I've seen happening again and again with such pages is this:
After years of use without any problems, an admin sends a newsletter
to all registered users:

TIPS AND TRICKS

Did you know?  You can manage all your email notifications from one
place.  Just visit http://path-to-the-notification-page ...

A few hundred people click on the link within a few minutes time.  Your
site goes down.

<blockquote> If PL/SQL is unusable - which is in essence the claim
</blockquote>

It is not.  The claim is that functions that perform SQL queries are
unusable (most of the time) within SQL queries.  They are also
dangerous because people using it often don't realize what
they are doing when they use it.  It looks like one neat SQL query
whereas in fact it may easily be 1000.  An especially nasty example
is trying to use it in the WHERE clause, say:

select count(*) from acs_objects
where acs_object.name(object_id) = 'Foo'

Don't try this on a site with any substantial number of objects.  You
*will* have to kill the database process!

<blockquote> why Ars Digita chose to architect ACS 4.x around the principle that
object system details should be hidden as much as possible within PL/
SQL functions and procedures?
</blockquote>

Well, abstractions always sound like a good idea.  And I do think there
is some merit to the hiding in PL/SQL functions for things like constructors and destructors.  Accessors on the other hand simply don't work. I think that people
who designed all that weren't aware of the performance issues back
then.

Actually I think we even abandoned constructors in some parts of
scripts to migrate from an ACS3-based system to an ACS4-based
one.  But even in that case constructors were useful because you could
treat them as documentation.

I don't think that objects are heavy-weight at the moment. The acs_objects table has less than 10 fields - adding a few fields doesn't hurt especially if they are accessed frequently.

And space - harddisks are cheap and huge these. We can drop a few useless indexes if space becomes a pressing issue.

I'd add as many fields to acs_objects as necessary. I count 307 occurences of "acs_object.name" in HEAD - imho adding the  name field makes a lot of sense.

Okay, so assume you add a name column to acs_objects. Have you calculated the time required to update the name, or any other field in that table? it is very very slow. Probably that won't change too much, but how about the description field? I would guess that acs_object.name scales fairly well with the number of rows in acs_objects, whereas putting changeable data into that table doesn't scale at all.

Poorly designed interfaces that select hundreds, or thousands of rows of data, of any possible object type, should be changed to work nicely.

I also find it difficult to imagine an application where you would search for an object by name by using acs_object.name. The data exists in a separate table that the application should know about.

<blockquote> Do all objects in the system really require this much information?
</blockquote>

No, in a system where rel_segment is an object with its own entry
in acs_objects this information doesn't belong to acs_objects.
I don't like the idea of having a bunch of columns that are NULL 90%
of the time.

So, the table in the CR that you mention might be the right place.

I personally wouldn't do it because I think that the whole idea of the Content
Repository is basically flawed (See Lars' posting on the subject about a year
ago or so).

Now, as I am also one of those people (Just as a reminder, I *don't* want to start
that discussion again, 😊 who think that objects like rel_segments are
useless, the optimal solution would be to first remove the useless crap from
acs_objects and then do what Dirk suggested.

Here's what we actually do: We use a separate table called acs_objects_description to
store the information Dirk mentions.  So whenever you decide that your object
type needs to be searchable, bookmarkable and generally behave as a piece of
content, you start adding those rows to acs_objects_description.  You
then join to acs_objects if you need information that is there.

Tom, but I just proved that acs_object.name doesn't scale!

As to changing data: if you expect heavy updating/deleting/inserting then you can mumble around with PCTFREE and PCTUSED. Given that the name field is usually quite small there should row migration should almost never happen.

Updating/inserting/deleting into a second table certainly cost, but accesses happen far more frequently.

Following up Branimir's comment: I always found that we had way too many useless joins between acs_objects and acs_objects_description. And I don't mind NULLed columns, they don't take space and don't cost when being accessed.

So here's a step-by-step approach: introduce acs_objects_description with above mentioned fields and deprecate the name and get_url function. If things work out well, then go one step further and try to deprecate acs_objects_description or at least move around some fields (modifying_ip in acs_objects for example looks really useless).

Adding an acs_objects_description table is just and ad-hoc way to store meta-data about content, so If you have objects that need name and url descriptions, why not just store them in the CR and be done with it?  What you seem to be proposing is to build an under-featured CR out of the acs_objects table.  So now we'll have a CR that many people think is too compicated and over-featured, and to fix this problem, we will instead add CR like features to acs_objects and create a parallel lightweight CR repository in acs_objects.

It seems that it would be better to address shortcomings in the CR, rather than create shortcomings in acs_objects.

Dan - an important correction: *nobody* suggested saving URL information in acs_objects or in acs_objects_description.

What is stored in the content repository? Would you store metadata information about a subsite there? Or information about sent email alerts? If so, then replace acs_objects_description with CR.

Null columns do take space in PostgreSQL - there's per-row metadata that is stored along with the data itself.  In general per-row costs in PG are considerably higher than in Oracle.  Per-seat costs are far lower, however :)

Essentially the various name_methods declared for various types just return a type-specific name field.  This is why creating a view could easily provide the abstraction necessary for a top-level page to grab information about a particular object in the system.  Assuming we really need to do this for non-content objects in a general way and that the current solution is really unsuitable.

We could speed the acs_objects.name() itself by declaring that name_methods are not inherited ... but my guess is the big expense comes from the "execute" command rather than the query.

The problem with moving stuff to a generic descriptions table is that type-specific pages - say the group admin pages - need to join with the description table as well as the type specific table in order to do its job.  You've traded off speed in one place for less speed in another.

As to overuse of the object system itself ... well, if relational segments aren't objects, then one of three things become true:

1. You can't grant permissions to (say) "all Professors of Class Foo" without expanding the rel_seg and assigning the permission to each member, and without providing triggers to maintain consistency when members are added or removed.

2. Permissions are rewritten to take (table name, key) pairs again like in 3.x.

3. We don't require object_ids to actually point to acs_objects, i.e. we can have objects without the acs_objects metadata.  This would be radical ... maintain the flat key space so system-wide concepts like permissions work.  One could make the argument that the= minimal level of auditing provided by acs_objects should be optional, after all.

On the other hands I freely admit that each row in a relationship rarely needs to be an object.  There is certainly some overuse of objects in the toolkit's implementation.  I just feel like you picked a bad example.

I keep hearing arguments about the evils of the content repository but I don't see people actually running into problems in practice.  I do have a scheme in mind that would allow non-versionable entries into the content repository, meaning there'd only be one object for the item-revision pair.  It turns out that nearly all of the code looks like it would work with such a hack.  If we provided both versionable and non-versionable content, along with an enhanced and much easier to use Tcl API, would people quit picking on the CR?

The code that displays metadata about a subsite will doubtless know it's dealing with subsites ... or at least with packages.  In other words ... it is fine that this code knows where to find package metadata.

I'd say that it seems fundamentally obvious to me that there are relatively few things one wants to do via a web interface to objects en masse.  Let's face it, pages that list every object in the system are by nature not very useful except to the programmer, and as a programmer I don't remember this approach every being useful to me.

On the other hand, pages that display information about broad classes of objects - content, for instance - do make sense.

That's where I'm coming from and I'd guess Dan's thinking something similar.

"Here's the content contributed by User Resu grouped by content package" ... something like this makes sense.  Abd the CR provides metadata common to content ... name and description being two examples.

"Here's every object contributed by User Resu - even though she was often unaware that something she did made a bunch of objects" ... not very useful.  How many site admins want to see a list of all of the relation objects this user has created by virtue of joining a group, etc?  How many casual users?

Well, if there is a name and description field in the Content Repository for most of the objects that do matter - what are the costs of accessing the "live revision"? In past incarnations you had to call another PL/SQL function for that ;)

We've been discussing adding name (and description) so far.

What are the feelings about adding package_id? IIRC that was very handy, too. Any thoughts on this?

The need to call a PL/SQL function to get the live revision is due to a religous belief that a line of code that reads

content_item.get_live_revision()

is more "abstract" than one that reads

cr_items.live_revision

an argument I don't make :)

package_id ... not sure one way or another.  I would certainly agree it can be useful for content items.  I know so because I added it to content items when working on the Greenpeace site, and it was very very useful!  We used it when grabbing all content generated by a particular package, i.e. "news for the dutch greenpeace site".

Package instances should create their own root folder and in theory this can be used to find all content generated by the package.  But in our case at Greenpeace we inherited some bad code that didn't bother to correctly generate root folders for each instance, just one shared by all instances...

Can you provide some specific examples where it has been useful?  There's quite a bit of sentiment to make "context_id" be a true "parent_id" (I know this goes against the original design but in practice aD itself released a bunch of code that depends on it, bboard for instance).  If we do that finding all objects generated by a package becomes a matter of casting the right CONNECT BY or tree sortkey query (depending on which RDBMS you're using) and these aren't terribly expensive.  More expensive than querying for the package_id but then again more general because you can query for the subtree underneath any object, not just a package ...

Hi,

I was only able to read the thread now, I hope what I post is relevant.  Based from what I see we can use CR for this situation rather than adding to acs_objects.

cr_folders already contain package_id column.  This can be used rather than putting a package_id on cr_items.  Logically we would want to group items in a container like cr_folders.  So a package can get its items that has a cr_items.parent_id = cr_folders.folder_id and cr_folder.package_id = :my_package_id.  ETP already functions this way.

Based from experience content_item__get_path is very expensive, although content_item__get_virtual_path is a little cheaper.  What I did to generate the url was actually to query some extra rows.  This rows are the ancestor of the cr_item I am interested in.  Then on tcl level I generate the url.  It was a lot faster.  A few more extra rows on the result set and some tcl manipulation.  Rather than using pl/sql to get the url.

As far as live or latest content is concerned.  Using the pl/sql api is a lot slower rather than query the cr_items.live_revision or cr_items.latest_revision directly.  But the pl/sql is useful on some situation, but my rule of the thumb is to query if possible the live/latest revision.

Using tcl api to access CR is definitely easier at the expense of less flexibility.  I have already made a lot of the tcl procs to access CR, after using them I would like to change them a bit.  To make things better.

I hope its relevant.

It is very relevant, Jun ... in fact hopefully Lars will see this because he's planning to modify a couple of his packages to use the CR and I hinted to him that developing some Tcl API stuff might be a good think to investigate while he does so (and I can help a bit, I think).  If he had some of your work in hand it would very likely save time and effort.

I think the difference between "live revision" and "latest revision" is clear enough that one rarely needs the PL/SQL fallback.  It's been a bit blurry in the original implementation of the CR in that some code operates on the "latest revision" if no "live revision" is available, but I'd rather tighten up our definition of what these concepts mean and avoid the PL/SQL sugar - because mostly it is just sugar and hiding the fact that either "live" or "latest" may be returned seems a bit confusing.  If you want it use COALESCE in the query where it will be immediately obvious to the next person who has to figure out what your query does.

Just my thoughts.

Lars and Don,

You can see some of the api here.

http://66.250.57.130:9080/api-doc/package-view?version_id=2635

http://66.250.57.130:9080/api-doc/package-view?version_id=2606

Take note that its in pre-alpha state, and I will revise these based from my first experience while using it.  OpenACS isn't my day job anymore, although I do have more control of my time now.  I hope I can get help especially advices.  Unfortunately I still haven't finished the high level design doc. :(

It seems that some old unused procs still shows in the API.  Anyway sorry for the cruf, but a good amount of the procs are depracated already.  It should be in better shape by end of Feb.
Jun, hang on. So all the data is in the database, just retrieving it is so
incredibly expensive that you hack together something on the TCL layer to make
it work somewhat properly?

While this is - err may - be viable for a project under deadline pressure, a
toolkit shouldn't be based on such kludges. You want to change a broken data
model here.

I'm a bit astonished that the acs_object model seems to be taken for "cast in
stone".

As Guy Harrison, a *real* Oracle expert, reasons on page 489 in his *most
excellent* book "Oracle SQL High-Performance Tuning" - the OpenACS' object
model performs *worst* of the three possible sub-type/super-type design
models. It's the worst choice in every possible case except for full-table
scans of the supertype table.

(Now, that case doesn't make sense in OpenACS because fields in acs_objects
don't make sense without a join with one or more sub-tables or calling
expensive PL/SQL function - hence weare falling back to the poorest choice
again - expensive scans).

Adding semantically useful fields to acs_objects makes the "object" data model
considerably more useful! In my book almost every (every?) acs_object has a
name and a context. Heck, most objects even have a verbose description.

Now back then - Mid 2001 - , it didn't look like a wise decision to use the
content repository. I was told that these days the CR is in a much better
shape, so if the name and the package_id is there, then let's use it
asap. Don, can you try a little scaling test there?

Hi Dirk,

I did not make those Tcl procs to fix a broken data model.  The design is to make it easier to use the CR.  I wanted to have a design wherein the UI package is separate from the data operations.  As I have experienced most CMS has the same functions and operations, but their UI varies a lot.  Of course the theory must be more tested.  And based from my initial first stint, it seems to be ok.  But I need to change it, most are naming convention, merge or split some procs.

CR seems to work for me ok.  Since the name (cr_items.name, normally used for url) and (cr_revisions.description) description are already there.  And you can look it up by package_id in cr_folders.package_id.  The CR maybe a heavy weight, especially if you do not need versioning.  Although I have used CR even if I did not need versioning.  The UI does not expose versioning and I only have one version in the CR.

Hope this is still related with the thread.

Who is saying the object model is cast in stone?  What's being said is that some of us disagree with you on the wisdom of adding new fields which will only be of use to a subset of objects.  In part my objection is due to the fact that in PG unused fields carry a per-row space penalty and people already worry about the cost of objects as they exist today.

If you want to rewrite the object model to use one of Guy Harrison's superior approaches, feel free.  Just remember, though, you'll need to write upgrade scripts so folks can migrate existing sites in situ ...

I guess in this sense, yeah, the object model *is* somewhat cast in stone for OpenACS 4.  We inherited it from aD and for better or worse chose not to rewrite it when we started our migration project.  I doubt you'll find a single member of the OpenACS implementation team that would argue that the model we inherited is perfect or even close to it.  However, when we started our assumption was that aD would stay in business and support ACS Tcl 4.x for Oracle and that we'd just be maintaining the Postgres version.  So we had very practical reasons for not rewriting the object datamodel.

In practice it seems to work quite well as is ...
.

For what is worth on CCM acs_objects is further simplified and some of the fields have moved to acs_auditing.  I am not sure if its better or not.
Yes, I've seen that and meant to mention it, actually - thanks for pointing that out Jun.
Dirk, Harrison's book Oracle SQL High-Performance Tuning is indead excellent, but your criticism of the OpenACS object model based on page 489 of that book is simply absurd.

Harrison barely even touches on the subject of object modeling in an RDBMS. His whole section on "mapping entities or classes to tables" is barely 3 pages long, he never mentions any real systems, and the table you are referring to on page 489 is at best a simple back of the envelope take on the performance ramifications. It looks at three very generic and high-level options: "Single table", "separate subtype and supertype stables" (which OpenACS uses), and "separate subtype tables only - no supertype table". At best, the whole section is useful soley as a starting point for further thinking about performance implications - it offeres no actual answers, and it certainly doesn't even attempt to discuss any other software design considerations whatsoever.

And Harrison at least, doesn't seem to pretend that those 3 pages are anything more.

Don, Jun, Andrew, I'm not criticizing the acs_objects model based on this book, but based on my 1.5 years experience with an OpenACS-backed site with 20.000+ users, about 300.000 objects (without acs_rels!!) and a peak load of up to 120 concurrent accesses.

OpenACS' data model was not good enough for that site!

Basic things in the data model had to be changed. Some of these changes were rather light i. e. adding semantics to acs_objects via two tables acs_objects_description and acs_content_areas was not that worrysome.

I think it's a reasonably thing to thing about moving something we would have for the majority of the type tables up into acs_objects. As a starting point, here are the object types and counts on openacs.org:
openacs.org=# select object_type, count(*) from acs_objects group by object_type having count(*) > 25;
       object_type      | count
  ----------------------+-------
*+ acs_message          |    38
 + acs_message_revision |    44
   acs_sc_msg_type      |    46
*  apm_package          |   140
*  apm_package_version  |    44
*  apm_parameter        |   128
*  apm_parameter_value  |   658
*+ bookmark             |   378
*+ bt_bug               |   254
*+ bt_patch             |    61
*  content_extlink      |   125
*  content_folder       |   117
*  content_item         |   476
   content_revision     |   168
   cr_item_child_rel    |    64
*+ etp_page_revision    |   341
*+ faq_q_and_a          |    50
*+ forums_message       | 32307
   membership_rel       |  6186
   notification         |  1486
   notification_request |  9832
*  site_node            |    97
*  static_page          |   188
*+ url                  |   504
*  user                 |  6186
   user_portrait_rel    |    31
I put a * next to those I think have a reasonable shot at having a name that is non stupid (i.e. user_portrait_rel could have a name but the name would simple be the ids in the table concatenated which seems like a bad thing to denormalize).

I also put a + by those things I think would be better off in the content repository (etp_page_revision and acs_message_revision are in fact in the CR already iirc).

In looking at the results I am left thinking that most things you actually would want to display to people with an attached name actually should really be in the content repository anyway and would have names there. I do want to see the "all new content sitewide", "all content by user X", and "all new content in subsite Y" be fast enough to use (which includes making permissions fast enough to do this -- something Don is working hard on as we type).

Hi Jeff, so name is in the CR, then I suggest we switch to using it from there and deprecate acs_objects.get_name for as many object types as possible.

(I'm tempted to suggesting adding a verbose description field to cr_items - but I'll do that later 😃)

There's one way to make acs_objects.name a lot lighter btw: hardcode the most frequently values. That's *very* easy and non-breaking.

One point as to why adding a central name field is pretty straight-forward and not that worrysome: It's a change that is visible even to a rather naive user. People complain quickly if "object xyz doesn't have a name".

That's only true for objects that are exposed to the user, and one of the fundamental flaws of the thinking of the aD 4.x crew was that exposing all objects to users (admins, at least) via the web is useful.

It's not.  Only certain classes of objects should ever be exposed directly to users, even admins.  Those that are internal should only be manipulated via web pages with a UI that makes sense to the user.

Having an object name in acs_objects wouldn't bother me particularly, especially if we didn't have to carry the cost of an index on it (i.e. discourage lookup by name).  But I think the notion that most named things belong in the CR is reasonable.  I also think that most of the named things that don't belong there - site nodes, for instance - ought to be manipulated by intellegently designed pages for those types, which will "know" where to find the name without calling acs_object.name (this is already true for site nodes).

So I'm still in serious need of convincing by more than one voice.

But dragging around columns for a description and html flag is something I can't see any justification for (again, keeping in mind that in PG there's a per-row cost for NULL columns)...

Don, what about this query : "Show me all the subsites I have access to". This query was performed on *every* single page in ShareNet and originally - iirc - took two seconds. Uh, of course we moved it into a static cache. Not using the get_url and name PL/SQL function made it a pretty light query.

Verbose description: used by search and site-wide linking. Very important packages...adding a varchar2(4000) was not performance-inhibiting on Oracle.

As it stands I'm more interested in deprecating two dangegerously useless PL/SQL functions than in throwing more fields onto acs_objects. There seems to be a consensus on how to achieve that: *fine*.

If I have achieved these things as well, I'll pop open an extra bottle of Weißbier tonight:

  • Made people aware that PL/SQL in SQL (or PG/SQL in SQL) usually doesn't scale.
  • How do I get this and that information efficiently in a frequently-used site-wide package? Hmm, why not from acs_objects_$annotation_table?
  • Joining with acs_objects to get only trivial information like the creation user is expensive.

Lo and behold - I found this query in the logfile of a HEAD-based system while browsing around:

    select site_node.url(node_id) as url, object_id,                                                                                                          
           acs_object.name(object_id) as object_name,                                                                                                         
           level                                                                                                                                              
    from site_nodes                                                                                                                                           
    start with node_id = :node_id                                                                                                                             
    connect by prior parent_id = node_id                                                                                                                      
    order by level desc                                                                                                                                       

It's executed on every click (and the site_node.url function gets more expensive the more subsites you add to the system).

site_nodes does have a name field though which looks sensible. Is it safe to rephrase the query?

Yes, it would be good to use the site node name entry (after checking that the method defined for site_node indeed uses that column.)

It would also be good to store the url as a field in the site node table, calculating it when the node is mounted or renamed.  This could be done via trigger.

This is code we've inherited from the original aD 4.2.  There are many, many examples of queries like this that we should clean up as time permits.  So far we've been concentrating on big things, i.e. I'm about half-way through my reimplementation of permissions, something that affects *everything*.

But patches to fix/improve suboptimal queries or datamodel choices are *always* more than welcome, though for obvious reasons those that fix both Oracle and PG rather than just one or the other are a heck of a lot more likely to get included ASAP.  And if there's a datamodel change (i.e. to maintain site_node.url as a column in the table rather than compute it everytime) then we need upgrade scripts, too ...

Note though that this isn't a very good argument for extending the basic object with extra data.  The code here *knows* that it is building a context bar from site node information.  It doesn't fit the criteria of "code that examines objects extracting data without any type-specific knowledge of the objects" which in my mind would justify extending the object datamodel.