Forum OpenACS Q&A: universal way of displaying lists of objects needed

Hello everyone!

Currently i'm working on a categorization package which packages can use to let users categorize their objects. It will contain an object directory that displays all objects of a selected category. To generate such an object list i need a FAST way (forget about acs_object.name - it's just way too slow) to get the object_name and hopefully also object_overview (possibly with object_overview_html_p). Although Dirk Gomez' proposal didn't get accepted, unfortunately i have to get this into the discussion once more because not all packages do use the CR and for some objects using the CR would just not be necessary (site_nodes, apm_packages etc). Therefore, i will now use the following datamodel for the categorization package and patch all necessary packages to make use of that table.

create table acs_objects_description (
object_id integer not null constraint acs_objects_desc_pk primary key constraint acs_objects_desc_object_id_fk references acs_objects(object_id),
shortname varchar2(200),
overview varchar2(4000),
overview_html_p char(1) default 'f' constraint acs_objects_desc_overview_h_ck check (overview_html_p in ('t','f')),
datastore char(1) default 'a' not null,
package_id integer,
state char(1) default 'l' not null constraint acs_objects_desc_state_ck check (state in ('l', 'd', 'a'))
);
Column datastore can be used to maintain a site-wide index and state shows if the object is live/deleted/archived. Package_id is quite important since i'll work on a templating package that offers a way to assign a different template set to every package instance. Traversing the object hierarchy through context_ids just takes too much time....
Collapse
Posted by Jeff Davis on
I don't think Dirk's proposal was rejected, I think we just wanted to discuss it some more. Some issues I see with what you are talking about are a) I still don't understand what datastore is for b) would all objects get a row in this table or would it populated for only some objects? c) what about things in the CR which already have a name, title, and description?
Collapse
Posted by Peter Marklund on
Timo,
great to hear that you are back doing OpenACS development! I have a couple of questions:
  • Do you have a spec for the categorization package that you can share? How will it differ from the one you and I wrote in Munich for aD? For reference, the old categorization package can be found this thread.
  • Which object types do you plan to be browsing?
  • Would it be feasible to convert the object types that you need to using the CR?
  • Is it established that acs_object.name is too slow? Can you try the new categorization package and alternate between using a name column and the plsql proc and see what the difference is in response time?
My basic feeling right now (not having followed this discussion in great detail) is that name and package_id columns would make sense to add to the acs_objects table, or in a separate table as proposed here.
Collapse
Posted by Timo Hentschel on
Hi Jeff!

I'll try to answer your questions:
a) i've never really been knee-deep in site-wide-search and intermedia in particular, but if i'm not mistaken this column has been used for a trigger to know when to stuff new/changed content into intermedia.
b) only named objects will get a row in this table, so that would mean bboard messages, files, news, site-nodes etc. We certainly don't need to store infos about mere internal objects which the user will never see nor know that they even exist.
c) well, just copy this info via a trigger to the objects_description table. I'm not sure btw (maybe you can enlighten me on that one) whether the description column in cr_revisions always hold a description of the revision OR a description about changes in the current revision etc. More so, is it sure that description is mostly set to some meaningful text (e.g. first 1000 chars of a text message) or does it always depend on the lazy user to add some description (which he rarely does, it guess)?

Collapse
Posted by Timo Hentschel on
Hi Peter!

Yeah... I'm back! To be honest, i don't really remember the old categorization system too well. Can you provide a link to a demo system where i can have quick look? The links you provided in your old posting aren't working any longer....

To answer your question regarding acs_object.name: Yes, i am sure that acs_object.name just has to be slow if used displaying a list of objects. Why? Well, if you think about it, it just has to be slow to have to invoke this proc for every single object whose name you want to display in an object list, because this proc itself will execute a few queries to get the correct name. If you compare this with an approach in which you just have one single query to get the objects together with their names out of a table then you don't need to set up a test system to compare speed because it's evident that the second approach has to be way way faster than the first one. In my opinion query speed is the topmost design-criteria overruling every other criteria possible, so therefore i see it as established that acs_object.name could be used to get the name of one or some few objects, but for displaying an object list my proposal for a table like acs_objects_description is the better since faster solution.

Collapse
Posted by Peter Marklund on
Timo,
sorry, the links are to an old dev server that we are no longer using, you can still obtain the categorization sources from cvs though (see cvs commands in the thread).

Regarding the performance problems with acs_object.name you are probably right.

Are you planning on contributing this new categories package to OpenACS?

Collapse
Posted by Don Baccus on
In my opinion query speed is the topmost design-criteria overruling every other criteria possible
I'm old-school - correctness should be the design criteria overruling every other criteria. Hopefully folks designing airplane control software, etc, still follow this old-fashioned way of thinking.

I've measured acs_object.name() and on my P500 it runs about 3 milliseconds in both Oracle and Postgres, not terribly slow but not terribly fast, either. On a modern system I'd expect performance to be closer to a millisecond.

Denormalizing by carrying the name in acs_objects isn't that big a deal, but ... it does involve munging code to make sure it's maintained correctly. At minimum anyone doing this who expects to find their work accepted into our mainstream OpenACS core is going to have to address the fact that content items already contain a name field, and provide upgrade scripts to migrate that to acs_object as well as fix all our standard packages that use the CR. That would not be a lot of work but it is necessary work.

The name column was the one proposed extension to acs_objects that I felt had some merit, others just seemed to widen objects in ways better handled by keeping content in the already-wide content repository. Denormalizing content URLs for instance makes sense but I'd rather see that stored in content item (in part to heavily encourage folks to put their content there instead of ignoring the CR's existence)

Collapse
Posted by Jun Yamog on
While waiting for this darn site indexer to index a ton of content.  Maybe I can contribute something.

Some people do tend to think that CR is wide and heavy.  While acs_objects lacks some stuff and needs to beef up.  Maybe what we need to have a new convention of none revisioned content.  Content that does not inherit from cr_revisions.  Pretty much like what is cr_folders is right now.  CR folders are from cr_items and not from cr_revisions.  Although its none standard maybe it possible to just use something similar.

There had been many cases when I worked in CR that some content type did not merit revisions.  I did still however made use of revisions and just hide the aspect of revision from the UI.  The above (acs_objects_description) can be derived from cr_items.  Since we are just interested on objects that are displayed on the user and not internal objects, its logical that we use CR.  Although current CR usage does suggest to have revision, but its not hard not to use revisions.  Its like having cr_no_revisions table, although that would not differ from cr_revisions :)

Anyway just some thoughts, from someone trying to catch up with OpenACS.

Collapse
Posted by Timo Hentschel on
Ok, ok, correctness is a little bit more important than speed, i guess...

Regarding the speed of acs_object.name(): It is nice that it runs fast on your machine, but i don't know many million objects your system is holding and how many concurrent page requests your system does have.... Under load a system has to serve every page as fast as possible and i think i made it pretty clear that the invocation of pl/sql procs for every object to be displayed in a possible long object list (yes, normally a long list should be split on several pages, but still...) just has to be way slower than my proposed approach. That doesn't mean that acs_object.name() should not even be used to get the names of a limited number of objects. I'm just saying that this proc is one of the ad inventions that seemed nice on first sight, but if you really think about it and really try to use it in a huge system then you realize how flawed that approach is.

Regarding migration: Yes, i will make sure that migration scripts for the major packages will be written (anyone here for help?).

If you already agree that adding a name column to acs_objects (or a second table as i suggested to not blow up acs_objects unnecessarily since we don't need these infos for every single acs_object) makes some sense, then why not also adding at least package_id if not state, overview and overview_html_p?

I'm also thinking that if using revisions simply doesn't make too much sense in a package then it simply should not use it and instead overwrite the one (and only) revision for each item each time the content gets edited (don't know if that already happens, if not, then we might think about that).

I also have to add that it seems that we will not only need the name of objects that use the CR, but also for objects for which it doesn't make sense to store them in the CR.

I want to give an example: Currently, i'm working on a categorization and a mailing list package. Mailing lists can be assigned to some categories. Hopefully, the category package will get used in other packages, too, let's say bboard and file-storage. In the categories package i want to have a directory in which the user can specify a category and he'll get a list of all objects categorized in that way, so he'll see a list of all bboard-postings, files and mailing-lists assigned a selected category. But mailing-list objects will definetely not be kept in the CR since that really doesn't make any sense, wastes disc space and blows up the already huge CR tables.

That's just one example for objects which will not be in the CR, but still might to get displayed in an object list.

So that's why i'm proposing adding a seperate table for that information as i suggested above and not using only the infos in the CR.

Collapse
Posted by Jun Yamog on
Hi Timo,

Are those mailing list objects are like postings of this forum?  If yes then I guess it would be ok to put them into CR.  So they can also be searchable.  I am not 100% sure but I think this forums is also a CR app.  I think the bboard package is also a CR app, also not sure (a bit out of touch of OACS).

Since you are also looking at using package_id for your templating you may want to look at CR capability of relating a template on an item level.  Also the templating of CR does enable you to relate many templates on a single item.  You can use different context (not context_id) for each item.

Also I think the size of the table rowise is not much an issue for rdbms.  As long as things are properly queried and indexed.  This is what I have picked up here in OACS, of course I am not a db guru.

On the other hand CCM I think adopted the use of acs_objects.name column to display the name of the object, not sure also.

Anyway I am not saying that your proposal is not good, I am just offering some thoughts.  I can not decide on its merits as I am trying to keep up with OACS.  I think Don and others can give better thougths... Long day for me, I better sleep.

Collapse
Posted by Don Baccus on
I also have to add that it seems that we will not only need the name of objects that use the CR, but also for objects for which it doesn't make sense to store them in the CR.
Yes, I agree entirely. My point is that the CR already has a name field and we want to avoid duplication - if people agree to an object name in acs_objects then we should make the CR use that field rather than maintain its own.

If we take the separate table approach then things like the overview field make more sense to me.

In this case wouldn't the logical way to define that separate table be to subtype acs_objects?

acs_object -> named_object ... this would separate out "skinny" non-named, non-overviewed objects from fancy named objects with (possibly) HTML overviews. Then your extra table would just be the type extension table for acs_object and bits of cr_item (name, description) could be folded in, making cr_item skinnier. At the cost of an extra join for cr_items I'm afraid but if enough were to go into the "named_object" field then you wouldn't need that join until you actually pull out a revision ... and they're often BLOB/CLOB so we don't expect great speed out of that operation anyway.

On the other hand pages summarizing objects by name would query against the named_objects type-specific table and not have to chew through internal objects that have no name ...

(BTW probably want to use MIME-type not html_p as does the CR)

Regarding non-revisioned objects ... I've already investigated the implication of setting a revisions revision_id == item_id and at least 95% of the CR code looks as though it would work unchanged. This would give us non-revisionable CR items that would only occupy one object rather than two. acs_object->cr_item->cr_revision rather than (acs_object->cr_item) + (acs_object->cr_revision). I've also toyed with the idea of always making the first revision this way and only creating separate cr_revision objects for versions after the first one ... this is for the future but my point is that I think we can make the CR "skinnier" without killing the advantages of using it.

Also I wasn't implying that 1 or 3 milliseconds was so fast that we could ignore the problem, I was merely providing a datapoint for folks. I'm not terribly interested in worrying about the speed of pages that list thousands of objects, though - we shouldn't have such pages and the fact that we have one such page today in the admin UI is a bug, not a feature IMO. But taking 1/10th of a second to call acs_object.name() for a page with 100 objects displayed is not a particularly short time.

Don, that subtyping sounds more or less exactly like what i originally intended with my proposal of the extra table!!

I'm personally not afraid at all to maintain the naming information in cr_items and my extra table acs_objects_description, because we could use a trigger to copy the data from cr_items to the other table. Ok, it would mean redundant information but it would speed up queries in the CR, i assume (selects are by far more common than inserts/updates, so i think it's more important to focus on fast select queries). The bottom line to this would be: It's not important at all whether a package stores a copy of the object names in their own tables as long as it makes sure that the acs_objects_description table is always uptodate, although i totally agree with you in saying that the datamodel should be normalized if and where possible, but in the end that shouldn't lead to expensive queries if avoidable.

Attempts to cut down the size of CR tables are always good, i think!

PS: I want to clarify that the mailing list objects mentioned above are just objects each describing a mailing list, i didn't mean actual mailing list postings.

PPS: I thought about mime-types, but i thought that when displaying an object in a list of objects the object overview could only be either text or html, but if other scenarios come to your mind, then i would also opt for a mime_type column.

Collapse
Posted by Don Baccus on
OK, let's see what other folks think about a named_object subtype ...

I do, of course, worry about extra joins.  However one of the things I want to work on (or get someone else to work on) is the caching of the results of content-retrieval queries in a generalized, consistent manner.  Maintaining cache consistency system-wide is a difficult problem but I think that it's solvable in a general way for subsystems like the CR.  And that will do a *lot* more to improve performance than removal of a simple join.  You can put 2GB ECC Registered DDR into a server for just over $400 - in this day and age we are crazy-stupid for not providing smart content caching where we can do so safely and consistently!

So I'd like to see us work towards a clean datamodel that, of course, is still efficient but also to keep in mind that they BEST way to make db queries fast is to NOT DO THEM.  Cache!

Collapse
Posted by Dirk Gomez on
Peter,

here's the PROOF that acs_objects.name (and get_url even more so) is expensive:

https://openacs.org/forums/message-view?message_id=75566