Forum OpenACS Development: Augmenting acs_objects - adding package_id, name, and clarifying other fields

Almost a year ago I suggested adding a couple of columns to acs_objects (See https://openacs.org/forums/message-view?message_id=75455.

Apart from that discussion I had a few more and most people expressed approval. However before "TIP"ping I want to solicit updated opinions.

In short, we have learnt how to use the object model two years after it has been conceived of, we know its advantages and disadvantages. It generally is fit for packages that exist on its own.

It is not yet really ready for site-wide services like:

  • a well-performing site-wide search (the one in Oracle doesn't work)
  • a generic linking module
  • a notification service that works on all packages
  • ...

So we have a unified permissions model, but not yet a unified model of retrieving the package_id or the name of an object. PL/SQL or PGSQL is not an option here because it does not perform (look up above thread), hence my suggestion to add package_id and object_name to acs_objects.

An object of a particular type is currently usually only shown within its package's scope although for a user it doesn't matter at all whether he looks at a calendar item, a bugtracker ticket, or an address. On the web everything is a link. A centrally stored package_id and name would enable *every* package to create a link to this particular object_id.

Central services would not only be more efficient in terms of CPU and IO consumption, we'd also get a lot more efficient in programming: E.g. when extending notifications towards a new module I don't necessarily need to know the internals of the package's data model, it is probably enough to look at acs_objects.

However there was at least one unresolved issue: I18N. Don made a great suggestion when Tilmann and I were chatting with him: objects whose name needs to be localized would have the message_key as their name.

And when touching acs_objects now, what other site-wide information may be lacking there?

E.g. when I told Tilmann community_id might be cool there as well - in DotLrn every object eventually belongs to a community, right? - I got a big honking no.

What about archived_p if we think about an archiving service?

Which columns' definition is not clear or not clear enough? E. g. what does last_modified mean for forum posting? The last_modified date for first question or for the particular answer?

Once we are through with this (the second time), I'll tip it.

Hi Dirk,

I am agreeable to most of what you said.  Although on the package_id can I ask for a clarification.  That this mean that is a one to one mapping?  Would it be good if different packages can say display or manipulate the same object?  Or is this package_id is more or less like the mother package?

For example a news object, its mother package_id is the an instance of a news package?

Is the name a link?  Much like cr_items.name?  If so then I guess some of the things in CR will have to move out to be in acs objects.  Is that right?

Objects are not necessarily related to a package_id. If you want that effect, just use the context_id of an object to refer eventually, via the context map to the package_id. But there is no inherient relationship between the two.

Someone decided that groups needed a package_id relationship. The result is that if you create groups via the pl/sql procedures they don't show up in _any_ group administration page.

Bottom line is a package is just another object. Don't give it special meaning by placing it in acs_objects.

Some more applications that would benefit a lot from centrally available object data:

  • The user contributions page
  • Categorization (to browse through categories)
  • Search (I repeat it deliberately since it is so important in my opinion)
  • Calendar with arbitrary objects attached to events
  • ...

If some basic object data would be available via the acs_objects table that allowed any application to efficiently render a list of objects without knowing about the implementation of each particular object type, then these applications would become feasible and OpenACS would gain a lot from the central integration that the object model already offers to a good extent.

The discussion about this stalled last time because of i18n issues with acs_objects.name. Don's idea of optionally putting a #message_key# in there instead of normal text (allowing both) seems like the optimal solution. Background: some object names must be translatable, otherwise they make no sense. Other objects are so infrequently seen by the user, such as acs_rels, that they don't need to be translated. Again other objects have a genuine name that doesn't need translation, e.g. a file-storage file could save the uploaded filename as acs_object.name.

creation_date and last_modified_date are both already there but sometimes such as in the case of forums they are not used in a way that makes them usable for general listings - it would be good if we could define exactly what last_modified means and make packages adhere to that standard. Then it could be displayed in search results for example (don't you just like to look into that forum thread first that was updated yesterday instead of the one last updated in 2000?). And this example shows also why it would be benefitial to have it directly in the acs_objects table, instead of a pl/sql function or a service contract: based on age (current_date-last_modifed) search results could be sorted and ranked. These are all needed features of a good site-wide search, that makes it superior to indexing from the outside.

Yes, the package_id would be the mother package. I don't understand the question to be honest - every object has a package that created it, no? Can you think of an example that doesn't? And the main point is that the package_id allows for guaranteed dispatching to the code that knows how to display it and where. Objects will always be obviously related to their creation package, and no, a package is not just another object, as I see it it is also a bunch of code and a location in the site-map which knows how to deal with its objects.

context_id is only used for the security hierarchy, and I have heard a number of times that it should actually be called security_context_id and not used for anything else, although I can't recall the rationale behind that right now. Anyway I think it would be much more efficient to have the (maybe somewhat redundant) package_id in acs_objects too, compared to having to do a 'connect by' all the time.

Another field that might make sense in acs_objects would be the subsite_id - opinions on that? It would allow for quick filtering e.g. of search results or objects attached to calendar items to be only from the current subsite.

Til.

You can probably calcualte the subsite_id using the site_nodes table and connect by (oracle) or tree_sortkey (postgresql)

Besides that, I agree with everything here. It is necessary to allow multiple instances of a package to store the package_id of an object. Some objects do not have package_id in their type specific table such as groups, acs_rels and other internel system objects.

I don't see this as a problem. All of the packages supplied with OpenACS are using package_id currently in the type specific storage table. This works fine to scope objects to the correct package instance, but does not help in searching or other general operations.

subsite_id in acs_objects: Yes, because not only would I want to limit searches (or whatever service), I definitely want to add the subsite name in every output whatsoever:

* object <foo> in community <bar>

Be it an email notification or a search result. (OpenACS == Open Architecture Community System)

Calculating the subsite_id with the site_nodes table is NOT straight-forward in Oracle because of the limitations of connect by up until 9i. (See the rather klunky PL/SQL procs Harish and I posted here

So no opinions about archived_p, deleted_p or a state column? My gutfeel is that they are not too useful currently but may become so very quickly...then they could probably be added without much ado.

No, the package_id in acs_objects would model a 1:1 (or 1:0 in case package_id doesn't make sense which Tom seems to be hinting at) relationship. It should be the package_id of the "mother package", the one you would most closely associate with this particular object. Is there a possibility of 1:n or even n:m (package_id : object_id) in OpenACS? I doubt it.

Yun, your example is right: for a news object the respective mother package_id is that of the corresponding news package.

Name - exactly as in cr_items. Does it make sense to duplicate this? Probably not, especially as cr_items already has locale, content_type, publish_status etc.

So, is there any reason why we wouldn't stuff every named object's name et al. into cr_items? CR experts - clarify please.

How would a generic object map work? If you want to relate two objects, stick them in this table with a tag on the relationship. You can use the same idea for a package_id, or subsite_id, allowing objects to be associated with zero or more packages or subsites.

create table object_map (
 object_id_1,
 object_id_2,
 tag
);

select * from object_map where tag = 'site image';

select * from object_map where object_id_1 = :my_site_node
and tag = 'site image';

select * from object_map where object_id_2 = :my_image
and tag = 'site image';

How the generic object map would work - sure. But do we need it?!

From past project I would say: no, we don't need it.

Dirk... its Jun :)  I think Tils is getting to you.  He is the only one that calls me Yun.

1 object and multiple packages is possible.  Lets look at for example news object.  A news package is the mother package, but since the news object is a content type, its possible that another cms package can either manipulate it or display it.  I am not disagreeing with putting package_id.  Just would like to make sure that its not enforced to be 1:1.  One of the main purpose of making a central object system is that other packages can either display or manipulate it.  If it was the case of 1:1 then its like ACS 3.x.  Basically it may not only be central services that can touch other objects.  It may well be other packages.  For example integration between logger and bug tracker, these packages may be manipulating a single shared object.  Anyway main point is that I hope I voiced out that 1:1 on package_id will NOT be imposed.

As for the name.  If we are to put it into acs_objects then we should remove cr_items.name.  maybe leave it there for legacy, not sure.  But it should be nice that it should not be duplicate.

As for stuffing all named objects into CR.  Some may disagree on the additional bagage CR will put.

The way I see it I think here are the options:

- add fields to acs_objects, if some exists in CR.  Remove from CR like name.

- put every name object in CR. and slim down CR.

- since acs_object is too light and CR is too heavy.  Create something in between.

I am not sure where to go.  But I do think that change is ok and good.  Just not sure where is the best path.  Hopefully with some more input, you will be able to decide our best path.

BTW, as for the archive_p, delete_p or state column.  publish_status in CR looks similar, but not the same.

Jun, yeah it's Til's fault. Sorry :)

Let me rephrase my question: Can an object belong to multiple to packages.

(That a particular object can be shown in different packages without much trouble is a goal of this proposal.)

I'd stuff name as a varchar2(200) field into acs_objects and remove it over two releases from CR.

What problem does it solve to add package_id and subsite_id to acs_objects, that you couldn't get from a mapping table? When you add fields to acs_objects, does that require you to modify pl and tcl apis for all objects? I guess I keep missing the point of what problem requires the extra fields to be in acs_objects? Shouldn't acs_objects only contain fields used or useable by every object?

I guess we are coming down to the question of performance vs. clean design here. I do agree that mapping tables would be the way to go in an ideal world. But they always have the drawback of one additional join. If we make more of them, then we have more joins (e.g. one for package_id and one for subsite_id).

So I think adding certain items to acs_objects and removing them as Jun suggested from CR would make sense. Especially as we so far have not made a decision on whether to strongly encourage usage of CR in every module. It would be bad to add functionality that can be used in categorization and search to CR and not having all modules use the CR.

Why clean design vs. performance? I am seriously asking whether there is a *must* to have a mapping table or not. I am not yet convinced that there is, so it is clean design to add package_id to acs_objects imho.

The fields to be added are useful for every object.

Performance vs. clean design? I thought it came down to what the requirements were, first off and if the fields added really belong to every acs_object.

If it is conceivable that someone would want an object to have zero or more packages or subsites it should be related to, you are stuck with a mapping table. Performance cannot be deteremined until someone lists what they are trying to do, why that is curretly hard to do, and what the new solution would do to make the situation better. You also need to include negative effects.

If you move package_id out of object specific tables, then display inside the application requires an extra join. But it makes it easier for someone to find all objects related to a package. Which is more likely to impact performance? What if you need a one to many relationship?

Many basic objects don't need this, and package_id and subsite_id could never contain a valid value. users, parties, group, rel_segments, acs_rels, etc. Acs_objects needs to contain fields common to _all_ objects, or at the very least only fields which could have valid values for all objects.

Finally maybe some extra fields could be removed from acs_objects: context_id. It hasn't been considered yet, but it might be useful to have more than one context for an object. Also, a lot of applications don't even use this field, so it has a zero or more relationship to other objects. (If you used a tag of 'context', maybe you could use the same object_map table.)

Its ok Dirk.

Anyway will leave it up to you and OCT on where to go.

I don't think cr_items.name is the same as acs_objects.name would be. cr_items.name is a terrible misnomer - it should be sth like cr_items.urlpart since that's what it is. cr_items.name (should) contain  a short, lowercase identifier with url compatible characters only, and unique within a particular content folder. All requirements that acs_objects.name wouldn't have. The pretty name of a cr_item can be found in cr_revisions.title instead.

acs_objects.name would ideally duplicate the cr_revisions.title of the live revision, or last revision if there is no live revision.

Adding acs_objects.name would centralize a facility that many packages have already - to store a pretty user displayable name for specific objects. E.g. forums_messages.subject or surveys.name. It would propably be advisable for the packages to move storage of name over to the centralised service. Or think about doing all this in the CR (sorry just rambling).

If it goes into acs_objects, then a null value in acs_objects.name could become 'One acs_object_types.pretty_name', e.g. 'One User Portrait' when displayed.

Til,

That change seems to be ok and good.  And is present in RHEA.  But from what I understand Dirk is looking for the urlpart.  Hence its cr_items.name.  Not sure.

So which part is it Dirk?  a url part or pretty name?

Both :)

Can you give me an example when url_part and pretty_name are different? And where it makes sense that they are different?

I think for things with non ascii pretty names you will need a plain ascii url part. What about things like forum threads where the pretty name might be the thread subject but the subjects are not always unique so you would need to disambiguate them in the urlpart. Or what about a document with a long title which you want to show up at a url which is not so long. e.g. something titled "2003 Annual Report to the Audit Committee" which you want at url /audit/2003Report or similiar. The presumption that the url fragment and the content item pretty name should be the same seems to be way too restrictive.
(DaveB pointed out the difference on #OpenACS)

I want the name that I would show to the user in a URL.

Also the URL part I am not sure how we are going to enforce it uniqueness.  Maybe by package_id?  Not sure.  On CR its enforced due to the content folders.  There is also a constraint there.

As for the sample that Jeff has put.  It did pop up when I was working with Talli.  The resolution was, to truncate the title then add a number.  It use to work is to make the title a good url and add number if needed.  But then long titles made long urls.

Please disregard my comments about cr_items.name. I want to add the user-readable name to be added to acs_objects.

In the case of content_repository it is the title of the live revision.

Collapse
Posted by Dirk Gomez on
So here's a few examples of what I refer to as name that should be put into acs_objects (debated on IRC):
  • In Forums it would be the subject of the forum posting.
  • In Calendar it would be the subject of the calendar item.
  • In File-Storage it would be the title of a file.
  • ...

It comes down to a decision by the package maintainer what the name is.

Collapse
Posted by Tom Jackson on

Can I ask why you are debating what to include in _the_ central table to OpenACS on an IRC channel? Does this have anything to do with an open discussion? No.

Maybe before putting denormalized data into acs_objects, you want to come up with _how_ that information is going to be maintained.

Collapse
Posted by Dirk Gomez on
It's not an IRC channel, it is being discussed on #openacs. You can look up the logs somewhere on this site. I have zero intention on this being a closed or secret discussion.

How is it going to be maintained? Similar to the current mechanisms, you just need to stuff a bit more information into acs_objects. That can be done explicitly by changing the respective -2 or -3 pages or implicitly by adding triggers.

The information to be added is readily available to a package, however currently every package has its own way of storing that data. This is a way of unifying immensely important information - so important that site-wide packages can hardly work without them.

Collapse
Posted by Tom Jackson on
How is it going to be maintained? Similar to the current mechanisms, you just need to stuff a bit more information into acs_objects. That can be done explicitly by changing the respective -2 or -3 pages or implicitly by adding triggers.

So just add two triggers to each object table, or modify every object create/update function with special code.

Collapse
Posted by Dirk Gomez on
Back in the very beginning of the Sharenet project did we add a table called acs_objects_description that held name, package_id etc.

I don't recall exactly how long it took us to make the various OpenACS packages we were using on the project to play nicely with acs_objects_description, it wasn't much work though. And the effort paid back handsomely because otherwise expensive and/or very complicated queries had become easy to implement and quick to perform.

Timo did a similar change to the AIESEC code. Maybe he can tell us how long it took there - it's not that long ago.

As for the upgrade. Here's an idea Tilmann had during a f2f meeting (f2f narcosis?). Those fields could be added without NOT NULL constraints to allow for a certain upgrade period e. g. a minor release.

I'd suggest adding a clob the acs_objects that contains an XML doc with the objects extra stuff. I don't think it would have much performance impact and then you can do anything. Oracle has support for indexing and querying it and Postgres does not seem to far behind.

It would be possible to specify tags for things line object_name. It would also be possible to have as many translated names as you like by having a set of name tags.

If template::list::create supported xpaths for elements you could easily build table data from the xml doc.

Services could interface with objects thru this document. For example if you want your object searchable via site wide search you could have a site_wide_search tag and put in data like keywords and view url.

ACS attributes could support xml storage and items in tags could also be attributes.

Collapse
Posted by Dirk Gomez on
Have you actually worked with XML datatypes Barry? I haven't yet seen much code for it save an article for Oracle Magazine. My gut feel is that they may be interesting for content that is truly hierarchical, but not for "networked" content like acs_objects. Eventually I want to be able to associate every object with every other object in the system (paying attention  to permissions of course).

Oh, and my other gutfeel is that they don't perform well on retrieval and are more used to trasport data in between systems.

Collapse
Posted by Tom Jackson on
Eventually I want to be able to associate every object with every other object in the system...

Dirk, your solution of adding an attribute to acs_objects will not allow you to do this, at most you will be able to associate an object with one other object, and the association will be of an unknown type (or a single type). Or maybe you can explain how it will work.

Also, repeating data in a parent table (acs_objects) is denormalizing in the wrong direction isn't it?

Collapse
Posted by Dirk Gomez on
That:s a bit taken out of context. The association business has to do with relational model vs. hierarchical model. If we were to use XML strictly it is practically impossible to associate entities in a way you can do in a  relational model.

Why is repeating data in a parent table the wrong direction for denormalization?

I just started development using 9i a couple of months ago and I'm switching to 9i this weekend for production. I have done a lot of XML in 8i but I settled on just putting XML in clobs and using PLSQL to parse it. The other XML features in 8i did not seem that useful. It appears 9i has an XML datatype, it looks promising but I have not tried it.

I've started using ConText (or whatever they call it) and 9i to index XML documents and search against them. This works pretty well because you can aggregate data from all over the place and search without joins. Joins can be a problem in acs because it often takes many tables to represent something. You are correct about "hierarchical vs networked" data. I don't think you would want to join on data in the XML document nor would you want to store all the forum data in one clob even though you could. It might make sense to store one thread in an XML document though.

As far as transporting data between systems that's what got me using XML in the first place. I transfer several gigs a day in documents with 1000s of tags. In some ways OpenACS has this same problem. There are many mostly independent packages but some global ones like site wide search. XML may be better at sharing data between packages than using columns because you don't need to join. It is also possible to use the acs_attributes table to describe the contents of the XML documents.

My comment was really just a completely different alternative to adding columns to acs_objects. The problem with that table is depending on what you are doing it will never have the right data. With an XML document you can provide a standardized IPC mechanism without extending the table for every attribute.

Finally I switched to AOL 4 a couple of months ago so I could use tDom. We used Java and PLSQL to parse all our XML before, but now TCL is better than either one. I think tDom will open up a whole new set of possibilities with XML and OpenACS.

The attributes we are talking about require only a handful of bytes to store and generally almost every object has them. That is why I think they belong into a central table (note that I don't want them to be moved out of their respective packages' tables)

The XML stuff sounds attractive for search. I started work on a search project and I am very interested in looking into bits of your XML code. The search project currently requires Oracle 8i, but I have good hope to return the code to OpenACS and then I will make it work and play well with Oracle 9i's new features.

Dirk,

So what is your proposal at this point? To add to acs_objects to to add a table to hold information for named objects?

I definitely think storing these attributes in XML is a very bad idea.

I do think that using XML for transferring data between systems can be a good solution.

Let me bring up a couple of reminders ...

Adding extra columns is not zero-cost in PG when the values are NULL.  Every row has a descripter which includes column metadata - adding new columns increases the size of that descripter regardless of the value stored in the column.

I had though we'd talked about deriving a "named_object" type from acs-objects and to carry denormalized data like name and URL there?  Rather than extend acs-object itself?  I thought we'd had consensus (more or less) that package_id was the one candidate that belongs in every object?

While I understand Tom's objection to having package_id in each object this particular problem has cropped up so often and has resulted in so many packages having a package_id field in its private object type that I think it is worth doing.  If all packages do it the same way maybe all of them will do it right :)

I've been thinking actually in terms of reworking the relationsip between objects and cr_items in a way that would more or less give "named_objects" as well as remove one object  in the case where one wants to define content types that aren't revisionable.  I don't have any details beyond that but I'd really like have us spend some time thinking about improving the datamodel holistically rather than just kludge it up to solve a couple of performance issues.

Anyway I've been thinking about this quite a bit and want to do some experimentation after 5.0 is released ...

What is the penalty of a NULLed column in PG? In Oracle it's one byte per field unless it is the last field in a table, then it is zero bytes.

My goal - the TIP's goal - is to have crucial data easily and cheaply available. The packages I (and Tilmann) have mentioned are not available or in a really rugged state. It is so not only because nobody has yet volunteered to take care of them, it is also that OpenACS' data model is not finished.

Quoting myself from this very thread:

Name - exactly as in cr_items. Does it make sense to duplicate this? Probably not, especially as cr_items already has locale, content_type, publish_status etc.

I haven't talked about performance here btw. This is about functionality, that is imho currently direly missing from OpenACS. (I only ruled out PL/SQL as a means of achieving my goals)

Maybe we can move forward because I think there generally is a consensus. We have a few options on how to implement this:

* add a whole slew to acs_objects

* create a table acs_objects_description (sic! that is how I named this table in the Sharenet project :)) that holds name, locale, verbose description and what not. (Package_id still goes into acs_objects)

* Use cr_items instead of acs_objects_description. Advantages here: reasonably well-understood TCL api already exists, the table already exists and is quite fat anyway.

Did I forget something?

Oh yes, what about doing this in a different thread?

My thoughts on the package_id was that first it only makes sense with some object types. In the cases where it makes sense, it could just as well be that an object needs to be related to more than one package. However, I only use a tiny subset of OpenACS. If it greatly simplifies things maybe it is the easiest solution.

I believe the PG penalty's four bytes, perhaps eight.  There's a bit in the column descriptor that says it's NULL (need to do this because NULL by definition is not representable by the column's datatype itself).

Since this is clearly a 5.1 issue ... for now perhaps we can agree in principle and chose a particular solution later?  That seems to be what you're saying with your three options ...

I'm also thinking that playing with cr_items would better integrate Karl's CR work with OpenACS proper ... there's considerable duplication of code as well as a messy datamodel.

An object can certainly be *related* to more than one package, but it can only be *created* by one package since creation is a one-time-event.  acs-rels or the CR's relationship stuff can be used to do the kind of relating you're talking about.

Including package_id is the traditional way to make a package "subsite aware".  We want all non-singleton packages to be "subsite aware" so they can be mounted under multiple subsites  and still work right (i.e. each instance only diddles with its own objects).  This is such a common case that having it be central to objects makes sense to me ... and making it central to objects increases the odds that a package writer will get the "subsite aware" bit right the first time IMO.

I think it's great we talk about this stuff in great detail, Tom.  We need to carefully consider any changes that make the basic object overhead higher.

Is there a package that uses multiple package_ids per object_id?

I'm asking because I haven't come across this requirement. As said somewhere above in the thread, the ShareNet code (based on some completely forked OpenACS 4.0 beta) has a table acs_objects_description which contains fields like name, description, intermedia datastore, content_type etc. And package_id and the mapping is strictly 1:1 between acs_objects and acs_objects_description (well, you don't necessarily need an entry in acs_objects_description)

The AIESEC code uses a table called acs_named_objects which was added by Timo and I think they are quite happy with that too.

What would be the attributes of an "object description" for the same object  but with a different package_id?

If an object needs to be related to more than one package then (as I said above) we have 1:many relationship tools available.

The 1:1 case is bound to be the most common case by far.  I've never seen any case where it's not sufficient.

I didn't realize Timo'd implemented the named object notion we'd tossed about last time this subject came up.  Cool!

I do use 1:many, but then again its still a bit unconventional.  But of course the object is created by a single package only.

Its a notion of packageless objects.  Basically a content item may be manipulated and/or displayed by different packages.  In an analogy.  In Windows we have a text file.  The text file can be edited by Notedpad, Wordpad or Word.  It would not care which package uses it.  As long as the package supports the content type and object type.  It should do just fine.

Anyway its not conventional, but I hope it provides some input about less package centric objects.