Forum OpenACS Development: Gratuitous use of acs_objects?
In reading through the objects documentation, one is greeted with this
line:
This last point cannot be over-stressed: the object model is not meant to be used for large scale application data storage. It is meant to represent and store metadata, not application data.
However, as I look through the different packages, it seems that everything is turned into an acs_object. For example, I installed the bboard package, and every message is an acs_object (actually, every post adds 4 rows to the acs_objects table). Bookmarks also uses objects and adds 2 rows to the table for every bookmark entered into the system.
I wonder if all of these objects are really necessary? Couldn't a bboard forum be an object, but not messages? Could messages inherit needed permissions from a parent forum which is an acs_object? Is message level granularity really necessary? Maybe I'm wrong and it is necessary for all messages to be objects for the system to work correctly, but it does make the acs_objects table grow fast.
In my mind, such extensive use of acs_objects violates spirit of the "no application data" principle. Even thought the bboard message itself, or the title of the bookmark, is stored elsewhere, it still has a little piece of itself in acs_objects.
Maybe I'm confused in my understanding of metadata and application data, but it seems to me that there may be other, better ways of dealing with things that making everything an object. If a bboard message is really "metadata" then what does qualify to be application data?
Is having such a huge acs_ojects table really a problem? Am I wrong in my analasys of the use of acs_objects? Very possibly. But, it does raise interesting questions.
The reason for acs-objects in the first place was to implement a psuedo-OO layer on ACS. The benifits (not all realized for sure) included, a single location to add functionality (i.e. audit trails), since an object can only exist once, tracing to the parent is possible (this is the reason that the ip/date/ etc are stored in the base object, usually acs_object directly, CR is a seperate case).
That said it is obvious that not all data should be acs_objects. I think that most of this "bad" coding was the legacy of AD personnel writing packages before the kernel was stable and certainly before any best practices were in place. The exercise at the time was to release the "new" ACS4 with most packages in place.
If you look at the way I wrote acs-reference et al, I think you will find a good use of the acs_object system. The reference data is not stored in acs_objects only the fact that the data exists is an object.
I completely agree that many packages went overboard and all tables are stored as acs_objects. Bboard is a problem but I don't think it is possible to fix this easily (i.e. it is better to write it from scratch). I also can see some reason for all messages being acs_objects but that is because the data model wasn't thought through to eliminate that dependence. One reason for using acs_objects is to get the permissioning system for free and I believe that was the major reason, that and the fact that bboard was the 2nd or 3rd package ported from 3.x.
I for one really think that a best practices doc should be written to solve some of these inconsitencies but alas most of us don't really have time. In the meantime I think peer review of any new package is the best bet, especially since we are now "db independent".
And the CR is built upon acs objects.
Workflow and permissions both work on objects, and any kind of categorization or other "knowledge management"-type feature we build will operate on objects as well.
So in short any information in the system that is meant to be used in a general way should be an object. Only items that are tightly controlled within an application and that aren't meant to be exposed to other packages (search being another example) should be non-object tables.
It is possible to go overboard with both objects and the CR, of course. But careful use of the object model shouldn't be overly expensive. Big tables themselves aren't really much of a problem, RDBMS's are designed to handle them efficiently. The permisssions model is a much more significant issue in terms of performance.
Now ... the object model includes a generalized attributes model that lets you define and store data in a single attributes table through an API. The attributes model is fairly expensive and I suspect the quote you've posted is really meant to mean that you shouldn't be using the attributes feature of the object model for large scale application data storage, but should rather be using a separate table. The object model allows for extension by mapping a type-specific table and indeed that's what is most commonly done.
We do need a best practices-type document and we do need to take a careful look at how the object system is used. Bboard's a bad example, unfortunately. There are other bad examples, too.
It is possible to go overboard with both objects and the CR, of course. But careful use of the object model shouldn't be overly expensive.But, using acs_objects is/was more expensive than it should be. While working on a performance issue for my project, Ben Adida discovered an inefficiency in acs_objects that was slowing things down "considerably." He tells me that the patch is slated to be a part of the next beta release. Not an answer to the original question, but everyone should be able to make use of the performance benefits.
But this can be true of *any* schema and *any* set of queries.
Ben talked to me recently about a performance problem in ACS Objects that may well be the same one you're referencing. It wasn't really objects per se, it was the code that computes the PG tree_sortkey for the object.
The tree_sortkey in question is needed for the permissions implementation. The object model per se doesn't require it, and a different permissions model laid on top of the object model wouldn't necessarily require it ...
I remember seeing that comment and having exactly the same reaction when I first started working with aD ACS4, but seeing that paragraph in context makes things clearer -
In the context of ACS 4, this means using the object model to make our data models more flexible, so that new modules can easily gain access to generic features. However, while the API itself doesn't enforce the idea that applications only use the object model for metadata, it is also the case that the data model is not designed to scale to large type hierarchies. In the more limited domain of the metadata model, this is acceptable since the type hierarchy is fairly small. But the object system data model is not designed to support, for example, a huge type tree like the Java runtime libraries might define.
This last point cannot be over-stressed: the object model is not meant to be used for large scale application data storage. It is meant to represent and store metadata, not application data.
the essential point here isn't "don't store application data in objects" but "don't store application data in the object hierarchy". I suspect the bold paragraph would be better rendered as "the object hierarchy is not meant to be used for large scale application data storage."
or that's what I reckon they were getting at, anyway...
That seems like a reasonable interpretation, the whole point of the new ACS object system was to provide a reference point back to acs_objects, but refering to the acs_objects table in your application data primary key is not the same as storing data in the acs_objects table.
I was discussing this with Branimir Dolicki the other day, and he told me about what they'd done on one of their projects. He'd ripped out the "parties" and "person" tables so that your user info isn't spread out across four tables, but only two: acs_objects and users. This by itself reduces the number of acs_objects considerably and speeds things up. It took them about 4 developer days to do it across the board, and they're really happy they did it.
Next step will be to de-acs-objectize acs_rels and membership_rels.
I think it sounds like an interesting exercise.
I agree with Branimir that only users, groups, package instances, and content should be acs_objects. Content being things like a bboard posting, a file in the file storage.
And while we're at it, I also agree with him that the content repository is created upside down. What we really need from the content repository is a central place to store title, description, and other generic info that you need for site-wide search or a "what's new" page. Instead of the kludgy acs_object.name() function, which has to do dynamic SQL to figure out something as banal as the name of an object.
Thoughts on this? Might as well start the discussion about what to clean up for our next big release :)
I've found acs_object.name to be very useful although for any objects I want the name in a result set I end up storing the name in the table, but for admin pages it's nice to pass an object id and let the admin page figure out the name and other stuff in a general way.
I think there are two conflicting goals. When working on admin stuff it's nice to have general methods for objects, but on user pages the general methods are too slow. I'd hate to see the general methods ignored because of this since I spend at least as much time working on the admin side of a site
I was discussing this with Branimir Dolicki the other day, and he told me about what they'd done on one of their projects. He'd ripped out the "parties" and "person" tables so that your user info isn't spread out across four tables, but only two: acs_objects and users. This by itself reduces the number of acs_objects considerably and speeds things up. It took them about 4 developer days to do it across the board, and they're really happy they did it.How does reduce the number of objects? It will reduce the number of joins and tables, but a user in the hierarchy object->party->person->user occupies only one object. As far as the utility of "persons" ... I too have had the need for non-registered users to be included in the system.
Now ... one thing I think we really need to do is to get rid of the need for a new object subtype to declare and use a new table, and also the restriction that only one type is allowed to point to any given table. The latter restriction prevents you from, say, defining two types ("person" and "user") that share a table, which in the "person" case could be left partially empty. One could still enforce the requirement that the information be there for "user" objects, while providing the flexibility of being able to add "persons" that aren't "users" and having the full advantage of the type system.
And that gets rid of one of your tables. I think that "parties" are probably an unneeded intermediate level, so that would get us down to two tables. Just like Branimir's solution but without throwing away as much functionality.
BTW the above thinking is representative of my thoughts about the object system. It needs some cleaning up, and some judicious changes would allow for more flexible subtyping leading to subtyping being a lighter-weight thing.
Next step will be to de-acs-objectize acs_rels and membership_rels.I think there's value in being able to build the complex relationships implied by their being objects. However ... the individual rows in the relationships are also, I believe, objects. And they don't need to be/shouldn't be? I'm blowin' smoke, here, I haven't looked closely at the issue. But if I'm right then the proliferation of objects due to mapping rows could go away, without losing the flexibility of having the rel itself an object.
I agree with Branimir that only users, groups, package instances, and content should be acs_objects. Content being things like a bboard posting, a file in the file storage.Not sure I'd go this far. In fact, several folks have come up with examples showing that having object types be objects would be useful. It may be that we're not using the object system enough, rather than too much.
We need to concentrate on making the object level as lightweight as possible ... regardless of what we choose to make objects.
And while we're at it, I also agree with him that the content repository is created upside down. What we really need from the content repository is a central place to store title, description, and other generic info that you need for site-wide search or a "what's new" page. Instead of the kludgy acs_object.name() function, which has to do dynamic SQL to figure out something as banal as the name of an object.
I think it's too heavyweight, too. I haven't thought too much about a solution, other than the fact that it's currently rather monolithic and as you point out, lacks a simple foundation.
Thoughts on this? Might as well start the discussion about what to clean up for our next big release :)
definately there are some packages that abuse acs_objects and permissions. the current survey package is an example of abuse of acs_objects, and bboard is an example of abuse of permissions. something i think would help OACS a lot is better guidelines for application developers as to what to make an acs_object and when/where to use permissions. (note i am not offering myself as the author of these guidelines, i hate writing and am not good at it)
i think the suggestion of having a subclass of acs_objects for permissioned objects is interesting. we could still write general services for acs_objects that would be inherited by all, while making a leaner, meaner, permissions infrastructure. i think this would help ease peoples concerns about performance of the permissions system given that the set of permissioned_objects would reduce in size drastically. for example, even if acs_rels remained acs_objects, i don't see a need for them to be permissioned_objects. this alone would reduce the number of permissioned rows drastically.
i agree with don that the object_type system needs help. there are a lot of gratuitous constraints and limitations on the acs_object_types table itself. this makes development a pain. i was supposed to test a "loosened up" version of acs_object_types but have not had time to do it. if i get to it i will post my findings, if anyone else feels up to the task of doing this, post here and i will post my suggestions, i know don has some too.
i am a big advocate for a leaner content-repository. lets just say that there were many heated arguments about this at aD and i think the wrong side got their way :). as everyone else has said, CR should just be a storage system, ideally just a table and api. yes one table. it would containt id, title, content_type, content_locale, and content. i might be forgetting one or two i haven't thought about this in a while. all the other services provided by the current CR should be built on top of this. it shouldn't be the case that all content i store HAS to be versioned. this just doesn't make sense. i know this is a pretty deep change but i think people here aren't averse to change and also see the benefit in this.
concretely, i would like to see these things fixed for OACS 4.6 but i think that might be too much.
ok, that's about as much as i can write in one sitting, i don't have the stamina that don and lars do. btw, i would participate in these forums a lot more often if i could reply to them via email, i HATE this interface.
As far as the CR goes, I think there has been some exageration with respect to its problems. Because of its large API, it can seem kind of daunting, but there is nothing that prevents you from using it in a simple fashion if you so desire. There is nothing that forces you to use versioning, and If you look at how packages use the CR, you can see that quite a few of them make no attempt to use the versioning features of the CR. A better approach might be to create a CR-lite interface that allows developers to make use of the CR in a basic way without having understand all of its features.
The more that is implemented in the core, the less that has to be re-implemented by each individual package. The core will not be light or lean+mean, but the packages will be. Appart from the engineering savings, i.e. time to build, previously debugged code, learn one api; what's more important is the idea of a common data model and the powerful things you can do with it. This is what gives the acs it's advantage over a loose collection of 'best of breed' programs.
The way I'm beginning to think about this, is as a seperate layer of abstraction above the database (augmenting, not replacing). This would be the lowest level of the toolkit, excluding things like parties, people and groups, but including objects and types, rels and rel types, attributes and values etc. Plumbing.
I see rels and rel_types at the very core of this so I wouldn't like to see them be stripped of their object_id's. I am in favour of giving object types object id's, and hence rel_types.
A concrete example of the push down of features I'd like to see from the higher layers of the toolkit into this core is the relationship inheritence of groups (component vs. member) in to relatiosnhips themselves. This mechanism is used already in the toolkit (groups, site nodes, object types, privileges, categories...?) but each is an island implemented uniquely. An integrated, toolkit-wide mechanism for this would enable some really cool interactions.
Considering the ideas floated here which seem to down play the importance of centralised, object_id identified features, I'm wondering where people see the toolkit going in the future. Is this just a performance problem, perhaps with tight client schedules forcing immediate action? Do people feel that an object model such as we have now will never be fast enough? Or do folks dissagree with the approach intirely?
We should absolutely,positively, allow for a screen name/user name instead of email, as the unique user identification method.
To Don: You're right. This only reduces the number of tables and joins.
I haven't looked this through in detail, either. All I have is a strong desire to find ways to make it more lightweight, both in performance, learning curve, and effort to develop with it. I'm trying to figure out to what degree we need a radical or an incremental approach.
I've also thought of the possibility of a more "interface"-based approach. The idea is that, when two object types (say, persons, and users) share some common attributes (say, email), instead of creating a table to hold the common attribute column, you could have the same column in both tables, and have a way of saying "they're really the same attribute", like you do with interfaces in OO languages. That way you can reduce joins. Don't know how useful that'd be in practice, but it's a thought.
Stephen: Yes, abstractions, and implementing things in general ways in the core for all applications to use, is very important. But it's also important to allow rapid application development, easy training of new developers, and strong performance.
While I like, in principle, all the cool things you can do with these cool abstraction (and I was part of designing most of them), I think there are significant costs along the three dimensions mentioned above: 1) time to learn, 2) time to develop with, 3) performance.
I believe we went overboard then, and we need to back off a bit. Parallels abound ... SGML to XML, X.500 to LDAP (L is for Light-weight). I've come to see the value of fewer and simpler abstractions, which, admittedly, is a worse model of reality in some respects, but it's workable, and it saves a lot of overhead.
I guess I should write up something about these thoughts soon.
First of all I would like to say that I am very glad to see that the OpenACS community has so many great minds who are willing to pick up with the design of the ACS object model where ArsDigita left off more than a year ago!
I agree with Lars that time to learn, time to develop with and performance are objectives that should be given top-priority. I also worked at ArsDigita so I have witnessed just how devastating over-engineering can be.
Here are my conclusions from this thread so far:
- In the interest of being able to write generic services all
user supplied content should be an ACS object. This conclusion rests
on the assumption that the performance penalty from using tables
with a huge number of rows is negligible or at least acceptable. I think
Don Baccus puts it very well:
"So in short any information in the system that is meant to be used in a general way should be an object. Only items that are tightly controlled within an application and that aren't meant to be exposed to other packages (search being another example) should be non-object tables."
- The performance problems with the permissions API can somehow be dealt with by excluding ACS Objects that don't need permissioning (i.e. Bboard posts) from the permission tables (by means of a flag or otherwise, see Barry Brooks post above).
- To be able to easily, scalably and generically write crucial services such as site-wide search and the "what's new" page, the columns name and url should be added to acs_objects. While I am writing about acs_object columns I should mention that I have started a thread on the topic of a logic site hierarchy here.
i don't think the problems of CR are bieng exaggerated. the size and convolutedness of its api reflect the underlying code and data model. it is hard to learn and use. it makes you jump through hoops to store a simple image in the system.
to store an item in the current CR i have to create cr_items and cr_revisions, and if i don't care about revisions i still have to do this. this doesn't make sense.
instead of making a CR-lite interface to CR-mamoth, i would rather make CR lite as it should be and add the other CR services, such as versioning on top of it.
stephen: having core services used by everyone is great, but forcing app developers to use core services they don't want or need is not great. so always pushing down is not a good idea, you may want to layer core services instead. having simple acs_objects is good. forcing everyone to have versioning, permissioning, categorization, search on every object they create is not good, it doesn't make sense.
i like lars' idea of interfaces. it allows us to create objects that can add each of these services independently.
something i think would help OACS a lot is better guidelines for application developers as to what to make an acs_object and when/where to use permissions. (note i am not offering myself as the author of these guidelines, i hate writing and am not good at it)Just by coincidence (really!) John Mileham, at the Berklee School of Music, mentioned in passing that they've developed some ideas about "best practices" there. He contacted me in response to the thread about Sloan/Berklee that's over in the general forum.
Anyway, I asked if he'd have the time to write them up and post them, in order to trigger discussion about community-wide guidelines as to how we'd like to see various bits of core functionality used.
He's interested in writing up at least a sketchy description of their thinking for us to take a look at and to pitch in and help create a doc for the community at large.
I think this would be a very good thing to have. Without putting pressure on John or publicly committing him to a lot of work, I'm hoping he'll have time to put something on paper soon.
Roger Williams is also working up some suggestions for changes to the permissions system that he intends to share "around June 1, maybe". I think he means this year :) He's been working through various discussions we've had about permissions in the past, sifting through them and using them as a basis for his proposals. This should help give focus to further discussions as we try to decide what to do in 4.6.
More generally ... this is a great discussion. Stephen, I don't think we're talking so much about removing functionality as streamlining it. If we can streamline stuff and make core functionality easier to use and prove that it scales well without a bunch of hair-pulling on the part of developers, everyone will use that functionality by reflex. We won't be seeing folks suggesting that maybe bboard posts shouldn't be objects because we'll make using objects as pain-free as we possibly can. At least that's where I'm coming from.
Dan ... in terms of rels I really do like having them be objects so you can build rels on top of rels ... my comment (in case it wasn't clear) was only that the ROWS, as Yon also says, shouldn't be objects. I don't think that costs us any practical functionality and has the potential to remove a LOT of objects from systems that make heavy use of acs_relations.
Yon - but it is simple to store an image in the CR once you know what to do. The user page in openacs stores your photo in the CR, and it doesn't make use of versioning. The apm uses the CR to store generated packages, and it also doesn't make use of versioning. Neither use is complicated. A simple interface is all you need to use it and the same CR can support versioned and non-versioned items. When proposing a change to the CR you also have to consider the impact on the all the packages that make use of it. Refactoring the CR into multiple layers would have a big impact on many of the packages in openacs.
It is not uncommon to attach data to a relationship between objects (think acs3 groups, still unmatched); the generic acs attributes mechanism supports this. If object types and rel types were objects, then this generic mechanism would support 'static' attributes too.
If rels have properties then users must provide data, in which case we need to record who entered the data, their ip address, etc. If users are entering data, then we'll probably want to use permissions.
The implementation of rels must surely have some unique idetifying id, and as per the rest of acs I assume this will be an integer from a sequence. What is the benefit of making rels non objects? These principles apply to more than just rels.
Taking the example of bboard, a direct permission need not be assigned to each message. The page which displays messages in a forum can assume that all messages have the same permission, the permission of the forum, and make just one check. This can be done today, there is no overead.
It's important to keep in mind that the permissions of messages still need to be checked, for example on the user contributions page. Does user A have permission to read the three messages user B posted to a private forum? The context_id takes care of this.
What is the overhead of using generic, site-wide services? In most cases--with the exception of the context_id / inherit_p pair in acs_objects, and cr_item / cr_revisions in the content repository--the requirement is that the object be identified with a system unique id. In other words there will be one large table of object ids.
Anything created as an acs object is not forced to take advantage of generic services. On the contrary, by having a system unique id acs objects present the oppertunity to generic services to take advantage of them.
The content repository has a large API. This makes it hard to learn, which slows development. But most of this can be solved trivially by extracting the services it provides e.g. keywords, and making them applicable to all acs objects. The remaining two features of the CR are the facilty to store blobs of stuff, and versioning.
The first of these seems like a historical accident. DB usage behind websites evolved into CMS's when people thought it a good idea to store their blobs of indivisible stuff (magazine articles generaly) in the db. But these days the line is pretty blurry between what is a piece of content and can be squeezed into blob-of-stuff-plus-title and what deserves it's own table with structured attributes.
The second feature of the CR, versioning, is the most interesting because it is not tackled any where else in the toolkit. I see versioning play two roles:
- Revisions: An object is revised, for example a magazine article going through an approval process or a bboard posting being edited by a moderator, where the newest revision becomes the canonical version of the object and the previous revisions serve as an audit trail.
- Alternatives: An object has more than one representation, for example the English and French versions of the description of a product in a catalogue, where each description is equaly valid but appropriate in different contexts.
Versioning applies to more than just indivisible blobs of stuff, so there needs to be a mechanism for all acs objects to store and represent revisions and alternatives. The CR is not bloated with useless features, those features need to be moved and expanded. What's left of the CR is a convenient API for storing stuff in the filesystem with metadata in the db.