Forum OpenACS Development: PG Object Extensions - pros and cons

Collapse
Posted by Don Baccus on
If you're not familiar with these, you'll probably want to read the somewhat lightweight PG documentation on inheritance and classes (tables are officially called classes in PG). I've cut parts of this out of a note I e-mailed Ben, thus the "you and I" and similar phrases sprinkled throughout.

Here are some of my thoughts on whether or not to use the inheritance features of PG. Tentatively we've been saying "yes, let's do it". The motivation for using them is that conceptually acs_objects are a good fit. However the implementation of inheritance in PG is incomplete ...

Pros and cons

Pro

  • Conceptually slick.
  • More readable datamodel.
  • Gets rid of endless joins, exchanging them for (in essence) magic unions performed inside the RDBMS engine itself, so is likely more efficient. Maybe *much* more efficient.

Con

  • Not entirely there. Indices aren't inherited. Most constraints aren't inherited, including referential integrity constraints.

    (this last wouldn't matter if ACS 4 were written in a 100% object oriented style, because inheritance would replace foreign keys in all cases - but that's not how the ACS 4's written. It is a hybrid and this helps expose the incompleteness of the existing PG inheritance mechanism).

  • Getting rid of lots of joins to parents also means rewriting almost every query in the system.

    Only the most trivial of queries on acs_object-derived tables will work without change. This increases the load on people porting individual modules - it is this factor that really makes me think we should air this particular issue in public. I'm hoping people other than you and I do most of the module-level porting, and they should have a say in a fundamental decision that affects how much query mangling is needed.

    This also magnifies the maintenance issue, though if it is true that aD is truly going to stop 4x development other than minor bug fixes then this is a minor point.

    Ben points out that ACS4 makes very heavy use of views, and that much of the necessary changes might well be isolated and therefore not much of a burden at all. We need to start reading modules...

  • Referential integrity checks on a foreign key pointing to an object table only check that table. This leads to potential RI errors if someone leaves them in by accident without thoroughly understanding the code using the table. Routine testing might not uncover these.

  • Because of the incompleteness of the inheritance system, the PG group decided *not* to adopt the SQL92 syntax for inheritance in PG 7.1, but rather to keep the old syntax. Improving the inheritance mechanism for PG 7.2 is something of a priority, apparently - I hope so. It would be really cool if it were complete.

Because of #1, if we do use the inheritance mechanism we might want to use a preprocessor on our datamodels to automatically define the necessary primary key, indices, etc when deriving one table from another. This would remove the primary source of worry i.e. folks forgetting to define the proper indices on child tables. A forgotten index could lead to test installations running fine, while real installations would get slower and slower as tables filled.

Any volunteers who might want to look into this? Any other hacks come to mind?

Collapse
Posted by Jonathan Ellis on
Wouldn't this make OpenACS even more postgres-specific than ACS is oracle-specific?  I see that as a bad thing; as things are, I come from an oracle background and had no (little, anyway :) trouble getting used to postgres.  I'm working for a company in sybase now and again there is little difference in how plain tables work.  I suspect there will be many more platform-specific issues with an object approach, so it will be that much harder for an oracle (or mysql :) developer to get started.
Collapse
Posted by Don Baccus on
Well ... the PG team's goal is to implement SQL3 object extensions, so  in that sense the path is only PG-specific to the point that other vendors ignore SQL3.

But in today's world, yes, your point should probably be listed as a "con".

Collapse
Posted by Dan Wickstrom on
I think Jonathan points out a contraction in our porting strategy.  On the one hand you have argued that we should aim for sql92 compliance, but on the other hand you're advocating the use of postgresql-specific features such as inheritance to help with performance issues and to obtain a more readable data-model.    As far as readiblity, It will be somewhat simpler, but the net gain is pretty small.  With acs classic you can follow the inheritance path by looking at the acs_object.create_type calls.  In the proposed openacs method of using pg's inheritance constructs, you will have an explicit inherits declaration on a particular table.  So for instance the users table will look something like:

create table users ( ....) inhertits (persons);

That is somewhat clearer, but not a huge win over tracing down the acs_object.create_type calls.

The performance gains are not quantified, so it's not easy to assess the value.  From looking at the acs object model, it seem that there are not a huge amount of joins generated by the model that acs classic uses.  I think that is primarily due to fact that the the core code does not make use of deep inheritance structures.  An exception to that is the new community-core data model which has decompoesed the old 3.x users table into users -> persons -> parties -> acs_objects.  Queries for user information are then done through a couple of views that hide the complexity of joining.  So I think in the case of the users table, there might be some significant gains to be made by using inheritance.  The old user_groups table from acs 3.x has undergone a similar transformation where it has been decomposed to groups -> parties -> acs_objects.  But other than the users and groups tables, most "classes" only inherit from the acs_objects_type.  So in the most common case, you would only need do a simple join against acs_objects when referencing a particular object.  In addition, the application development notes recommend against doing joins and applying application-specific semantics to the acs_objects table when developing applications.

Collapse
Posted by Dan Wickstrom on
I was previewing what I had written so far and I pressed submit by mistake.  So I'm continuing on from my previous post.

I have to admit though that It's hard generalize what I see in the core code to applications that are under development or will be developed in the future, so we might see alot of use of deep inheritance structures where pg's inheritance mechanisms might turn out to be a huge performance win.  It seems that acs_objects will be used extensively for cotent that is generated on the site, so in theory the acs_objects table will grow to be quite large over time on a busy site, so we really make sure we get this right.

So as I see it we can boil it down to a choice between a possible performance gain, which so far is unquantified, through the use of pg's inheritance versus the extra porting work that will be involved in using a different method for implementing acs_objects.  If we use pg's inheritance method, we significantly raise the bar on our porting effort, and we will also be requiring people that port modules to learn quite a bit more about how the data model works, before they can effectively port a module to openacs 4.x.  If we don't use pg's inheritance model, then porting is a simple mechanical process of substituting postgresql sql for oracle sql constructs.  In addition future ports to interbase or other db's will be simpler if we don't use pg's inheritance methods.

Collapse
Posted by Don Baccus on
Thanks for the thoughtful response, Dan.

First let me make it clear that I'm thoroughly on the fence on this one, and have flip-flopped numerous times, much to Ben's disgust (as he leans somewhat to the "in favor" side of the argument).

You're right that the users and user_groups tables are the only (shall  we say) sophisticated users of the acs_objects framework, so the join  problem is minimal, perhaps.  On the other hand, these two tables are  used a *lot*.  Still, the efficiency issue is up in the air.

The major bottleneck in ACS 4 seems to be way in which the generalized  permissions system is used - I'm sure you've seen some of the threads  in web/db in regard to this.  The permissions bottleneck wouldn't be affected either way by a decision to use PG inheritance, so my thinking that we might see an efficiency gain is probably not correct.  Any gain seems likely to be swamped by the time consumed by the constant checking of permissions.

Regarding SQL92, yeah, that's important to me.  The PG inheritance scheme is moving towards SQL3 compliance, but isn't there yet.  I would assume that other SQL engine implementors are working towards, or already support, SQL3 inheritance but am not sure.  Since the SQL standards committee is likely dominated by RDBMS providers (based on my experience with standards committees for other languages), it is likely that the drive for standardization in this area was driven by RDBMS makers, probably because their customers want ORDBMS features.

But ... I've not done a survey, and in particular don't know if IB implements SQL3 inheritance.

Does anyone else?

Given that ACS 4x makes relatively minimal use of acs_objects in the first place, we may be making a mountain out of a molehill anyway.  By  the time ACS 5 arrives PG should have much better support for inheritance, strengthening the argument in favor at that point in time...

Collapse
Posted by Dan Wickstrom on
I know what you mean about flip-flopping.  I was initially against it when I first heard that we might use pg's inheritance.  This was mostly because I had not been hearing good things about it on the pg hackers list.  Then I played around with it some, and I thought well it is kind of slick and it does seem to work pretty well.  Then after starting to look at the acs 4.x object model, I realized that it was a good match, and I was more in favor of using it, but I was still concerned about the extra porting effort.  Now finally that I've looked at a lot of code in the core, I think maybe that it is overkill and not worth the effort.  I think I need a more compelling arguement to be convinced that it is worth the effort.  I'm going to spend some time looking at some of the new acs 4.x packages and see how much use they make of the acs_objects system, and if I have time I'll set up some benchmarks between the two methods and see if I can better quantify the performance gain.
Collapse
Posted by Don Baccus on
I hadn't read both parts of your message before posting my response (despite the order in which they appear here, you typed faster I guess).

Anyway, my comments about ACS 5 vs. ACS 4 reflect my thinking that most ACS 4 packages probably don't make very sophisticated use of acs_objects and probably won't.  I think more thorough redesign and dependence on the object model will be seen in ACS 5.  This is only an  educated guess, however.

If PG 7.2 has better support for inheritance (inheritable indexes, etc) then I have absolutely no doubt that I'll want to use the feature  in any ACS 5 port (which will include a big datamodel port, once again).

And, if PG currently had better support I'd be strongly in favor for 4x.

I'll be very interested in whatever information you can come up with regarding performance issue.  I'm busy at the moment trying to grind out a very large set of enhancements to my personal project at birdnotes.net (including making it a lot less ugly to look at) so I can clear the decks for March 1's porting blowout...

Collapse
Posted by Ben Adida on
Let me clearly state that I am pro-inheritance, but that I will gladly back down if beaten down by the twin forces of Don & Dan :)

On the SQL compliance issue: we can do quite a bit to abstract out the inheritance by layering views in much the same way that ACS 4.x currently does to prevent explicit 5-way joins in some queries. Which means that if we wanted to switch to a trigger-version of inheritance, we should be able to do that relatively easily. The PL/SQL abstractions help with that a lot, too.

I wasn't thinking of cutting out acs_objects. I was thinking that, either through views on pg_inherit or through explicit population of the acs_objects table, we would maintain the same meta-data interface.

At the end of the day, I cannot imagine how the internal inheritance mechanism could be less efficient than explicit joins and explicit triggers. Plus, the porting of the data model will be that much simpler (there are a LOT of triggers!!). That's why I'm leaning in favor of it.

I've ported the users, persons, parties, and acs_objects table and set up a simple benchmark. For example to benchmark the users, persons, parties, acs_objects join, I've created the following function:


create function bm1(integer) returns text as '
declare
        cnt      alias for $1;
        v_i      integer;
        startt   text;
        endt     text;
        trec     record;
begin
        startt := timeofday();
        for v_i in 1..cnt LOOP
            select u.*,p.*,pt.*,o.*  into trec
              from users1 u, persons p, parties pt, acs_objects o
             where o.object_id = v_i
               and pt.party_id = o.object_id
               and p.person_id = pt.party_id
               and u.user_id = p.person_id;
        end loop;
        endt := timeofday();
        return delta_time_msecs(startt,endt);
end;' language 'plpgsql';

the timeofday function is like now(), except it returns up to usec precision:

acspg=# select timeofday();               
timeofday 
-------------------------------------  
Tue Feb 20 19:46:15.262591 2001 EST (1 row)  

It returns a text string but it works to cast the result to a timestamp, and if used for time interval calculation, they work out correctly. I use this as a basis for estimating the time for executing a query.

In addition to the above function, I have created similar functions to test the following: 1) simple select from the acs4 users table, 2) simple join of acs_objects and the parties table, 3) select from users table which is inherited from persons -> parties -> acs_objects. To perform the tests, I have created two similar data-models, one based on acs4, and one based on porting acs4 to use pg inheritance. I then populated the acs_objects table with 11,000 rows where 1000 of the objects correspond to users in the users table. I'm not sure what a reasonable ratio of objects to users should be, but it seems that the number of objects would be much higher than the number of users. I then ran a test where I averaged the number of selects over 1000 iterations. I get the following results:

acspg=# i trial.sql
               method                |   time   |    units
-------------------------------------+----------+--------------
 user inherits person, party, object | 0.292497 | milliseconds
(1 row)
 
        method         |   time   |    units
-----------------------+----------+--------------
 simple select on user | 0.187112 | milliseconds
(1 row)
 
             method              |   time   |    units
---------------------------------+----------+--------------
 simple join of object and party | 0.275417 | milliseconds
(1 row)
 
               method                |   time   |    units
-------------------------------------+----------+--------------
 join of object, party, person, user | 0.667479 | milliseconds
(1 row)                                                                

This test was performed on redhat 6.2, pg 7.1 beta3 using a 667MHz PIII with 512 Mb of ram. If anybody has requests for what they feel might a more realistic test, let me know, and I'll see if I can set it up.

Collapse
Posted by Don Baccus on
Thanks, Dan!

The first two tests (both on users, one using inherits, the other not)  do essentially the same thing, since the inherited classes physically are part of the users table (simple vs. virtual inheritance  in C++ terms).

What you're seeing, then, is the difference in time required to return  rows of sizeof(user) length vs. sizeof(user)+sizeof(party)+sizeof(person)+sizeof(object) length.

The tradeoff, then, is a slower query if all you want is stuff from the user table.  But ... that time's constant for any query on users no matter how many of the inherited columns you want to return.

Another one to test is:

select count(*) from acs_objects

for both cases.  The inheritance case needs to UNION acs_objects, parties, persons, and users while the non-inheritance case only needs to interrogate the acs_objects table.

In other words, the join case to build the entire child class loses vs. inheritance, but grabbing stuff from the base class by implicit UNION will lose vs. the relational datamodel.  The actual numbers would be interesting to see.  Make sure you have all necessary indexes  declared ...

Collapse
Posted by Dan Wickstrom on
Here it is:

         method          |  time   |    units
-------------------------+---------+--------------
 count(*) on acs_objects | 7.23621 | milliseconds
(1 row)
 
          method           |  time   |    units
---------------------------+---------+--------------
 count(*) on acs_objects2* | 9.97299 | milliseconds
(1 row)
 
          method          |  time   |    units
--------------------------+---------+--------------
 count(*) on acs_objects2 | 2.89982 | milliseconds
(1 row)
 
acspg=#     

The first case is the acs classic case, and the second and third cases are with inheritance. The only difference being that I'm selecting from acs_objects* in the second case.

Collapse
Posted by Don Baccus on
Thanks, Dan!  That's not bad at all.

Your examples remind me once again that the PG syntax will be changing  to the (nicer) SQL3 syntax for inheritance in PG 7.2.  It changed briefly in PG 7.1 but got changed back to the old syntax because the PG folks decided their support for inheritance just wasn't good enough  to justify masquerading as SQL3 inheritance.

I'm going to go look at the datamodel again in the next few days, I've  not done so in about three weeks.

Collapse
Posted by Don Baccus on
Another problem I have with the current PG mechanism is that it's very  error-prone.  Forget the "*" on the select and only the base table is  used.  When porting queries failure to do so will fail in a somewhat silent fashion, i.e. it will work but won't return all (for instance) acs_objects.

This is something else the PG 7.2/SQL3 syntax will fix.

Collapse
Posted by Albert Langer on
Caveat to this and all my future messages. I'm a newbie inexperienced in postgresql, Oracle, acs and porting issues and a "know nothing and don't want to" re Tcl and AOLserver.

Just bypassed sql/oracle permissions problem with:
cvs checkout openacs-4/packages/acs-kernel/sql/postgresql

Still wading through a lot of background.

Seems to me that missing from above discussion of Pg extensions is *rules* - assuming the rules system works "as advertized" and is not broken (I wouldn't know).

Focus on *inheritance* may miss central point that you can abstract the entire api for a postgresql database to make it look entirely like plain vanillla SQL2 with simple tables that are really "writable views" using the rules system. I believe that could even include maintenance of metadata like the kernel by rules that trigger creation of necessary inheritance indexes etc and work with the extensive built in catalog facilities that hold *everything* about a postgresql database and already provide most of the metadata needed.

Surely that would make it far easier for package developers to interface to. They would just be working with what appear to be ordinary "tables", doing ordinary CRUD and not having to think much
about the plpgsql magic behind the scenes.

Now that a prototype kernel is available (when permissions fixed 😉
there is a "working" base necessary for investigating other
approaches. Even if it turns out that rewriting existing queries
is a nuisance when porting other packages, it would be far more
difficult trying to do it later. Should get that done, or at least
carefully investigated, for the kernel now, before it becomes a
"lost opportunity".

This has obvious implications for portability - but can be looked at as targeting for future convergence on SQL3. aD was the *stronghold* of "who needs objects". Everything is going to get more and more OO
and aD is going to get heavily into java "middleware" contrary to
"the book". It would be a real shame for a project that has
expertize in a truly ORDBMS that can put most of the data abstraction
right inside the database engine (plus much of the "business logic")
to sacrifice that for an illusory attempt to straddle divergent
approaches.

As mentioned, ACS4 makes heavy use of Oracle views. Postgresql
essentially has fully *writable* views provided the necessary
framework is setup for package developers to use them. That
could make an *immense* difference.