Forum OpenACS Development: New implementation of tree_sortkey

Collapse
Posted by Don Baccus on
I thought I'd let folks know that the CVS tree is currently partly broken with some partial changes I've made to the way we do tree queries. I'll be committing my completed changes this weekend, after I get back from Seattle.

I've changed the underlying representation from "text" to "bit varying" ("varbit", actually, since PL/pgSQL doesn't accept "bit varying" in all contexts).

What does these changes get us?

  • No reliance on the collating sequence for "text".

    The toolkit should work with locale support built into PG, and with LANG != C. At least, if it doesn't, it won't be because of hierarchical queries.

  • Shorter tree sort keys

    For most trees with modest numbers of nodes at each level, only one byte per level is needed. If more than 128 direct children exist for a particular node, two bytes are used, giving a maximum of 32768 direct children per node (a bit more than with the current implementation). The text-based version requires three bytes per level.

  • Fast performance

    Postgres optimizes byte-aligned bit "chunks", and the new tree sortkey implementation stores byte-aligned values. Postgres is more than happy to use an index to speed "between" joins just as it does with "text". Shorter tree sort keys mean more per index page meaning fewer splits and somewhat better performance (probably not much, though).

I'm not sure when Postgres bit varying support matured. I know that neither Ben, Dan or I was aware that we could use easily use this type for tree sort keys when we first began investigating the hierarchical query issue.

Note that while this saves space, solves the LANG=C dependency problem, and (also true of my recent changes from LIKE to BETWEEN on the text-based version) is very fast for slicing subtrees because the tree sort key index is used, one problem isn't solved.

Generating all ancestors is still somewhat slow.

There are a couple of approaches that might be possible to solve this last problem in the future (not this first release cycle, it will be "fast enough" for now):

  • Define a GiST index type and use it, not btrees

    This would allow us to define "contains" and "contained" operators, both usable by the GiST index strategy. "tree_root contains tree_children" and "tree_root contained by tree_children" to get ancestors and subtrees. Probably a bit slower than btrees for slicing out subtrees but certainly faster when building a list of ancestors.

  • Return a set of parent nodes from a function

    You can only return rowsets from SQL functions. If a rowset could be returned from PL/pgSQL we could build put the set of ancestors by climbing the tree programatically.

Keep your eyes open in the next several days, I'll make an announcement when I commit all my changes to CVS. Until then ... expect bits and pieces to be broken ('cause they are).
Collapse
Posted by David Walker on
Maybe helpful, maybe not. The next postgres version, with "create or replace" will allow self referencing recursive functions. (You need "create or replace" because currently the "create function" will fail due to the reference to a function that does not exist yet. Postgres may support self referencing queries with a simple "create function" but I don't know that for sure yet.)

I've written a recursive sql getparents function (using a hack that emulates "create or replace") that gets all the parents and returns a rowset. I don't really have the data available to test the performance of this function but it seems like it should be fast.

See https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=0003Fq&topic_id=12&topic=OpenACS%204%2e0%20Design

Collapse
Posted by Gilbert Wong on
Don,

Will these be transparent changes?  I've adapted the "old" tree_sortkey triggers (acs_objects_insert_tr and acs_objects_update_tr) for some custom code I created.  If I drop the DB and reload it using the new code, would I have to change my customized triggers?  Thanks.

Collapse
Posted by Jun Yamog on
Hi Don,

This affects only CVS copies after Dec 6 right?

I have a Dec 3 tree which should be ok right?

As with Gilbert question how are the upper levels will be affected?  Is this something transparent to us or do we have to make some changes on the packages that has been made?

Collapse
Posted by David Walker on
Please let me know if I'm not on the right track. I'm trying to add ideas but I haven't been actively messing with the OpenACS4 code.

I'm wondering if dynamically generating a query in tcl would decrease or enhance speed. The query plan would not be cached but the tree is handled easily (assuming I understand tree_sortkey).
set path "/tr/ee/so/rt/ke/y"

set path_list [split $path /]
set x [expr [llength $path_list] -1]
while {$x > 0} {
 ns_write "select * from permissions where tree_sort = '[join 
[lrange $path_list 0 $x] /]'
" incr x -1 if {$x > 0} { ns_write "union
" } }
Collapse
Posted by Don Baccus on
David, yes, building the list of identifiers in Tcl would work, but it can also be done in PL/pgSQL thus hiding it from the Tcl code (as we're trying to do with all DB-dependent code).  That was one of my ideas for finding parents faster.  I tried the recursive SQL function trick and ran into the roadblock that PG 7.1 doesn't support such recursiveness.  Drat!  I was bummed out.  It's welcome news to learn that PG 7.2 has removed the restriction, I wasn't aware of that.

As far as the triggers go ... for starters, your own custom tables could  continue to use the "text" type triggers, there's no need for every table to do it the same way.  Though overall it's cleaner, of course.  The triggers do need changing but the changes are very minor.  Probably the best thing to do is to look at some examples after I commit my changes this weekend and decide if the benefits are worth your changing this stuff for *your* tables.

Breaking the coupling with the collation sequence is the most important benefit of doing it the new way, IMO.

Jun ... no guarantees on the Dec 3rd CVS tree.  I was making some changes to the use of the old style sortkeys (using BETWEEN rather than LIKE in order to pick up the index on tree_sortkey) and was doing some testing before updating.  But ... that's transition code.  We're not making promises that every night's CVS tree will work.  Overall, the nightly tarballs have been in good shape for all but a handfull of nights but I'd consider that "luck".

Collapse
Posted by Don Baccus on
Oh, I've checked the thread and David's "create or replace" hack written in PL/Tcl.  Cute!  I may look into this later.

Frankly I'm not too worried about the speed of finding parents at this point.  For the most part it's done on limited sets of objects, i.e. site nodes, bookmarks, and the like.  The tree_sortkey index can be used to diminish the rows scanned somewhat for larger structures and these limited sets usually aren't going to be large anyway.

So I'm probably willing at this point to put aside a faster solution until we're ready to require PG 7.2 or at least until our second release.  Note it shouldn't require any change in the datamodel's representation of sortkeys so we can make such changes transparent to OpenACS 4 sites in existence when we make this further improvement.

Collapse
Posted by David Walker on
Just in case postgres 7.2 doesn't support recursive functions
directly I'll add this note on creating recursive functions using
"create or replace".

To use "create or replace" to create a recursive function you first
issue a create function command that returns the proper content type
but is not recursive.  Then you replace that function using "create
or replace" with the final version that is recursive.

It might be considered a hack but it isn't too bad, especially
compared to the hack I did to get recursive functions into PG 7.1
(although that one could be worse.  I did not have to edit any
source code to do it).

Hopefully they'll just support recursive functions directly and we
won't have any trouble.

Collapse
Posted by Ken Kennedy on
Don: Along the "is this transparent" vein; have you changed the next_sortkey function? (I'll look in CVS, but I figured I'd ask so the discussion can be documented here). I ask because bookmarks (my module) is rife with CONNECT BYs, so I've used the tree_sortkey solution quite a bit in porting to PG. Improvements sound excellent, but if I've got to go back and update my module at this stage of the game...bummer. Not that I won't *grin*, but is there going to be a point where changes like this will be "frozen" until after the release?
Collapse
Posted by Don Baccus on
Ken ... I've made the necessary changes to every package that uses our hierarchical keys to replace the Oracle CONNECT BY queries.  Including bookmarks.  I've been doing a lot of testing (and fixing my various typos and other boo-boos) the past two days.
Collapse
Posted by Don Baccus on
Oh, once we hit "beta" fundamental changes like this will be strictly verbotten, even by me.

I would've put this off but after I found out that the "LIKE (tree_sortkey || '%')" hack would always (and properly, I might add) do a sequential scan of all the rows of interest rather than use the index on tree_sortkey I knew I had to fix this.  Changing to a "BETWEEN" form worked fine but at that point I realized I could change the representation from text to bit varying without having to change my already-modified queries.

Given that the LANG=C collation dependency was also a significant problem I decided that now was the time to do this.  Generating upgrade scripts for a post-first-release-cycle change of this magnitude would be a major PITA (actually PIMA!)

Collapse
Posted by Don Baccus on
David ... a problem with most/all of the ideas above is that PG's not going to do an indexed scan on the table containing the potential parents, but rather a sequential scan - which is the problem.in a nutshell.

The current scheme I've got leads to simple enough queries to find all parents, once you get the hang of it:

select parents.*
from foo_table parents, foo_table child
where child.foo_key = :foo_key
  where tree_ancestor_p(parents.tree_sortkey, child.tree_sortkey);
The trick is to come up with a query that will use the index on parents.tree_sortkey, which this query will not.

Generating the correct prefixes via a recursive SQL function would allow you to say something like

select parents.*
from foo_table parents, foo_table child
where child.foo_key = :foo_key
  and parents.tree_sortkey in tree_ancestor_sortkeys(child.tree_sorktey);
but, alas, this will not use the index on parents.tree_sortkey either.

Now ... generating the list of parent sortkeys separately and passing a list of literal tree_sortkeys as the right operand to "in" might do so ... but I'm not sure. I may do some testing.

But ... using the GiST index method and defining proper "contains" and "contained by operators would use the index on tree_sortkey and be readable, as well ("@" is the PG "contains" operator):

select parents.*
from foo_table parents, foo_table child
where child.foo_key = :foo_key
  and child.tree_sortkey @ parents.tree_sortkey;
But this is going to have to wait a bit ...
Collapse
Posted by Don Baccus on
Actually, in an effort to oversimplify I've cast the query in a much worst style than necessary using the current scheme of things. The toolkit never queries an entire table in this way, since objects, bookmarks, bboard threads, content folders etc are created by packages. Rather the queries do something like "show all the parents for the child in the subtree starting with the root folder for this package", giving us two rather than one opportunity to restrict the query, leading to a "between" with left and right values determined by the package-specific root object (folder whatever) id and the given child.

Even the general case can be improved:

select parents.*
from foo_table parents, foo_table child
where child.foo_key = :foo_key
  and parents.tree_sortkey <= child.tree_sortkey
  and tree_ancestor_p(parents.tree_sortkey, child.tree_sortkey);
The index on tree_sortkey can be used on the "<=" (or on the "between" in the more complex example alluded to above) so rather than searching the entire tree, we'll search the tree "above and to the left" of the child node. Of course if the child's the rightmost leaf of the entire tree than we'll still end up searching the entire table. On a more cheerful note, if the child's the leftmost leaf of the entire tree we'll grab just the parents with the "<=". The average case will lie between the two extremes.

This makes the current system a bit more palatable until I either implement a GiST index and the proper "contains" operator, or someone comes up with another bright idea that we can show will make effective use of the index on tree_sortkey.

Also "tree_ancestor_p" only compares the keys, it doesn't touch the RDBMS. It's one line is "return substring(potential_parent, potential_child) = 1". I've encapsulated it in a PL/pgSQL function in order to make it easier to change things in the future if we need or want to.

(PG's bitstrings are pretty cool, supporting "substring", "position", and "length" operators just like text strings)

Collapse
Posted by Don Baccus on
At the risk of rambling on and on it appears that using "in" with a list of constant tree sortkeys will indeed use the index on tree_sortkey.  So I now have a couple of approaches in mind which will make selecting  the ancestors of a node every bit as efficient as the select of a node's children.

Meaning we're home free with just a bit more work.  Cool...

These queries should really be faster than the equivalent "connect by" queries in Oracle.  The penalty we pay, though, is a slight increase in storage and an increase in update/insert/delete times (the time required to calculate and update new tree_sortkeys).  Since these DML operations are much less frequent that selects I really don't care.

As far as the space required for the new tree_sortkeys, after loading up all the openacs.org users, forums, categories and messages the maximum tree_sortkey for acs_objects is 8 bytes + 4 bytes for the variable length counter (text, bit varying, etc are all stored in counter + data form), for a key that's at the 7th level in the object tree.

Not bad.

Of course, more complex sites will have deeper trees with correspondingly longer keys but most nodes won't have more than 128 immediate children, thus taking a byte for each level in the tree.

Collapse
Posted by Ken Kennedy on
Holy smokes...thanks, Dan!! (For doing the updates) You're the man!! A lot of work for you, but I definitely appreciate it...looking over what you're doing here, it looks outstanding. Now I'm chomping at the bit for the "brokenness" to clear up, so I can cvs update and get all this cool stuff to play with! *Grin* Looking forward to hearing from you.
Collapse
Posted by Don Baccus on
A bunch of stuff is checked in and working, but thanks to David's inspired hack that taught me how to declare a recursive SQL function I've broken things again :)

Actually my checked-out copy's pretty good, and I'll be committing my changes tomorrow.

Using Dave's hack makes finding parents very cool, though the actual technique's quite different than Dave's example, being based entirely on the tree_sortkey for the child node:

select parent_nodes.*
from (select tree_ancestor_keys(acs_object_get_tree_sortkey(:child_object_id)) as tree_sortkey) parent_keys,
  acs_objects parent_nodes
where parent_nodes.object_id = parent_keys.object_id;
This query is blindingly fast, using the index on tree_sortkey to find the right nodes from the parent_nodes relation.

And it's even somewhat readable, a bonus! tree_ancestor_keys is the recursive SQL function, and acs_object_get_root_key is necessary because, well, you can play around and figure out why yourselves :)

Kudos to Dave once again for what is truly a deviously sneakily deliciously decadent hack on the PG internal tables!

Collapse
Posted by David Walker on
Thanks Don.  This project is very important to me.  That is why I
try to generate at least one line of code every six months.  I feel
it is important that I keep up my productivity.
Collapse
Posted by Dan Wickstrom on
I've been playing around with the new tree query support and I'm curious about a few of the implementation details. I haven't had a chance to follow the new tree query development in detail, so it's possible I missed the answers to my questions when I searched the forums.

Limiting the number of number of nodes per-level to 2^15 nodes seems to be too restrictive and it's likely to cause problems as the size of a site grows. A probable problem area in the kernel is the acs_objects table. The tree formed by acs_objects is likely to be wide and shallow, so it's not too hard to imagine more than 2^15 objects with identical context_ids. The length of the key value at each level is currently encoded with one bit. I think it would be prudent to either allocate two bits per-level for encoding the length, thus allowing key lengths of 6,14,22 or 30 bits, or as an alternative the current default key lengths could be changed from 7 and 15 bits to 15 and 31 bits. The first method would allow ~ 2^30 nodes per level while the second method would allow ~ 2^31 nodes per level. The storage requirements for the sortkeys would increase, but you would still have byte aligned data, I don't think the overall increase in query time would be significant. It should still be better than the old text based sortkeys.

There is a comment in acs-kernel/sql/postgresql/postgres.sql that the old text-based sortkey method was limited to 25k nodes per level. This is an incorrect statement. The old text-based sortkeys, for all practical purposes, had no limit on the number of nodes per-level. If you wanted to be pedantic, the limit = 159^159 ~ 1e350, and there was a tradeoff between the depth of the tree and the number of nodes per-level.

Why is a helper function required to get the tree_sortkey for a table when using the tree_ancestor_keys function? I've tried replacing the helper function with a select and I don't see any difference in the cost or execution time of the query. There is a quote in postgres.sql that indicates that the query will not work without the helper function:

"This query will use the index on tree_sortkey to scan acs_objects. The function to grab the tree_sortkey for the node is necessary (and must be defined for each table that uses our hierarchical query scheme) to avoid restrictions on the use of SQL functions that return sets. "

This however, doesn't seem to be the case as shown by the example below.


testtree=# select o.*
testtree-#  from tree_view o,
testtree-#    (select tree_ancestor_keys(acs_objects_get_tree_sortkey(2051)) as tree_sortkey) parents
testtree-#  where o.tree_sortkey = parents.tree_sortkey;
 level |     object_id     | context_id |                       tree_sortkey                       
-------+-------------------+------------+----------------------------------------------------------
     1 |  1                |            | 00000000
     2 |    307            |          1 | 0000000000000000
     3 |      311          |        307 | 000000000000000000000000
     4 |        323        |        311 | 00000000000000000000000000000000
     5 |          371      |        323 | 0000000000000000000000000000000000000000
     6 |            611    |        371 | 000000000000000000000000000000000000000000000000
     7 |              2051 |        611 | 00000000000000000000000000000000000000000000000000000000
(7 rows)

testtree=# explain select o.*
testtree-#  from tree_view o,
testtree-#    (select tree_ancestor_keys(acs_objects_get_tree_sortkey(2051)) as tree_sortkey) parents
testtree-#  where o.tree_sortkey = parents.tree_sortkey;
NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..2.03 rows=1 width=32)
  ->  Subquery Scan parents  (cost=0.00..0.00 rows=0 width=0)
        ->  Result  (cost=0.00..0.00 rows=0 width=0)
  ->  Index Scan using acs_objs_tree_skey_idx on acs_objects  (cost=0.00..2.02 rows=1 width=20)

EXPLAIN
testtree=# select o.*
testtree-#  from tree_view o,
testtree-#    (select tree_ancestor_keys((select tree_sortkey from acs_objects where object_id = 2051)) as tree_sortkey) parents
testtree-#  where o.tree_sortkey = parents.tree_sortkey;
 level |     object_id     | context_id |                       tree_sortkey                       
-------+-------------------+------------+----------------------------------------------------------
     1 |  1                |            | 00000000
     2 |    307            |          1 | 0000000000000000
     3 |      311          |        307 | 000000000000000000000000
     4 |        323        |        311 | 00000000000000000000000000000000
     5 |          371      |        323 | 0000000000000000000000000000000000000000
     6 |            611    |        371 | 000000000000000000000000000000000000000000000000
     7 |              2051 |        611 | 00000000000000000000000000000000000000000000000000000000
(7 rows)

testtree=# explain select o.*
testtree-#  from tree_view o,
testtree-#    (select tree_ancestor_keys((select tree_sortkey from acs_objects where object_id = 2051)) as tree_sortkey) parents
testtree-#  where o.tree_sortkey = parents.tree_sortkey;
NOTICE:  QUERY PLAN:

Nested Loop  (cost=0.00..2.03 rows=1 width=32)
  ->  Subquery Scan parents  (cost=0.00..0.00 rows=0 width=0)
        ->  Result  (cost=0.00..0.00 rows=0 width=0)
              InitPlan
                ->  Index Scan using acs_objects_pkey on acs_objects  (cost=0.00..2.02 rows=1 width=12)
  ->  Index Scan using acs_objs_tree_skey_idx on acs_objects  (cost=0.00..2.02 rows=1 width=20)

EXPLAIN
testtree=# 

Overall, the implementation looks good, and it's nice to be able to eliminate the LANG=C requirement and the dependence on the collation order.

Collapse
Posted by Don Baccus on
Thanks for the feedback, I've been hoping you'd have time to dig into this ...

First of all, thanks for correcting my misunderstanding about the text sortkeys version, I thought they were limited to two characters per level.

I had considered going to three or four bytes for the "wide" tree representation but had assumed the object tree would be more "branchy" than you do. My choice would be to keep the single bit "short or long" encoding and going to three for four bytes if we feel we should do that. My reasoning is that bboard, file storage and other content-oriented packages will naturally build trees in the content repository that will be "branchy" and it would be nice to stick with single-byte keys in such cases as much as possible.

We should decide this soon ...

(for those following the conversation who aren't intimate with variuos bits and pieces of OpenACS 4 there's one sort key placing such objects in the object tree, and another placing them in the CR folder tree).

As far as the function issue, I wasn't thinking of the replacement you suggest, i.e.

testtree=# select o.*
testtree-#  from tree_view o,
testtree-#    (select tree_ancestor_keys((select tree_sortkey from acs_objects where object_id = 2051)) as tree_sortkey) parents
testtree-#  where o.tree_sortkey = parents.tree_sortkey;
But rather other ways of casting the query that I'd tried at first that led to PG complaining that I was using tree_ancestor_keys in ways that it didn't like. The important point is that you have to grab the ancestor keys in a separate subselect and aren't able to combine this with anything else. At least all my attempts to do otherwise failed.

Which is cleaner? Your way's probably slightly faster, my way isolates things a bit better perhaps. Half a dozen of one, six of the other, pretty much?

And thanks for the compliment. I'd stumbled across the fact that BETWEEN would use an index while playing around with the problems we faced due to the unforseen fact that "LIKE" with a non-constant right operand correctly ignores the index. Having made that change I then searched for another data type that would work, and stumbled across BIT VARYING and after reading some PG sources, the fact that operations on byte-aligned ones are very fast.

So it's still really your implementation, it's just heavily disguised!

Collapse
Posted by Don Baccus on
In my answer above I meant 7-bit short keys or three or four byte long keys (four, I guess, go for broke) ... I'd like to keep the short key to one byte.
Collapse
Posted by Dan Wickstrom on
In general, you're probably right about the branchiness of the various trees, but I think acs_objects is a clear exception, and I think we need to make sure and support longer key lengths for that table. Your suggestion of using 7 or 31 length keys seems like a good compromise. You seem to be suggesting that it's important to retain one-byte keys. Have you done any benchmarking to substantiate the need for one-byte keys? Intuitively, it seems likely that wider keys would slow down comparisons when querying, but I wonder how significant the comparison cost is relative to the overall query cost.

One of the things that I really liked about the old text based method was that it was completely generalized, so that you didn't have to worry about the branchiness of the table ahead of time. The m-vgID method would support really deep hierachies, or it would support really wide hierarchies without any pre-implementation requirements to optimize parameters for either case.

The function issue is no big deal, mostly just curiousity on my part, as I wanted to make sure I wasn't missing some subtle detail that could affect performance. I like using the sub-select, because I don't need to know the name of the helper function when I write a query. It's not hard to find out, but it's one less thing to remember when doing new development. Like you said, half a dozen of one, six of the other...

Collapse
Posted by Don Baccus on
I don't think wider keys significantly slow down queries - indexing's the big deal here.  Once they're used, queries run fast.

But reducing the space overhead seems important to me if we can do so without penalizing ourselves in some other significant way.  There are all sorts of reasons for this.  A simple one - in PG, when you join tables you drag rows around, so the longer rows are the more time is spent in byte-shuffling.  I mentioned the impact on indexes somewhere above in a previous post, I think.

I think the 7-bit/31-bit split is a reasonable compromise.  While the text approach implemented unlimited keys (I think I confused your implementation with Philip's old 2-char-per-level implementation for bboard), 2^31 (2 billion plus) direct ancestors per node should be sufficient.

There's no way to use a delimiter ala the text approach unless we go to unary rather than binary keys! :)  Getting rid of the collation dependency (LANG=C) was the only reason I switched representations...

Collapse
Posted by Don Baccus on
I've implemented 31 bit long keys as suggested by Dan.  Thanks for the feedback!
Collapse
Posted by Gilbert Wong on
Don,

Have you done the commit already?  What did you need to change in order to move to the 31 bit keys?  I guess what I'm asking is what would I need to change in order to get the 31 bit keys.  I have one table which uses a copy of the acs_objects tree_sortkey insert/update/delete triggers.  Thanks.

Collapse
Posted by Don Baccus on
If your code is already using the new BIT VARYING representation, you don't need to change your code.  You'll just need to reinstall because the functions which do key manipulations (found in postgresql.sql) now spill over to 31-bit rather than 15-bit long key.

If you're converting from the old varchar + LIKE style you'll need to change the triggers on your type and switch to the BETWEEN style comparisons (I'd done this for the varchar keys some time ago since the LIKE style doesn't use the index on the varchar sortkey).  Take a look at some of the queries using tree_ancestor_keys.  Just grep and look, the usage is very straightforward even if the path we took to get here is not :)

Collapse
Posted by Gilbert Wong on
Thanks.  I already changed my code to use the bit varying a few weeks ago.  I'll just do a reload as you suggested.  Thanks.