Forum OpenACS Q&A: RFC: Avoid exposing object_ids in permanent URLs

Dirk asked me to explain myself better over in this thread:

https://openacs.org/forums/message-view?message_id=75455

I don't want to further clutter that thread, so I'll deal with it separately.

If you expose raw object_id's in permanent URLs, such as the URL of a forum, the URL of a posting, or any other URL which Google will index or users will bookmark, you're setting yourself up for trouble.

Because when at some point you need to migrate your web site to new software, whether it be a new release of OpenACS, a new cool .NET or Java server system, or whatever it is, all of your permanent URLs are going to have to change.

This problem doesn't occur if you stick with OpenACS and run upgrade scripts and everything behaves well. And perhaps this is the situation we want to design for.

But in my mind, object_id is an internal implementation-specific detail, just like the old ".tcl" or ".adp" extenstions were, and thus shouldn't be exposed in the URLs.

Instead, what I'm proposing is what I did with the bug-tracker. Bugs are still ACS Objects, and thus have a bug_id, references acs_objects.object_id.

But they also have a bug_number (could've been named bug_no), which is unique (package_id, bug_number). This is the number that's exposed in public URLs, and this number starts with 1 for every new package, and grows from there.

I think this leads to much nicer URLs and user interface (imagine if instead it was "Bug #830234" even though it was the first bug created in your project.)

It's like in an object oriented system. You wouldn't want to expose the internal object pointer. Instead you expose some reasonable unique identifier (in this case the combination of URL and bug_no).

Makes sense to people?

/Lars

Collapse
Posted by Dirk Gomez on
OK, now I got it ;)

Does make a lot of sense to me. It's even better to have a reasonable and speaking short URL.

Collapse
Posted by Jeff Davis on
One drawback with the "number by package id" thing is it means you have number_id collisions between packages and you can't easily move things between packages and you can't have a generic dispatch routine (like the yourserver.com/obj/ID). To make this concrete, take bug tracker on openacs.org. There are two instances created: one for openacs.org bugs and the other for openacs bugs generally. Often a bug submitted in one or the other instance should be moved but as it stands the bug id's collide. Furthermore, patches submitted to cvs with bug #s in them are ambiguous since generally you just say bug X/patch Y and don't specify the package #.

I think there is a lot of value to making moving things around on the site easier and I think having a generic object dispatch could be a great thing (especially in the context of aggregated site views like a new-content page and for searching).

If being able to ensure a range of object_ids are available, we could put a hook in to the bootstrap loader to start at some higher number so that you could find the highest number on the current site, start above that and hence ensure the range is available. Of course if you are merging two sites all bets are off, but I am not sure that's a big deal.

Collapse
Posted by Jeff Davis on
Actually I guess since you would not consider a generic dispatch url to be a "permanent" url as it would never do anything other than issue a 302 to the "real URL" that criticism is not really valid.

I do find the ambiguity of bug IDs accross instances on a single site to be a drawback in the design...

Collapse
Posted by Lars Pind on
Jeff,

I agree that the ambiguity is a problem.

Another way of fixing that would be to have a bug-tracker instance have a short pretty name, like "OpenACS Bug", or something even shorter, which can be used in referring to bugs, like "OpenACS Bug #123".

This would also be the string that you put in brackets in the subject line of email notifications going out...

OpenACS Bug #123: Here's what we tell you.

When you move a bug, well, you've moved the bug. Conceivably we could keep the old bug row around and clone it, and have bug-tracker redirect to the new, moved copy. The new one would still have a new number, though, as in "dotLRN Bug #73".

/Lars

Collapse
Posted by Jeff Davis on
I like the disambiguation via extremely short spellable tags.

The only other issue is that DB vendors added sequences for a reason. There are serious performance implications to trying to do "roll your own sequences" and it's easy to either get wrong or create performance problems.

In the category of "getting it wrong" is the current bug number generation for bug-tracker which does this query in bt_bug__new:

select coalesce(max(bug_number),0) + 1
    into   v_bug_number
    from   bt_bugs
    where  project_id = p_project_id;
to get the next bug number but does not do any locking to ensure that no one else ends up with the same number.

Locking the bt_bugs table is probably not an acceptable solution anyway. You could get a little better performance by having a table with (package_id, cur_bug_num) and doing a select for update (which would likely be fast enough for bug tracker for most sites out there but would not work well for a high traffic site with a lot of inserts (think forums for a site getting something like the traffic slashdot gets).

What are you left with? Sequences again only this time you would have a sequence or two per package_id. I think that is probably ok although you end up with a pretty cluttered db. Sequences mostly just work right although generally the tradeoff for avoiding locking is to sometimes have gaps in the sequence.

Collapse
Posted by Jeff Davis on
oh, there is one other possible answer on bug tracker which is that since there is a constraint:
 
constraint bt_bugs_bug_number_un
  unique (project_id, bug_number)
you could do the insert, and loop until the bt_bugs_bug_number_un constraint stopped firing (which would generally be one time through).

I am not sure how you do that with postgresql though and if that works in a transaction or not.

Collapse
Posted by Lars Pind on
If we were to do this OACS-wide, I suppose we'd have the

(object_type, package_id, object_no, object_id)

with a unique (object_type, package_id, object_no)

table, and a helper PL/SQL proc which could do what you suggest: Grab the numer it thinks is right, and keep trying to insert a row into this table until it succeeds, then return the object_no to the caller.

Do you think that could be made to perform well enough?

/Lars

Collapse
Posted by Ola Hansson on
Is the unique constraints on (package_id, bug_no) hurting performance? If so, is it possible to estimate how much and what type of queries/dml's suffer the most? There will be calls to an extra sequence per instance, etc...no?

Some packages have the notion of what I call "sub instances", e.g., in my initial sketch of Curriculum (just to name something I'm familiar with) an individual instance of it can hold any number of "sequences" (courses) which in turn has "activities" (learning resources)... So, to implement this according to your proposal (which I have nothing against, if it proves valuable), I would have uniqueness on three colums (package_id, sequence_id, activity_no).

Packages designed under the paragigm I described may suffer even more, performance-wise. *If* it is a problem in the first place, that is.

Also, packages will be even harder to design right for beginners.

FWIW,

/Ola

Collapse
Posted by Jeff Davis on
Ola,

there is not really a performance issue with the constraint, the issue is that if you had two people simultaneously submitting bugs that you could end up with the same bug number for both inserts and so one would work and the other would cause the constraint to fire and the insert to fail. As it stands now, that will throw an error back to the user rather than attempt to insert the row again.

Retrying the insert on that when that fires is ok and would not be a problem up to a point but yeah, it makes writing code harder since you need to know how to handle exceptions properly and the constraint names need to be known (and you are relying on the constraint being there for things to work right etc.)

The point where it would break down would be when the rate of inserts is high enough that the query to find the max number took longer than the interval between inserts. Inserts would then back up and at some point the service just would not work.

Using a table with package_id,type,number and select for update would be better since even with a lot of rows the select for update bit would be fast so you could support a higher transaction rate before it broke down. At some point even that would be a problem and you would have to go to having a sequence, and even sequences can have problems so what you go to is a sequence that can hand out more at a time and cache sequences which comes at the expense of making sequences more gappy and no longer even necessarily sequential. To get an idea of why it matters, take a look at all those wierd options for oracle sequences; they are there for a reason and when you think about how fast getting a number from a sequence is, think also about sites for which that is not in fact fast enough.

The think to keep in mind though, is that these kinds of transaction rates are well beyond what the vast majority of DB's see.

My real concern with things like the code in bug tracker is less that of possible performance problems, it is that it will outright fail for some users in practice, it will be hard to reproduce, and it is buried down inside a plpgsql proc so is likely to be overlooked.

one solution of that would be to store a key related to the document, for instance the MD5 key of a post.

advantage:
document change => key change
key not in current server => could be located in the web, should the key contains that info

disadvantage:
MD5 key is long, 32 bytes
MD5 is not collision free, but no collision known
quite easy for document, post, images but for a bug

note that malte post a similar idea https://openacs.org/forums/message-view?message_id=65365

Collapse
Posted by Don Baccus on
select max(some_value)+1 from foo

DOES NOT WORK.  As Jeff says, sequences were invented by vendors for a reason.

The *only* way to make that statement work in a concurrent environment is to do a full table lock or the insert-until-it doesn't fail approach.  Both are ugly.

Even if concurrency weren't a problem, as the table grows, the full table scan required by the max() function would become an issue, particularly in PostgreSQL (in Oracle it's not so bad because the aggregate can scan the index rather than the full table, but PG's multiversioning storage system makes that optimization impossible).

I personally don't see exposing object_ids in the general case as being that big a deal.  Migration of data to a non-OACS system is going to be a PITA and this would be just one of many, many things to tackle.

On the other hand I agree that having nice bug numbers is useful.  People communicate about bugs in e-mail and the like, and being able to say "Check out bugs 23 through 25 in OpenACS" is more convenient than saying "Check bugs 75763, 77321, and 78212".

Here's one idea: create a sequence for each instance of the bug tracker that's mounted, using the package_id as part of the sequence name.  Create it in the post instantiate callback proc.  It would then be easy enough to use the dynamic name in your queries that grab the next available number.

Something like:

select nextval('bt_[ad_conn package_id]_seq');

This works beautifully in a high concurrency environment.  It doesn't slow down as your table of bugs grows.  You can do double-click detection using the object_id (use ad_form's "key" psuedo-datatype, which also signs and verifies the object_id which is generated) so you don't have to grab the nextval until you're actually inserting the new bug.  And if there's a failure for some reason leaving a "hole" in the sequence wouldn't be that big a deal, because such failure should be very rare if everything's put together correctly.

Collapse
Posted by Tom Jackson on
Subject: Re: [ aolserver-Bugs-588470 ] Memory leak when running Tcl scipts
Bugs item #588470, was opened at 2002-07-30 03:59

I was just looking at a bug email from Source Forge. This one is for AOLserver. Somehow I doubt that AOLserver has had 588469 previous bugs. I do see the desire to have nice small numbers for certain sequences, but first the vendor supplied sequence routines should always be used for that purpose, and they should probably be inside the __new pl routine that creates the object. However, then you are stuck trying to design new ways to return the new 'nice id' so that you can do things like redirect to the right place after the insert. All of the routine ways of doing things in OpenACS are now going to be ad-hoc, and for little or no advantage.

It is useless to consider moving the data in the design of a site. The problem of conserving external links will be the least significant problem. If you move to a different software, more than likely, it will not have the concept of a package either, so the package_id will present the same issue that the object_id presents.

What would be helpful would be an xml or other hierarchical language method of describing the data in your system. This language would aid in the creation of the datamodel, as well as the ability to dump/restore the data.

I think it would still be near impossible to move to another system when you consider the permissioning scheme the package will rely on.

Overall I don't think having big, not quite sequential numbers is a bad idea. It lends a little obfuscation to the workings of a site. It can make it more difficult to hack. For all the benefits you get from not having to deal with new sequences, and getting it wrong, not having to write extra code which is non-standard, I don't see why this is good. For application specific problems like this, maybe create an unique index on some fields in your primary table, I doubt the object_id foreign key is the only thing that makes the row unique. For the bug tracker maybe who submitted the bug and the date is enough. I doubt that one person would submit two bugs at the same time.

Collapse
Posted by Dave Bauer on
One way to work around this is by naming your objects. ETP does this for each page. It uses the cr_item name and cr_item parent_id to define an item. Each "directory" is a cr_folder so ETP can find the item.

One possibility is to convert the bug "title" into text suitable for using in a URL. This would not make short URLs though. It would work for all the other elements of a bug URL, that is, the bugtracker instance name or site-node, the project and component names could be used as parts of the URL.

Collapse
Posted by Christian Eva on
I agree very much with the proposal to have an external (permanent) id per object type. In my applications (non ACS, DB independent) I did choose to generate the ID's in one specific procedure for the whole application as not all DB's used do have sequences.

I have one table with all the last keys, which can also be initialised to logical start values and an increment value as well as a limit, where the numbering restarts.

This is some overhead but faster than the aggregate function solution and it leaves you in control. As it is in one place its functionality can be changed or improved without changing the logic in the different places of usage.

As it is a logical key (usually the table name) one can use any combination, i.e package-id+subsite-id etc).

This is just a simple approach and in now way the most sophisticated solution. It had to be done quickly and it does it's job since very well.

I insert the code here (out of my environment) just for illustration.

--  DBTYPE=pgs7, generated by zt_gen_sql at Tue Jan 14 16:29:16 GMT 2003
--  for PostgreSQL-7.x, to be executed by psql
--  (C) 1996/99 by Panorgan AG, CH-8820 Wädenswil

--  TABLE taf_idref - TAF ID Referenz Verwaltung

create table taf_idref
(
  i_idref    char(20) not null,    -- ID Referenz Name
  i_last    decimal(10) not null,  -- ID Letzter Wert
  i_start    decimal(10) not null,  -- ID Startwert
  i_incr    decimal(4) not null,  -- ID Inkrement
  i_limit    decimal(10) not null,  -- ID Limite
  primary key ( i_idref))

# ==========================================================
# y_db_getId -- generate uniqe id's for key fields
#
#          be aweare of calling the proc in a db transaction
#          otherwise we will commit the changes in here!
#
# USAGE:    idref      identifier for unique key (name of field)
#          tablename  idref table name (default taf_idref)
#          start      idref start value (default db definition)
#          incr        idref increment value (default db definition)
#          limit      idref limit value (default db definition)
#
#          start, incr, limit are only considered for a new idref!
#
# RETURN:  key        generated key value
# ==========================================================
proc y_db_getId {idref {table {taf_idref}} {start {}} \
                      {incr {}} {limit {}} } {
    global errorInfo errorCode DB_lib_priv Fields

    # make sure we have the table def
    y_table_def $table

    # lock the table to make sure we are the only one fiddling

    switch -- $DB_lib_priv(DB_TYPE) {
        adab    {
            set lock "lock (wait) table $table in exclusive mode"
        }
        orac    {
            set lock "lock table $table in exclusive mode"
        }
    }

    set ret [sql run $lock ]
    Debug $ret 9

    # read the last value
    set cmd "select i_last, i_start, i_incr, i_limit from $table
                    where i_idref = '$idref'"
    # open cursor
    if [catch {sql open $cmd} c] {
        Debug "Error in sql open: $c error=$errorCode"
        if {$errorCode < 0} {
            # return error
            return -code error
        }
    }
    Debug curs=$c 9

    if {![string length $c]} {

        # no record found, insert it
        if {[string length $start]} {
            set i_start $start
        } else {
            set i_start $Fields(i_start.value)
        }
        set i_last $i_start

        if {[string length $incr]} {
            set i_incr $incr
        } else {
            set i_incr $Fields(i_incr.value)
        }

        if {[string length $limit]} {
            set i_limit $limit
        } else {
            set i_limit $Fields(i_limit.value)
        }

        set ins "insert into $table (i_idref,i_last,i_start,i_incr,i_limit) \
                        values ('$idref',$i_last,$i_start,$i_incr,$i_limit)"

        set ret [sql run $ins]
        if {$ret <= 0} {
            # return error
            return -code error -errorcode $ret
        }

    } else {
        # fetch record
        set row [sql fetch $c ]
        Debug $row 9
        set ret [sql close $c]

        array set r $row
        set i_last  $r(i_last)
        set i_start  $r(i_start)
        set i_incr  $r(i_incr)
        set i_limit  $r(i_limit)
    }

    # if freshly inserted
    if  { $i_last == 0 } {
        set i_last $i_start
    }

    # check limit
    if { $i_limit > 0 && $i_last >= $i_limit } {
        set i_last $i_start
    }

    # now increment last and update row on the fly
    incr i_last $i_incr

    set upd "update $table set i_last = $i_last \
                    where i_idref = '$idref'"

    set ret [sql run $upd]
    if {$ret <= 0} {
        # return error
        return -code error -errorcode $ret
    }

    # commit if not within transaction !
    if {$DB_lib_priv(tr_stat) == 0 } {
        set ret [sql run "COMMIT WORK"]
    }

    y_db_replication write $i_last

    return $i_last
}

Collapse
Posted by Jonathan Ellis on
I agree with Tom -- in general, I don't think this is a problem.  Certainly the complication involved in "fixing" it globally isn't worth it.
Collapse
Posted by Mark Aufflick on
The conversation seems to have gotten stuck on bugtracker, but Lars' initial message includes another example of forums. I think that is an example of something that is much more of a "permanent URL" than bugs (at least i hope bugs aren't permanent ;)

For bookmarking purposes and shifting to new packages (or servers) simple text urls are important.

Certainly the cr style naming for forums etc. wouldn't go astray.

Collapse
Posted by Andrew Piskorski on
Lars, I see your concerns, and I remember worrying about this issue myself at one point too, but this smells like inventing a problem to me. I say, use object_id, it's simplest; it's also nice that it is guaranteed to be unique across your entire system, and it's nice that it is a sequence number from the DB rather than something ad-hoc sequence like thing you come up with yourself.

If you don't want to break links when dumping and reloading your data, fine, either, ideally, take the trouble to make sure your object_ids don't change - it can certainly be done. Or not quite as good but still ok, install redirects from all the old to the new URLs, like openacs.org did for the 3.x BBoard to 4.5 Forums transition.

I also see no real benefit to having nice small numbers for Bug IDs. As a user of bug and ticket trackers I don't really care. On the other hand, as a programmer and admin the the fact that an object_id (aka ticket number) is unique system-wide across my OpenACS instance gives me warm fuzzies - as does the fact that using single simple sequence like object_id is very robust to programmer errors, etc.

Collapse
Posted by Lars Pind on
Just to clarify, Tom, I wouldn't want to expose package_id's, either, the path of the URL is sufficient for that.

And Andrew: The problem came up when trying to move my personal web site to a fresh install (it currently runs very old code from way before there was even the faintest notion of an upgrade path),. I tried several tricks to move data so the object_id's would stay the same, but failed. It seemed like the number of objects created during install was larger than it used to, so some ID numbers would already be taken. In any event, it's tough.

Don's idea of creating a new sequence per package ID should work, though it's not terribly pretty.

I guess the prettiest solution is Dave's suggestion of creating a unique short string from the summary to use in the URL, though that doesn't solve the "Bug #123" problem.

/Lars

Collapse
Posted by Andrew Grumet on
Christian wrote
In my applications (non ACS, DB independent) I did choose to generate the ID's in one specific procedure for the whole application as not all DB's used do have sequences.
For those of you who are scratching your heads over this one...You cannot do "create sequence foo;" in SQL Server, and maybe others. The SQL Server way is to use something called "autonumber columns". The idea is that the column generates its own values from an internal, otherwise hidden sequence. When inserting, you simply omit the magical column from the list of columns to insert, and it gets automatically filled with the next sequence value. The sequence value is then retrievable from a session variable called @@IDENTITY that works like sequence.currval.

This can complicate things when you want to write code that works with the two types of databases, because the steps are different. The best approach I have seen is to mimic sequences by creating a sequence generator table like so...

create table foo_id_sequence (
  foo_id bigint identity(1,1),
  c char(1)
);
To get new sequence values, you insert a value for c and then query for @@IDENTITY. This can all be hidden inside a multipurpose proc like db_nextval

It still feels a little crufty to me, but is a lot simpler than trying to re-implement sequences or supporting both approaches in the application layer.

Collapse
Posted by Don Baccus on
Note that Cristian's example uses an exclusive table lock on a table that's then shared by object types that need pretty external URLs.

This is not really good in a high-concurrency environment.

Collapse
Posted by Tom Jackson on

Lars,

The internal structure that is exposed is really just the name of the variable 'object_id', or 'message_id', whatever you end up calling it. When you have a million of something, it is going to get pretty ugly pretty fast, so it really doesn't matter. You could nice the url by removing the var name and using a fixed url subdirectory to specify the object number. like:

http://example.com/forums/myforum/123456/one-message
http://example.com/subsite/bugs/3456789/one-bug

The point is, your package_id is exposed in the url, but by way of translation (using a db lookup). This is possible on the scale at which packages are created. It isn't a useful idea on the scale at which objects are created.

OpenACS handles the problem of object identity (and very well). You are trying to substitute something else in a way that will eventually fail. For instance the (package_id, bug_number) is not an identity for the bug. If the package_id is changed, the identity of the bug hasn't changed, just the package the bug is viewed from has changed.

Believe me, I really understand the desire to make urls simple, and rational, and permanent. (For an example check out saleonall.com, which uses the OpenACS3.x ecommerce module to display around 200,000 pages) Using a .vuh handler to do the final translation might be a better solution.

Collapse
Posted by Andrew Piskorski on
So, what is the conclusion here? The "just use the object_id" camp seems to have mustered the stronger arguments, but then, I'm in that camp... Is there sufficient consensus to make an OpenACS policy on this?

One more argument:

Correct me if I'm wrong, but the latest code in Lars's new Bug Tracker package still has a race condition when creating these pretty "expose it in the URL" ids in bt_patch.new. That's both on the head and on the 4.6 branch, both Postgres and Oracle:

select nvl(max(patch_number),0) +1
into   v_patch_number
from   bt_patches
where  project_id = new.project_id;
The code neither locks the bt_patches table nor does anything else fancy (like pre-allocating rows in the style of some room reservation systems), so as far as I can see it's just plain wrong.

I think we can all agree that Lars is one of our best and most skilled developers, yes? So if he got it wrong, what does that say about how likely other developers are to get it wrong?

Therefore, I suggest that if the, "No, we don't want to expose the object_id in URLs, we want something else" camp really wants their way, that a necessary (but not necessarily sufficient) requirement is that is someone first design and implement a general, scaleable, and correct solution that can be used by any package to map object_ids to some or one "pretty" id representation.

Collapse
Posted by Dave Bauer on
I agree with Jeff. I think we should strive for unique, readable names as the last part of a URL. I am not sure how to accomplish this if you decide to use a unqiue number per package instance.
Collapse
Posted by Tom Jackson on

How 'bout we create our own internet unique 'tinyurl' like is available from TinyUrl.com. Of course ours would be free of ads and would be useable from a simple ns_geturl request. I don't see any other way to crack the bug than to have one single place on the internet to generate unique ids.

Maybe you could install djb's libtai and create an attosecond precision timestamp and convert that into a 16 byte packed representation. Maybe uuencode that. Apparently this would handle up to 1x10^18 objects per second.

Maybe you could request a series of numbers in one request.

Actually I guess a simple udp protocol would work great.