Forum OpenACS Development: GUIDs (Globally Unique Identifiers) as Object IDs?

Hi!

Just a quick & dirty though that crossed through my head: Wouldn't it be possible and quite easy to use GUIDs (large random numbers...) as object_ids in OpenACS/]project-open[? To my knowledge there are no parts of OpenACS that rely rely on a sequential order of IDs.

GUIDs would allow us to scale out horizontally, operating one ]po[ instance for each business unit of a large Indian outsourcing company, for example. This way we could consolidate data at night by just importing DB dumps in a single "reporting server". Unique constraint errors would could just be ignored.

Common master data (users, etc..) would take the opposite direction. We would create them on a "master data server" with normal OpenACS object_ids and distribute them to the BU servers. So they would get overwritten at the reporting server, which is OK.

Deletion of objects should not be a problem, as ]po[ has kept with the ACS tradition of not deleting anything (well, we'd need to double check this, but basically this is true...).

Cheers,
Frank

Collapse
Posted by Gustaf Neumann on
here exists many different algorithms for generating unique identifiers (see e.g. [1]). However, one cannot store a UUID in a integer type, so for OpenACS, every datatype keeping object_ids has to be altered, every view has to regenerated, every function has to be dropped and recreated - it is furthermore not clear what this means for Oracle.

Starting with pg9.2, postgres supports UUIDs as datatype, but does not provide a standard function for generating it (there are add-on modules available) [2]. Instead of sequences, one has to use functions to generate IDs. The organization of the files in the CR has to be altered. The impact of UUIDs on OpenACS magic IDs is as well not clear.... Upgrade scripts look like an intellectual challenge.

So, i would not call this a small change, since it will effects at the end every package (data model and the input validation). Don't expect, that this would solve all aspects for horizontal scaling. If you "ignore unique constraints" you customers will complain about data losses.

Using UUIDs has some sex appeal, i thought about it in the past, but it would be a substantial change.

-gn

[1] https://www.postgresql.org/docs/9.6/static/uuid-ossp.html
[2] https://www.postgresql.org/docs/9.6/static/datatype-uuid.html

Collapse
Posted by Benjamin Brink on
My mind looks at everything via mapping right now, and given the old saying

"everything looks like a hammer when all one knows is a hammer",

I wonder:

What are the weaknesses in addressing this with mapping for everything output and input, where internals remain unchanged:

create table uu_site_object_map (
uu_id uuid,
site_id text,
object_id integer
);

-- indexes created for each case

So long as UUIDs are pre-generated (such as via abundant output from a bitcoin mining pool or blockchain) and confirmed unique?

Collapse
Posted by Gustaf Neumann on
If the goal is to ease importing of objects, adding to uuids to acs_objects might be an option.

OTOH, if you know to import data from some satellite" system B or C into A, one can store the highest known object_id of the last sync on A (e.g. B_max, and C_max). When performing an import from e.g. B, map just the IDs higher than B_max (i.e. ignoring deletes). When the goal is a bidirectional sync, guess then there is no other way than having per "satellite" a separate acs_objects table as basis for a sync....

Collapse
Posted by Frank Bergmann on
Hi!

quite easy to use GUIDs (large random numbers...)
> one cannot store a UUID in a integer type

Yes you can. This is not about adherence to standards here, but about a simple solution to scale out horizontally 😊 So for example I could imagine to just use "large random numbers".

Actually, sequential numbers with a "prefix" * 1000000000 offset (with a separate "prefix" for each system) could also be sufficient to create separate "number circuits". So even sequences might continue to work... And this might actually work with normal integers and without the needs for bigints...

Frank

Collapse
Posted by Gustaf Neumann on
The usual standard use 128 bit [1,2] not without reason. if you want to cook up your own convention, don't call it GUID/UUID otherwise you confuse others with terms used in established standards.

If you mean "1000000000" literally, be aware that the maximum value of an integer in PostgreSQL is "2147483647" (roughly twice that value) [3]. Notice that on sites like openacs.org, we had already overruns on 32bit sequences [4].

Nevertheless, for "site-synching" (see above), i see no need for any kind of UUIDs.

-g

[1] https://www.ietf.org/rfc/rfc4122.txt
[2] https://en.wikipedia.org/wiki/Universally_unique_identifier
[3] https://www.postgresql.org/docs/9.6/static/datatype-numeric.html
[4] https://openacs.org/forums/message-view?message_id=4877368

Collapse
Posted by Benjamin Brink on
Frank,
I mentioned the blockchain for obtaining a pool of preprocessed random hashes to be used as UUIDs, but those are apparently hashed at 256bits. And as Gustaf mentions, using standards outside of their definition tends to confuse things and break convention;

Maybe verifiable, unpredictable 128bit UUID format hashes can be generated using a system/satellite salt reference and object_id et cetera with SHA-3/ It apparently has little risk of collisions and relatively low processor demands, and the bit length can be configured to fit UUID.
cheers,
Ben