Forum OpenACS Development: Categorization
Since this is a request for comments, I thought it would be more appropriate to move the discussion in the OpenACS Design forum. I have written a *minimal* datamodel for categorization. This one started as a fun project while chatting at #openacs but I think it could evolve into a complete package. The datamodel is comprised of two tables:
- categories -- The hierarchy of categories. Each category can only have one parent. The categories table stores information about the name of a category and it's parent.
- object_category_map -- Membership of objects to categories. An object can be a member of several categories but not in the same tree branch, i.e. you cannot have an object belong to a specialized category (e.g. Developers) and at the same time belong to a more general category (e.g. Persons) since we can infer that already using the tree_sortkey. The object_category_map stores information about the object_id, object_type and the container/category.
- Minimal Datamodel
- Extension that supports relationships between categories
- API that provides services to other packages
For the tree-like structures, we use the tree_sortkey mechanism (slightly modified) in the same way it is used in other packages like acs-kernel, etc.
We would appreciate any help, comments and/or suggestions. Please take a minute and give us some feedback.
Is this essential? I have some categories that would really be most appropriately assigned to two parents. For example, www.labarchive.net (still running 3.2.5) uses categories to classify uploaded experiments. Sometimes having two parents for a category would make more sense. For example, a biochemistry category might most appropriately be listed as a child of BOTH chemistry and biology. Similarly for other interdisciplinary children, joint ventures, etc.
Re-organizing categories seems to work within the proposal for anything I can think of right now. In your example, having Biology and Chemistry as childs (children) of Science (and furthermore Science a child of Everything), allows Biochemistry to be a member of Biology and Chemistry --without breaking the requirement of membership from different tree branches. Using the categories on your website, one could have Engineering (application of science) as a child of Science or Everything. A Science Fair Biochemistry Project category might share membership with Engineering, Biology and Chemistry --again, not breaking the requirements.
Neophytos Demetriou, would there/could there be a mechanism to check for overlapping category membership (for people like me who tend to get into categorization mazes), or does addressing this become part of OACS internal configurations and administrating level, and therefore a much lower development priority?
Yes the tree like structure will be great. The single parent issue
or multi-parent issue must be handled by the package that will use
it. I guess each package will have its own behaviour. Some package
may need to have a behaviour or restriction of a single parent.
Some may need multi parent or residing in several nodes. Not really
sure how to implement this stuff so I have to toy around this...
maybe later kinda sleepy right now... hehehe.
At dmoz, biochemistry is *listed* both as a subcategory of biology and chemistry. However, the entry under chemistry is only a link to biology/biochemistry. In the same way, our package shall maintain only one parent for each category and model other associations using relationships.
" Neophytos Demetriou, would there/could there be a mechanism to check for overlapping category membership (for people like me who tend to get into categorization mazes), or does addressing this become part of OACS internal configurations and administrating level, and therefore a much lower development priority?"
Yes, I suppose. I want the categorization package to be able to maintain dmoz categories.
I'm not working full-time on this one, so I cannot promise a date of when it will be released (as soon as possible). However, I want to keep this thread active for status updates and feedback. For those of you who followed the discussion about the tree_sortkey mechanism and null values in another thread: I had some private email exchanges with Don and we both agree that it would be better to use the current tree_sortkey mechanism for consistency with the rest of the OpenACS packages.
(See: content-create.sql, content-keywords.sql under packages/acs-content-repository/sql/*)
Maybe there is nothing wrong with the existing one... its just that I did not know about it and my Neophytos did not know about it too. Thanks for the great info.
We should checkout the existing first before making a new one.
I volunteer to help out with the categorization service. I developed a categorization service at aD - see this requirements document. What is the status of the categorization service project? Where can I access the source code?
You should probably e-mail him directly to see if he has any time to contribute ideas, etc, but my suspicion is that you're probably free to pick this up if you're interested.
We also have a recent addition to the community, Dean Des Rosier, who though new to OpenACS 4, has an extensive background building web-based knowledge-management systems. He's working on a "general ratings" packages as a first step towards understanding the toolkit and providing KM-ish facilities in it.
You might contact him, too, since I know he's interested in categorization, too.
------------------ -- OBJECT TYPES -- ------------------ select acs_object_type__create_type ( 'category', -- object_type 'Category', -- pretty_name 'Categories', -- pretty_plural 'acs_object', -- supertype 'categories', -- table_name 'category_id', -- id_column null, -- package_name 'f', -- abstract_p null, -- type_extension_table null -- name_method ); ------------ -- TABLES -- ------------ create table categories ( category_id integer constraint categories_category_id_fk references acs_objects(object_id) on delete cascade constraint categories_category_id_pk primary key, pretty_name varchar(200) constraint categories_pretty_name_nn not null, parent_id integer default 0, tree_sortkey varbit, constraint categories_category_object_un unique (parent_id, category_id) ); create table object_category_map ( object_id integer constraint ocm_object_id_nn not null constraint ocm_object_id_fk references acs_objects(object_id) on delete cascade, object_type varchar(100) constraint ocm_object_type_nn not null constraint ocm_object_type_fk references acs_object_types(object_type) on delete cascade, category_id integer constraint ocm_category_id_nn not null constraint ocm_category_id_fk references categories(category_id) on delete cascade ); ------------------------------- -- One Root to Bind Them All -- ------------------------------- insert into categories ( category_id, pretty_name, parent_id, tree_sortkey ) values ( 0, 'One Root to Bind Them All', 0, int_to_tree_key(0) ); ------------------------------------------------ -- Add Foreign Key Constraint on 'parent_id' -- ------------------------------------------------ alter table categories add constraint categories_parent_id_fk foreign key (parent_id) references categories (category_id); ------------- -- INDICES -- ------------- create index cat_parent_category_idx on categories (parent_id, category_id); create index cat_tree_sortkey_idx on categories (tree_sortkey); -------------- -- TRIGGERS -- -------------- create function categories_insert_tr () returns opaque as ' declare v_parent_sk varbit default null; v_max_value integer; begin select max(tree_leaf_key_to_int(tree_sortkey)) into v_max_value from categories where parent_id = new.parent_id; select tree_sortkey into v_parent_sk from categories where category_id = new.parent_id; new.tree_sortkey := tree_next_key(v_parent_sk, v_max_value); return new; end;' language 'plpgsql'; create trigger categories_insert_tr before insert on categories for each row execute procedure categories_insert_tr (); create function categories_update_tr () returns opaque as ' declare v_parent_sk varbit default null; v_max_value integer; v_parent_id integer; v_rec record; clr_keys_p boolean default ''t''; begin if (new.category_id = old.category_id) and (new.parent_id = old.parent_id) then return new; end if; for v_rec in select category_id from categories where tree_sortkey between new.tree_sortkey and tree_right(new.tree_sortkey) order by tree_sortkey loop if clr_keys_p then update categories set tree_sortkey = null where tree_sortkey between new.tree_sortkey and tree_right(new.tree_sortkey); clr_keys_p := ''f''; end if; select parent_id into v_parent_id from categories where category_id = v_rec.category_id; select max(tree_leaf_key_to_int(tree_sortkey)) into v_max_value from categories where parent_id = v_parent_id; select tree_sortkey into v_parent_sk from categories where category_id = v_parent_id; update categories set tree_sortkey = tree_next_key(v_parent_sk, v_max_value) where category_id = v_rec.category_id; end loop; return new; end;' language 'plpgsql'; create trigger categories_update_tr after update on categories for each row execute procedure categories_update_tr (); -------------- -- Packages -- -------------- create function category__new(varchar,integer) returns integer as ' declare p_pretty_name alias for $1; p_parent_id alias for $2; v_category_id integer; begin v_category_id := acs_object__new( null, ''category'', now(), null, null, null ); insert into categories ( category_id, pretty_name, parent_id ) values ( v_category_id, p_pretty_name, p_parent_id ); return v_category_id; end;' language 'plpgsql'; create function category__get_name(integer) returns varchar as ' declare p_category_id alias for $1; v_pretty_name varchar; begin select pretty_name into v_pretty_name from categories where category_id = p_category_id; return v_pretty_name; end;' language 'plpgsql'; create function category__delete(integer) returns integer as ' declare p_category_id alias for $1; begin delete from categories where category_id = p_category_id; return 0; end;' language 'plpgsql'; create function object_category_map__new(integer,varchar,integer) returns integer as ' declare p_object_id alias for $1; p_object_type alias for $2; p_category_id alias for $3; begin insert into object_category_map ( object_id, object_type, category_id ) values ( p_object_id, p_object_type, p_category_id ); returns 0; end;' language 'plpgsql'; create function object_category_map__new(integer,integer) returns integer as ' declare p_object_id alias for $1; p_category_id alias for $2; v_object_type varchar; begin select object_type into v_object_type from acs_objects where object_id = object_id; perform object_category_map__new(p_object_id,v_object_type,p_category_id); return 0; end;' language 'plpgsql'; create function object_category_map__delete(integer,integer) returns integer as ' declare p_object_id alias for $1; p_category_id alias for $2; begin delete from object_category_map where object_id = p_object_id and category_id = p_category_id; return 0; end;' language 'plpgsql';
Don't know what happened for the generic stuff. But if its about City which is going to be CR then we can use the CR categorization.
I do not have time to look at it very closely right now, but this is something that I would definitely be interested in and have been thinking about.
If we are able to support existing classification schemes with a relatively simple solution, we would have the foundations for AMAZING knowledge and sharing tools. Supporting schemes like the Dewey Decimal System, Medical Subject Headings, the Art and Architecture Thesaurus, DMOZ etc. would astound.
The National Library of Medicine's (NLM's) Medical Subject Headings (MeSH) is used as an authoritative convention for resource description/categorization in medicine. Important online medical databases such as Medline (which you can search using PubMed) use it. It is also widely used by those in the teaching and study of medical, veterinary and dental fields. Check out the MeSH Browser to get a better idea of what it is. If a general categorization package would go in a direction that would make it easy to support MeSH or UMLS* based classification I think I could easily find money in 2003 to make it happen (it would also make it easy to have every medical school using dotLRN in short order!
Anyone have time to look at this a little more closely?
Dave, what kind of timeframe are you thinking about?
*The Unified Medical Language Scheme (UMLS) is a related NLM Project used to assist in online cataloguing, which maps diverse medical terminologies with single preferred concepts.
I am just going to work on this in my free time, no client is driving it. I just know it is an important site-wide tool and I really want to see it done right.
It looks like those categories are all tree structures and should be able to be modeled with the system Neophytos proposed.
Hopefully we can develop a flexible data model that can start out with a simple user interface and could expand to meet more of the requirements in Peter's package.
An important key is to allow all packages to tie into the system so that data and pages can be found by category. This information could also be used to enhance search.
I'm really glad you are picking up on this. I have released the code I developed at ArsDigita, see
...Now enough theory - have to see if any of this actually works...