Forum OpenACS Development: Categorization

Collapse
Posted by Neophytos Demetriou on
This is a follow-up of the discussion for a new categorization package. The discussion started here: https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=0003v1&topic_id=11&topic=OpenACS

Since this is a request for comments, I thought it would be more appropriate to move the discussion in the OpenACS Design forum. I have written a *minimal* datamodel for categorization. This one started as a fun project while chatting at #openacs but I think it could evolve into a complete package. The datamodel is comprised of two tables:

  • categories -- The hierarchy of categories. Each category can only have one parent. The categories table stores information about the name of a category and it's parent.
  • object_category_map -- Membership of objects to categories. An object can be a member of several categories but not in the same tree branch, i.e. you cannot have an object belong to a specialized category (e.g. Developers) and at the same time belong to a more general category (e.g. Persons) since we can infer that already using the tree_sortkey. The object_category_map stores information about the object_id, object_type and the container/category.
As soon as we resolve issues with this minimal datamodel we are going to model relationships between categories (siblings). Next, we'll need an API for using the categorization package services from other packages. For example, each forum shall maintain a table of "forum_categories" that references the main categories table. Each category may contain objects of different types, e.g. bboard postings, news articles, ecommerce products and so on. Finally, we shall need a mechanism to associate keywords to categories which is probably gonna make use of the search package.

Implementation schedule:

  1. Minimal Datamodel
  2. Extension that supports relationships between categories
  3. API that provides services to other packages
  4. Keywords

For the tree-like structures, we use the tree_sortkey mechanism (slightly modified) in the same way it is used in other packages like acs-kernel, etc.

We would appreciate any help, comments and/or suggestions. Please take a minute and give us some feedback.

Collapse
2: Response to Categorization (response to 1)
Posted by Cathy Sarisky on
" Each category can only have one parent. "

Is this essential? I have some categories that would really be most appropriately assigned to two parents. For example, www.labarchive.net (still running 3.2.5) uses categories to classify uploaded experiments. Sometimes having two parents for a category would make more sense. For example, a biochemistry category might most appropriately be listed as a child of BOTH chemistry and biology. Similarly for other interdisciplinary children, joint ventures, etc.

Collapse
3: Response to Categorization (response to 1)
Posted by Torben Brosten on
Cathy Sarisky, I was thinking about the category cross-over problem you mention, also, because any attempt to restrict categorization organization tends to cause points of stress in a system, which shows itself in other ways. The 1 parent rule has important performance implications in object-modeling, so it would be best to keep it if we can.

Re-organizing categories seems to work within the proposal for anything I can think of right now. In your example, having Biology and Chemistry as childs (children) of Science (and furthermore Science a child of Everything), allows Biochemistry to be a member of Biology and Chemistry --without breaking the requirement of membership from different tree branches.  Using the categories on your website, one could have Engineering (application of science) as a child of Science or Everything. A Science Fair Biochemistry Project category might share membership with Engineering, Biology and Chemistry --again, not breaking the requirements.

Neophytos Demetriou, would there/could there be a mechanism to check for overlapping category membership (for people like me who tend to get into categorization mazes), or does addressing this become part of OACS internal configurations and administrating level, and therefore a much lower development priority?

Collapse
4: Response to Categorization (response to 1)
Posted by Jun Yamog on
Hi Neophytos,

Yes the tree like structure will be great.  The single parent issue
or multi-parent issue must be handled by the package that will use
it.  I guess each package will have its own behaviour.  Some package
may need to have a behaviour or restriction of a single parent.
Some may need multi parent or residing in several nodes.  Not really
sure how to implement this stuff so I have to toy around this...
maybe later kinda sleepy right now... hehehe.

Collapse
5: Response to Categorization (response to 1)
Posted by Neophytos Demetriou on
"Sometimes having two parents for a category would make more sense. For example, a biochemistry category might most appropriately be listed as a child of BOTH chemistry and biology."

At dmoz, biochemistry is *listed* both as a subcategory of biology and chemistry. However, the entry under chemistry is only a link to biology/biochemistry. In the same way, our package shall maintain only one parent for each category and model other associations using relationships.

" Neophytos Demetriou, would there/could there be a mechanism to check for overlapping category membership (for people like me who tend to get into categorization mazes), or does addressing this become part of OACS internal configurations and administrating level, and therefore a much lower development priority?"

Yes, I suppose. I want the categorization package to be able to maintain dmoz categories.

I'm not working full-time on this one, so I cannot promise a date of when it will be released (as soon as possible). However, I want to keep this thread active for status updates and feedback. For those of you who followed the discussion about the tree_sortkey mechanism and null values in another thread: I had some private email exchanges with Don and we both agree that it would be better to use the current tree_sortkey mechanism for consistency with the rest of the OpenACS packages.

Collapse
6: Response to Categorization (response to 1)
Posted by Stephen . on
What is wrong with the existing implementation of keywords/categories? As far as I can see it supports everything you propose to build.

(See: content-create.sql, content-keywords.sql under packages/acs-content-repository/sql/*)

Collapse
7: Response to Categorization (response to 1)
Posted by Jun Yamog on
Hmmm....

Maybe there is nothing wrong with the existing one... its just that I did not know about it and my Neophytos did not know about it too.  Thanks for the great info.

We should checkout the existing first before making a new one.

Collapse
8: Response to Categorization (response to 1)
Posted by Peter Alberer on
I think a problem with the existing cat-system is that it is content-repository only. Not every application needs the content-repository.
Collapse
9: Response to Categorization (response to 1)
Posted by Don Baccus on
Yes, that's the point exactly.  We probably just can't rip out the existing one because the CMS probably uses it, but investigating this is something Neophytos should add to his "todo" list.
Collapse
Posted by Peter Marklund on

Neophytos,
I volunteer to help out with the categorization service. I developed a categorization service at aD - see this requirements document. What is the status of the categorization service project? Where can I access the source code?

Thanks!

Collapse
Posted by Don Baccus on
Neophytos has become busy with a variety of things and at the moment isn't doing active OpenACS 4.5 work per se. But we're in contact and he hopes to find time to get involved again.

You should probably e-mail him directly to see if he has any time to contribute ideas, etc, but my suspicion is that you're probably free to pick this up if you're interested.

We also have a recent addition to the community, Dean Des Rosier, who though new to OpenACS 4, has an extensive background building web-based knowledge-management systems. He's working on a "general ratings" packages as a first step towards understanding the toolkit and providing KM-ish facilities in it.

You might contact him, too, since I know he's interested in categorization, too.

Collapse
Posted by Dave Bauer on
Here is the initial data model that Neophytos worked out from our conversations on IRC. It is just a first draft, but maybe there is something here to work with. The APIs still need to be worked on and I think that is very important to work out the operations it will need to perform. Let's use this to continue this discussion. Dave
------------------
-- OBJECT TYPES --
------------------

select acs_object_type__create_type (
    'category',                  -- object_type
    'Category',                  -- pretty_name
    'Categories',                -- pretty_plural
    'acs_object',                -- supertype 
    'categories',                -- table_name
    'category_id',               -- id_column
    null,                        -- package_name
    'f',                         -- abstract_p
    null,                        -- type_extension_table
    null                         -- name_method
);



------------
-- TABLES --
------------

create table categories (
  category_id		integer
			constraint categories_category_id_fk
			references acs_objects(object_id)
			on delete cascade
			constraint categories_category_id_pk
			primary key,
  pretty_name		varchar(200)
			constraint categories_pretty_name_nn
			not null,
  parent_id		integer default 0,
  tree_sortkey		varbit,
  constraint categories_category_object_un
  unique (parent_id, category_id)
);


create table object_category_map (
  object_id		integer
			constraint ocm_object_id_nn
			not null
			constraint ocm_object_id_fk
			references acs_objects(object_id)
			on delete cascade,
  object_type		varchar(100)
			constraint ocm_object_type_nn
			not null
			constraint ocm_object_type_fk
			references acs_object_types(object_type)
			on delete cascade,
  category_id		integer
			constraint ocm_category_id_nn
			not null
			constraint ocm_category_id_fk
			references categories(category_id)
			on delete cascade
);


-------------------------------
-- One Root to Bind Them All --
-------------------------------
insert into categories (
    category_id,
    pretty_name,
    parent_id,
    tree_sortkey
) values (
    0,
    'One Root to Bind Them All',
    0,
    int_to_tree_key(0)
);


------------------------------------------------
-- Add Foreign Key Constraint on 'parent_id' --
------------------------------------------------
alter table categories add constraint categories_parent_id_fk 
foreign key (parent_id) references categories (category_id);


-------------
-- INDICES --
-------------

create index cat_parent_category_idx on categories (parent_id, category_id);
create index cat_tree_sortkey_idx on categories (tree_sortkey);


--------------
-- TRIGGERS --
--------------


create function categories_insert_tr () returns opaque as '
declare
        v_parent_sk     varbit default null;
        v_max_value     integer;
begin
        select max(tree_leaf_key_to_int(tree_sortkey)) into v_max_value 
        from categories
        where parent_id = new.parent_id;

        select tree_sortkey into v_parent_sk 
        from categories
        where category_id = new.parent_id;

        new.tree_sortkey := tree_next_key(v_parent_sk, v_max_value);

        return new;

end;' language 'plpgsql';


create trigger categories_insert_tr before insert on categories
for each row execute procedure categories_insert_tr ();


create function categories_update_tr () returns opaque as '
declare
        v_parent_sk     varbit default null;
        v_max_value     integer;
        v_parent_id     integer;
        v_rec           record;
        clr_keys_p      boolean default ''t'';
begin
        if (new.category_id = old.category_id) and 
           (new.parent_id = old.parent_id) then

           return new;

        end if;

        for v_rec in select category_id
                     from categories
                     where tree_sortkey between new.tree_sortkey and tree_right(new.tree_sortkey)
                     order by tree_sortkey
        loop

            if clr_keys_p then
               update categories set tree_sortkey = null
               where tree_sortkey between new.tree_sortkey and tree_right(new.tree_sortkey);
               clr_keys_p := ''f'';
            end if;

            select parent_id into v_parent_id
            from categories 
            where category_id = v_rec.category_id;

            select max(tree_leaf_key_to_int(tree_sortkey)) into v_max_value
            from categories
            where parent_id = v_parent_id;

            select tree_sortkey into v_parent_sk 
            from categories
            where category_id = v_parent_id;

            update categories
            set tree_sortkey = tree_next_key(v_parent_sk, v_max_value)
            where category_id = v_rec.category_id;

        end loop;

        return new;

end;' language 'plpgsql';

create trigger categories_update_tr after update on categories 
for each row execute procedure categories_update_tr ();

--------------
-- Packages --
--------------

create function category__new(varchar,integer)
returns integer as '
declare
    p_pretty_name               alias for $1;
    p_parent_id			alias for $2;
    v_category_id               integer;
begin

    v_category_id := acs_object__new(
			null,
			''category'',
			now(),
			null,
			null,
			null
		      );

    insert into categories (
        category_id,
        pretty_name,
        parent_id
    ) values (
        v_category_id,
        p_pretty_name,
        p_parent_id
    );

    return v_category_id;

end;' language 'plpgsql';


create function category__get_name(integer)
returns varchar as '
declare
    p_category_id               alias for $1;
    v_pretty_name               varchar;
begin

    select pretty_name into v_pretty_name
    from categories
    where category_id = p_category_id;

    return v_pretty_name;

end;' language 'plpgsql';


create function category__delete(integer)
returns integer as '
declare
    p_category_id               alias for $1;
begin

    delete from categories
    where category_id = p_category_id;

    return 0;

end;' language 'plpgsql';


create function object_category_map__new(integer,varchar,integer)
returns integer as '
declare
    p_object_id		alias for $1;
    p_object_type	alias for $2;
    p_category_id	alias for $3;
begin

    insert into object_category_map (
        object_id,
	object_type,
	category_id
    ) values (
        p_object_id,
        p_object_type,
        p_category_id
    );

    returns 0;

end;' language 'plpgsql';


create function object_category_map__new(integer,integer)
returns integer as '
declare
    p_object_id		alias for $1;
    p_category_id	alias for $2;
    v_object_type	varchar;
begin

    select object_type into v_object_type
    from acs_objects
    where object_id = object_id;

    perform object_category_map__new(p_object_id,v_object_type,p_category_id);

    return 0;

end;' language 'plpgsql';


create function object_category_map__delete(integer,integer)
returns integer as '
declare
    p_object_id		alias for $1;
    p_category_id	alias for $2;
begin

    delete from object_category_map
    where object_id = p_object_id
    and category_id = p_category_id;

    return 0;

end;' language 'plpgsql';
Collapse
Posted by Lars Pind on
Thought this article would be interesting to those working on categorization: http://firstmonday.org/issues/issue7_7/bates/: Getting Web Information Retrieval Right This Time.
Collapse
Posted by Deds Castillo on
Whatever happened to this?  Did it ever get off the ground?  I'm in need of categorization and I'm avoiding rewriting of stuff.
Collapse
Posted by Jun Yamog on
Hi Deds,

Don't know what happened for the generic stuff.  But if its about City which is going to be CR then we can use the CR categorization.

Collapse
16: Re: Categorization (response to 1)
Posted by Dave Bauer on
Is anyone still interested in working on categorization? I would like to get a basic service working soon, and explore more of the ideas in Peter's catergorization package.
Collapse
17: Re: Categorization (response to 1)
Posted by Carl Robert Blesius on

I do not have time to look at it very closely right now, but this is something that I would definitely be interested in and have been thinking about.

If we are able to support existing classification schemes with a relatively simple solution, we would have the foundations for AMAZING knowledge and sharing tools. Supporting schemes like the Dewey Decimal System, Medical Subject Headings, the Art and Architecture Thesaurus, DMOZ etc. would astound.

The National Library of Medicine's (NLM's) Medical Subject Headings (MeSH) is used as an authoritative convention for resource description/categorization in medicine. Important online medical databases such as Medline (which you can search using PubMed) use it. It is also widely used by those in the teaching and study of medical, veterinary and dental fields. Check out the MeSH Browser to get a better idea of what it is. If a general categorization package would go in a direction that would make it easy to support MeSH or UMLS* based classification I think I could easily find money in 2003 to make it happen (it would also make it easy to have every medical school using dotLRN in short order! 😉

Anyone have time to look at this a little more closely?

Dave, what kind of timeframe are you thinking about?

*The Unified Medical Language Scheme (UMLS) is a related NLM Project used to assist in online cataloguing, which maps diverse medical terminologies with single preferred concepts.

Collapse
18: Re: Categorization (response to 1)
Posted by Dave Bauer on
Carl,

I am just going to work on this in my free time, no client is driving it. I just know it is an important site-wide tool and I really want to see it done right.

It looks like those categories are all tree structures and should be able to be modeled with the system Neophytos proposed.

Hopefully we can develop a flexible data model that can start out with a simple user interface and could expand to meet more of the requirements in Peter's package.

An important key is to allow all packages to tie into the system so that data and pages can be found by category. This information could also be used to enhance search.

Collapse
19: Re: Categorization (response to 1)
Posted by Peter Marklund on
Dave,
I'm really glad you are picking up on this. I have released the code I developed at ArsDigita, see

https://openacs.org/forums/message-view?message_id=67698

Collapse
20: Re: Categorization (response to 1)
Posted by Ciaran De Buitlear on
Hi starting a new project - thread here https://openacs.org/forums/message-view?message_id=85199 - some of the functionality might be related to this  thread.  We're in the very early stages but I am thinking about using the "knowledge base" in Oracle text (formerly known as intermedia) or at least part of it as a  hierarchy of categories.  This knowledge base is stored in a tree like structure (otherwise known as a taxonomy).  I am also looking at  letting users browse or search the taxonomy itself)or part of it)  and to update it.  They could register their preferences at various poiints in the taxonomy and also use the taxonomy tree to do Yahoo-like browse searching.  There is also the concept of searching for content based on what the content is "about" which I am looking at.  I'm sure this will return several categories for each piece of content.  So we wouldn't have to actually store any categories with a piece of  content - we could just generate them each time we searched...
...Now enough theory - have to see if any of this actually works...
Collapse
21: Re: Categorization (response to 1)
Posted by Ciaran De Buitlear on
Sorry I should elaborate here slightly.  I feel the main flaw in most categorisation systems is that they rely on the users to categorise the content.  I really don't think people will ever do this.  I mean I'm very interested in Knowledge management and I probably wouldn't bother most of the time.  I have been looking desperatly fir a system which doesn't involve 'altruistic' users.  That's where I'm coming from...  In a sense my suggestion doesn't really interfere with any existing categorisation system that anyone might have as it's "just searching"  really, that and using the features of Oracle text!