Forum OpenACS Development: Indexing the content-repository

Collapse
Posted by Dirk Gomez on
I'm posting my findings on indexing the content repository. I'd like
to get feedback whether I'm right before jumping into actual
coding. This writeup targets the Oracle version of .LRN search.

Firstly files can either be stored in the database or in the
file-system. That's a system-wide parameter which can be changed
arbitrarily.

Oracle 8i (that's the target version) cannot easily access files in
the file-system, so the inserting, deleting, updating of a
content-repository item has to be "intercepted" to index this item.

The function which needs to be changed is
revision-procs.tcl:cr_import_content. An imported file needs to be fed
to some binary filter (INSO or some free tool) and then inserted into
the Intermedia index. The item's title and description can be indexed
here as well.

To index already existing content only live revisions (via cr_items)
will be considered. Most of the code for this has to be written
outside the database (probably easiest if within OpenACS/AOLserver)

Am I on track?

Collapse
Posted by Dave Bauer on
Here is what Dirk, Jeff, and I discussed today regarding indexing of CR content.

Right now content revisions are indexed, so you can have multiple versions of the same item in the index. This really isn't how we expect it to work, since you only should see the live revision in the search results.

So this can be simplified where a trigger on cr_items adds the item_id to the search_observer_queue for indexing when the item is created, edited, or deleted. Changes to latest or live_revision will cause the item to be queued for indexing.

In the content_item datasource callback, the item will be indexed if there is a live revision and publish_status is "live". CR based applications would need to correctly set these attributes for search indexing to work. This may require changes to packages that do not set the live_revision or publish_status.

The datasource procedure for content_item will find the revision to index, and call either a content_type specific callback, if one exists, or the default content_revision callback.

The main content of a revision may contain binary content such as a word document, or PDF, etc... A callback for converting the binary content to text will be called, if one exists. And additional attributes of the revision may also be added to the content for indexing.

At this point the datasource will be returned to the search indexer procedure and the data will be sent to the search engine for indexing.

I will be posting information on the callback signatures for the binary to text conversion.

Collapse
Posted by Malte Sussdorff on
Though it does not teach people to set the live revision you should nevertheless index the latest revision if no live revision is set.