Forum OpenACS Development: Response to OpenACS wish-list

Hmm, my casual reading of the docs was that each of the three indexing programs indexed objects like word, pdfs, or dbs in a remarkably similar way, that is, you create some sort of external progam to feed the index. And each provided links to, or distributed perl scripts that worked with docs and pdfs. But as I haven't gotten to setting that up just yet, I don't know for sure.

I believe that the appropriate indexing behavior is ACS module specific (to a first approximation). I.e. the bboard module knows best what constitutes a bboard message, and similarly with faqs, file storage, etc.

My first thoughts are to use a mechanism similar to the new-stuff system: each interested module provides a routine that accepts a date, and when called with a given date, the module returns lists of the content (either text, HTML, or XML).

To perhaps better support incremental indexing, it might be good to make that date into a date range. Because some content to be indexed can be huge, it might be better to give each module the option of returning lists of ALL the content between the date-range, or just returning SOME of the content and an indicator that there is more. And what to do about PDFs and docs stored in the db, where you would like the indexing engine to remove the content from the PDF? I would think this can be handled if the module can return the mime/type of each piece of content.

Finally, it looks as though some of these indexing programs are easiest to set up if they have some ability to spider the site, so the site-admin may want to direct each module to return the content, or to just return the URL of the content to return to the spider.
<blockquote><pre>
ad_proc moduleSearchCallback {begin end content_or_url_p more} {
@returns [list more described_content_list]
more :== an empty string if there is no more content
or a cookie to return to the callback function
to determine the next piece of content to return
described_content_list :==
[list mime/type title author date keyword-list url content]
} {
}
</pre></blockquote>

An alternative approach would be to make it more table driven. In this approach, a generic indexing routine might be made to fit many modules if it could work off a table:
<blockquote>
<pre>
create table IndexDescriptors (
module_name varchar, -- name of module
table_name varchar, -- name of table containing date and unique id
unique_id_column varchar, -- column integer uid of content
date_column varchar, -- column of date of content
title_query varchar, -- query returning title
author_query varchar, -- query returning author
mime_query varchar, -- query returning mime/type
content_query varchar, -- query returning content
url_query varchar -- query returning URL
);
</pre>
</blockquote>
In this case, for modules that have an obvious date column, and use integers as their unique ids, it would be pretty easy to have a generic routine go through and index them.

Uh, gotta run to swim class with the kids, what are your thoughts?