Forum OpenACS Development: Response to OpenACS wish-list

Posted by Jerry Asher on
Hmm, my casual reading of the docs was that each of the three indexing programs indexed objects like word, pdfs, or dbs in a remarkably similar way, that is, you create some sort of external progam to feed the index.  And each provided links to, or distributed perl scripts that worked with docs and pdfs.  But as I haven't gotten to setting that up just yet, I don't know for sure.
I believe that the appropriate indexing behavior is ACS module specific (to a first approximation).  I.e. the bboard module knows best what constitutes a bboard message, and similarly with faqs, file storage, etc.
My first thoughts are to use a mechanism similar to the new-stuff system: each interested module provides a routine that accepts a date, and when called with a given date, the module returns lists of the content (either text, HTML, or XML).
To perhaps better support incremental indexing, it might be good to make that date into a date range.  Because some content to be indexed can be huge, it might be better to give each module the option of returning lists of ALL the content between the date-range, or just returning SOME of the content and an indicator that there is more.  And what to do about PDFs and docs stored in the db, where you would like the indexing engine to remove the content from the PDF?  I would think this can be handled if the module can return the mime/type of each piece of content.
Finally, it looks as though some of these indexing programs are easiest to set up if they have some ability to spider the site, so the site-admin may want to direct each module to return the content, or to just return the URL of the content to return to the spider.
ad_proc moduleSearchCallback {begin end content_or_url_p more} {
    @returns [list more described_content_list]
      more :== an empty string if there is no more content
            or a cookie to return to the callback function
            to determine the next piece of content to return
      described_content_list :==
          [list mime/type title author date keyword-list url content]
  } {

An alternative approach would be to make it more table driven.  In this approach, a generic indexing routine might be made to fit many modules if it could work off a table:
create table IndexDescriptors (
    module_name            varchar, -- name of module
    table_name              varchar, -- name of table containing date and unique id
    unique_id_column        varchar, -- column integer uid of content
    date_column            varchar, -- column of date of content
    title_query            varchar, -- query returning title
    author_query            varchar, -- query returning author
    mime_query              varchar, -- query returning mime/type
    content_query          varchar, -- query returning content
    url_query              varchar  -- query returning URL
In this case, for modules that have an obvious date column, and use integers as their unique ids, it would be pretty easy to have a generic routine go through and index them.
Uh, gotta run to swim class with the kids, what are your thoughts?