Forum OpenACS Development: Response to OpenACS wish-list

Collapse
Posted by Jerry Asher on
Kapil, I'm not exactly sure what you mean by "direct access to the index."

I think it's very important to have an interface where the indexing program has direct access to the databases.  As an experiment, last night I directed htDig to index the openacs bboards.  While I left all the spidering parameters at their default the result was that it took an hour an ten minutes to index 2898 documents.  So not only was spidering slow, but a spidering solution doesn't make it easy to perform incremental indexing.  I think a db driven search filter would make incremental indexing much much easier.  (The one difficulty with incremental indexing with all three of these programs is that none of the programs make it possible to delete from the index, so periodically, the entire index must be rebuilt to eliminate false positives.)

The htdig spidering solution is fine for sites where the searchable content doesn't change frequently, but I would probably want something better for a site with active bboards or wps.

(Other statistics from that experiment: using 20% of the CPU (one PIII 500), it looked through a million words finding about 60,000 unique words.  It built a 700K index, and a 1.2M excerpts database.  When I get a chance, I will open up that htDig index so that everyone can play with it and with no disparagement towards Don, it is a much more powerful interface than the standard out of the box search solution.  (Example, I can easily search for threads containing adida but not mello.)

That said, I don't care whether the searching program runs embedded with AOLserver or as an external process.  I started cobbling together a SWISH-E AOLserver module, but my intention was STILL to run that as a separate AOLserver.  Mainly I was using the AOLserver to get cheap and quick daemon, chroot, threading solution that can be controlled with tcl scripts and scheduled procs.  (I also figure that a nice extension to SWISH-E, SWISH++, or htDig would be to expose the search interface as either a) separate webserver, b) socket driven, c) XML-RPC search engine.)

I believe it is possible to design a process in which authors created abstracts and stored them in meta keywords that either of these three search engines can get to.  htDig will make excerpts on the fly, meaning that what it returns as the excerpt contains at least one of the search items. Again, user preference as to which is better.