Forum OpenACS Q&A: Response to Search in AOLserver

Collapse
Posted by Krzysztof Kowalczyk on
Are you indexing database entries?
No, I've only played with files. It should be possible to index db entries, of course using a hack: save db entry to a file with a name that we can decode later to extract table/record id and tell Swish++ to index this file, then delete file and go on. This is how httpindex frontend for indexing pages grabbed directly from web server works.

I don't understand why it should be necessary to merge Swish++ with PG, in your earlier posts you've mentioned that you're leaning towards out-of-database solution which I second: just like with files storage, it's kind of pointless to put into a database a copy of data that's already there and only serves to create an index - but maybe I'm missing some bigger picture here.

swish++ allows incremental updates of the index file - important for indexing things like bboard entries. However my quick poking around didn't see any concurrency protection ...
It's because it doesn't have any concurrency protection. The way it works with incremental indexing is: the original index is read-only the whole time, Swish++ creates a copy of this index and adds new documents to it. Since there can only be one process that updates new index there is no concurenncy and thus no problem. When updating is finished one just have to switch to a new index. The biggest problem I see is that during this operation you have to have twice as much space for index but I don't see this as a showstopper
  • index is relatively small (under 10% of original files)
  • people who are serious about this stuff and are lucky enough to have things to index will just buy bigger drives

I'm a bit sketchy on ACS Classic's search implementation, but I think the way I would like to do it is similar to their approach (with the exception of using external program to index things, of course):

  • keep track of what needs to be indexed (table/record id)
  • have a periodic task that updates the index by moving records out of database to files and feeding those files to Swish++
  • pause search for a while, substitute an old index with newly created index and voila
In theory it's trivial and I'll implement this unless someone will beat me to it (I'll only be able to start working on it in 3 weeks). I've just sent Jim Davidson patches to AOLserver that will make it possible to efficiently communicate with Swish++ from within AOLserver, let's hope he'll integrate them.