Forum OpenACS Q&A: Response to Search in AOLserver

Collapse
Posted by Don Baccus on
First of all - I won't be able to spend time on this for four to five weeks, so you should go for it, Krzysztof (I'll be in Nevada for the month of September, with no computer much less net access).

Some specifics:

No, I've only played with files. It should be possible to index db entries, of course using a hack: save db entry to a file with a name that we can decode later to extract table/record id and tell Swish++ to index this file, then delete file and go on. This is how httpindex frontend for indexing pages grabbed directly from web server works.
We should be able to do much, much better than this by providing a datasource that knows about Postgres, or by providing a new entry into swish++ that takes parameters rather than a file name. Neither should be particularly hard.
I don't understand why it should be necessary to merge Swish++ with PG, in your earlier posts you've mentioned that you're leaning towards out-of-database solution which I second: just like with files storage, it's kind of pointless to put into a database a copy of data that's already there and only serves to create an index - but maybe I'm missing some bigger picture here.
My thinking is that eventually it might be nice to provide a PG function that can query the index directly, so you can join the results to the (in 4.0) repository without doing any intermediate work. Seems like this could be more efficient than querying the swish++ daemon over a socket.

This is only for searching the index, of course. As far as building the index goes, the only level of integration that would be nice would be the ability to put a trigger on a table like bboard that causes the entry to be indexed automatically on insert, and deleted for delete/update (and reinserted for the latter). The trigger approach isn't strictly necessary, of course, just nice (you can call the indexer directly from Tcl instead).

Does swish++ support incremental deletes as well as inserts?

But this certainly isn't important in the near term, and the search daemon's an improvement over the swish-e approach. You can make persistent connections to the daemon and pool them ala database drivers (in fact, you could make it a dummy "database" and use the driver protocol as a quick hack).

It's because it doesn't have any concurrency protection. The way it works with incremental indexing is: the original index is read-only the whole time, Swish++ creates a copy of this index and adds new documents to it. Since there can only be one process that updates new index there is no concurenncy and thus no problem
This sounds like a race condition to me...AOLserver threads "A" and "B" both start updating the index at the same time, reading the same read-only copy, then in turn each write a new index. You can lock in the AOLserver interface, though. Since the search works off of a read-only copy, they won't be blocked. Having inserts block while searches don't should be OK.

The biggest problem with swish++ is the lack of a phrase-based search, which we can poach from swish-e later anyway, so i'm not worried about this.