integration strategies. Since I haven't found a site indexing engine
that I like, I am of the belief that the OpenACS should support many
different types of site indexers and search engines.
It's no revelation that many sites already use other search engines
(htDig, SWISH, mnogo, even Microsoft Context Server (is that the
name?) and it would be beneficial (and even synergistic) to be able
to integrate the OpenACS into those sites, letting them keep their
same search engine technology. Ideally, I would like the OpenACS to
come, out of the box, with a search engine that works for any OpenACS
supported platform: *nix or Windows, Oracle or Postgres or ...? Then
a developer would have a working solution but would also have the
ability to change out/swap/tune that solution with other appropriate
technologies.
Here's a partial list of search engines that have been proposed for
OpenACS sites.
a) Intermedia
b) OpenFTS
c) Lucene
d) htDig
e) SWISH<fork>
f) mnoGoSearch
g) ???
What a panoply of technology! We have C, C++, Perl, and Java. We
have GPL and not. There are engines based on site crawling, engines
based on db crawling, and those that would do both. Some of these
engines are specific to a given database (Intermedia/Oracle, and
OpenFTS/Postgres).
My proposal is to see a "search engine driver" based approach to
search engine integration. This would consist of two sets of
interfaces that each site indexer would have to support, with
the "promise" that once the support is present, the site indexer
would plug right into the ACS.
On the frontend, a search engine would support an aolserver module
(specifically an aolserver db driver like interface) that exposes
methods to open and close an index, to specify a search, to retrieve
results, and to allow a limited amount of introspection (is this type
of document handled, are excerpts returned, etc.) This would
let /search pages handle all sorts of different search engines
generically in the same manner that ns_db or db_ lets us deal with
many different types of RDBMS.
On the backend, a module would support some sort of "search me"
callback interface similar perhaps to the "ACS new stuff" interface,
or similar to the ACS Intermedia interface. As glue then, support of
a site indexer would consist of code that runs through the "search
me" callbacks or scans the ACS Intermedia table returning the various
documents to the site indexer, and perhaps ACS-wide parsers that
might help an indexer understand ACS objects when encountering them
as embedded links within a document (users, bboards and bboard
topics).
With a bit of experience, I would prefer a callback scheme to an ACS
classic/Intermedia table based scheme. There is a conflict within
the ACS today: put keep all the content in the db vs. allow some of
the content to reside in the file system. A callback scheme lets the
module determine where the content lies. A callback scheme would let
the site indexer index remote content (via a callback that supports
XML-RPC or SOAP, or any sort of webservice). A table based scheme
may cause the data to be replicated one more time: it's somewhere in
db or file space in it's primary format, it's replicated in indexes
within the site indexer, and we then ask it to be stored one more
time within the table to be scanned.
What do you folks think? Is this an OpenACS requirement,
distraction, or bloat? What should these interfaces look like?
Callbacks or table driven?
Request notifications