Forum OpenACS Development: OpenFTS indexing non-text formats revisited

I am sure we have discussed this before, but now I am actually implemented a solution.

Right now I am looking at a solution that will be keyed to a content item's mime type that will [exec] an external program to extract the text of an item and return that in the datasource service contract tcl procedure.

The result of this is that the text would be extracted on every call to that service contract.

The alternatives include: storing a text version parallel to the binary item in the filesystem.

Storing a text version as a related content item in the content repository.

Storing text of items to be indexed in a seperate table in the database.

Eventually I'd like to get whatever solution I find back into OpenACS.

Posted by Malte Sussdorff on
Hi Dave, this sounds awesome. Hope there are enough external procedures that will allow you to do your work.

As for the storage, have it in a seperate table, so we could have this seperate table on a different partition or even a different server if need be. I'm just thinking about the many many files e.g. AIESEC has in their file storage.

Posted by Tom Mizukami on
What is the current honest assessment of searching within OpenACS? OpenFTS with PG seems a bit ... fragile and InterMedia with Oracle seems broken.
Posted by Dave Bauer on
OpenFTS is a bit annoying to configure and compile, but seems reliable in use.

I am going to be working on (hopefully with some help) on integrating the latest tsearch2 for Postgresql into search. More details when some exist.

Posted by Alfred Werner on
On a related(?) note - the datatype work I'm doing is generating a large body of regular expressions. Could make it possible to associate (or tag) text with matched 'named entities' that can be found with the regex library.
6: OpenACS site wide search (response to 1)
Posted by Andrew Piskorski on
See also Dirk's recent comments on OpenACS search using Oracle Intermedia. Basically he recomends junking the current OpenACS site-wide-search package (Oracle only, uses Intermedia), and porting the Intermedia-based search pacakge from ACES (aka, ACS 3.5+) instead.