Forum OpenACS Q&A: Implement site-wide search using OpenFTS with modules such as file-storage

Dear OpenACS Community, Jowell Sabino created within the file-storage module a file called file-storage-sc-create.sql. Now I suppose by performing the select statements in that file you are able to register the content in the file-storage with the OpenFTS search engine. Does anybody know what else is involved in writing a service contract for the OpenFTS search engine. Also I noticed that you can choose to store the uploaded files in the file storage package either in the file system or in the PostgreSQL. I suppose OpenFTS can only perform a search on files that are in the DB. Maybe there is some documentation on how to register modules in general so that the search engine starts to index them.

I am really keen on using OpenFTS, especially after I was successful installing it 😊. But so far it only indexes the acs-notes: Have a look under http://unido- dev.lanifex.com/search

Another question I have is how is the performance of PostgreSQL if binary files are stored in the DB. I will have to make a decision whether the file-storage package should store the binary data in the file system or in the DB. I have around 1.6 GB of data so far that has to be migrated.

Performance of Postgres itself is fine, however due to the lack of any reasonable I/O interface to large objects we chose to encode binary data before placing it in the database (the chunks have to be inserted as strings, thus the encoding process).

So pulling data in and out of the database and up the pipe to your user's browser can be a bit expensive.

Backing up if the data's in PG is simpler because pg_dump will guarantee consistency.

If you store the data in the filesystem you'll need to pg_dump and tar the file data.  To be absolutely certain of getting a consistent dump you'd probably want to shut down during the backup.

There are tradeoffs either way, which is why we added code to allow you to choose which storage method to use.

Gregor, I'm out of town so I can't help much with out the computer I'm usually working. If you want to, you can check the first messages from my profile -- there's enough info about how the search and the openfts-driver package relate. Basically, yes, you can index files as well as content in the database with the current search packages. The search package is very general and the only thing you need to do is write an implementation for the FtsContentProvider contract. OpenFTS-driver only provides an implementation of the search engine contract -- you could if you want enable the search package to work with some other engine, htsearch, swish, etc by writing an implementation of the search engine contract.

I'm sorry that I cannot be of much help now. If you can wait until the end of next week, ping me again about this and I'll try to provide you with more information.

Now I suppose by performing the select statements in that file you are able to register the content in the file-storage with the OpenFTS search engine.

You also need to install the binding "file_storage_object" that file-storage registered (in the acs-service-contract admin page). Search on file-storage is working for me (OpenFTS 0.2). Let me know how I could help.

Don, Neophytos & Jowell, thanxs for the help. Today I have a meeting on the customer site. We will see what they decide. Personally I don't like the idea that users upload any binary data in the database, but we will see. Thank you for offering your help and the outstanding response!
Gregor,

I have some more examples of service contract implementations for Search and OpenFTS. I am testing them on the new openacs.org site. If you want to see them let me know.

Dear Dave,

that would be great! I just returned from the customer meeting where we also discussed OpenFTS. At the moment the editor has to input keywords manually, but the list is mainained poorly. So he came up with the question whether it is possible that we send the search term to a thesaurus and based on the results the search is performed. The reason why he needs this is because for example UN is the English acronym for United Natiions, but in German it is called UNO and in Spanish it is even ONU. But whenever a Spanish user is looking for ONU he should also get results back from documents that used UN or UNO etc.

Well we will see the outcome of the development.

Thanxs for your help!