Forum OpenACS Q&A: Re: .LRN Search thoughts and architecture - Please give feedback

Hi,

Dave wrote:

These are important issues to consider for the
implementation of a specific search driver.

We've just finished ahead of time a timesheet billing system for Project/Open, so that we are going forward with the implementation of a search engine now. So I've looked around a bit more to come up with a concrete architecture.

Here is what I've found:

- The "search" module together with the "tsearch2-driver" module contain a total number of 695 lines of TCL and SQL code.
- It was surprisingly difficult to enable searching for a simple Project/Open object such as a "Project".
- The "txt" table has a really ugly name and only contains two columns. However, we would need a lot more search-relevant information
- The current API is not sufficient for our purposes. We would need to pass around a lot more information between the application modules and the search module.
- Returning a list of lists for search results doesn't suit the needs of P/O, in particular because permission checking is extremely performance critical. However, permission checking is specific to each business object, so we will need to mix ("tightly integrate") application specific code with code from the "generic search engine driver". Creating a generic API for this purpose will be very complex.
- We will need support for "popularity" or objects (for example by analyzing web server logs or by counting the number of "hits" of the object). I haven't seen any support for this in the current code.
- I don't really understand why search needs the "observer queue", separating (in terms of execution time and application context) the search packages from the business logic. Updates of the "txt" table should be pretty fast, right?

Conclusions:

- We will need to add search performance optimization based on each object's "business object container" (typically the object's project_id). The reasoning goes like that: If the user doesn't have permissions on the project then he won't have permissions on the "contained" objects such as discussions and documents. For this purposes we will have to add an additional "container_object_id" column in the table "txt".
- "Filestorage documents" and "forum items" in P/O are not OpenACS objects (for performance reasons). However, they are also the most interesting objects to search for. So we will need a "txt" table with a primary key composed of object_id and object_type.

I hope that it becomes clear from the above reasons that we are probably not going to go with the "search" and "tsearch2-driver" packages for P/O searching. Instead, we are probably going to build a P/O specific search package based on the code of these two modules, but implementing all the additional stuff that we need.

The resulting search package is - obviously - going to be GPLed, so I hope that it may serve as an input to a future (generic?) OpenACS search package.

I'm not at all sure about these conclusions. I'm just writing in a relatively "provicative" style (you may already have noted it from previous postings...). So please argue with me and prove that I'm wrong.

I would be delighted... :)

Bests,
Frank