Forum OpenACS Development: Project to build a new FtsEngineDriver using tsearch2 for PostgreSQL

I have been looking into what is going on with OpenFTS and tsearch2. tsearch2 offers in database ranking, and headline functions. Also, as its included with postgresql 7.4 and tsearch1 will not be included with Postgresql 7.5, its about time to look at writing a search driver for tsearch2.

One issue is that the standard tsearch2 install does parsing of the text in the database. So, to index items that are stored in the filesystem the text content to be indexed needs to be stored in the database. Because the installation process would be much simpler than compiling the nsfts.so driver for the external parser, and because nsfts.so is currently not compatible with tsearch2, I think a way to store indexed content in the database is a good first implementation.

Some installations might find storing duplicate text content for filesystem items in the database to be a problem, in that case, an different tsearch2 compatible driver with an external parser can be built.

I have discussed these ideas with Paul Doerwald and Dirk Gomez. Anyone else who is interested in improving the search capabilities for OpenACS, for PostgreSQL or Oracle, let us know by replying to this thread. If you just have a comment or idea, post it here.

More to come.

For support of other non-text formats we would have needed a text version of all indexed non-text content anyway, at least with openfts, because with the current architecture it calls the conversion to text everytime an item shows up as match in the search results page and that is not practicable for documents with expensive conversion processes, e.g. pdfs. With a text copy in the db the search results page wouldn't need to be built as streaming http with ns_write, what a relief!
Storing a text version of the document in the database allows us to send emails with part of the content or have the search page look like google (with a small teaser for the document).
comment:  Would be nice to have a scoping feature. For example, users who want to search the docs, not the forums.
What about if we were to "repurpose" the 'txt' table to store the fulltext plus any metadata such as the scope (object type?) as Torben suggested, the context_id, the object_id, etc.

We would just need a new driver (service contract?) to stuff new data into the txt table and we'd need to rewrite the search pages.

I'm not an expert on drivers and service contracts, but I have used OpenFTS/tsearch since 0.32 so I would be very happy to contribute whatever I can. Perhaps I could rewrite the search pages to be tsearch2 friendly?

Paul,

Right now the txt table includes the object_id (tid) and the indexed terms.

To expand that we might want to store the text to be indexed, but I am not sure how much duplicating that will work. For example, cr_revisions store the text of the revisions in cr_revisions.content.

We definitely need an efficient way to get the text to generate headlines when returning search results. Doing a seperate query for each row really doesn't work.

So, I guess a first draft of the tsearch search might be something like this:

create table txt (
    object_id integer references acs_objects,
    content text,
    tsv tsvector
);

And content to be indexed would be stored in the content column, and the tsvector index for the content would be stored in the tsv column.

For cr_items it would only contain the live revision of the content to be indexed, so other revisions would not be copied into the txt table.

I think we want to get most of the other attributes from the acs_objects table if possible.

One important point is that tsearch2 can assign 4 different weights to parts of the document. So the title, description, and content, as well as other metadata such as author, categories/keywords could be assigned to different weights. This should probably be configurable somehow. Should there be 4 coluns for each different part, A, B, C, D to be parsed by tsearch2 instead of one column just for "content"?

Just read through most of the tsearch2 documents, and it looks like a very good solution.  From the documents, it looks like they tried to make it pretty straight-forward to build a featureful search solution.

I love how searches are done with normal select statements, which makes it fairly easy to add other "where" clauses to implement functionality like scoping.

This appears to be a very useful project.  My employer has needs for it as well, so I'm anxious to help.

Dave,

Thanks for reminding me about the A, B, C and D rankings.

It's definitely a good idea to split these 4 rankings (+ fulltext) into 5 db columns. It's important to note that right now the FTS interface only supports 2 columns (afaicr): title and text. We'll have to keep a backwards-compatible 2-field interface as well as providing a 5-field interface.

In my experience, I tend to use the B ranking for most titles. That keeps the A ranking free as a "trump card" when needed. I find it helpful to use it when I know I have a certain class of results that I want to sort to the top or if I want to give a bit more priority to content that has only one line (such as links) of text versus content that has 5K of text.

Do you think we need to assign semantic names to A, B, C, and D, or should we leave them with those nonsemantic names? Personally I haven't been able to think of suitable names, so I might just leave them as A, B, C, and D. The letters probably make enough sense to a programmer anyway and they're the only ones who are going to be dealing with this code.

... and what do you think about Torben's "scope" suggestion (i.e. support for metadata?).

It's definitely the kind of thing that advanced search would want, and ACS has historically had an aversion to advanced searches, but I've found that very often my customers have wanted advanced search. We should support it even if we don't implement it right away.

Do you have any thoughts on how we could support (or implement) advanced searches? Or should we just leave that up to the programmer to expand the search system?

Paul.

I have checked in an initial tsearch2-driver into HEAD.

It supports the existing features of the search package.