Paul,
Right now the txt table includes the object_id (tid) and the indexed terms.
To expand that we might want to store the text to be indexed, but I am not sure how much duplicating that will work. For example, cr_revisions store the text of the revisions in cr_revisions.content.
We definitely need an efficient way to get the text to generate headlines when returning search results. Doing a seperate query for each row really doesn't work.
So, I guess a first draft of the tsearch search might be something like this:
create table txt (
object_id integer references acs_objects,
content text,
tsv tsvector
);
And content to be indexed would be stored in the content column, and the tsvector index for the content would be stored in the tsv column.
For cr_items it would only contain the live revision of the content to be indexed, so other revisions would not be copied into the txt table.
I think we want to get most of the other attributes from the acs_objects table if possible.
One important point is that tsearch2 can assign 4 different weights to parts of the document. So the title, description, and content, as well as other metadata such as author, categories/keywords could be assigned to different weights. This should probably be configurable somehow. Should there be 4 coluns for each different part, A, B, C, D to be parsed by tsearch2 instead of one column just for "content"?