Forum OpenACS Development: Response to Full Text Search

Collapse
Posted by Rafael Calvo on
Neophytos,
Thanks for the update.
I would like to read a bit more on the text indexing procedure but the papers seem to be more on general indexing issues. Do you use any weighting scheme? so a term that appears often is heavier than otherwise?

For indexing, a parser is used that reads the document and converts it into a stream of lexemes. Then, morphology or stemming is applied in order to get the base form and finally, an algorithm calculates an ID for each of the lexemes. The resulting array of integers is stored into the database.

which indexes? the ID or the weights (ocurrences?) of the terms? Do you use your own stemming algorithms? any stopwords?
cheers