Neophytos,
Thanks for the update.
I would like to read a bit more on the text indexing procedure but the papers seem to be more on general indexing issues. Do you use any weighting scheme? so a term that appears often is heavier than otherwise?
For indexing, a parser is used that reads the document and converts it into a stream of lexemes. Then, morphology or stemming is
applied in order to get the base form and finally, an algorithm calculates an ID for each of the lexemes. The resulting array of
integers is stored into the database.
which indexes? the ID or the weights (ocurrences?) of the terms?
Do you use your own stemming algorithms? any stopwords?
cheers