Here is an update on the work being done to
integrate OpenFTS with OpenACS:
OpenFTS is a PostgreSQL-based search engine that
makes use of the GiST
interface available in PostgreSQL. It provides online full text
indexing of data and relevance ranking for database searching. It can
be used to find documents containing terms with the same linguistic
root as the specified word and it can also be used for
indexing/searching of multi-lingual and non-text documents. Currently,
OpenFTS is implemented as a collection of PERL-scripts.
We are working with Oleg Bartunov and Teodor Sigaev (XWare) to help them open source their
search tools under the GPL license. Dan, is currently porting the
PERL-scripts into TCL and he is going to move some of the
functionality into an aolserver module.
OpenFTS uses PostgreSQL as a database backend where documents are
stored as arrays of integers. The index access structure for the array
of integers is constructed as an RD-Tree which is implemented using
the GiST interface that is available in PostgreSQL. The RD-Tree is a
variant of the R-Tree, a popular access method for spatial data. RD
stands for "Russian Doll", which describes the transitive containment
relation that is fundamental to the tree structure. The RD-Tree data
structure implementation provides three predicates between sets:
superset, subset, and overlap.
For indexing, a parser is used that reads the document and converts it
into a stream of lexemes. Then, morphology or stemming is applied in
order to get the base form and finally, an algorithm calculates an ID
for each of the lexemes. The resulting array of integers is stored
into the database.
When a search query is received, the parser converts it
into a stream of lexemes and morphology or stemming is
applied to get the base form. Then, each lexeme is assigned an
integer ID and finally, SQL queries are generated and executed.
A prominent feature of OpenFTS is the
ability to rank documents according to proximity between the words of
the search query --
this is accomplished by maintaining coordinate information of the
lexemes of each document.
For example, if the query is "full text search", documents containing
the
phrase "full text search" will be ranked higher than documents where
words "full", "text", "search" occur in different places.
Information about GiST support in PostgreSQL can be found
here
(http://www.sai.msu.su/~megera/postgres/gist/).