Forum OpenACS Q&A: Re: .LRN Search thoughts and architecture - Please give feedback

Hi,

here is a summary from the issues we found developing a search archtiecture for the guys at the HP professional printer division.

Permissions:

Only a fraction of the document result set from a search query may be visible to a particular user. In the HP case we had to consider search result sets with a million to a billion document IDs. What's important in such a context is a fast way to discard documents that a user can't see to avoid slow searches (hierarchical or via wildcard) over document permission hierarchies.

To tackle this issue we've come up with several options. The best one for the HP context was to include a fast (denormalized and precalculated) index on "projects" (the topmost containers of documents and document trees). So checking whether a user has read permissions on the document's project allows to discard 95% of all non-allowed documents in the result set.

Stemming and Normalization:

Big issue outside of the US. "Bäume" is the plural of "Baum" in German ("trees" and "tree"). Intermedia solves this issues by allowing the user to specify a stemming table that would translation "Bäume" back to "Baum". Works pretty simple during indexing time.

Search by Proximity:

Looking for "search engine" should return documents first where "search" and "engine" are following each other, and not a document where somebody talks about an "engine" and later about how to "search about an expert". => You'd need to include the position of words in a document in the search index.

Ranking in General:

The current OpenACS search is extremely poor, because it doesn't do any ranking of documents. I use Google to look for stuff in www.openacs.org...
=> You'd need to provide means for users to plug-in their own ranking algorithms.

Just some thoughts. Maybe you want to checkout
http://www-db.stanford.edu/%7Ebackrub/google.html
in order to learn about some _real_ stuff...

Bests,
Frank