Forum OpenACS Q&A: Summary of Search discussion at .LRN User Meeting in Madrid May 2005

In attendance, Dave Bauer, Al Essa, Rob Denison, Peter Alberer, and a few other people who I unfortunately did not remember. If you were there, reply to this thread.

Some notes on search as discussed in Madrid:

- How do you detect the language of the search query? The most
straightforward way is to use [ad_conn locale] and assume the
current locale is set correctly. Only signifigant if the search
engine has multiple dictionaries. Another possibility is to try
dectecting the language from the search terms themselves comparing
them against a dictionary.

- How do you dectect the language of content to be indexed? The
simplest method is to require the content author or editor to
specify the language, or use [ad_conn locale]. Peter Alberer of
Vienna suggested it might be possible to detect the language based
on the content.

- Vienna is using the trigram optional add-on to tsearch2 to find
similar words in addition to stemming. This can be used for
Google-like suggestion of similarly spelled search terms.

- Suggestions on additional ways to organize and rank search results
included: grouping results by language, object_type, or category, or
grouping by containter object, ie: content folder, subsite,
package_id.

- Jeff Davis suggested adding an additional ranking component based on
object type. Administrators would be able to assign relative weight
to differnet objec types, for example, forum posts could be weighted
higher than edit-this-page content pages.

- An amazon.com like search scope selection should be available. For
example in .LRN, while viewing "My Space" the scope would default to
all objects readable by the user, while in a class portal, default
scope would be all objects within that class, and while viewing
file-storage, scope would default to that particular file-storage
package isntance. So there would be 3 scopes, site-wide, community (.LRN or
subsite), package instance. Another possibily might be package type,
ie: search all forums, or all file-storage instances the user can
read.

- It might be possible to take the user submitted query, process it
with to_tsquery function, and compare the results. Any terms missing
in the results should be stopwords that are not used in search. This
list of stopwords could be provided to the user as feedback. (This
needs to be tested to see if it works)

- tsearch2 has a lot of configuration options. It should be possible
to provide a web interface for administration of some of these
options to improve search performance.

- Peter Alberer mentioned that the character set support of external
programs used to convert binary content to text needs to be evaluated.

- Vienna has a feature that can suggest similar searches based on
common search terms. We did not discuss the implementation details. It
should be possible to store the most popular search queries and rank
them for similarity.

Once we have a baseline search package that works for Oracle and PostgreSQL I look forward to improving search with some of these features.