Forum OpenACS Q&A: Idea for doing site-wide-search...

OK, I have been continually frustrated by the lack of site-wide-search
on OpenACS installs, both on my own OpenACS installs and the
openacs.org site.  Here is my idea - note that I have not actually
done any work on this yet.

There is a freeware search engine called ISearch.  It does full-text
indexing, knows how to handle HTML, and has a few other nice features
:  you can do merging of indexes (no need to regenerate stuff), it is
completely free, can handle multiple text databases does relevancy
ranking etc.

Here is my idea.

1.  Use Isearch as a backend for site-wide-search.  Use wget or other
recursive web-sucker to pull down files and place them in a directory.
Index, then we need to translate the Isearch URL to the actual URL.
This in itself is, I think, relatively easy.  Maybe we need to use the
tcl code that turns arguments into sub-dirs or whatever (I think Bas'
code does this).

2.  Now, we have a problem - if we index everything on the server,
some of the URLs returned will be to places that the user CANNOT go -
because they do not have permission.  In fact, from a security
standpoint, allowing a non-privileged user to even know that
informaiton does or does not exist is considered a bad thing.  Plus,
users will just become plain frustrated and the s-w-s will seem to be
"broken" .

3.  So, we create separate web-crawler users, one for each group that
is defined in the system.  Then we tell the crawler to crawl the site
several times, each time pulling down the info it SHOULD see as a
member of this group.  We put results in different directories, with
the name of the index as the name of the group.

4.  Then, when someone accesses the s-w-s, we find out what group(s)
they belong to.  We called Isearch, telling it to use the following
indices:  GROUP1 GROUP2 etc.  Isearch merges the indices, returns the
results.

5.  For speed (since merging might take a while) we could
automatically create merged indices from commonly used group
combinations.

6.  Problem - if a page is visible in two or more groups, when index
is merged and results returned we may end up with 2 or more links to
same content.

7.  Problem - as well, relevancy results for that page may vary (since
2 diff groups will have two diff sets of text to be searched).  Need
to determine how to handle such a condition - perhaps an average of
the 2 rankings?  But if we do that, then we need to sort the returned
results ourselves if we are trying to return them in order of
relevancy.

What do you think?  Is anyone else working on this kind of thing?

./patrick

Collapse
Posted by Ben Adida on
We're definitely working on that. The solution you mention has a lot of issues, especially given the way that the data is retrieved from the DB for indexing. It seems extremely complex and not particularly simple for finding which record was returned by a given search. Don Baccus is working on a number of solutions to this, and I think you should discuss this with him (and on this bboard is fine, too).
Collapse
Posted by Don Baccus on
I've already got a simple keyword-style search hack working on my laptop (did it yesterday afternoon at the 'ole coffeeshop over an iced Americano).  It is a redo of the original search kludge used on photo.net and in the ACS until Context/Intermedia came along, using a simple ranking procedure (written at the moment in PL/pgTCL) called from a query on the bboard.

I think it will suffice as a stopgap measure - ONLY as a stopgap measure, but one that photo.net used for quite a long time and was certainly much better than nothing at all.  It requires a single sequential scan on bboard content, and can also be used to search other tables (news, for instance) as well.

I plan to make this available in the next few days so folks can play with it.

I've looked at swish/swish++ and a couple of other options.  Ben and I seem to agree that an out-of-database solution would be best.  In-database solutions tend to be slow (Intermedia is slow, at least), and requires you stuff all your static pages into the database if you want to support a single integrated search facility.

An out-of-database indexer can always reindex everything if the index gets corrupted, so the normal ACID paranoia can be ignored for such a tool.  As long as the database holding content isn't trashed, you can rebuild the search index.

However, I'd like to integrate more closely than the approach you suggest would indicate.

I may make some progress on a more permanent solution before leaving at the end of the month to band hawks in Nevada (I'll be gone virtually all of September), but am not sure.

Collapse
Posted by Scott Mc Williams on
Don,

I read your response to this question and which was August 14 and then saw a response on August 19 talking about /doc/sql/rank-for-search.sql. Is that the solution that you were talking about as a stopgap for site wide search, or is that just for bboard searching? Also, is site-wide search going to be a part of 4.x?

Thanks for all the help!

Scott