OK, I have been continually frustrated by the lack of site-wide-search
on OpenACS installs, both on my own OpenACS installs and the
openacs.org site. Here is my idea - note that I have not actually
done any work on this yet.
There is a freeware search engine called ISearch. It does full-text
indexing, knows how to handle HTML, and has a few other nice features
: you can do merging of indexes (no need to regenerate stuff), it is
completely free, can handle multiple text databases does relevancy
ranking etc.
Here is my idea.
1. Use Isearch as a backend for site-wide-search. Use wget or other
recursive web-sucker to pull down files and place them in a directory.
Index, then we need to translate the Isearch URL to the actual URL.
This in itself is, I think, relatively easy. Maybe we need to use the
tcl code that turns arguments into sub-dirs or whatever (I think Bas'
code does this).
2. Now, we have a problem - if we index everything on the server,
some of the URLs returned will be to places that the user CANNOT go -
because they do not have permission. In fact, from a security
standpoint, allowing a non-privileged user to even know that
informaiton does or does not exist is considered a bad thing. Plus,
users will just become plain frustrated and the s-w-s will seem to be
"broken" .
3. So, we create separate web-crawler users, one for each group that
is defined in the system. Then we tell the crawler to crawl the site
several times, each time pulling down the info it SHOULD see as a
member of this group. We put results in different directories, with
the name of the index as the name of the group.
4. Then, when someone accesses the s-w-s, we find out what group(s)
they belong to. We called Isearch, telling it to use the following
indices: GROUP1 GROUP2 etc. Isearch merges the indices, returns the
results.
5. For speed (since merging might take a while) we could
automatically create merged indices from commonly used group
combinations.
6. Problem - if a page is visible in two or more groups, when index
is merged and results returned we may end up with 2 or more links to
same content.
7. Problem - as well, relevancy results for that page may vary (since
2 diff groups will have two diff sets of text to be searched). Need
to determine how to handle such a condition - perhaps an average of
the 2 rankings? But if we do that, then we need to sort the returned
results ourselves if we are trying to return them in order of
relevancy.
What do you think? Is anyone else working on this kind of thing?
./patrick