Forum OpenACS Q&A: Search in AOLserver

Collapse
Posted by Krzysztof Kowalczyk on
I have written an article on implementing search in AOLserver by making it talk to Swish++. It's available at www.fifthgate.org.

My experiments are intended to be the first step towards implementing site-wide search in OpenACS. I know that at least Don and Ben has been thinking about it. Does any of you have any design for that or is this idea still in infancy?

Collapse
Posted by Ola Hansson on
Fantastic work!
I can surely use this stuff.

I noticed one minor thing: There was no match to "pilot" but one or two for "palmpilot"...

Collapse
Posted by Krzysztof Kowalczyk on
I guess it just means that either there is no document on my site that contains word "pilot" or that word was discarded as being too frequent/rare (according to Swish++). If it's the latter one can tweak it with some Max* parameter when indexing with Swish++.
Collapse
Posted by Janine Ohmer on
interMedia would do the same thing - if "palmpilot" was in the index, "pilot" would not match it.  This is pretty brain-dead and one of these days I need to do some research and see if there is a setting to change this, since this reduces it's matching abilities to less than a LIKE query.

Anyway, just thought you'd like to know that Swish++ is at least keeping up with the competition in this regard. :)

Collapse
Posted by Krzysztof Kowalczyk on
If this is really so about interMedia then Swish++ is one step ahead: you can use asterisk at the end of the word (ie. looking for PilotM* would find PilotMain). Sure, regexp it ain't, but it's better than nothing.
Collapse
Posted by Michael Bryzek on
I asked Paul Dixon at interMedia about the "pilot" vs "palmpilot" search problem. He wrote:
In 8.1.6 we introduced "substring indexing" which rotates tokens so we can index suffixes. We did this for a pharmaceutical company that wanted to wildcard both sides (%benzo%). Trailing wildcards were already OK. That's what this user should use, although I suspect they're being linguistically naive: do they really want a hit if someone queries on "lot"?

A thesaurus or extended knowledge-base might help if the user noticed this spelling error more than once in their logs.

Germans tend to lump a string of words together as one - in that case, and in Dutch, we decompound linguistically according to a dictionary.

There is some documentation about substring indexing at http://oradoc.photo.net/ora816/inter.816/a77063/cdatadi6.htm. I have never used it, but it would be interesting to play with.
Collapse
Posted by Don Baccus on
Great, I'll have to take a look at this.  Are you indexing database entries?

I spend about a half day earlier this week looking at swish-e, swish++, and some variations on the existing keyword search scheme that barely exists as contributed code in PG.

swish++ allows incremental updates of the index file - important for indexing things like bboard entries.  However my quick poking around didn't see any concurrency protection ...

swish-e 2.0 (in beta, but seems to work OK in my experimenting) has a phrase-search facility as well as the keyword-based search of swish-e 1.* and swish++.  This would be nice to have.

Integrating C++ code into PG is an interesting proposition, not sure it is practical in a portable way - it might be nice to have functions  to search the index that can be incorporated into queries to pull stuff from module tables like the bboard table.  So swish++ may not be the ideal solution.

On the other hand, swish-e doesn't handle incremental updates of the index, though it should be easy to add.

I'll have some more comments in a day or two, before I leave for Nevada.

Collapse
Posted by Krzysztof Kowalczyk on
Are you indexing database entries?
No, I've only played with files. It should be possible to index db entries, of course using a hack: save db entry to a file with a name that we can decode later to extract table/record id and tell Swish++ to index this file, then delete file and go on. This is how httpindex frontend for indexing pages grabbed directly from web server works.

I don't understand why it should be necessary to merge Swish++ with PG, in your earlier posts you've mentioned that you're leaning towards out-of-database solution which I second: just like with files storage, it's kind of pointless to put into a database a copy of data that's already there and only serves to create an index - but maybe I'm missing some bigger picture here.

swish++ allows incremental updates of the index file - important for indexing things like bboard entries. However my quick poking around didn't see any concurrency protection ...
It's because it doesn't have any concurrency protection. The way it works with incremental indexing is: the original index is read-only the whole time, Swish++ creates a copy of this index and adds new documents to it. Since there can only be one process that updates new index there is no concurenncy and thus no problem. When updating is finished one just have to switch to a new index. The biggest problem I see is that during this operation you have to have twice as much space for index but I don't see this as a showstopper
  • index is relatively small (under 10% of original files)
  • people who are serious about this stuff and are lucky enough to have things to index will just buy bigger drives

I'm a bit sketchy on ACS Classic's search implementation, but I think the way I would like to do it is similar to their approach (with the exception of using external program to index things, of course):

  • keep track of what needs to be indexed (table/record id)
  • have a periodic task that updates the index by moving records out of database to files and feeding those files to Swish++
  • pause search for a while, substitute an old index with newly created index and voila
In theory it's trivial and I'll implement this unless someone will beat me to it (I'll only be able to start working on it in 3 weeks). I've just sent Jim Davidson patches to AOLserver that will make it possible to efficiently communicate with Swish++ from within AOLserver, let's hope he'll integrate them.
Collapse
Posted by Janine Ohmer on
although I suspect they're being linguistically naive: do they really want a hit if someone queries on "lot"?

Not linguistically naive, just expecting interMedia to be a lot smarter than it apparently really is. :)

From all the PR about Context and then interMedia being intelligent searching tools, I would expect it to have enough knowledge about the language it's configured to use to recognize when a word fragment is so common as to be meaningless and ignore it.

Having to use % as a wildcard isn't intuitive to the average user, and even with an explanation right on the search page I don't think they would use it.

However, this being the OpenACS bboard makes this a bit off topic, so I'll end my "explain to me again what is so great about interMedia" rant here. :)

Collapse
Posted by Don Baccus on
First of all - I won't be able to spend time on this for four to five weeks, so you should go for it, Krzysztof (I'll be in Nevada for the month of September, with no computer much less net access).

Some specifics:

No, I've only played with files. It should be possible to index db entries, of course using a hack: save db entry to a file with a name that we can decode later to extract table/record id and tell Swish++ to index this file, then delete file and go on. This is how httpindex frontend for indexing pages grabbed directly from web server works.
We should be able to do much, much better than this by providing a datasource that knows about Postgres, or by providing a new entry into swish++ that takes parameters rather than a file name. Neither should be particularly hard.
I don't understand why it should be necessary to merge Swish++ with PG, in your earlier posts you've mentioned that you're leaning towards out-of-database solution which I second: just like with files storage, it's kind of pointless to put into a database a copy of data that's already there and only serves to create an index - but maybe I'm missing some bigger picture here.
My thinking is that eventually it might be nice to provide a PG function that can query the index directly, so you can join the results to the (in 4.0) repository without doing any intermediate work. Seems like this could be more efficient than querying the swish++ daemon over a socket.

This is only for searching the index, of course. As far as building the index goes, the only level of integration that would be nice would be the ability to put a trigger on a table like bboard that causes the entry to be indexed automatically on insert, and deleted for delete/update (and reinserted for the latter). The trigger approach isn't strictly necessary, of course, just nice (you can call the indexer directly from Tcl instead).

Does swish++ support incremental deletes as well as inserts?

But this certainly isn't important in the near term, and the search daemon's an improvement over the swish-e approach. You can make persistent connections to the daemon and pool them ala database drivers (in fact, you could make it a dummy "database" and use the driver protocol as a quick hack).

It's because it doesn't have any concurrency protection. The way it works with incremental indexing is: the original index is read-only the whole time, Swish++ creates a copy of this index and adds new documents to it. Since there can only be one process that updates new index there is no concurenncy and thus no problem
This sounds like a race condition to me...AOLserver threads "A" and "B" both start updating the index at the same time, reading the same read-only copy, then in turn each write a new index. You can lock in the AOLserver interface, though. Since the search works off of a read-only copy, they won't be blocked. Having inserts block while searches don't should be OK.

The biggest problem with swish++ is the lack of a phrase-based search, which we can poach from swish-e later anyway, so i'm not worried about this.

Collapse
Posted by Ola Hansson on
Having to use % as a wildcard isn't intuitive to the average user, and even with an explanation right on the search page I don't think they would use it.

On Krzysztof's page I was able to use * as a wildcard to make "PalmPilot" a hit on a search for "*pilot", but I guess it's not one bit more intuitive😊

I'd like to run swish++ as a daemon and I got the "non-daemon" search running. Why can't I load nsunix witch I presume is needed? All permissions look right to me.

This is what the log file says:

[30/Aug/2000:17:15:59][4955.1024][-main-] Error: nsunix: could not listen: File or directory doesn't exist.
[30/Aug/2000:17:15:59][4955.1024][-main-] Notice: binder: listen(127.0.0.1, 80) = 17
[30/Aug/2000:17:15:59][4955.4101][-nssock-] Notice: waiting for startup
[30/Aug/2000:17:16:00][4955.1024][-main-] Notice: AOLserver/3.0 running.s

A snippet from my nsd.tcl:

ns_section "ns/server/${server}/module/swish-search"
    ns_param PathToSearch "/tmp/swish++-4.6.6/search"
    ns_param PathToIndex "${pageroot}/history/swish++.index
    ns_param UseDaemon 1
    ns_param SocketName "/tmp/search.socket"


# Unix domain socket driver -- nsunix
#
ns_section "ns/server/${server}/module/nsunix"
ns_param   hostname        $hostname    ;# Hostname used in response to client
ns_param   port            80    ;# Port to listen on
ns_param   socketfile      "/tmp/search.socket" ;# UNIX domain socket driver


ns_section "ns/server/${server}/modules" 
        ns_param   nssock          ${bindir}/nssock.so 
        ns_param   nslog            ${bindir}/nslog.so 
        ns_param   nsperm          ${bindir}/nsperm.so 
	ns_param   nscp            ${bindir}/nscp.so
	ns_param   nsunix	   ${bindir}/nsunix.so
Collapse
Posted by Krzysztof Kowalczyk on
Ola, good news are that you don't need nsunix. Bad news is that you need to patch AOLserver to get daemon search working. There is a chance that this patch will get integrated with main distribution of AOLserver but at this time you have to get your hands dirty if you want make it work. Patch is at http://www.fifthgate.org/ns_sockunixopen.patch. It's against latest CVS sources of AOLserver. You need to run patch -p 0 < ns_sockunixopen.patch under aolserver directory.

This patch adds ns_sockunixopen tcl command that is needed to communicate with Swish++ daemon.

Collapse
Posted by Ola Hansson on
I'm much obliged, it's working as expected.
Collapse
Posted by Ola Hansson on
Every time I turn on my computer, log in as root and fire up the "daemon-search" the permissions in "/tmp/search.socket" changes from nsadmin to root, resulting in a broken search engine.
It looks like the daemon must be run as nsadmin.

What would be a good way to make the search-daemon start on every boot (as nsadmin)?
(I just thought maybe, perhaps you'd want to have this information on your page
anyway:-p)

Collapse
Posted by Krzysztof Kowalczyk on
Warning: this is pure, untested speculation. If you're running RedHat you can use the standard boot startup framework. In short you should chown nsadmin.nsadmin /usr/local/bin/search and setuid the binary (too lazy to check the syntax) so that search always starts as nsadmin process regardless of the user who executed it (this is admittedly an ugly kludge but it will take you where you want to go). Test this by loggin in as root and starting search - socket file should be created with permissions of nsadmin now.

As of starting a daemon on boot there is more than way to do it. One way: save the script that I attach at the end as /etc/rc.d/init.d/swishd and make it executable. Test it by executing swishd start (this should start the daemon). To make it start automatically execute chkconfig --add swishd. An alternative would be to just launch search from /etc/rc.d/rc.local. Detail of those approaches are beyond the scopy of this post. This is just general RedHat administration stuff. If you use other distribution there is probably a similar feature but details would be different.

#!/bin/sh
#
# swish:       Starts the Swish++ search daemon
#
# chkconfig: 345 43 77
# description: Starts and stops the search as daemon at boot time and shutdown.
# processname: search

# Source function library.
. /etc/rc.d/init.d/functions

# See how we were called.
case "$1" in
  start)
        echo -n "Starting Swish search"
        /usr/local/bin/search
        echo
        touch /var/lock/subsys/search
        ;;
  stop)
        echo -n "Shutting down Swish++ search"
        killproc search
        rm -f /var/lock/subsys/search
        echo
        ;;
  restart|reload)
        echo -n "Restarting Swish++ search daemon"
        $0 stop
        $0 start
        RETVAL=$?
        ;;
  *)
        echo "*** Usage: swishd {start|stop|restart|reload}"
        exit 1
esac

exit 0
Collapse
Posted by MK Tam on
Hi,

Does Swish++ support multibye encoding, like Big5 or GB?
Thanks.

Collapse
Posted by MK Tam on
Well, I just get "search failed with: swish_error {search: error: malformed query}" when typing big5 characters for searching...
Anybody help?
Collapse
Posted by Krzysztof Kowalczyk on
I think that SWISH++ don't support anything but English.