Forum OpenACS Q&A: Full text search

1: Full text search

Posted by mark rohn on 08/21/00 11:56 PM

I see that full text searching is now working, is full text search
included in version 3.2.4 ? Also how is full text searching done?

Thanks in advance

2: Response to Full text search (response to 1)

Posted by Gregory McMullan on 08/22/00 02:19 AM

See the thread of two days ago from Don Baccus entitled "Simple search tool available for bboard module" at http://new.openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=0000Tz&topic_id=11&topic=OpenACS - Don says that this is only a stopgap measure, but it is clearly immensely better than nothing.

3: Response to Full text search (response to 1)

Posted by Don Baccus on 08/22/00 03:08 AM

The full text search I implemented exactly mimics the original simple search implemented for photo.net before Context (now InterMedia) was available (and resurrected the first time or two Context was tried, since it had a nasty habit of looping infinitely and stuff like that).

It does a simple ranking based on a list of keywords - it's not phrased based. The more keywords that are matched, the higher the score you get. It doesn't weight for multiple occurances of keywords or anything like that. It scales the return value so it lies between 0-100, 0 being "no keywords matched", 100 being "all keywords matched".

I suspect the simple Tcl ranking function could easily be twiddled to provide more finely-tuned search results - Tcl's a lot more fun for writing this kind of code than Oracle PL/SQL, that's for sure! The current ranking function is about 10 lines of code...

But there's no way to avoid the basic problem that this hack requires a sequential scan of the bboard table (or any table you decide to search), so is inherently slow. This is the major reason it is a
stopgap, as it won't scale.

But as Greg mentions, it indeed is better than nothing. photo.net survived surprisingly well with this little hack for quite some time.

Right now Ben and I are leaning towards an out-of-database solution, since a good indexing solution in the database is likely to lead to slow inserts of posts, news items, and other searchable things. Experience with InterMedia tends to back up that point of view (if not outright poison our point of view!)

An out-of-database solution is fine, because you don't really need your search index to be ACID - if it hoses, you just rebuild it.

I've been playing with swish and swish++...

4: Response to Full text search (response to 1)

Posted by mark rohn on 08/24/00 08:20 PM

What about PLS? Now that its gone open source is it not now a option?

5: Response to Full text search (response to 1)

Posted by Krzysztof Kowalczyk on 08/24/00 08:31 PM

I took a look at PLS web site and didn't get the impression that it's open-source. Quite the contrary. You can download binaries (after registration) but license states that "no dissasembly or reverse engineering allowed". The source code download states, that:

"The PLWeb Turbo 3.0 source code distribution allows you to rebuild the application-level binaries of PLWeb Turbo 3.0. The libraries that are part of the search engine (CPL) are provided as binaries in the CPL 6.3 binary distribution and cannot be rebuilt (i.e., no source code is provided)."

In summary: it's not an open-source software and nothing seems that it'll ever be.

6: Response to Full text search (response to 1)

Posted by mark rohn on 08/24/00 08:44 PM

Sorry, you are correct PLS is not open source, but it is free. The license does however read as follows.

2. License Grant. AOL hereby grants You a world-wide, royalty-free, non-exclusive license (a) to use, reproduce, sublicense and distribute the Licensed Software, including as part of one or more Integrated Works; (b) to provide support and maintenance to a third-party in the use of the Licensed Software and in the development and use of one or more Integrated Works; and (c) to use, reproduce, sublicense and distribute the Documentation in connection with all of the foregoing. Perl scripts incorporated with the Licensed Software may be used and modified to facilitate use of the Licensed Software or Integrated Works.

It may not be open source, but free is still free.

Thanks
Mark Rohn

7: Response to Full text search (response to 1)

Posted by Krzysztof Kowalczyk on 08/24/00 08:59 PM

Well, not really. The biggest problem is: if OpenACS' implementation of search is based on psl and pls gets ditched by AOL (and the fact that they're only providing RedHat 5.2 rpms, which, among others, leaves out Debian and probably others out of the equation, doesn't really give me a lot of confidence in their countinuing support) in a few months the whole work would get obsolete because old binaries would no longer work and you couldn't fix them yourself.

However, as the (real) Free Software philosophy goes: you've got the source (to OpenACS) and there is nothing that can stop you to provide OpenACS search interface for pls.