Forum OpenACS Development: Response to OpenACS wish-list

Collapse
Posted by Jerry Asher on
I've been playing with htDig and AOLserver/OpenACS integration. You can see a work in progress at http://www.theashergroup.com/demo/openacs. For the purpose of this demo, I occasionally index https://openacs.org/bboard, https://openacs.org/doc, and https://openacs.org/wp.

Ah, the cruft, the cruft, the cruft.

This demo shows: exact matches, fuzzy matches (stemming, wildcarding (the wildcard char is *), synonym lookups, speling correction), site and subsite searches, and date range searches

This demo doesn't yet implement incremental updates, and because I am indexing OpenACS and not my own site, the index is performed via crawling, and not through any OpenACS search/db integration.

So far to get this to work, I've

  • used htDig 3.2.0b3
  • made one patch to AOLserver (to "understand" semicolons in query strings) (alternatively you can make a simple patch to htDig to use ampersands.)
  • installed xpdf so I could index pdf
  • installed catdoc so I could index doc
The integration is "interesting". htDig has none of the interface we discussed above, it's pretty monolithic and likes to be a cgi program. So what I did was to take it's template html wrappers and turn them into template adp wrappers. My search page execs out to htsearch, captures the result and turns it over to ns_adp_parse, returning the results of that to the user.

Future work:

  • incremental indexing
  • db indexing - htDig wants to be a crawler, and that's not great for our OpenACS kind of pages. For example: the main /bboard page has all these wonderful links like "sort by activity" which can take you to a "statistics page". Naively, htDig spiders it correctly, but this is both inefficient and leads to search results where the bboard table of contents (listing every bboard message) often comes to the top of the results. htDig can be easily configured to avoid those links and I've done that. But it's clear that better htdig/db integration would lead to better results, quicker results, and leave your log files usable.

    So one goal of better db integration with htDig would be to create an API (like the newstuff system) where a module could quickly hook into htDig (or whatever)

    The other thinkg db integration can provide is much better date range matching. Right now, htDig thinks any tcl or adp page was just created today, so almost everything but html pages come up in the "last week filter"

  • stability - htDig has worked very well for me so far, but there are reports on the net of various bugs: crashes, and occasional db corruption.
  • xml generation and xml/rpc or soap interface: imagine an OpenACS search solution that let any OpenACS site request a search from any other OpenACS site. Do you use Michael Cleverly's community ubersearch? With an xml/rpc or soap interface this sort of search (and more powerful) can be an easy and efficient OpenACS module)
Lucene looks interesting. Would that I had seen that two weeks ago, eh?

I am not sure what to make of the ACS 5 search solution. I glanced at it briefly, and while the ACS and OpenACS certainly could use a search capability like that, it appears to be completely RDBMS centric. I.e. it doesn't appear to be implemented using Intermedia or any other specialized indexer, so I am left wondering what it's performance will be. I just gave this a brief glance, so I am not certain that they do not plan to rely on Intermedia or some other search engine technology.

Let me know what you think of my search implementation. I will try to make it available to the community soon, and I would really like some community help in discussing htdig/db integration issues.

Time now for the Simpson's and some more cold medicine: summer colds are just the worse.