Forum OpenACS Q&A: ht://dig for OpenACS

Collapse
Posted by Arnel Estanislao on
We have a website that is based on OpenACS 3.2.4.  A key part of its
functionality was a search engine that scoured a large number of
video-related websites (about 150+ sites, 250,000 documents, so, a
fairly large index).  We were able to successfully deploy the search
engine using Inktomi, formerly Ultraseek.

Due to budgetary constraints, however, the client cancelled the use of
this search engine. We're currently looking for a lower-cost
alternative, one that is preferably open-source. We came across
ht://dig as one option (we've used this in one a few of our smaller
projects), and Jerry Asher has integrated it nicely with AOLServer.
Could that handle the requirements of the site?  Are there other
solutions we should be looking into?  Once again, we are looking for
something that can reasonably handle an index of about 250,000+
documents, possibly more as the site grows.

Thanks.

Collapse
Posted by Jerry Asher on
Hi Arnel,

It's hard to say if htDig will satisfy your requirements.  I would think so under the right circumstances.

What makes me scratch my head here, is that your documents don't appear to be on your own machine, but are on 150+ sites.  That leads me to believe you'll always have to have some htDig crawler running to index those documents.  But people's biggest "complaint" about htDig is the length of time it needs building its index, and I've never tried with as many as 250,000 documents that are located on remote sites.

On the plus side, I know that there are folks using htDig to generate an index of millions of documents, but they are willing to take the index rebuild time of an order of many many hours, if not a day or two.

Also, htDig has a variety of incremental indexing schemes which can greatly speed the index process up, but require you to be able to feed htDig with only new documents, and tell explicitly tell htpurge to delete old documents.  If the docs are on 150+ external sites, it doesn't seem as though that will be an available tactic.

Finally, another way to speed up indexing is to point htDig (the indexing component) not to a URL for each of your 250,000 documents, but to where they reside on disk (assuming they are on disk and not in the database).  Again though, that will not work with docs on remote sites.

However, it is very easy just to get the basic htDig running.  An experiment to find out if it will satisfy your requirements should take anywhere from a few hours, to a day or two.  (So try it, and let us know the results!)