Forum OpenACS Q&A: ht://dig for OpenACS

Hi Arnel,

It's hard to say if htDig will satisfy your requirements. I would think so under the right circumstances.

What makes me scratch my head here, is that your documents don't appear to be on your own machine, but are on 150+ sites. That leads me to believe you'll always have to have some htDig crawler running to index those documents. But people's biggest "complaint" about htDig is the length of time it needs building its index, and I've never tried with as many as 250,000 documents that are located on remote sites.

On the plus side, I know that there are folks using htDig to generate an index of millions of documents, but they are willing to take the index rebuild time of an order of many many hours, if not a day or two.

Also, htDig has a variety of incremental indexing schemes which can greatly speed the index process up, but require you to be able to feed htDig with only new documents, and tell explicitly tell htpurge to delete old documents. If the docs are on 150+ external sites, it doesn't seem as though that will be an available tactic.

Finally, another way to speed up indexing is to point htDig (the indexing component) not to a URL for each of your 250,000 documents, but to where they reside on disk (assuming they are on disk and not in the database). Again though, that will not work with docs on remote sites.

However, it is very easy just to get the basic htDig running. An experiment to find out if it will satisfy your requirements should take anywhere from a few hours, to a day or two. (So try it, and let us know the results!)