Forum OpenACS Q&A: htDig
We're building a site that has a protected area that htDig cannot index.
As a quick fix, I was thinking, a quick way of indexing the protected area is to let the system allow htDig to circumvent openACS's password screen.
Perhaps, by looking at the http header and confirming that the search is being done from the same box (i.e. htDig is running on the same computer).
htDig will still create the index, but when site users attempt to go to the protected pages from the search results, they are then required to login.
It's quick and dirty, but you'll have to admit, quite an easy way to index a site.
My question now is:
Where should I look in the source code to enable this functionality?
To help secure your hack, you might consider loading up another nssock instance within AOLserver that only accepts requests on a certain port of localhost, and not your public IP address. Then point htDig at that address and that port. Since the address is the localhost, the public and the h4x0r should not be able to gain access to your system. htDig's rewrite rules should fix the urls back to publically addressable urls.
Your strategy should work, but may have small problems: since htDig won't maintain cookie state everytime anything wants the user id it will have to go through a code patch that encounters your "fix".
Another strategy would be to create a tcl based proxy. Modify httpget to login and obtain/maintain the cookie login information by adding the appropriate cookie headers to each request. Point htdig towards your proxy, and have your proxy httpget the actual pages and return them to htdig. Again use htDig's rewrite rules to fix up the returned urls. Once again you can secure this by having your proxy check to ensure the connection is being made on the localhost.
Robot-detection for Postgresql in OpenACS 4 is not ported yet according to the status document.
It's pretty easy to forge the user agent, so I guess it depends on the nature of the content that is being protected and indexed. Joel wants it indexed, so I guess it's not that confidential and he just wants users to register to see it. But it is protected, so maybe not.