Forum OpenACS Q&A: htDig

Collapse
1: htDig
Posted by Joel Natividad on
Folks,

We're building a site that has a protected area that htDig cannot index.

As a quick fix, I was thinking, a quick way of indexing the protected area is to let the system allow htDig to circumvent openACS's password screen.

Perhaps, by looking at the http header and confirming that the search is being done from the same box (i.e. htDig is running on the same computer).

htDig will still create the index, but when site users attempt to go to the protected pages from the search results, they are then required to login.

It's quick and dirty, but you'll have to admit, quite an easy way to index a site.

My question now is:
Where should I look in the source code to enable this functionality?

Collapse
2: Response to htDig (response to 1)
Posted by Jerry Asher on
If you are using OpenACS 3.2.5, then take a look within tcl/ad-security.tcl at ad_verify_and_get_user_id.

To help secure your hack, you might consider loading up another nssock instance within AOLserver that only accepts requests on a certain port of localhost, and not your public IP address.  Then point htDig at that address and that port.  Since the address is the localhost, the public and the h4x0r should not be able to gain access to your system.  htDig's rewrite rules should fix the urls back to publically addressable urls.

Your strategy should work, but may have small problems: since htDig won't maintain cookie state everytime anything wants the user id it will have to go through a code patch that encounters your "fix".

Another strategy would be to create a tcl based proxy.  Modify httpget to login and obtain/maintain the cookie login information by adding the appropriate cookie headers to each request.  Point htdig towards your proxy, and have your proxy httpget the actual pages and return them to htdig.  Again use htDig's rewrite rules to fix up the returned urls.  Once again you can secure this by having your proxy check to ensure the connection is being made on the localhost.

Collapse
3: Response to htDig (response to 1)
Posted by Dave Bauer on
Would robot detection work here? Does htDig has a user-agent? If you can customize it you should be able to identify your htDig  and setup robot detection to allow htDig to see those parts of your site.

Robot-detection for Postgresql in OpenACS 4 is not ported yet according to the status document.

Collapse
4: Response to htDig (response to 1)
Posted by Jerry Asher on
I've never set up the robot detection stuff, but yes, provided that that works, it is trivial to set htDig's user agent, and so the combination should win.

It's pretty easy to forge the user agent, so I guess it depends on the nature of the content that is being protected and indexed.  Joel wants it indexed, so I guess it's not that confidential and he just wants users to register to see it.  But it is protected, so maybe not.

Collapse
5: Response to htDig (response to 1)
Posted by Jun Yamog on
Not sure if this would help but we use htdig on our smaller none ACS sites.  One site that uses it has the basic http authentication.  You can set htdig to login for you.  Not sure if htdig can work for forms but maybe it can post.  Dig around the docs it may do it out of the box.  Htdig has been around for quite sometime I am sure they have done this or someone out there on the net had done this.