Forum OpenACS Q&A: tcl web crawler...

Collapse
Posted by David Kuczek on
I would like to implement a feature into my oacs system that crawls
specific websites on a scheduled basis and puts the information in my
oacs database...

Two scenarios exist:

1. The specific site uses IDs for different pages. I could ns_httpget
the sites with a counter, regexp the page and see if I already have
that page in my db... (rather easy)

2. The specific site uses forms to get to different pages. I would
have to fill out the form with specific information and follow the
result links one by one... To make the story even harder: I might have
to login with some name and password first!

Did anyone build a system or has experience with the technology
matching my scenario 2. and/or 1.???

Thanks

Collapse
Posted by Tilmann Singer on
For scenario 2., check out tclwebtest. Filling out forms is easy (see examples in documentation). Looping through the links of a html page works for example like that:
do_request $url
foreach link [link all] {
  array set a_link $link
  # do something with $a_link(url) ...
}
Unfortunately running it from within aolserver is a bit complicated - you need to make its commands as well as the tcl http package available to the running thread. There was some discussion recently either here or on the aolserver mailing list on how to use external libraries with aolserver but I did not follow it closely.

I have a quick-and-dirty solution lying around somewhere that was done for the acs-automated-testing package, which checks on each request if the libraries are already available and if not then it "source"s them. If you are interested in it I can look that stuff up for you.

Collapse
Posted by Carl Coryell-Martin on
Me too!  I am thinking about implementing the google soap api in aolserver and know that I am going to have to either get nssoap working or make a bunch of tcl packages available to aolserver threads.  I would love to learn more about this.

cheers,

carl

Collapse
Posted by David Kuczek on
Hello Tilmann,

this would be a good starting point I believe. It would be great if you could package your oacs solution together with some little howto!

Carl,

I was also thinking about integrating the google API into acs, but it is not really high up on my priority list... Will you make your work public?

Collapse
Posted by Don Baccus on
I, too, have been thinking about the google SOAP API, but there's no way in the world that I have time to work on it in the near future.

But I'll gladly give encouragement to anyone who does!

Collapse
Posted by carl garland on
Userland Software (home of weblogs.com, scripting.com, and founders of xml-rpc/ cofounder of SOAP) have created a xmlrpc interface service wrapper for the Google SOAP api. We already have a minimal xmlrpc interface for aolserver and I have been working on extending it (still lots to do).

If you are interested I just started a thread to explain more about the extension work over in the OpenACS Design forum.
Collapse
Posted by Tilmann Singer on
David: good idea! I copied the stuff that I have previously put into acs-automated-testing into its own package, check it out from file-storage: https://openacs.org/new-file-storage/one-file.tcl?file_id=357

After installing and mounting it somewhere it should display a hopefully self-explaining form. Sorry, no docs (yet). Let me know if it works for you.

Collapse
Posted by David Kuczek on
Hello Tilmann, muchos gracia. I've had nothing to do with oacs 4.5 so far... would it be complicated to use it on a 3.2.5 system?

And/Or what would I have to do for that?

Collapse
Posted by Tilmann Singer on
<blockquote> I've had nothing to do with oacs 4.5 so far
</blockquote>

try it and you will never want to go back ...

<blockquote> would it be complicated to use it on a 3.2.5 system?
</blockquote>

You could try it by unzipping the .apm file, copy the tcl/tclwebtest-procs.tcl someplace where it is sourced  and backport the page www/admin/test-run.tcl where the stuff actually happens, but I wonder if that would be worth the effort.

Collapse
Posted by David Kuczek on
Does anyone run tclwebtest on a 3.2.5 installation?

If nobody does, what would I have to do in order to use it on my test server? Install oacs 4.6? I was looking around for installation docs for tclwebtest, but I couldn't find anything. I also read in a thread that tclwebtest is being used for automatically setting up dotLRN and that I should run a cvs update on the 0.4 version. Is this correct?

Thanks

Collapse
Posted by Tilmann Singer on
You don't have to do anything special to make it work with any OpenACS version, it just simulates the user behaviour. You can run it against any website you want, as long as it does not contain too weird html that confuses it.

No installation necessary, it's just a tcl script - download and run it. And yes, the CVS version is recommended.