Forum OpenACS Q&A: OT: How to search the web...

Posted by Anton Bajri on 07/05/02 09:56 PM

Hello.

I have the following problem, tough not directly OACS related, but
interesting anyway. I have to search several sites, on a daily or
weekly basis, for some keywords (i'm looking for news coverage of
childhood/family violence/minors traffic/etc issues). The first idea
was to hit google with the queryes, process the pages and store the
results in a table. Then an operator would walk all the hits and
separate the data from the noise. But in theyr rules of use they
(reasonably) forbid that kind of use and say "don't even ask", and say
than they will block acces from offending IPs upon detection. The
questions are, anyone know a service (it could be even not free...
eeeech) that could solve this problem? English is not my first
language, so i'm not sure i'm being polite enough... but anyone has
experience on how to deal with search services, and if this is an
option? Is there another solution viable, like setting up a crawler
myself or something?

Thanks in advance,

Jorge

2: Response to OT: How to search the web... (response to 1)

Posted by John Sequeira on 07/05/02 10:30 PM

Google doesn't let you spider their site, but they offer a SOAP-based API for noncommercial use IIRC. I've used it via the Perl module, and it works as advertised.

http://www.google.com/apis/

Also, check out spyonit.com - they have something like what you're looking for, but it's not programmable.

3: Response to OT: How to search the web... (response to 1)

Posted by Michael A. Cleverly on 09/04/02 08:02 AM

Not that I'd advocate your breaking Google's rules but if you do write an HTTP robot to run queries against Google (or possibly sites that think if you aren't running the latest MSIE you're unfit for their site), just make sure that the User-Agent string that you send isn't "AOLserver" or "Tcl HTTP 2.3" or "Perl::LWP" (or whatever). Stick to something mainstream (IE, Netscape, Mozilla, or Opera, etc.) and you'll be reasonably safe (in general) from sites that discriminate based on the User-Agent header.