Forum OpenACS Development: Image grabbing

Request notifications

Posted by Iuri Sampaio on
With regard grabbing images as available at and how it works. Also, checkout at flikr website.

Basically it pulls up images on a page and you can select the image and it is added to their website.

Is there any ad_proc to parse html files?

I searched for it on /api-doc but none seems to achieve the objective.

ad_parse_html_attributes html pos
ad_parse_html_attributes_upvar html_varname pos_varname
ad_html_to_text html
ad_convert_to_html text

I guess I must go towards TDOM.
Any ideas?

2: Re: Image grabbing (response to 1)
Posted by Iuri Sampaio on
The problem with tdom is that I get syntax errors if the html is not well written.

For example, a simple li missing closure caused the error bellow

[20/Jun/2013:22:05:28][8714.3036609392][-default:18-] Error: GET
referred by ""
error "Unterminated element 'li' (within 'div')" at position 38963
"ript:joinow('joinow');"INSCREVA-SE AGORA/adivdiv class="shadow"/div/li --Error--


4: Re: Image grabbing (response to 2)
Posted by Gustaf Neumann on

are you aware of the "-html" option of "dom parse" ( it handles most html pages, and has no problems with "missing" end tags for LI

all the best
-gustaf nejmann

5: Re: Image grabbing (response to 4)
Posted by Iuri Sampaio on

I used "-html" parameter on "dom parse". Have a look

set doc [dom parse -html $result]

set doc [dom parse -simple -html $result]

Syntax errors still remains.

3: Re: Image grabbing (response to 1)
Posted by Benjamin Brink on
Hi Uri,

Take a look at the ecds_* procs in ecommerce. There are procs there that grab an image or html from an external website, as well as procs to parse an html page (even if the html page varies or has poor syntax).

ecds_import_product_from_vendor_site <-- can import image and data from an html page

The ecds_* procs are private, because they are made to work with ecommerce, but you should be able to make them available for site-wide use with a little modification.


6: Re: Image grabbing (response to 3)
Posted by Iuri Sampaio on

ecds_get_image_from_url ""

It returns errors. ad_proc is called on file /e-commerce/www/grabber.tcl

The rest I wasn't able to grab images from a random external website

[22/Jun/2013:19:45:35][8714.3018832752][-default:20-] Notice: ecds_get_image_from_url: wgetting (waiting 20 sec to check)

[22/Jun/2013:19:45:55][8714.3018832752][-default:20-] Error: ecds_get_image_from_url: file /var/www/natopia/ecds-url-cache/ does not exist after attempt to fetch from
[22/Jun/2013:19:45:55][8714.3018832752][-default:20-] Notice: ecds_get_image_from_url: status is ERROR

7: Re: Image grabbing (response to 6)
Posted by Torben Brosten on
Hi Iuri,

That url would return errors, because it is not pointing to a single image. The proc expects a url to a single image file.

To test, try something like:



8: Re: Image grabbing (response to 7)
Posted by Iuri Sampaio on

if I understood correctly then, I should firstly get the html page

Using regular expression, to create a list with image path/urls then run a loop to get images

is that it?

What ad_proc should I use to get html?

9: Re: Image grabbing (response to 8)
Posted by Iuri Sampaio on
Something coded as in

set hmtl [ns_httpget ""]

foreach url $img_urls {
set img [ecds_get_image_from_url $url]

10: Re: Image grabbing (response to 8)
Posted by Iuri Sampaio on
I guess [ecds_get_image_from_url] isn't the ad_proc I'm looking for.

I want to provide their urls to an ajax/jquery album.

As in the ad_proc bellow. It returns urlList which is a list of image urls to be used in the ajax source code.

That way I don't even need to have them on my server. Everything will be loaded in the ajax on client side.

ad_proc -public url_grabber {
{-type "img"}
} {
Returns a list of urls
} {

set html [ns_httpget $url]

set doc [dom parse -html $html]

if {[catch {set root [$doc documentElement]} err]} {
error "Error parsing XML: $err"

set img_nodes [$root selectNodes {descendant::img}]

set urlList {}

foreach node $img_nodes {
set name [$node nodeName]
set attribs [$node attributes *]
ns_log Notice "$name - $node - $attribs"

foreach attribute $attribs {
if {[string tolower $attribute] == "src"} {
lappend urlList [$node getAttribute $attribute]
ns_log Notice " [$node getAttribute $attribute]"

# Get rid of the DOM representation of your HTML document
$doc delete

# finished
return $urlList

set url ""

set img_urls [url_grabber -url $url -type img]

11: Re: Image grabbing (response to 10)
Posted by Iuri Sampaio on

I am able to run "dom parse". I'm one step further now.

13: Re: Image grabbing (response to 10)
Posted by Torben Brosten on
Okay, yeah, you don't want to use the ecds_* procs. The ecds_ procs cache a local copy. It's assumed that the content will be manipulated for use on an ecommerce website.
14: Re: Image grabbing (response to 13)
Posted by Iuri Sampaio on

In fact I want to manipulate images for use on an ecommerce website.

How would I do that with ecds_* procs?

15: Re: Image grabbing (response to 14)
Posted by Torben Brosten on
Hi Iuri,

Hmm.. each case is somewhat unique. Here is an overview of what we did with ecds procs.

ecds-procs.tcl contains general procedures to help with much of that.

An example of how we imported products from a partner website is at:




In either case, an admin feeds a bunch of product references to the page. The code (using ecds_import_product_from_vendor_site) converts each product reference to a url. ecds_get_url is called to collect the html page from a partner website.

Each partner vendor requires a different set of parsing routines, because (at least for us) no two vendors used the same standards. The reality is that most didn't use any standard. Custom procedures for each case were required. An internal abbreviation for each vendor was used as a unique reference ecds_vendors.abbrev. Each vendor had a minimum of the abbrev and title fields used in the table ecds_vendors.

The html content would be parsed using these and other procs:


Much of the parsing requires different code for each vendor. So, some (many) procs reference unique procs for each vendor. The ecds_vendors.abbrev would be used in the proc name to define the proc uniquely. The following page has an example, where the ecds_vendors.abbrev = 'ex'.


Once product information is generated, the ecommerce product data would be updated using:


To get an image for the product, use:


Then import the image to the product directory:


You do not have to worry about retrieving a page from an external website multiple times if more than one product is represented on one page. ecds_get_url lets you set the local cache refresh period to a relative time compatible with tcl's clock scan.

This is a brute force paradigm designed to handle the most difficult cases, where everything is different.

If you are not importing product data to ecommerce and just grabbing images for related presentation, then the task should be straight forward. However, this process cache's a local copy on the hard disk in order to not clobber partner websites. The time delays make this process incompatible with instant ajax style requirements.

Also, see the note at the top of ecommerce/tcl/ecds-procs.tcl. A few custom fields need to be defined via the ecommerce admin's "Add a custom field" page.


12: Re: Image grabbing (response to 8)
Posted by Torben Brosten on
ecds_get_url can be used to get the html.