Forum OpenACS Development: Re: Image grabbing

Collapse
3: Re: Image grabbing (response to 1)
Posted by Benjamin Brink on
Hi Uri,

Take a look at the ecds_* procs in ecommerce. There are procs there that grab an image or html from an external website, as well as procs to parse an html page (even if the html page varies or has poor syntax).

Specifically:
ecds_get_image_from_url
ecds_get_url
ecds_import_image_to_ecommerce
ecds_import_product_from_vendor_site <-- can import image and data from an html page

The ecds_* procs are private, because they are made to work with ecommerce, but you should be able to make them available for site-wide use with a little modification.

cheers,

Collapse
6: Re: Image grabbing (response to 3)
Posted by Iuri Sampaio on
Benjamin,

ecds_get_image_from_url "http://weheartit.com/?page=1";

It returns errors. ad_proc is called on file /e-commerce/www/grabber.tcl

The rest I wasn't able to grab images from a random external website

[22/Jun/2013:19:45:35][8714.3018832752][-default:20-] Notice: ecds_get_image_from_url: wgetting http://weheartit.com/ (waiting 20 sec to check)

[22/Jun/2013:19:45:55][8714.3018832752][-default:20-] Error: ecds_get_image_from_url: file /var/www/natopia/ecds-url-cache/weheartit.com does not exist after attempt to fetch from http://weheartit.com/
[22/Jun/2013:19:45:55][8714.3018832752][-default:20-] Notice: ecds_get_image_from_url: status is ERROR

Collapse
7: Re: Image grabbing (response to 6)
Posted by Torben Brosten on
Hi Iuri,

That url would return errors, because it is not pointing to a single image. The proc expects a url to a single image file.

To test, try something like:

ecds_get_image_from_url https://openacs.org/templates/slices/openacs.gif

cheers,

Collapse
8: Re: Image grabbing (response to 7)
Posted by Iuri Sampaio on
Torben,

if I understood correctly then, I should firstly get the html page

Using regular expression, to create a list with image path/urls then run a loop to get images

is that it?

What ad_proc should I use to get html?

Collapse
9: Re: Image grabbing (response to 8)
Posted by Iuri Sampaio on
Something coded as in

set hmtl [ns_httpget "http://iurix.com";]

foreach url $img_urls {
set img [ecds_get_image_from_url $url]
}

Collapse
10: Re: Image grabbing (response to 8)
Posted by Iuri Sampaio on
I guess [ecds_get_image_from_url] isn't the ad_proc I'm looking for.

I want to provide their urls to an ajax/jquery album.

As in the ad_proc bellow. It returns urlList which is a list of image urls to be used in the ajax source code.

That way I don't even need to have them on my server. Everything will be loaded in the ajax on client side.

ad_proc -public url_grabber {
{-url}
{-type "img"}
} {
Returns a list of urls
} {

set html [ns_httpget $url]

set doc [dom parse -html $html]

if {[catch {set root [$doc documentElement]} err]} {
error "Error parsing XML: $err"
}

set img_nodes [$root selectNodes {descendant::img}]

set urlList {}

foreach node $img_nodes {
set name [$node nodeName]
set attribs [$node attributes *]
ns_log Notice "$name - $node - $attribs"

foreach attribute $attribs {
if {[string tolower $attribute] == "src"} {
lappend urlList [$node getAttribute $attribute]
ns_log Notice " [$node getAttribute $attribute]"
break
}
}
}

# Get rid of the DOM representation of your HTML document
$doc delete

# finished
return $urlList
}

set url "http://weheartit.com/";

set img_urls [url_grabber -url $url -type img]

Collapse
11: Re: Image grabbing (response to 10)
Posted by Iuri Sampaio on
Gustaf,

I am able to run "dom parse". I'm one step further now.

Collapse
13: Re: Image grabbing (response to 10)
Posted by Torben Brosten on
Okay, yeah, you don't want to use the ecds_* procs. The ecds_ procs cache a local copy. It's assumed that the content will be manipulated for use on an ecommerce website.
Collapse
14: Re: Image grabbing (response to 13)
Posted by Iuri Sampaio on
Torben,

In fact I want to manipulate images for use on an ecommerce website.

How would I do that with ecds_* procs?

Collapse
15: Re: Image grabbing (response to 14)
Posted by Torben Brosten on
Hi Iuri,

Hmm.. each case is somewhat unique. Here is an overview of what we did with ecds procs.

ecds-procs.tcl contains general procedures to help with much of that.

An example of how we imported products from a partner website is at:

ecommerce/www/admin/products/upload-vendor-imports

and

ecommerce/www/admin/products/vendor-imports-add-update

In either case, an admin feeds a bunch of product references to the page. The code (using ecds_import_product_from_vendor_site) converts each product reference to a url. ecds_get_url is called to collect the html page from a partner website.

Each partner vendor requires a different set of parsing routines, because (at least for us) no two vendors used the same standards. The reality is that most didn't use any standard. Custom procedures for each case were required. An internal abbreviation for each vendor was used as a unique reference ecds_vendors.abbrev. Each vendor had a minimum of the abbrev and title fields used in the table ecds_vendors.

The html content would be parsed using these and other procs:

ecds_abbreviate
ecds_convert_html_list_to_tcl_list
ecds_convert_html_table_to_list
ecds_get_category_id_from_title
ecds_get_contents_from_tag
ecds_get_contents_from_tags_list
ecds_email_on_purchase_list
ecds_get_subcategory_id_from_title
ecds_get_subsubcategory_id_from_title
ecds_is_natural_number
ecds_keyword_search_update
ecds_remove_attributes_from_html
ecds_remove_from_list
ecds_remove_html
ecds_remove_tag_contents
ecds_sku_from_brand
ecds_webify

Much of the parsing requires different code for each vendor. So, some (many) procs reference unique procs for each vendor. The ecds_vendors.abbrev would be used in the proc name to define the proc uniquely. The following page has an example, where the ecds_vendors.abbrev = 'ex'.

ecommerce/tcl/ecds-ex-procs.tcl

Once product information is generated, the ecommerce product data would be updated using:

ecds_update_ec_products_product
ecds_add_product_to_ec_products
ecds_update_ec_category_map

To get an image for the product, use:

ecds_get_image_from_url

Then import the image to the product directory:

ecds_import_image_to_ecommerce

You do not have to worry about retrieving a page from an external website multiple times if more than one product is represented on one page. ecds_get_url lets you set the local cache refresh period to a relative time compatible with tcl's clock scan.

This is a brute force paradigm designed to handle the most difficult cases, where everything is different.

If you are not importing product data to ecommerce and just grabbing images for related presentation, then the task should be straight forward. However, this process cache's a local copy on the hard disk in order to not clobber partner websites. The time delays make this process incompatible with instant ajax style requirements.

Also, see the note at the top of ecommerce/tcl/ecds-procs.tcl. A few custom fields need to be defined via the ecommerce admin's "Add a custom field" page.

cheers,
Torben

Collapse
12: Re: Image grabbing (response to 8)
Posted by Torben Brosten on
ecds_get_url can be used to get the html.