Forum OpenACS Q&A: grabbing text from a website...

Collapse
Posted by David Kuczek on
I want to grab the text from a website and automatically display it in
a textarea...

Is there another way of doing this than grabbing the page with
ns_httpget and regexping the content? Did anyone write a regexp that
would transform http into text?

Collapse
Posted by David Kuczek on
... I mean:

Did anyone write a regexp that would transform html into text?

of course...

Collapse
Posted by Don Baccus on
See Lars Pind's excellent ad_html_text_convert proc in OpenACS 4.5's packages/acs-tcl/tcl/text-html-procs.tcl file.

Though it's part of OpenACS 4.x it should work fine in your 3.x context.

Collapse
Posted by MaineBob OConnor on

Hi David,

in an openacs 3.x install at:

http://www.ercmembers.net/doc/proc-one.tcl?proc_name=util%5fstriphtml

Is this proc:
util_striphtml html

What it does:

Returns a best-guess plain text version of an HTML fragment. Better than ns_striphtml because it doesn't replace & g t ; and & l t ; with empty string.
Defined in: /web/nsaerc/tcl/ad-utilities.tcl.preload

Source code:
  return [util_expand_entities [util_remove_html_tags $html]]

AND
There are other procs you might find useful at one of my sites:

http://www.ercmembers.net/doc/proc-search.tcl?query_string=html

The above link to /doc/... has been broken for a long time in 'this here' openacs.org 😟

http://www.openacs.org/doc/proc-search.tcl?query_string=html

-Bob