• Publicity: Public Only All

text-html-procs.tcl

Contains procs used to manipulate chunks of text and html, most notably converting between them.

Location:
packages/acs-tcl/tcl/text-html-procs.tcl
Created:
19 July 2000
Author:
Lars Pind <lars@pinds.com>
CVS Identification:
$Id: text-html-procs.tcl,v 1.113 2024/10/27 16:51:11 gustafn Exp $

Procedures in this file

Detailed information

ad_convert_to_html (public, deprecated)

 ad_convert_to_html [ -html_p html_p ] text
Deprecated. Invoking this procedure generates a warning.

Convenient interface to convert text or html into html. Does the same as ad_html_text_convert -to html.

Switches:
-html_p (optional, defaults to "f")
specify t if the value of text is formatted in HTML, or f if text is plaintext. DEPRECATED: this proc is a trivial wrapper for ad_html_text_convert
Parameters:
text (required)
Author:
Lars Pind <lars@pinds.com>
Created:
19 July 2000
See Also:

Testcases:
No testcase defined.

ad_convert_to_text (public, deprecated)

 ad_convert_to_text [ -html_p html_p ] text
Deprecated. Invoking this procedure generates a warning.

Convenient interface to convert text or html into plaintext. Does the same as ad_html_text_convert -to text.

Switches:
-html_p (optional, defaults to "t")
specify t if the value of text is formatted in HTML, or f if text is plaintext. DEPRECATED: this proc is a trivial wrapper for ad_html_text_convert
Parameters:
text (required)
Author:
Lars Pind <lars@pinds.com>
Created:
19 July 2000
See Also:

Testcases:
No testcase defined.

ad_dom_sanitize_html (public)

 ad_dom_sanitize_html -html html [ -allowed_tags allowed_tags ] \
    [ -allowed_attributes allowed_attributes ] \
    [ -allowed_protocols allowed_protocols ] \
    [ -unallowed_tags unallowed_tags ] \
    [ -unallowed_attributes unallowed_attributes ] \
    [ -unallowed_protocols unallowed_protocols ] [ -no_js ] \
    [ -no_outer_urls ] [ -validate ] [ -fix ]

Sanitizes HTML by specified criteria, basically removing unallowed tags and attributes, JavaScript or outer references into page URLs. When desired, this proc can act also as just a validator in order to enforce some markup policies on user-submitted content.

Switches:
-html (required)
the markup to be checked.
-allowed_tags (optional)
list of tags we allow in the markup.
-allowed_attributes (optional)
list of attributes we allow in the markup.
-allowed_protocols (optional)
list of attributes we allow into links
-unallowed_tags (optional)
list of tags we don't allow in the markup.
-unallowed_attributes (optional)
list of attributes we don't allow in the markup.
-unallowed_protocols (optional)
list of protocols we don't allow in the markup. Protocol-relative URLs are allowed, but only if proc is called from a connection thread, as we need to determine our current connection protocol.
-no_js (optional, boolean)
this flag decides whether every script tag, inline event handlers and the javascript: pseudo-protocol should be stripped from the markup.
-no_outer_urls (optional, boolean)
this flag tells the proc to remove every reference to external addresses. Proc will try to distinguish between external URLs and fine fully specified internal ones. Acceptable URLs will be transformed in absolute local references, others will be just stripped together with the attribute. Absolute URLs referring to our host are allowed, but require the proc being called from a connection thread in order to determine the proper current url.
-validate (optional, boolean)
This flag will avoid the creation of the stripped markup and just report whether the original one respects all the specified requirements.
-fix (optional, boolean)
When parsing fails on markup as it is, try to fix it by, for example, closing unclosed tags or normalizing attribute specification. This operation will remove most of plain whitespace into text content of original HTML, together with every comment and the eventually present DOCTYPE declaration.
Returns:
sanitized markup or a (0/1) truth value when the -validate flag is specified
Author:
Antonio Pisano

Testcases:
ad_dom_sanitize_html

ad_enhanced_text_to_html (public)

 ad_enhanced_text_to_html text

Converts enhanced text format to normal HTML.

Parameters:
text (required)
Author:
Lars Pind <lars@pinds.com>
Created:
2003-01-27

Testcases:
ad_enhanced_text_to_html, ad_html_text_convert, acs_tcl__process_enhanced_correctly

ad_enhanced_text_to_plain_text (public)

 ad_enhanced_text_to_plain_text [ -maxlen maxlen ] text

Converts enhanced text format to normal plaintext format.

Switches:
-maxlen (optional, defaults to "70")
Parameters:
text (required)
Author:
Lars Pind <lars@pinds.com>
Created:
2003-01-27

Testcases:
ad_html_text_convert

ad_html_qualify_links (public)

 ad_html_qualify_links [ -location location ] [ -path path ] html

Convert in the HTML text relative URLs into fully qualified URLs including the hostname. It performs the following operations: 1. prepend paths starting with a "/" by the location (protocol and host). 2. prepend paths not starting a "/" by the path, in case it was passed in. Links, which are already fully qualified are not modified.

Switches:
-location (optional)
protocol and host (defaults to [ad_url])
-path (optional)
optional path to be prepended to paths not starting with a "/"
Parameters:
html (required)
HTML text, in which substitutions should be performed.

Testcases:
ad_html_qualify_links

ad_html_security_check (public)

 ad_html_security_check [ -allowed_tags allowed_tags ] \
    [ -allowed_attributes allowed_attributes ] \
    [ -allowed_protocols allowed_protocols ] html

Returns a human-readable explanation if the user has used any HTML tag other than the allowed ones. It uses for checking the provided values. If these values are not provided the function takes the union of the per-package instance value and the values from the "antispam" section of the kernel parameters.

Switches:
-allowed_tags (optional)
-allowed_attributes (optional)
-allowed_protocols (optional)
Parameters:
html (required)
The HTML text being validated.
Returns:
a human-readable, plaintext explanation of what's wrong with the user's input. If everything is ok, return an empty string.
Author:
Lars Pind <lars@pinds.com>
Created:
20 July 2000

Testcases:
ad_html_security_check_href_allowed, ad_html_security_check_forbidden_protolcols, ad_html_security_check_forbidden_tags

ad_html_text_convert (public)

 ad_html_text_convert [ -from from ] [ -to to ] [ -maxlen maxlen ] \
    [ -truncate_len truncate_len ] [ -ellipsis ellipsis ] \
    [ -more more ] text

Converts a chunk of text from a variety of formats to either text/html or text/plain.

Example: ad_html_text_convert -from "text/html" -to "text/plain" -- "text"

Putting in the -- prevents Tcl from treating a - in text portion from being treated as a parameter.

Html to html closes any unclosed html tags (see util_close_html_tags).

Text to HTML does ad_text_to_html, and HTML to text does an ad_html_to_text. See those procs for details.

When text is empty, then an empty string will be returned regardless of any format. This is especially useful when displaying content that was created with the richtext widget and might contain empty values for content and format.

Switches:
-from (optional, defaults to "text/plain")
specify what type of text you're providing. Allowed values:
  • text/plain
  • text/enhanced
  • text/markdown
  • text/fixed-width
  • text/html
-to (optional, defaults to "text/html")
specify what format you want this translated into. Allowed values:
  • text/plain
  • text/html
-maxlen (optional, defaults to "70")
The maximum line width when generating text/plain
-truncate_len (optional, defaults to "0")
The maximum total length of the output, included ellipsis.
-ellipsis (optional, defaults to "...")
This will get put at the end of the truncated string, if the string was truncated. However, this counts towards the total string length, so that the returned string including ellipsis is guaranteed to be shorter than the 'truncate_len' provided.
-more (optional)
This will get put at the end of the truncated string, if the string was truncated.
Parameters:
text (required)
Author:
Lars Pind <lars@pinds.com>
Created:
19 July 2000

Testcases:
ad_html_text_convert, ad_text_html_convert_outlook_word_comments, ad_text_html_convert_to_plain, general_comments_create_link

ad_html_text_convertable_p (public, deprecated)

 ad_html_text_convertable_p [ -from from ] [ -to to ]
Deprecated. Invoking this procedure generates a warning.

The name of this proc has an spelling error. Use ad_html_text_convertible_p instead.

Switches:
-from (optional)
-to (optional)
See Also:

Testcases:
No testcase defined.

ad_html_text_convertible_p (public)

 ad_html_text_convertible_p [ -from from ] [ -to to ]

Returns true of ad_html_text_convert can handle the given from and to mime types.

Switches:
-from (optional)
-to (optional)

Testcases:
ad_html_text_convert

ad_html_to_text (public)

 ad_html_to_text [ -maxlen maxlen ] [ -showtags ] [ -no_format ] html

Returns a best-guess plain text version of an HTML fragment. Parses the HTML and does some simple formatting. The parser and formatting is pretty stupid, but it's better than nothing.

Switches:
-maxlen (optional, defaults to "70")
the line length you want your output wrapped to.
-showtags (optional, boolean)
causes any unknown (and uninterpreted) tags to get shown in the output.
-no_format (optional, boolean)
causes hyperlink tags not to get listed at the end of the output.
Parameters:
html (required)
Authors:
Lars Pind <lars@pinds.com>
Aaron Swartz <aaron@swartzfam.com>
Created:
19 July 2000

Testcases:
html_to_text, ad_html_to_text_bold, ad_html_to_text_anchor, ad_html_to_text_image, ad_html_to_text_clipped_link, text_to_html

ad_js_escape (public)

 ad_js_escape string

Return supplied string with invalid javascript characters property escaped. This makes possible to use the string safely inside javascript code.

Parameters:
string (required)
Author:
Antonio Pisano

Testcases:
ad_js_escape

ad_looks_like_html_p (public)

 ad_looks_like_html_p text

Tries to guess whether the text supplied is text or html.

Parameters:
text (required)
the text you want tested.
Returns:
1 if it looks like html, 0 if not.
Author:
Lars Pind <lars@pinds.com>
Created:
19 July 2000

Testcases:
acs_api_browser_api_describe_function, acs_api_browser_api_proc_documentation, acs_api_browser_api_script_documentation, acs_api_browser_apidoc_format_see, acs_api_browser_apidoc_tclcode_to_html, ad_looks_like_html_p, ad_dimensional

ad_pad (public)

 ad_pad [ -left ] [ -right ] string length padstring

Tcl implementation of the pad string function found in many DBMSs. One of the directional flags -left or -right must be specified and will dictate whether this will be a lpad or a rpad.

Switches:
-left (optional, boolean)
text will be appended left of the original string.
-right (optional, boolean)
text will be appended right of the original string.
Parameters:
string (required)
length (required)
padstring (required)
Returns:
padded string

Testcases:
ad_pad

ad_parse_html_attributes (public)

 ad_parse_html_attributes [ -attribute_array attribute_array ] html \
    [ pos ]

This is a wrapper proc for ad_parse_html_attributes_upvar, so you can parse attributes from a string without upvar'ing. See the documentation for the other proc.

Switches:
-attribute_array (optional)
Parameters:
html (required)
pos (optional, defaults to "0")
Author:
Lars Pind <lars@pinds.com>
Created:
November 10, 2000

Testcases:
ad_parse_html_attributes

ad_quotehtml (public, deprecated)

 ad_quotehtml arg
Deprecated. Invoking this procedure generates a warning.

Quotes ampersands, double-quotes, and angle brackets in $arg. Analogous to ns_quotehtml except that it quotes double-quotes (which ns_quotehtml does not).

Parameters:
arg (required)
See Also:

Testcases:
No testcase defined.

ad_string_truncate (public)

 ad_string_truncate [ -len len ] [ -ellipsis ellipsis ] [ -more more ] \
    [ -equal ] string

Truncates a string to len characters adding the string provided in the ellipsis parameter if the string was truncated. The length of the resulting string, including the ellipsis, is guaranteed to be shorter or equal than the len specified. Should always be called as ad_string_truncate [-flags ...] -- string since otherwise strings which start with a - will treated as switches, and will cause an error.

Switches:
-len (optional, defaults to "200")
The length to truncate to. If zero, no truncation will occur.
-ellipsis (optional, defaults to "...")
This will get put at the end of the truncated string, if the string was truncated. However, this counts towards the total string length, so that the returned string including ellipsis is guaranteed to be shorter or equal than the 'len' provided.
-more (optional)
This will get put at the end of the truncated string, if the string was truncated.
-equal (optional, boolean)
Parameters:
string (required)
The string to truncate.
Returns:
The truncated string
Author:
Lars Pind <lars@pinds.com>
Created:
September 8, 2002

Testcases:
ad_string_truncate

ad_string_truncate_middle (public)

 ad_string_truncate_middle [ -ellipsis ellipsis ] [ -len len ] string

Cut middle part of a string in case it is too long.

Switches:
-ellipsis (optional, defaults to "...")
placeholder for the portion of text being left out
-len (optional, defaults to "100")
length after which we are starting cutting text
Parameters:
string (required)
Returns:
truncated string
See Also:

Testcases:
ad_string_truncate_middle

ad_text_to_html (public)

 ad_text_to_html [ -no_links ] [ -no_lines ] [ -no_quote ] \
    [ -includes_html ] [ -encode ] text

Converts plaintext to html. Also translates any recognized email addresses or URLs into a hyperlink.

Switches:
-no_links (optional, boolean)
will prevent it from highlighting
-no_lines (optional, boolean)
-no_quote (optional, boolean)
will prevent it from HTML-quoting output, so this can be run on semi-HTML input and preserve that formatting. This will also cause spaces/tabs to not be replaced with nbsp's, because this can too easily mess up HTML tags.
-includes_html (optional, boolean)
Set this if the text parameter already contains some HTML which should be preserved.
-encode (optional, boolean)
This will encode international characters into its html equivalent, like "ü" into ü
Parameters:
text (required)
Authors:
Branimir Dolicki <branimir@arsdigita.com>
Lars Pind <lars@pinds.com>
Created:
19 July 2000

Testcases:
ad_text_to_html, xowiki_test_cases, create_form_with_form_instance

ad_unquotehtml (public)

 ad_unquotehtml arg

reverses ns_quotehtml

Parameters:
arg (required)
See Also:
  • ns_quotehtml

Testcases:
quote_unquote_html

string_truncate (public, deprecated)

 string_truncate [ args... ]
Deprecated. Invoking this procedure generates a warning.

Truncates a string to len characters adding the string provided in the ellipsis parameter if the string was truncated. The length of the resulting string, including the ellipsis, is guaranteed to be shorter or equal than the len specified. Should always be called as ad_string_truncate [-flags ...] -- string since otherwise strings which start with a - will treated as switches, and will cause an error.

Returns:
The truncated string
Author:
Lars Pind <lars@pinds.com>
Created:
September 8, 2002 DEPRECATED: does not comply with OpenACS naming convention
See Also:

Testcases:
No testcase defined.

string_truncate_middle (public, deprecated)

 string_truncate_middle [ args... ]
Deprecated. Invoking this procedure generates a warning.

Cut middle part of a string in case it is too long DEPRECATED: does not comply with OpenACS naming convention

See Also:

Testcases:
No testcase defined.

util_close_html_tags (public)

 util_close_html_tags html_fragment [ break_soft ] [ break_hard ] \
    [ ellipsis ] [ more ]

Given an HTML fragment, this procedure will close any tags that have been left open. The optional arguments let you specify that the fragment is to be truncated to a certain number of displayable characters. After break_soft, it truncates and closes open tags unless you're within non-breaking tags (e.g., Af). After break_hard displayable characters, the procedure simply truncates and closes any open HTML tags that might have resulted from the truncation.

Note that the internal syntax table dictates which tags are non-breaking. The syntax table has codes:

  • nobr -- treat tag as nonbreaking.
  • discard -- throws away everything until the corresponding close tag.
  • remove -- nuke this tag and its closing tag but leave contents.
  • close -- close this tag if left open.

Parameters:
html_fragment (required)
break_soft (optional, defaults to "0")
the number of characters you want the HTML fragment truncated to. Will allow certain tags (A, ADDRESS, NOBR) to close first.
break_hard (optional, defaults to "0")
the number of characters you want the HTML fragment truncated to. Will truncate, regardless of what tag is currently in action.
ellipsis (optional)
This will get put at the end of the truncated string, if the string was truncated. However, this counts towards the total string length, so that the returned string including ellipsis is guaranteed to be shorter than the 'len' provided.
more (optional)
This will get put at the end of the truncated string, if the string was truncated.
Author:
Jeff Davis <davis@xarg.net>

Testcases:
util_close_html_tags

util_convert_line_breaks_to_html (public)

 util_convert_line_breaks_to_html [ -includes_html ] [ -contains_pre ] \
    text

Convert line breaks to <p> and <br> tags, respectively.

Switches:
-includes_html (optional, boolean)
-contains_pre (optional, boolean)
Parameters:
text (required)

Testcases:
util_convert_line_breaks_to_html, ad_text_to_html

util_expand_entities (public, deprecated)

 util_expand_entities html
Deprecated. Invoking this procedure generates a warning.

Replaces all occurrences of common HTML entities with their plaintext equivalents in a way that's appropriate for pretty-printing.

Currently, the following entities are converted: &lt;, &gt;, &apm;quot;, &amp;, &mdash; and &#151;.

This proc is more suitable for pretty-printing that its sister-proc, util_expand_entities_ie_style. The two differences are that this one is more strict: it requires proper entities i.e., both opening ampersand and closing semicolon, and it doesn't do numeric entities, because they're generally not safe to send to browsers. If we want to do numeric entities in general, we should also consider how they interact with character encodings.

Parameters:
html (required)
See Also:
  • ns_unquotehtml

Testcases:
No testcase defined.

util_expand_entities_ie_style (public, deprecated)

 util_expand_entities_ie_style html
Deprecated. Invoking this procedure generates a warning.

Replaces all occurrences of &#111; and &x0f; type HTML character entities to their ASCII equivalents. It also handles lt, gt, quot, ob, cb and amp.

This proc does the expansion in the style of IE and Netscape, which is to say that it doesn't require the trailing semicolon on the entity to replace it with something else. The reason we do that is that this proc was designed for checking HTML for security-issues, and since entities can be used for hiding malicious code, we'd better simulate the liberal interpretation that browsers does, even though it complicates matters.

Unlike its sister proc, util_expand_entities, it also expands numeric entities (#999 or #xff style).

Parameters:
html (required)
Author:
Lars Pind <lars@pinds.com>
Created:
October 17, 2000
See Also:
  • ns_unquotehtml

Testcases:
No testcase defined.

util_remove_html_tags (public)

 util_remove_html_tags html

Removes everything between < and > from the string.

Parameters:
html (required)

Testcases:
util_remove_html_tags

wrap_string (public, deprecated)

 wrap_string input [ width ]
Deprecated. Invoking this procedure generates a warning.

wraps a string to be no wider than 80 columns by inserting line breaks

Parameters:
input (required)
width (optional, defaults to "80")
See Also:
  • ns_reflow_text

Testcases:
No testcase defined.
[ show source ]