util_close_html_tags (public)
util_close_html_tags html_fragment [ break_soft ] [ break_hard ] \ [ ellipsis ] [ more ]
Defined in packages/acs-tcl/tcl/text-html-procs.tcl
Given an HTML fragment, this procedure will close any tags that have been left open. The optional arguments let you specify that the fragment is to be truncated to a certain number of displayable characters. After break_soft, it truncates and closes open tags unless you're within non-breaking tags (e.g., Af). After break_hard displayable characters, the procedure simply truncates and closes any open HTML tags that might have resulted from the truncation.
Note that the internal syntax table dictates which tags are non-breaking. The syntax table has codes:
- nobr -- treat tag as nonbreaking.
- discard -- throws away everything until the corresponding close tag.
- remove -- nuke this tag and its closing tag but leave contents.
- close -- close this tag if left open.
- Parameters:
- html_fragment
break_soft (defaults to"0"
) - the number of characters you want the HTML fragment truncated to. Will allow certain tags (A, ADDRESS, NOBR) to close first.
break_hard (defaults to"0"
) - the number of characters you want the HTML fragment truncated to. Will truncate, regardless of what tag is currently in action.
ellipsis (optional) - This will get put at the end of the truncated string, if the string was truncated. However, this counts towards the total string length, so that the returned string including ellipsis is guaranteed to be shorter than the 'len' provided.
more (optional) - This will get put at the end of the truncated string, if the string was truncated.
- Author:
- Jeff Davis <davis@xarg.net>
- Partial Call Graph (max 5 caller/called nodes):
- Testcases:
- util_close_html_tags
Source code: # # The code in this function had an exponential behavior based on # the size. On the current OpenACS.org site (Jan 2009), the # function took on certain forums entries 6 to 9 hours # (e.g. /forums/message-view?message_id=357753). This is in # particular a problem, since bots like googlebot will timeout on # these entries (while OpenACS is still computing the content) and # retry after some time until they get the result (which never # happened). So, often multiple computation ran at the same # time. Since OpenACS.org is configured with only a few connection # threads, this is essentially a "bot DOS attack". # # Therefore, the tdom-based code in the next paragraph is used to # speedup the process significantly (most entries are anyway # correct). The forum processing query from above takes now 7.3 # seconds instead of 9h. The tdom-based code was developed as an # emergency measure. # # The code below the mentioned paragraph could be certainly as # well made faster, but this will require some more detailed # analysis. # # The best solution for forums would be to check the fragment not # at rendering time, but at creation time. # # -gustaf neumann (Jan 2009) if {$break_soft == 0 && $break_hard == 0} { if {[::acs::icanuse "ns_parsehtml"]} { # # In case, we have have the command "ns_parsehtml" use it # for closing tags. In cases, were we haved used the # command before, we could obtain from the first pass the # information about unbalanced tags for optimization. # return [util_close_html_tags_ns_parsehtml $html_fragment] } # # We have to protect against crashes, that might happen due to # unsupported numeric entities in tdom. Therefore, we map # numeric entities into something sufficiently opaque # set frag [string map [list &# "\0&#\0"] $html_fragment] try { dom parse -html <body>$frag doc } on error {errorMsg} { # we got an error, so do Tcl based HTML completion processing #ad_log notice "tdom can't parse the provided HTML, error=$errorMsg, checking fragment without tdom\n$frag" ad_log notice "tdom can't parse the provided HTML, error=$errorMsg, checking fragment without tdom" } on ok {r} { $doc documentElement root set html "" # discard forms foreach node [$root selectNodes //form] {$node delete} # output wellformed html set b [lindex [$root selectNodes {//body}] 0] foreach n [$b childNodes] { append html [$n asHTML] } return [string map [list "\0&#\0" &#] $html] } } set frag $html_fragment # original code continues set syn(a) nobr set syn(address) nobr set syn(nobr) nobr # set syn(form) discard # set syn(blink) remove # set syn(table) close set syn(font) close set syn(b) close set syn(big) close set syn(i) close set syn(s) close set syn(small) close set syn(strike) close set syn(sub) close set syn(sup) close set syn(tt) close set syn(u) close set syn(abbr) close set syn(acronym) close set syn(cite) close set syn(code) close set syn(del) close set syn(dfn) close set syn(em) close set syn(ins) close set syn(kbo) close set syn(samp) close set syn(strong) close set syn(var) close set syn(dir) close set syn(dl) close set syn(menu) close set syn(ol) close set syn(ul) close set syn(h1) close set syn(h2) close set syn(h3) close set syn(h4) close set syn(h5) close set syn(h6) close set syn(bdo) close set syn(blockquote) close set syn(center) close set syn(div) close set syn(pre) close set syn(q) close set syn(span) close set out {} set out_len 0 # counts how deep we are nested in nonbreaking tags, tracks the nobr point # and what the nobr string length would be set nobr 0 set nobr_out_point 0 set nobr_tagptr 0 set nobr_len 0 if { $break_hard > 0 } { if { $break_soft == 0 } { set break_soft $break_hard } } set broken_p 0 set discard 0 set tagptr -1 # First try to fix up < not part of a tag. regsub -all -- {<([^/[:alpha:]!])} $frag {\<\1} frag # no we do is chop off any trailing unclosed tag # since when we substr blobs this sometimes happens # this should in theory cut any tags which have been cut open. while {[regexp {(.*)<[^>]*$} $frag match frag]} {} while { "$frag" != "" } { # here we attempt to cut the string into "pretag<TAG TAGBODY>posttag" # and build the output list. if {![regexp "(\[^<]*)(<(/?)(\[^ \r\n\t>]+)(\[^>]*)>)?(.*)" $frag match pretag fulltag close tag tagbody frag]} { # should never get here since above will match anything. ns_log Error "util_close_html_tag - NO MATCH: should never happen! frag=$frag" append out $frag set frag {} } else { #ns_log Notice "pretag=$pretag\n fulltag=$fulltag\n close=$close\n tag=$tag\n tagbody=$tagbody frag length is [string length $frag]" if { ! $discard } { # figure out if we can break with the pretag chunk if { $break_soft } { if {! $nobr && [string length $pretag] + $out_len > $break_soft } { # first chop pretag to the right length set pretag [string range $pretag 0 [expr {$break_soft - $out_len - [string length $ellipsis]}]] # clip the last word regsub "\[^ \t\n\r]*$" $pretag {} pretag append out [string range $pretag 0 $break_soft] set broken_p 1 break } elseif { $nobr && [string length $pretag] + $out_len > $break_hard } { # we are in a nonbreaking tag and are past the hard break # so chop back to the point we got the nobr tag... set tagptr $nobr_tagptr if { $nobr_out_point > 0 } { set out [string range $out 0 $nobr_out_point-1] } else { # here maybe we should decide if we should keep the tag anyway # if zero length result would be the result... set out {} } set broken_p 1 break } } # tack on pretag append out $pretag incr out_len [string length $pretag] } # now deal with the tag if we got one... if { $tag eq "" } { # if the tag is empty we might have one of the bad matched that are not eating # any of the string so check for them if {[string length $match] == [string length $frag]} { append out $frag set frag {} } } else { set tag [string tolower $tag] if { ![info exists syn($tag)]} { # if we don't have an entry in our syntax table just tack it on # and hope for the best. if { ! $discard } { append out $fulltag } } else { if { $close ne "/" } { # new tag # "remove" tags are just ignored here # discard tags if { $discard } { if { $syn($tag) eq "discard" } { incr discard incr tagptr set tagstack($tagptr) $tag } } else { switch -- $syn($tag) { nobr { if { ! $nobr } { set nobr_out_point [string length $out] set nobr_tagptr $tagptr set nobr_len $out_len } incr nobr incr tagptr set tagstack($tagptr) $tag append out $fulltag } discard { incr discard incr tagptr set tagstack($tagptr) $tag } close { incr tagptr set tagstack($tagptr) $tag append out $fulltag } } } } else { # we got a close tag if { $discard } { # if we are in discard mode only watch for # closes to discarded tags if { $syn($tag) eq "discard"} { if {$tagptr > -1} { if { $tag != $tagstack($tagptr) } { #puts "/$tag without $tag" } else { incr tagptr -1 incr discard -1 } } } } else { if { $syn($tag) ne "remove"} { # if tag is a remove tag we just ignore it... if {$tagptr > -1} { if {$tag != $tagstack($tagptr) } { # puts "/$tag without $tag" } else { incr tagptr -1 if { $syn($tag) eq "nobr"} { incr nobr -1 } append out $fulltag } } } } } } } } } # on exit of the look either we parsed it all or we truncated. # we should now walk the stack and close any open tags. # Chop off extra whitespace at the end if { $broken_p } { set end_index [expr {[string length $out] -1}] while { $end_index >= 0 && [string is space [string index $out $end_index]] } { incr end_index -1 } set out [string range $out 0 $end_index] } for { set i $tagptr } { $i > -1 } { incr i -1 } { set tag $tagstack($i) # LARS: Only close tags which we aren't supposed to remove if { $syn($tag) ni {discard remove}} { append out "</$tagstack($i)>" } } if { $broken_p } { append out $ellipsis append out $more } return $outXQL Not present: Generic, PostgreSQL, Oracle