util_close_html

util_close_html_tags (public)

 util_close_html_tags html_fragment [ break_soft ] [ break_hard ] \
    [ ellipsis ] [ more ]

Defined in packages/acs-tcl/tcl/text-html-procs.tcl

Given an HTML fragment, this procedure will close any tags that have been left open. The optional arguments let you specify that the fragment is to be truncated to a certain number of displayable characters. After break_soft, it truncates and closes open tags unless you're within non-breaking tags (e.g., Af). After break_hard displayable characters, the procedure simply truncates and closes any open HTML tags that might have resulted from the truncation.

Note that the internal syntax table dictates which tags are non-breaking. The syntax table has codes:

nobr -- treat tag as nonbreaking.
discard -- throws away everything until the corresponding close tag.
remove -- nuke this tag and its closing tag but leave contents.
close -- close this tag if left open.

Parameters:: html_fragment
break_soft (defaults to "0") - the number of characters you want the HTML fragment truncated to. Will allow certain tags (A, ADDRESS, NOBR) to close first.
break_hard (defaults to "0") - the number of characters you want the HTML fragment truncated to. Will truncate, regardless of what tag is currently in action.
ellipsis (optional) - This will get put at the end of the truncated string, if the string was truncated. However, this counts towards the total string length, so that the returned string including ellipsis is guaranteed to be shorter than the 'len' provided.
more (optional) - This will get put at the end of the truncated string, if the string was truncated.
Author:: Jeff Davis <davis@xarg.net>
Partial Call Graph (max 5 caller/called nodes):
Testcases:: util_close_html_tags

Source code:

    #
    # The code in this function had an exponential behavior based on
    # the size.  On the current OpenACS.org site (Jan 2009), the
    # function took on certain forums entries 6 to 9 hours
    # (e.g. /forums/message-view?message_id=357753). This is in
    # particular a problem, since bots like googlebot will timeout on
    # these entries (while OpenACS is still computing the content) and
    # retry after some time until they get the result (which never
    # happened). So, often multiple computation ran at the same
    # time. Since OpenACS.org is configured with only a few connection
    # threads, this is essentially a "bot DOS attack".
    #
    # Therefore, the tdom-based code in the next paragraph is used to
    # speedup the process significantly (most entries are anyway
    # correct).  The forum processing query from above takes now 7.3
    # seconds instead of 9h. The tdom-based code was developed as an
    # emergency measure.
    #
    # The code below the mentioned paragraph could be certainly as
    # well made faster, but this will require some more detailed
    # analysis.
    #
    # The best solution for forums would be to check the fragment not
    # at rendering time, but at creation time.
    #
    # -gustaf neumann    (Jan 2009)

    if {$break_soft == 0 && $break_hard == 0} {

        if {[::acs::icanuse "ns_parsehtml"]} {
            #
            # In case, we have have the command "ns_parsehtml" use it
            # for closing tags. In cases, were we haved used the
            # command before, we could obtain from the first pass the
            # information about unbalanced tags for optimization.
            #
            return [util_close_html_tags_ns_parsehtml $html_fragment]
        }
        #
        # We have to protect against crashes, that might happen due to
        # unsupported numeric entities in tdom. Therefore, we map
        # numeric entities into something sufficiently opaque
        #
        set frag [string map [list &# "\0&amp;#\0"] $html_fragment]

        try {
            dom parse -html <body>$frag doc
        } on error {errorMsg} {
            # we got an error, so do Tcl based HTML completion processing
            #ad_log notice "tdom can't parse the provided HTML, error=$errorMsg, checking fragment without tdom\n$frag"
            ad_log notice "tdom can't parse the provided HTML, error=$errorMsg, checking fragment without tdom"
        } on ok {r} {
            $doc documentElement root
            set html ""
            # discard forms
            foreach node [$root selectNodes //form] {$node delete}
            # output wellformed html
            set b [lindex [$root selectNodes {//body}] 0]
            foreach n [$b childNodes] {
                append html [$n asHTML]
            }
            return [string map [list "\0&amp;#\0" &#] $html]
        }
    }

    set frag $html_fragment

    # original code continues

    set syn(a) nobr
    set syn(address) nobr
    set syn(nobr) nobr
    #
    set syn(form) discard
    #
    set syn(blink) remove
    #
    set syn(table) close
    set syn(font) close
    set syn(b) close
    set syn(big) close
    set syn(i) close
    set syn(s) close
    set syn(small) close
    set syn(strike) close
    set syn(sub) close
    set syn(sup) close
    set syn(tt) close
    set syn(u) close
    set syn(abbr) close
    set syn(acronym) close
    set syn(cite) close
    set syn(code) close
    set syn(del) close
    set syn(dfn) close
    set syn(em) close
    set syn(ins) close
    set syn(kbo) close
    set syn(samp) close
    set syn(strong) close
    set syn(var) close
    set syn(dir) close
    set syn(dl) close
    set syn(menu) close
    set syn(ol) close
    set syn(ul) close
    set syn(h1) close
    set syn(h2) close
    set syn(h3) close
    set syn(h4) close
    set syn(h5) close
    set syn(h6) close
    set syn(bdo) close
    set syn(blockquote) close
    set syn(center) close
    set syn(div) close
    set syn(pre) close
    set syn(q) close
    set syn(span) close

    set out {}
    set out_len 0

    # counts how deep we are nested in nonbreaking tags, tracks the nobr point
    # and what the nobr string length would be
    set nobr 0
    set nobr_out_point 0
    set nobr_tagptr 0
    set nobr_len 0

    if { $break_hard > 0 } {
        if { $break_soft == 0 } {
            set break_soft $break_hard
        }
    }

    set broken_p 0
    set discard 0
    set tagptr -1

    # First try to fix up < not part of a tag.

    regsub -all -- {<([^/[:alpha:]!])} $frag {\&lt;\1} frag
    # no we do is chop off any trailing unclosed tag
    # since when we substr blobs this sometimes happens

    # this should in theory cut any tags which have been cut open.
    while {[regexp {(.*)<[^>]*$} $frag match frag]} {}

    while { "$frag" != "" } {
        # here we attempt to cut the string into "pretag<TAG TAGBODY>posttag"
        # and build the output list.

        if {![regexp "(\[^<]*)(<(/?)(\[^ \r\n\t>]+)(\[^>]*)>)?(.*)" $frag match pretag fulltag close tag tagbody frag]} {
            # should never get here since above will match anything.
            ns_log Error "util_close_html_tag - NO MATCH: should never happen! frag=$frag"
            append out $frag
            set frag {}
        } else {
            #ns_log Notice "pretag=$pretag\n fulltag=$fulltag\n close=$close\n tag=$tag\n tagbody=$tagbody frag length is [string length $frag]"
            if { ! $discard } {
                # figure out if we can break with the pretag chunk
                if { $break_soft } {
                    if {! $nobr && [string length $pretag] + $out_len > $break_soft } {
                        # first chop pretag to the right length
                        set pretag [string range $pretag 0 [expr {$break_soft - $out_len - [string length $ellipsis]}]]
                        # clip the last word
                        regsub "\[^ \t\n\r]*$" $pretag {} pretag
                        append out [string range $pretag 0 $break_soft]
                        set broken_p 1
                        break
                    } elseif { $nobr &&  [string length $pretag] + $out_len > $break_hard } {
                        # we are in a nonbreaking tag and are past the hard break
                        # so chop back to the point we got the nobr tag...
                        set tagptr $nobr_tagptr
                        if { $nobr_out_point > 0 } {
                            set out [string range $out 0 $nobr_out_point-1]
                        } else {
                            # here maybe we should decide if we should keep the tag anyway
                            # if zero length result would be the result...
                            set out {}
                        }
                        set broken_p 1
                        break
                    }
                }

                # tack on pretag
                append out $pretag
                incr out_len [string length $pretag]
            }

            # now deal with the tag if we got one...
            if  { $tag eq "" } {
                # if the tag is empty we might have one of the bad matched that are not eating
                # any of the string so check for them
                if {[string length $match] == [string length $frag]} {
                    append out $frag
                    set frag {}
                }
            } else {
                set tag [string tolower $tag]
                if { ![info exists syn($tag)]} {
                    # if we don't have an entry in our syntax table just tack it on
                    # and hope for the best.
                    if { ! $discard } {
                        append  out $fulltag
                    }
                } else {
                    if { $close ne "/" } {
                        # new tag
                        # "remove" tags are just ignored here
                        # discard tags
                        if { $discard } {
                            if { $syn($tag) eq "discard" } {
                                incr discard
                                incr tagptr
                                set tagstack($tagptr) $tag
                            }
                        } else {
                            switch -- $syn($tag) {
                                nobr {
                                    if { ! $nobr } {
                                        set nobr_out_point [string length $out]
                                        set nobr_tagptr $tagptr
                                        set nobr_len $out_len
                                    }
                                    incr nobr
                                    incr tagptr
                                    set tagstack($tagptr) $tag
                                    append out $fulltag
                                }
                                discard {
                                    incr discard
                                    incr tagptr
                                    set tagstack($tagptr) $tag
                                }
                                close {
                                    incr tagptr
                                    set tagstack($tagptr) $tag
                                    append out $fulltag
                                }
                            }
                        }
                    } else {
                        # we got a close tag
                        if { $discard } {
                            # if we are in discard mode only watch for
                            # closes to discarded tags
                            if { $syn($tag) eq "discard"} {
                                if {$tagptr > -1} {
                                    if { $tag != $tagstack($tagptr) } {
                                        #puts "/$tag without $tag"
                                    } else {
                                        incr tagptr -1
                                        incr discard -1
                                    }
                                }
                            }
                        } else {
                            if { $syn($tag) ne "remove"} {
                                # if tag is a remove tag we just ignore it...
                                if {$tagptr > -1} {
                                    if {$tag != $tagstack($tagptr) } {
                                        # puts "/$tag without $tag"
                                    } else {
                                        incr tagptr -1
                                        if { $syn($tag) eq "nobr"} {
                                            incr nobr -1
                                        }
                                        append out $fulltag
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }

    # on exit of the look either we parsed it all or we truncated.
    # we should now walk the stack and close any open tags.

    # Chop off extra whitespace at the end
    if { $broken_p } {
        set end_index [expr {[string length $out] -1}]
        while { $end_index >= 0 && [string is space [string index $out $end_index]] } {
            incr end_index -1
        }
        set out [string range $out 0 $end_index]
    }

    for { set i $tagptr } { $i > -1 } { incr i -1 } {
        set tag $tagstack($i)

        # LARS: Only close tags which we aren't supposed to remove
        if { $syn($tag) ni {discard remove}} {
            append out "</$tagstack($i)>"
        }
    }

    if { $broken_p } {
        append out $ellipsis
        append out $more
    }

    return $out

XQL Not present:

Generic, PostgreSQL, Oracle

[ hide source ] | [ make this the default ]