ad_dom_sanitize_html (public)

 ad_dom_sanitize_html -html html [ -allowed_tags allowed_tags ] \
    [ -allowed_attributes allowed_attributes ] \
    [ -allowed_protocols allowed_protocols ] \
    [ -unallowed_tags unallowed_tags ] \
    [ -unallowed_attributes unallowed_attributes ] \
    [ -unallowed_protocols unallowed_protocols ] [ -no_js ] \
    [ -no_outer_urls ] [ -validate ] [ -fix ]

Defined in packages/acs-tcl/tcl/text-html-procs.tcl

Sanitizes HTML by specified criteria, basically removing unallowed tags and attributes, JavaScript or outer references into page URLs. When desired, this proc can act also as just a validator in order to enforce some markup policies.

Switches:
-html
(required)
the markup to be checked.
-allowed_tags
(optional)
list of tags we allow in the markup.
-allowed_attributes
(optional)
list of attributes we allow in the markup.
-allowed_protocols
(optional)
list of attributes we allow into links
-unallowed_tags
(optional)
list of tags we don't allow in the markup.
-unallowed_attributes
(optional)
list of attributes we don't allow in the markup.
-unallowed_protocols
(optional)
list of protocols we don't allow in the markup. Protocol-relative URLs are allowed, but only if proc is called from a connection thread, as we need to determine our current connection protocol.
-no_js
(boolean) (optional)
this flag decides whether every script tag, inline event handlers and the javascript: pseudo-protocol should be stripped from the markup.
-no_outer_urls
(boolean) (optional)
this flag tells the proc to remove every reference to external addresses. Proc will try to distinguish between external URLs and fine fully specified internal ones. Acceptable URLs will be transformed in absolute local references, others will be just stripped together with the attribute. Absolute URLs referring to our host are allowed, but require the proc being called from a connection thread in order to determine the proper current url.
-validate
(boolean) (optional)
This flag will avoid the creation of the stripped markup and just report whether the original one respects all the specified requirements.
-fix
(boolean) (optional)
When parsing fails on markup as it is, try to fix it by, for example, closing unclosed tags or normalizing attribute specification. This operation will remove most of plain whitespace into text content of original HTML, together with every comment and the eventually present DOCTYPE declaration.
Returns:
sanitized markup or a (0/1) truth value when the -validate flag is specified
Author:
Antonio Pisano

Partial Call Graph (max 5 caller/called nodes):
%3 test_ad_dom_sanitize_html ad_dom_sanitize_html (test acs-tcl) ad_dom_sanitize_html ad_dom_sanitize_html test_ad_dom_sanitize_html->ad_dom_sanitize_html ad_conn ad_conn (public) ad_dom_sanitize_html->ad_conn ad_dom_fix_html ad_dom_fix_html (private) ad_dom_sanitize_html->ad_dom_fix_html ad_log ad_log (public) ad_dom_sanitize_html->ad_log dom dom ad_dom_sanitize_html->dom parameter::get parameter::get (public) ad_dom_sanitize_html->parameter::get packages/general-comments/www/comment-add-2.tcl packages/general-comments/ www/comment-add-2.tcl packages/general-comments/www/comment-add-2.tcl->ad_dom_sanitize_html packages/general-comments/www/comment-edit-2.tcl packages/general-comments/ www/comment-edit-2.tcl packages/general-comments/www/comment-edit-2.tcl->ad_dom_sanitize_html

Testcases:
ad_dom_sanitize_html
Source code:
        ## Allowed/Unallowed tags come from the user or default to
        ## those specified in the parameters

        array set allowed_tag {}
        if {![info exists allowed_tags]} {
            set allowed_tags [parameter::get -package_id $::acs::kernel_id -parameter AllowedTag]
        }

        array set allowed_attribute {}
        if {![info exists allowed_attributes]} {
            set allowed_attributes [parameter::get -package_id $::acs::kernel_id -parameter AllowedAttribute]
        }

        array set allowed_protocol {}
        if {![info exists allowed_protocols]} {
            set allowed_protocols [parameter::get -package_id $::acs::kernel_id -parameter AllowedProtocol]
        }

        if {"*" in $allowed_tags} {
            set allowed_tags "*"
        }
        foreach tag $allowed_tags {
            set allowed_tag([string tolower $tag]) 1
        }

        if {"*" in $allowed_attributes} {
            set allowed_attributes "*"
        }
        foreach attribute $allowed_attributes {
            set allowed_attribute([string tolower $attribute]) 1
        }

        if {"*" in $allowed_protocols} {
            set allowed_protocols "*"
        }
        foreach protocol $allowed_protocols {
            set allowed_protocol([string tolower $protocol]) 1
        }

        array set unallowed_tag {}
        if {![info exists unallowed_tags]} {
            set unallowed_tags {}
        }

        array set unallowed_attribute {}
        if {![info exists unallowed_attributes]} {
            set unallowed_attributes {}
        }

        array set unallowed_protocol {}
        if {![info exists unallowed_protocols]} {
            set unallowed_protocols {}
        }

        # TODO: consider default unallowed stuff to come from a parameter

        if {$no_js_p} {
            lappend unallowed_tags "script"
            lappend unallowed_attributes {*}{
                onafterprint onbeforeprint onbeforeunload onerror
                onhashchange onload onmessage onoffline ononline
                onpagehide onpageshow onpopstate onresize onstorage
                onunload onblur onchange oncontextmenu onfocus oninput
                oninvalid onreset onsearch onselect onsubmit onkeydown
                onkeypress onkeyup onclick ondblclick onmousedown
                onmousemove onmouseout onmouseover onmouseup
                onmousewheel onwheel ondrag ondragend ondragenter
                ondragleave ondragover ondragstart ondrop onscroll
                oncopy oncut onpaste onabort oncanplay
                oncanplaythrough oncuechange ondurationchange
                onemptied onended onerror onloadeddata
                onloadedmetadata onloadstart onpause onplay onplaying
                onprogress onratechange onseeked onseeking onstalled
                onsuspend ontimeupdate onvolumechange onwaiting onshow
                ontoggle
            }
            lappend unallowed_protocols "javascript"
        }

        foreach tag $unallowed_tags {
            set unallowed_tag([string tolower $tag]) 1
        }

        foreach attribute $unallowed_attributes {
            set unallowed_attribute([string tolower $attribute]) 1
        }
        foreach protocol $unallowed_protocols {
            set unallowed_protocol([string tolower $protocol]) 1
        }

        ##
        # root of the document must be unique, this will enforce it by
        # wrapping html in an auxiliary root element
        set lmarker "<root>"
        set rmarker "</root>"

        try {
            dom parse -html "${lmarker}${html}${rmarker}" doc

        } on error {errorMsg} {
            if {$fix_p} {
                try {
                    set doc [ad_dom_fix_html -html $html -dom]
                } on error {errorMsg} {
                    ad_log error "Fixing of the document failed. Reported error: $errorMsg"
                    return [expr {$validate_p ? 0 : ""}]
                }
            } else {
                ad_log error "Parsing of the document failed. Reported error: $errorMsg"
                return [expr {$validate_p ? 0 : ""}]
            }
        }

        $doc documentElement root

        # Some sanitizing requires information that is available only
        # from a connection thread such as our local address and
        # current protocol.
        if {[ns_conn isconnected]} {
            set driver_info [util_driver_info]
            set driver_prot [dict get $driver_info proto]
            set driver_host [dict get $driver_info hostname]
            set driver_port [dict get $driver_info port]

            ## create a regex clause of possible addresses referring to
            ## this system
            set our_locations [list]

            # location from conf files
            set configured_location [util::join_location  -proto    $driver_prot  -hostname $driver_host  -port     $driver_port]
            lappend our_locations $configured_location
            regsub {^\w+://} $configured_location {//} no_proto_location
            lappend our_locations $no_proto_location

            # location from connection
            set conn_location [ad_conn location]
            lappend our_locations $conn_location
            regsub {^\w+://} $conn_location {//} no_proto_location
            lappend our_locations $no_proto_location

            set our_locations [join $our_locations |]
            ##
        } else {
            set our_locations ""
            set driver_prot ""
        }

        set queue [$root childNodes]
        while {$queue ne {}} {
            set node [lindex $queue 0]
            set queue [lrange $queue 1 end]

            # skip all non-element nodes
            if {$node eq "" || [$node nodeType] ne "ELEMENT_NODE"} {
                continue
            }

            # 1: check tag is allowed
            set node_name [string tolower [$node nodeName]]
            if {[info exists unallowed_tag($node_name)] ||
                ($allowed_tags ne "*" && ![info exists allowed_tag($node_name)])} {
                # invalid tag!
                if {$validate_p} {
                    return 0
                } else {
                    $node delete
                }
                continue
            }

            # tag itself is allowed, we can inspect its children
            lappend queue {*}[$node childNodes]

            # 2: check tag contains only allowed attributes
            foreach att [$node attributes] {
                set att [string tolower $att]
                if {[info exists unallowed_attribute($att)] ||
                    ($allowed_attributes ne "*" && ![info exists allowed_attribute($att)])} {
                    # invalid attribute!
                    if {$validate_p} {
                        return 0
                    } else {
                        $node removeAttribute $att
                    }
                    continue
                }

                # 3: check for any attribute that could contain a URL
                # whether this is acceptable
                switch -- $att {
                    "href" - "src" - "content" - "action" {
                        set url [string trim [$node getAttribute $att ""]]
                        if {$url eq ""} {
                            continue
                        }

                        set proto ""
                        try {
                            set parsed_url [ns_parseurl $url]
                            if {[dict exists $parsed_url proto]} {
                                set proto [dict get $parsed_url proto]
                            }
                        } on error {errorMsg} {
                            ns_log warning "ad_dom_sanitize_html cannot parse URL '$url': $errorMsg"
                            #
                            # The attribute is invalid. Report it or remove it.
                            #
                            if {$validate_p} {
                                return 0
                            } else {
                                $node removeAttribute $att
                            }
                            continue
                        }
                        if {$proto ne ""} {
                            if {$no_outer_urls_p} {
                                # no external URLs allowed: we still
                                # want to allow fully specified URLs
                                # that refer to this server, but we'll
                                # transform them in a local absolute
                                # reference. For all others, attribute
                                # will be just removed.
                                # - This is ok, points to our system...
                                if {[regsub ^($our_locations) $url {} url]} {
                                    set url /[string trimleft $url "/"]
                                    $node setAttribute $att $url
                                    # ...this is not, points elsewhere!
                                } else {
                                    # invalid attribute!
                                    if {$validate_p} {
                                        return 0
                                    } else {
                                        $node removeAttribute $att
                                    }
                                    continue
                                }
                            }
                        }

                        # to check for allowed protocols we need to
                        # treat URLs without one (e.g. relative or
                        # protocol-relative URLs) as using our same
                        # protocol
                        if {$proto eq ""} {
                            set proto $driver_prot
                        }

                        # check if protocol is allowed
                        if {[info exists unallowed_protocol($proto)] ||
                            ($allowed_protocols ne "*" && ![info exists allowed_protocol($proto)])} {
                            # invalid attribute!
                            if {$validate_p} {
                                return 0
                            } else {
                                $node removeAttribute $att
                            }
                            continue
                        }
                    }
                }
            }
        }

        if {$validate_p} {
            $doc delete
            return 1
        } else {
            if {[package vsatisfies [package require tdom] 0.9.3]} {
                # tDOM 0.9.3 will return the tree including the
                # parent.  To keep the previous behavior, one should
                # specify the -onlyContents flag, that previous
                # versions do not support.
                set html [$root asHTML -onlyContents]
            } else {
                set html [$root asHTML]
            }
            $doc delete
            # remove auxiliary root element from output
            set html [string range $html [string length $lmarker] end-[string length $rmarker]]
            set html [string trim $html]
            return $html
        }
XQL Not present:
Generic, PostgreSQL, Oracle
[ hide source ] | [ make this the default ]
Show another procedure: