ad_dom_sanitize_html (public)

 ad_dom_sanitize_html -html html [ -allowed_tags allowed_tags ] \
    [ -allowed_attributes allowed_attributes ] \
    [ -allowed_protocols allowed_protocols ] \
    [ -unallowed_tags unallowed_tags ] \
    [ -unallowed_attributes unallowed_attributes ] \
    [ -unallowed_protocols unallowed_protocols ] [ -no_js ] \
    [ -no_outer_urls ] [ -validate ] [ -fix ]

Defined in packages/acs-tcl/tcl/text-html-procs.tcl

Sanitizes HTML by specified criteria, basically removing unallowed tags and attributes, JavaScript or outer references into page URLs. When desired, this proc can act also as just a validator in order to enforce some markup policies on user-submitted content.

Switches:
-html (required)
the markup to be checked.
-allowed_tags (optional)
list of tags we allow in the markup.
-allowed_attributes (optional)
list of attributes we allow in the markup.
-allowed_protocols (optional)
list of attributes we allow into links
-unallowed_tags (optional)
list of tags we don't allow in the markup.
-unallowed_attributes (optional)
list of attributes we don't allow in the markup.
-unallowed_protocols (optional)
list of protocols we don't allow in the markup. Protocol-relative URLs are allowed, but only if proc is called from a connection thread, as we need to determine our current connection protocol.
-no_js (optional, boolean)
this flag decides whether every script tag, inline event handlers and the javascript: pseudo-protocol should be stripped from the markup.
-no_outer_urls (optional, boolean)
this flag tells the proc to remove every reference to external addresses. Proc will try to distinguish between external URLs and fine fully specified internal ones. Acceptable URLs will be transformed in absolute local references, others will be just stripped together with the attribute. Absolute URLs referring to our host are allowed, but require the proc being called from a connection thread in order to determine the proper current url.
-validate (optional, boolean)
This flag will avoid the creation of the stripped markup and just report whether the original one respects all the specified requirements.
-fix (optional, boolean)
When parsing fails on markup as it is, try to fix it by, for example, closing unclosed tags or normalizing attribute specification. This operation will remove most of plain whitespace into text content of original HTML, together with every comment and the eventually present DOCTYPE declaration.
Returns:
sanitized markup or a (0/1) truth value when the -validate flag is specified
Author:
Antonio Pisano

Partial Call Graph (max 5 caller/called nodes):
%3 test_ad_dom_sanitize_html ad_dom_sanitize_html (test acs-tcl) ad_dom_sanitize_html ad_dom_sanitize_html test_ad_dom_sanitize_html->ad_dom_sanitize_html aa_test_running_p aa_test_running_p (public) ad_dom_sanitize_html->aa_test_running_p ad_dom_fix_html ad_dom_fix_html (private) ad_dom_sanitize_html->ad_dom_fix_html ad_log ad_log (public) ad_dom_sanitize_html->ad_log dom dom ad_dom_sanitize_html->dom parameter::get parameter::get (public) ad_dom_sanitize_html->parameter::get ad_page_contract_filter_proc_general_comments_safe ad_page_contract_filter_proc_general_comments_safe (public) ad_page_contract_filter_proc_general_comments_safe->ad_dom_sanitize_html

Testcases:
ad_dom_sanitize_html
Source code:
    ## Allowed/Unallowed tags come from the user or default to
    ## those specified in the parameters

    array set allowed_tag {}
    if {![info exists allowed_tags]} {
        set allowed_tags [parameter::get -package_id $::acs::kernel_id -parameter AllowedTag]
    }

    array set allowed_attribute {}
    if {![info exists allowed_attributes]} {
        set allowed_attributes [parameter::get -package_id $::acs::kernel_id -parameter AllowedAttribute]
    }

    array set allowed_protocol {}
    if {![info exists allowed_protocols]} {
        set allowed_protocols [parameter::get -package_id $::acs::kernel_id -parameter AllowedProtocol]
    }

    if {"*" in $allowed_tags} {
        set allowed_tags "*"
    }
    foreach tag $allowed_tags {
        set allowed_tag([string tolower $tag]) 1
    }

    if {"*" in $allowed_attributes} {
        set allowed_attributes "*"
    }
    foreach attribute $allowed_attributes {
        set allowed_attribute([string tolower $attribute]) 1
    }

    if {"*" in $allowed_protocols} {
        set allowed_protocols "*"
    }
    foreach protocol $allowed_protocols {
        set allowed_protocol([string tolower $protocol]) 1
    }

    array set unallowed_tag {}
    if {![info exists unallowed_tags]} {
        set unallowed_tags {}
    }

    array set unallowed_attribute {}
    if {![info exists unallowed_attributes]} {
        set unallowed_attributes {}
    }

    array set unallowed_protocol {}
    if {![info exists unallowed_protocols]} {
        set unallowed_protocols {}
    }

    # TODO: consider default unallowed stuff to come from a parameter

    if {$no_js_p} {
        lappend unallowed_tags "script"
        lappend unallowed_attributes {*}{
            onafterprint onbeforeprint onbeforeunload onerror
            onhashchange onload onmessage onoffline ononline
            onpagehide onpageshow onpopstate onresize onstorage
            onunload onblur onchange oncontextmenu onfocus oninput
            oninvalid onreset onsearch onselect onsubmit onkeydown
            onkeypress onkeyup onclick ondblclick onmousedown
            onmousemove onmouseout onmouseover onmouseup
            onmousewheel onwheel ondrag ondragend ondragenter
            ondragleave ondragover ondragstart ondrop onscroll
            oncopy oncut onpaste onabort oncanplay
            oncanplaythrough oncuechange ondurationchange
            onemptied onended onerror onloadeddata
            onloadedmetadata onloadstart onpause onplay onplaying
            onprogress onratechange onseeked onseeking onstalled
            onsuspend ontimeupdate onvolumechange onwaiting onshow
            ontoggle
        }
        lappend unallowed_protocols "javascript"
    }

    foreach tag $unallowed_tags {
        set unallowed_tag([string tolower $tag]) 1
    }

    foreach attribute $unallowed_attributes {
        set unallowed_attribute([string tolower $attribute]) 1
    }
    foreach protocol $unallowed_protocols {
        set unallowed_protocol([string tolower $protocol]) 1
    }

    ##
    # root of the document must be unique, this will enforce it by
    # wrapping html in an auxiliary root element
    set lmarker "<root>"
    set rmarker "</root>"

    try {
        dom parse -html -- "${lmarker}${html}${rmarker}" doc

    } on error {errorMsg} {
        set severity [expr {$validate_p ? "notice" : "warning"}]
        if {$fix_p} {
            try {
                set doc [ad_dom_fix_html -html $html -dom]
            } on error {errorMsg} {
                if {![aa_test_running_p]} {
                    ad_log $severity "Fixing of the document failed. Reported error: $errorMsg"
                }
                return [expr {$validate_p ? 0 : ""}]
            }
        } else {
            #ns_log notice "PARSING of\n${lmarker}${html}${rmarker}\n FAILED"
            if {![aa_test_running_p]} {
                ad_log $severity "Parsing of the document failed. Reported error: $errorMsg"
            }
            return [expr {$validate_p ? 0 : ""}]
        }
    }

    $doc documentElement root

    #
    # We use the current location to validate URLs without a protocol.
    #
    set current_location [util_current_location]

    set queue [$root childNodes]
    while {$queue ne {}} {
        set node [lindex $queue 0]
        set queue [lrange $queue 1 end]

        # skip all non-element nodes
        if {$node eq "" || [$node nodeType] ne "ELEMENT_NODE"} {
            continue
        }

        # 1: check tag is allowed
        set node_name [string tolower [$node nodeName]]
        if {[info exists unallowed_tag($node_name)] ||
            ($allowed_tags ne "*" && ![info exists allowed_tag($node_name)])} {
            # invalid tag!
            if {$validate_p} {
                return 0
            } else {
                $node delete
            }
            continue
        }

        # tag itself is allowed, we can inspect its children
        lappend queue {*}[$node childNodes]

        # 2: check tag contains only allowed attributes
        foreach att [$node attributes] {
            set att [string tolower $att]
            if {[info exists unallowed_attribute($att)] ||
                ($allowed_attributes ne "*" && ![info exists allowed_attribute($att)])} {
                # invalid attribute!
                if {$validate_p} {
                    return 0
                } else {
                    $node removeAttribute $att
                }
                continue
            }

            # 3: check for any attribute that could contain a URL
            # whether this is acceptable
            switch -- $att {
                "href" - "src" - "content" - "action" {
                    set url [string trim [$node getAttribute $att ""]]
                    if {$url eq ""} {
                        continue
                    }

                    #
                    # Try to detect malicious attempts to
                    # "disguise" a protocol by replacing
                    # characters with HTML entities.
                    #
                    # Tools that target earlier versions of the
                    # HTML specification may not be able to
                    # properly recognize the latest entities.
                    #
                    # Currently, tDOM targets HTML standard 4.01,
                    # hence will not automatically unquote
                    # entities such as "&colon;" and others, that
                    # were introduced later. (See
                    # http://tdom.org/index.html/tktview/d59ea07e74a1903435a947862dd7acd74a4eb92e)
                    #
                    # To overcome this limitation, we pass the URL
                    # through ns_unquotehtml, which on NaviServer
                    # > 4.99.30 will recognize and properly
                    # unescape many of these new entities.
                    #
                    set url [ns_unquotehtml $url]

                    #
                    # Another trick seen by e.g. penetration tools
                    # is to try and sneak in URLs sporting
                    # multiple protocols. We reject those
                    # altogether.
                    #
                    if {![regexp -nocase {^([a-z]+:){2,}} $url]} {
                        #
                        # A normal "0 or 1 protocols" URL
                        #
                    } elseif {$validate_p} {
                        #
                        # Multi-protocol URL and we are
                        # validating. This HTML is invalid.
                        #
                        return 0
                    } else {
                        #
                        # Multi-protocol URL and we are
                        # sanitizing. Remove it from the
                        # result.
                        #
                        $node removeAttribute $att
                        continue
                    }

                    #
                    # Ensure the URL is complete. Relative or protocol
                    # relative URLs will be completed using the
                    # information from our current location.
                    #
                    set url [ns_absoluteurl $url $current_location]
                    if {$no_outer_urls_p && [util::external_url_p $url]} {
                        if {$validate_p} {
                            #
                            # External URL and we are
                            # validating. This HTML is invalid.
                            #
                            return 0
                        } else {
                            #
                            # External URL and we are
                            # sanitizing. Remove it from the
                            # result.
                            #
                            $node removeAttribute $att
                            continue
                        }
                    }

                    #
                    # Parse the URL
                    #
                    try {
                        #
                        # We extract the URL protocol. The URL is
                        # guaranteed to have one at this point.
                        #
                        ns_parseurl $url
                    } on ok {parsed_url} {
                        set proto [dict get $parsed_url proto]
                    } on error {errorMsg} {
                        ns_log warning "ad_dom_sanitize_html cannot parse URL '$url': $errorMsg"
                        if {$validate_p} {
                            #
                            # Cannot parse URL and we are
                            # validating. This HTML is invalid.
                            #
                            return 0
                        } else {
                            #
                            # Cannot parse URL and we are
                            # sanitizing. Remove it from the result.
                            #
                            $node removeAttribute $att
                            continue
                        }
                    }

                    #
                    # Check if the determined protocol is
                    # allowed. Since comparison values (e.g., in
                    # unallowed_protocol) are lower-case, lowercase
                    # the determined protocol as well.
                    #
                    set proto [string tolower $proto]
                    if {[info exists unallowed_protocol($proto)]
                        || ($allowed_protocols ne "*" && ![info exists allowed_protocol($proto)])
                    } {
                        # invalid attribute!
                        if {$validate_p} {
                            return 0
                        } else {
                            $node removeAttribute $att
                        }
                        continue
                    }
                }
            }
        }
    }

    if {$validate_p} {
        $doc delete
        return 1
    } else {
        if {[package vsatisfies [package require tdom] 0.9.3]} {
            # tDOM 0.9.3 will return the tree including the
            # parent.  To keep the previous behavior, one should
            # specify the -onlyContents flag, that previous
            # versions do not support.
            set html [$root asHTML -onlyContents]
        } else {
            set html [$root asHTML]
        }
        $doc delete
        # remove auxiliary root element from output
        set html [string range $html [string length $lmarker] end-[string length $rmarker]]
        set html [string trim $html]
        return $html
    }
XQL Not present:
Generic, PostgreSQL, Oracle
[ hide source ] | [ make this the default ]
Show another procedure: