Carl, here is some feedback mostly about the tags,
i have not looked at the attributes in detail.
we should allow tags from html 4.01 excluding these
- programmatical elements (forms, form elements, applet, object, map, script/noscript),
- head elements (link, title, meta, ...),
- frames
- style
Potentially dangerous are tags/attributes allowing urls, which have to be checked to disallow javascript: and which point to untrusted sites. oacs already checks for untrusted protocols, but it is very hard to do this everywhere (e.g. parsing inline styles). so, potential dangerous are
- A and
- IMG,
- but as well the STYLE attribute.
The XSS page lists e.g.
<>DIV STYLE="background-image: url(javascript:alert('XSS'))">
or
<DIV STYLE="width: expression(alert('XSS'));">
which are dangerous for some browsers.
So, the STYLE attribute is dangerous and should be handled with care (e.g. not in the default configuration).
Other attributes like e.g. CLASS can be used to confuse the user (e.g. using style elements from the navigation) or might break code (using ID, when javascript elements of oacs search for IDs and find unexpected occurrences)
here is a slightly completed and sorted list of HTML 4.01 elements:
abbr acronym address b big blockquote br caption cite code col
colgroup dd del dfn div dl dt em fieldset font h1 h2 h3 h4 h5 h6 hr i
ins kbd legend li ol p pre q s samp small span strike strong sub sup
table tbody td tfoot th thead tr tt u ul var
i am not sure if we should allow the ms office tags in the web pages, since these will cause errors on HTML conformance tests.
In general, we should distinguish between public content (so special rights are required to provide HTML content) like in a public forum, where a conservative policy is required, and somewhat trusted and known content developers, where a more liberal policy can be used.
For the general user i would ask myself, why do we want to allow e.g. CLASS, STYLE or ID, what do we gain by doing so. The general policy should stay conservative.