Forum OpenACS Development: HTML quoting in the templating system

Collapse
Posted by Jeff Davis on
The one major bit of work on ACS 4.x (post 4.2 on which openacs is
based) that never found it's way to OpenACS is the noquote stuff done in
the munich office.  I personally think it is an inportant issue to
address but I am not sure what other people think about it.  Should this
be one the plate for 4.7?
Collapse
Posted by Lars Pind on
Jeff,

I think this is important. Thanks for bringing it up.

/Lars

Collapse
Posted by Jeff Davis on
The following is the writeup by Hrvoje Niksic on the work done...

HTML Quoting as Part of the Templating System

The Templating System.

Templating systems, as deployed by most web software, serve to distinguish the programming logic of the system from the presentation that is output to the user.

Before introduction of a templating systems to ACS, pages were built by outputting HTML text directly from Tcl code. Therefore it was hard for a designer or a later reviewer to change the appearance of the page. "Change the color of the table? How do I do that when I cannot even find the body tag?" At this point some suggest to embed Tcl code in the document rather than the other way around, like PHP does. But it doesn't solve the problem, because the code is still tightly coupled with the markup, requiring programmer-level understanding for every change. The only workable solution is to try to uncouple the presentation from the design as much as possible.

ACS 4.0 addressed the problem by introducing a custom-written templating system loosely based on the already-present capabilities of the AolServer, the ADP pages. Unlike the ADP system, which allowed the coder to register his own tags to encapsulate often-used functionality, the new templating system came with a pre-programmed set of tags that performed the basic transformations needed to process the page, and some additional value.

Comparing ACS templating to other templating systems, it is my impression that the former was designed to be useful in real life rather than minimalistic -- which is only makes sense given the tight deadlines most ArsDigita projects have to face. Besides the if tag, multiple tag and @variable@ variable substitution, which are sufficient to implement any template-based page, it also includes features like including one template in another, customizing site- or module-wide look using the master templates, directly importing query results to the template, facilities for building grid-tables, and more. This utilitarian approach to templating urges us to consider the quoting issues as integral part of the system.

Quoting.

In the context of HTML, we define quoting as transforming text in such a way that the HTML-rendered version of the transformed text is identical to the original text. Thus one way to quote the text "<i>" is to transform it to "&lt;i&gt;". When a browser renders the transformed text, entities &lt; and &gt; are converted back to < and >, which makes the rendered version of the transformation equal to the original.

The easiest way to guarantee correct transformation in all cases is to "escape" ("quote") all the characters that HTML considers special. In the minimalistic case, it is enough to transform &, <, and > into their quoted equivalents, &amp;, &lt;, and &gt; respectively. For additional usefulness in quoted fields, it's a good idea to also quote double and single quotes into &quot; and &#39; respectively.

All of this assumes that the text to be quoted is not meant to be rendered as HTML in the first place. So if your text contains "<i>word</i>", and you expect the word to show up in italic, you should not quote that string. If anything, you should quote word before enclosing it between <i> and </i>.

The ACS has a procedure that performs HTML quoting, ad_quotehtml. It accepts the string that needs to be quoted, and returns the quoted string. In ACS 3.x, properly written code was expected to call ad_quotehtml every time it published a string to a web page. For example:

doc_body_append "<ul>
"
set db [ns_db gethandle]
set selection [ns_db select $db {SELECT name FROM bboard_forums}]
while {[ns_db getrow $db $selection]} {
    set_variables_after_query
    doc_body_append "<li>Forum: <tt>[ad_quotehtml $name]</tt>
"
}
doc_body_append "</ul>
"
      
Obviously, this was very error-prone, and the programmers would, as often as not, forget to quote the variables that come from the database or from user input. This would "usually" work, but in some cases it would lead to broken pages and even pose a security problem. For instance, one could imagine a mathematicians' forum being named "0 < 1", or an HTML designers' forum being named "The Woes of <h1>".

In some cases the published variable must not be quoted. Examples for that are the bboard postings that are posted in HTML, or variables containing the result of export_form_vars. All in all, the decision about when to quote had to be made by the programmer on a case-by-case basis, and many programmers simply enjoyed the issue because the resulting code happened to work in 95% of the cases.

Then came ACS 4.

(Lack of) Quoting in ACS 4.

One hoped that ACS 4, with its advanced templating system, would provide an easy and obvious solution for the (lack of) quoting problem. It turned out that this did not happen, partly because no easy solution exists, and partly because the issue was ignored or postponed.

Let's review the ACS 3.x code from above. The most important change is that it comes in two parts: the presentation template, and the programming logic code. The template will look like this:

<ul>
  <multiple name=forums>
    <li>Forum: <tt>@forums.name@</tt>
  </multiple>
</ul>
      
Once you understand the (simple) workings of the multiple tag, this version strikes you as much more readable than the old one. But we're not done yet: we need to write the Tcl code that grabs forum names from the database. The db_multirow proc is designed exactly for this; it retrieves rows from the database and assigns variables from each row to template variables in each pass of a multiple of our choice.
db_multirow forums get_forum_names {
    SELECT name FROM forums
}
      
At this point the careful reader will wonder at which point the forum name gets quoted, and if so, how does the templating system know whether the forum name needs to be quoted or not? The answer is amazingly blunt: no quoting happens anywhere in the process. If a forum name contains HTML special characters, you have a problem.

There are two remedies for this situation, and neither is particularly appealing. One can rewrite the nice db_multirow with a db_foreach loop, manually create a multirow, and feed it the quoted data in the loop. That is ugly and error-prone because it is more typing and it requires you to explicitly name the variables you wish to export at several points. It is exactly the kind of ugly code that db_multirow was designed to avoid.

The alternative approach means less typing, but it's even uglier in its own subtle way. The trick is to remember that our templating still supports all the ADP features, including embedding Tcl code in the template. Thus instead of referring to the multirow variable with the @forums.name@ variable substitutions, we use <%= [ad_quotehtml @forums.name@] %>. This works correctly, but obviously breaks the abstraction barrier between ADP and Tcl syntaxes. The practical result of breaking the abstraction is that every occurrence of Tcl code in an ADP template will have to be painstakingly reviewed and converted once ADPs start being invoked by Java code rather than Tcl.

At this point, most programmers simply give up and don't quote their variables at all. Quoting is handled only in the areas where it is really crucial and where not handling it would quote immediate and visible breakage, such as in the case of displaying the bodies of bboard articles. This is not exaggeration; it has been proven by auditing the ACS 4.0, both manually and through grepping for ad_quotehtml. Strangely, this otherwise sad fact allows us to deploy a very radical but much more robust solution to the problem.

Quote Always, Except When Told Not to.

At the time when we came to realize how serious the quoting deficiencies of ACS 4.0 were, we were about two weeks away from the release of a project for the German Bank. There was simply no time to hunt all the places where a variable needs to be quoted and implement one of the above quoting tricks.

While examining the ADPs, we noticed that most substituted variable fall into one of three categories:

  1. Those that need to be quoted -- names and descriptions of objects, and in general stuff that ultimately comes from the user.
  2. Those for which it doesn't make a difference whether they are quoted or not -- e.g. all the database IDs.
  3. Those that must not be quoted -- e.g. exported form vars stored to a variable.
Finally we also remembered the fact that almost none of the variables are quoted in the current source base.

Our reasoning went further: if it is a fact that most variables are not quoted, and if the majority of variables either require quoting or are not harmed by it, then we are in a much better position if we make the templating system quote all variables by default! That way the variables from the first and the second category will be handled correctly, and the variables from the third category will need to be marked as noquote to function correctly. But even those should not be a problem, because HTML code that ends up quoted in the page is immediately visible, and all you need to do to fix it is add the marker.

We decided to test whether the idea will work by attempting to convert our system to work that way. I spent several minutes making the change to the templating system. Then we went through all the ADPs and replaced the instances of @foo@ where foo contained HTML code with @foo;noquote@.

The change took two people less than one day for the system that consisted of core ACS 4.0.1, and modules bboard, news, chat, and bookmarks. (We were also doing other things, so it's hard to measure correctly.) During two of the following days, we would find a broken page from time to time, typically by spotting the obviously visible HTML markup. Such a page would get fixed it in a matter of seconds by appending ;noquote to the name of the offending variable.

We launched successfully within schedule.

Porting the quoting changes to the ACS.

After some discussion, it was decided that these changes will be included into the next ACS release. Since the change is incompatible, it will be announced to module owners and the general public. Explanation on how to port your existing modules and the "gotchas" that one can expect follows in a separate document.

The discussion about speed, i.e. benchmarking results before and after the change, is also available.

Hrvoje Niksic
Collapse
Posted by Hrvoje Niksic on
Jeff, thanks for bringing this up. I very much wanted noquote to be included in OpenACS. Despite the initial agreement, this didn't happen because I didn't have time to initiate a public discussion, which Don and Ben requested. I hope this will provide an opportunity to rectify the situation.

Aside from what was said in the article, I'll try to briefly describe the advantages and the possible gotchas.

I still think noquote is an much-needed addition to ACS. We went with it in ShareNet (which my original article doesn't mention) and never regretted it. noquote takes very little time to get used to, and rids you of otherwise almost inevitable quoting bugs and many important security problems (!) without any additional effort.

There is a separate document describing porting existing pages to the noquote semantics in some detail:

http://jagor.srce.hr/~hniksic/no-quote-upgrade.html
Some people have expressed concern about the speed of wrapping almost all variables in [ad_quotehtml $var]. To make sure there is no slow-down for very large pages, we included an optional C implementation of ad_quotehtml. This should not be taken to mean that noquote slows down ACS so much as to *require* C code to work. The alternate ad_quotehtml is only provided as an add-on to make sure that high-performance sites that serve huge ADPs are not hindered with the change.

One thing you might want to be careful about is backward compatibility. At the time we introduced noquote, ACS 4 was pretty much an in-house thing. This is no longer the case with OpenACS. To keep the existing pages running, you might want to start by making noquote optional, off by default and turned on by a magic cookie in the ADP and/or by a flag to ad_return_template. I'm not sure if this would work correctly with includes and masters, but it's certainly worth a try.

If someone is willing to work on this on OpenACS side, I'm willing to provide help: the original patches, insight into the idea behind the changes, as well as other random improvements to the templating system that have accumulated in the ShareNet code base.

Collapse
Posted by Jeff Davis on
Hrvoje, I would like to get a copy of the noquote patches if you have them
around (that or just a tarball of the latest ACS version you had,
whichever is easier for you get).

Also any other templating fixes you have would be helpful as well;
smaller ones could go into 4.6 and I will probably try and get the
noquote stuff into OpenACS 4.7...

Collapse
Posted by Eduardo Pérez on
This is a hack to keep current things working correctly rather than a fix.
It's not a fix because when trying to fix some things it breaks others.
The correct fix is using ad_quotehtml where appropriate.

If someone want to continue with this hack,
Why instead declaring a function that does nothing "noquote" you declare a function that does something "quote", this way we finish with the overquote problem easier.

I'm almost sure using ;quote instead of ;noquote is far better idea
Why are you not using ;quote?
Did you discard ;quote in favor of ;quote or you didn't think it at first?

I prefer instead of "quote", "convert_utf8tohtmlutf8" or "convert_texttohtml" but that's a naming convention that only would help OpenACS beginners

Collapse
Posted by Jeff Davis on
No Eduardo, this is not a hack given that in the majority of cases the variables should in fact be quoted. The default behavior should be what is most common. Doing it this way means that if someone is either inexperienced or careless you still get a page which is not susceptible to cross site scripting exploits, and the way it breaks is obvious and easy to fix.

Doing it the other way around makes the failure mode silent and exposes your site to a security risk.

Collapse
Posted by Eduardo Pérez on
I understand your point, don't get me wrong, I know most people don't care about quoting, but they should care.

You know there are cases where you have html stored in the DB and you don't want to quote it when showing it in the html page.

Maybe we should use some kind of strict typing to prevent unquoted text to go directly to a html page.

Another problem with this approach is that the transition is not gradual, once you apply this noquote patch everything breaks, as it's quoted twice, and you have to remove all the code that properly quoted text.
That's why I suggest the alternative ;quote tag.

Collapse
Posted by Jim Lynch on
I'd like to suggest a different place to put the quoting/not-quoting attribute...

ad_page_contract -properties {
    navbar:onevalue,noquote
    keyrows:multirow
} {
}

Why, you may ask? Answer: since it is the goal of the templating system to separate the graphic designer from the programmer, why give me something programmatic to explain to the graphic designer?

On the other hand, this could be done in -both- places (.tcl -and- .adp), because there is a translation process that happens before the page is rendered. Since this is true, the translation process could be altered to insert quoting based on what it's told in the page contract.

For multirow data sources, any desired lack of quoting would be specified on a per-column basis when the multirow data structure is defined. For multirows read from the database, you could have a switch to the db_multirow command, that might look like "-no_quote_on_columns {column1, column3}".

If you offer the choice of where the quoting attribute is set (in the .tcl, or in the .adp), then project managers can explain the quoting to the designers if and only if they want to.

-Jim