In general we should move towards storing user input as Unicode in the database and then depend on the display logic to take care of presenting pages with browser/html friendly text as needed (which should be the default).
E.g. this ASCII Demoroniser, which "corrects moronic Microsoft HTML": http://www.fourmilab.ch/webtools/demoroniser/
has a Unicode variant:
http://rheme.net/unmoroniser/
If we set this up correctly, we can then easily create things like valid xml feeds for content without worrying about what users type in text fields (presently I get feed validation errors for any non-typical input: e.g. special characters in German or Icelandic) and worry less about any xhtml validation errors caused by users as the cleanliness of the toolkit's markup increases with time.