Forum OpenACS Q&A: Re: Help on searching static pages with foreign characters

Why do you think numeric entities interact with character encoding? I think they don't. E.g. á would be the same as á no matter what encoding the page is. Three-digit numeric entities always belong to iso8859-1 charset. Four-digit numeric entities (&xxxx;) shouldn't depend on the page charset, either.

I can't vouch for this in all possible cases, but it works with three-digit numeric entities on pages set to Russian character encoding in every browser in my zoo: IE/Mozilla/Opera/NN6 (all for Windows NT).

Can anybody provide an example to the contrary?

Note: NN4 doesn't handle entities (both numeric and non-numeric) if their charset differs from that of the page, displays them as ? then.

P.S. Unrelated complain to the forum maintainer: filtering NOBR tag I attenpted to use above is stupid. Unless you really want to see auto hyphens this way: iso8859-
1 charset.

Collapse
Posted by Jeff Davis on
Vadim, I was not talking about numeric entities when refering to the encoding problem. It is rather than something like:
regsub -all {é} $html {\é} html
requires that the .tcl file be read as iso-8859-1 when the function is defined for it to work correctly. It is better to do
regsub "\x00e9" $html {\é} html
since it is not then sensitive to the encoding used when parsing the .tcl files.