Forum OpenACS Q&A: Help on searching static pages with foreign characters

Hi,

We have installed OACS 4.5, aol3.3+ad13, PG 7.2.1, OpenFTS 0.2 and static pages.

We are in trouble searching extended characters in our static pages.

The language site is spanish.

We have characters with accute accent like á, é, í, ó, ú and our most famous special letter 'ñ' a.k.a. "ñ"

We create html pages in Golive and put the pages in the server, then load the pages in the static pages module.

When we search an accented word like 'expectación', openfts search exactly 'expectación' but Golive has saved the word as 'expectación' using the extended html notation.

OpenFTS doesn't find anything.
But if we search 'expectación' then it finds the correct pages, BUT it shows the word found as their html representation 'expectación'. Not valid to show to an user.

There is any method to solve this?
Any parameter to OpenFTS?
Any proc in the OpenACS API could convert from extended html characters to single character and viceversa?

Thanks.

Collapse
Posted by Jeff Davis on
There is a function util_expand_entities_ie_style which does some of what you want but does not currently have vary many named entities. I think that's a good starting place. You could then change static pages to de-entity-ize your html which would make fts do the right thing. An alternative would be to make the fts indexer smarter but that might be more work.
Here is a table of entities to help you if you decide to rewrite the function. Ideally it should handle all these entities, both &#nnn; and &symbol;-style. Please report on how you solve the problem, as I'm interested in it as well :)
Collapse
4: Solution: (response to 1)
Posted by Jorge Garcia on
We have improved those procedures to include the most important characters we need to use in Spain.

We scan for new static page to the database and then we filter the content of the file with:

ad_proc util_condense_entities { html }

Then we write the modified and filtered file.

Maybe could serve as a template for other languages.

---------------
#packages/acs-tcl/tcl/text-html-procs.tcl

ad_proc util_expand_entities { html } {

    Replaces all occurrences of common HTML entities with their plaintext equivalents
    in a way that's appropriate for pretty-printing.


    This proc is more suitable for pretty-printing that it's
    sister-proc, <a href="/api-doc/proc-view?proc=util_expand_entities_ie_style"><code>util_expand_entities_ie_style</code></a>.
    The two differences are that this one is more strict: it requires
    proper entities i.e., both opening ampersand and closing semicolon,
    and it doesn't do numeric entities, because they're generally not safe to send to browsers.
    If we want to do numeric entities in general, we should also
    consider how they interact with character encodings.

} {
    regsub -all {&lt;} $html {<} html
    regsub -all {&gt;} $html {>} html
    regsub -all {&quot;} $html {"} html
    regsub -all {&mdash;} $html {--} html
    regsub -all {&#151;} $html {--} html
    regsub -all {&aacute;} $html {á} html
    regsub -all {&eacute;} $html {é} html
    regsub -all {&iacute;} $html {í} html
    regsub -all {&oacute;} $html {ó} html
    regsub -all {&uacute;} $html {ú} html
    regsub -all {&Aacute;} $html {Á} html
    regsub -all {&Eacute;} $html {É} html
    regsub -all {&Iacute;} $html {Í} html
    regsub -all {&Oacute;} $html {Ó} html
    regsub -all {&Uacute;} $html {Ú} html
    regsub -all {&ntilde;} $html {ñ} html
    regsub -all {&Ntilde;} $html {Ñ} html
    regsub -all {&iquest;} $html {¿} html
    regsub -all {&iexcl;} $html {¡} html
    regsub -all {&ccedil;} $html {ç} html
    regsub -all {&Ccedil;} $html {Ç} html
    regsub -all {&uuml;} $html {ü} html
    regsub -all {&Uuml;} $html {Ü} html
    regsub -all {&amp;} $html {\&} html
    return $html
}

ad_proc util_condense_entities { html } {

    Replaces plaintext extended characters with their HTML entities equivalents.

} {
    regsub -all {&} $html {\&amp;} html
    regsub -all {<} $html {\&lt;} html
    regsub -all {>} $html {\&gt;} html
    regsub -all {"} $html {\&quot;} html
    regsub -all {\-\-} $html {\&mdash;} html
    regsub -all {\-\-} $html {\&#151;} html
    regsub -all {á} $html {\&aacute;} html
    regsub -all {é} $html {\&eacute;} html
    regsub -all {í} $html {\&iacute;} html
    regsub -all {ó} $html {\&oacute;} html
    regsub -all {ú} $html {\&uacute;} html
    regsub -all {Á} $html {\&Aacute;} html
    regsub -all {É} $html {\&Eacute;} html
    regsub -all {Í} $html {\&Iacute;} html
    regsub -all {Ó} $html {\&Oacute;} html
    regsub -all {Ú} $html {\&Uacute;} html
    regsub -all {ñ} $html {\&ntilde;} html
    regsub -all {Ñ} $html {\&Ntilde;} html
    regsub -all {¿} $html {\&iquest;} html
    regsub -all {¡} $html {\&iexcl;} html
    regsub -all {ç} $html {\&ccedil;} html
    regsub -all {Ç} $html {\&Ccedil;} html
    regsub -all {ü} $html {\&uuml;} html
    regsub -all {Ü} $html {\&Uuml;} html
    return $html
}

Collapse
5: Re: Solution: (response to 4)
Posted by Jeff Davis on
It would probably be faster to do this with string map and for robustness I think you probably would want to put in the numeric codes for the characters rather than iso-8859-1 encoded characters (since that means if the tcl file encoding is not iso-8859-1 this will not behave as expected).

Heres an example of what I am talking about

set text [string map { \x00e4 ae \x00f6 oe \x00fc ue \x00df ss} $text]

we could definitely use this in openacs.

Collapse
6: Re: Solution: (response to 5)
Posted by Tilmann Singer on
This is also partially done in the util_text_to_url proc: http://dev.openacs.org:8000/cvs/openacs-4/packages/acs-tcl/tcl/utilities-procs.tcl?rev=1.28&content-type=text/x-cvsweb-markup

Maybe we should take this string map call out and put it in it's own proc, e.g. util_to_safe_ascii? Better suggestions for the name?

Why do you think numeric entities interact with character encoding? I think they don't. E.g. &#225; would be the same as &aacute; no matter what encoding the page is. Three-digit numeric entities always belong to iso8859-1 charset. Four-digit numeric entities (&xxxx;) shouldn't depend on the page charset, either.

I can't vouch for this in all possible cases, but it works with three-digit numeric entities on pages set to Russian character encoding in every browser in my zoo: IE/Mozilla/Opera/NN6 (all for Windows NT).

Can anybody provide an example to the contrary?

Note: NN4 doesn't handle entities (both numeric and non-numeric) if their charset differs from that of the page, displays them as ? then.

P.S. Unrelated complain to the forum maintainer: filtering NOBR tag I attenpted to use above is stupid. Unless you really want to see auto hyphens this way: iso8859-
1 charset.

I have noticed this Javascript function:
http://www.devguru.com/Technologies/ecmascript/quickref/unescape.html

Maybe the own browser could make it the dirty work for us?.

Maybe the source code in C of this function could be used as a template if a full TCL implementation is needed?.

Just thinking :)

Collapse
Posted by Jeff Davis on
Vadim, I was not talking about numeric entities when refering to the encoding problem. It is rather than something like:
regsub -all {é} $html {\&eacute;} html
requires that the .tcl file be read as iso-8859-1 when the function is defined for it to work correctly. It is better to do
regsub "\x00e9" $html {\&eacute;} html
since it is not then sensitive to the encoding used when parsing the .tcl files.