Forum OpenACS Q&A: Localized searching

Collapse
Posted by Lars Pind on
One of the items that came up was localized searching. I'm not sure
I understand what's involved here.

Is it mainly a question of limiting the search to things that are in
a given language?

Or is there more involved?

Of course, if you want to do word stemming in all manner of
different languages, this is going to be infinitely hard.

But if the database is in Unicode, and your search query is in
Unicode, what else is needed?

What are the issues that need to be resolved here?

Thanks,

/Lars

Collapse
Posted by Tilmann Singer on
Language specific indexing of tables that contain contents in different languages is possible with oracle intermedia, using a so-called multilexer and a language column to tell it which language each row is in. This is what the nls_language field of the CR's cr_revision table is for.

Don't (but would like to) know about OpenFTS.

Collapse
Posted by Dan Wickstrom on
Openfts has support for multiple languages, but I believe it has some problems with searching in different locales.  The locale problems stem mostly from the use of flex as the parsing engine.  Having said that, it wouldn't be too hard to use a different parser, if a suitable one could be found which had good support for different locales.

Making such a change would be desirable, especially if the new parser were thread-safe. Flex is not, and as a result parsing jobs must be single threaded by treating the parsing section of openfts as a critical section and protecting it with a mutex. A multi-threaded parser might improve performance, although I've never really noticed that parsing was a bottle-neck when using openfts.

Collapse
Posted by Dan Wickstrom on
And the following timely post on postgresql hackers list from Oleg Barunov, one of the developers of tsearch, which constitutes the underpinnings of Openfts.
> OK, attached is an example of the problem.  Notice how trademarks and
> copyright symbols are being indexed along with the word.  This means that if
> someone searches for 'balance' in the above data set, they won't find
> anything.
>
> I'm not sure how this would be handled.  In the English language, it'd
> probably be safe to say that high ascii characters would be stripped from
> the index?  But you'd want to leave accents and stuff in I guess.  Tricky.

Rather tricky. The problem is that we don't know how to get flex to works
with locale. Parser recognizes latin words ([a-zA-Z]), nonLatin ([0-7])
and mixed words ([a-zA-Z0-7]). Your case (Balance®) is the mixed word.
The right way is to have locale aware parser to properly recognize words.  We incline to refuse a flex.
Collapse
Posted by Tilmann Singer on
Do you happen to know if the multilanguage ability of OpenFTS allows/will allow to index content within the same table in different languages, with some mechanism similar to the language column mentioned above?

(BTW, before anybody starts experimenting with Intermedia make sure you have a newer Oracle version than 8.1.7.0.1 Linux, which has a bug related to UTF8 encoded CLOBs)

Collapse
Posted by Dan Wickstrom on
I believe you could as the dictionary used for word stemming is stored in the indexing table.  Of course you need to add code to select the correct stemmer for each item that you're indexing, and I the code is not designed to do this dynamically.  It's something that could be added though.