Forum OpenACS Q&A: OpenFTS problem with non US_ASCII characters

This is my first installation of OpenFTS and I followed literally the installation instructions in the 5.1 docs.

I enabled the service contracts for edit-this-page, file-storage and news, created some content and searched it with initial success.

I then run the queries to populate the search_observer_queue so as to index the already existent contents, but most of the contents were discarded by the indexer complaining for an invalid encoding. The following is a sample of the error log:

[07/Nov/2004:17:30:38][29993.98316][-sched:18-] Error: Aborting transaction due to error:
Database operation "dml" failed (exception ERROR, "ERROR:  invalid byte sequence for encoding "UNICODE": 0xc3
")

ERROR:  invalid byte sequence for encoding "UNICODE": 0xc3

SQL: 
                    insert into index10 
                        (lexem,tid,pos) 
                         values 
                        ('novitÃ',12349,
                        '{89}')
My database is a postgres 7.3 created with UNICODE encoding and if I query the content with psql I get the correct word, i.e. 'novità'. Also the browser shows the content correctly.

Perhaps I need to configure somehow OpenFTS, but I have no clue as how to proceed. Any hint very appreciated.

Collapse
Posted by Gary Roesler on
fts_index.tcl
from

db_dml insert_idx_tbl "
    insert into $self(PREFIX)index$g
        (lexem,tid,pos)
        values
        ('$lexem',$tid,
        '\{[join $rs($lexem) ,]\}')"

to

set converted_lexem [encoding convertto utf-8 $lexem]
db_dml insert_idx_tbl "
    insert into $self(PREFIX)index$g
        (lexem,tid,pos)
        values
        ('$converted_lexem',$tid,
        '\{[join $rs($lexem) ,]\}')"

Collapse
Posted by Claudio Pasolini on
Thank you very much, Gary: now it works!

Actually I was already quite close to the solution, but I converted the lexem to unicode instead of utf-8, without success.

Furthermore I had to shut down and restart aolserver to get things work.

Collapse
Posted by Joel Aufrecht on
I have two related problems. I applied the fix above and restarted but it didn't have any effect, which is unsurprising because my error messages are slightly different:
[08/Nov/2004:13:47:18][16217.163850][-sched:25-] Error: Ns_PgExec: result status: 7 message: ERROR:  Invalid UNICODE character sequence found (0xc200)

transaction error
[08/Nov/2004:13:47:18][16217.163850][-sched:25-] Error: Aborting transaction due to error:
Database operation "dml" failed (exception ERROR, "ERROR:  Invalid UNICODE character sequence found (0xc200)
")

ERROR:  Invalid UNICODE character sequence found (0xc200)

SQL:
                    insert into index8
                        (lexem,tid,pos)
                         values
                        ('0rÂ,16537,
                        '{1005}')
and
[08/Nov/2004:13:47:54][16217.163850][-sched:25-] Error: Ns_PgExec: result status: 7 message: ERROR:  Cannot insert a duplicate key into unique index in
dex10_key

transaction error
[08/Nov/2004:13:47:54][16217.163850][-sched:25-] Error: Aborting transaction due to error:
Database operation "dml" failed (exception ERROR, "ERROR:  Cannot insert a duplicate key into unique index index10_key
")

ERROR:  Cannot insert a duplicate key into unique index index10_key

SQL:
                    insert into index10
                        (lexem,tid,pos)
                         values
                        ('2004',18789,
                        '{20}')
Collapse
Posted by Claudio Pasolini on
Joel,

regarding the UNICODE error something has gone wrong, because you are again trying to insert an unconverted (or badly converted) string: perhaps you could try your patch in a sample tcl script and verify if it actually does the conversion.

I also got the duplicate key problem, but I ignored it for the moment, because it is caused by the double insertion into the search_observer_queue when you create a new content (I observed this creating a news): in this case the content will be processed correctly.

Collapse
Posted by Gary Roesler on
What does your openfts_driver__index function look like?
Collapse
Posted by Peter Alberer on
i would recommend to look at tsearch2 (there is a driver for openacs) rather then openfts. i could not get openfts to work with a german openacs installation, with tsearch2 i did not have a single problem.