Forum OpenACS Q&A: Invalid Unicode character sequence found in pg index.

1: Invalid Unicode character sequence found in pg index.

Posted by Ryan Gallimore on 05/14/06 02:27 PM

I have gotten the following error on a daily basis for several weeks now. Any idea what is causing it?

Thanks,
Ryan

13/May/2006:23:09:17 Error: Ns_PgExec: result status: 7 message: ERROR: Invalid UNICODE character sequence found (0xe20000)

transaction error

13/May/2006:23:09:17 Error: Aborting transaction due to error: Database operation "dml" failed (exception ERROR, "ERROR: Invalid UNICODE character sequence found (0xe20000) ")

ERROR: Invalid UNICODE character sequence found (0xe20000)

SQL: insert into index3 (lexem,tid,pos) values ('budgetâ',107532, '{1293}')

2: Re: Invalid Unicode character sequence found in pg index. (response to 1)

Posted by Ryan Gallimore on 05/23/06 11:48 AM

I've determined the cause of this error is my search indexer for pdfs where I run

set txt [exec pdftotext $data -]

inside catch in search-procs.tcl.

Often the output from pdftotext is 99% UNICODE compliant, but there are some scattered A's with ^ on top of them, which chokes the index table.

Is there anyway to run another filter on the output from pdftotext to eliminate non-unicode characters?

pdftotext claims fixing their utility to work with unicode is not easy.
http://www.stillhq.com/ctpfaq/2002/comp.text.pdf-faq-2002-02.sgml

Thanks.

3: Re: Invalid Unicode character sequence found in pg index. (response to 1)

Posted by Torben Brosten on 05/23/06 02:47 PM

Depending on the OS used, you might want to prefilter using either "recode" "iconv" to silently remove those gremlins.

4: Re: Re: Invalid Unicode character sequence found in pg index. (response to 3)

Posted by Ryan Gallimore on 05/24/06 02:18 AM

I'm using RHE - so iconv is available.
But I'm not sure what encoding to specify to and from.
For reading into the indexer, should I be converting the output from pdftotext to UNICODE or UTF-8? From LATIN1/ASCII. I've tried these combinations with no luck.

Any help from someone familiar with iconv and pdftotext?

Thanks

5: Re: Invalid Unicode character sequence found in pg index. (response to 1)

Posted by Torben Brosten on 05/24/06 08:09 AM

You want to convert to the encoding that your database is using.

'man iconv' explains how use iconv.

..and..

iconv -l

..states the available encodings.

http://linuxcommand.org/man_pages/pdftotext1.html

..mentions that the default text output is LATIN1 encoding, but that you can also specify the output to be another encoding using '-enc encoding-name'.

If the db is encoded in UNICODE, then try outputting pdftotext in UNICODE.

If pdftotext still outputs encoding with gremlins, consider processing the pdftotext output with iconv using the '-c' flag, since it silently removes characters that are not convertible to the '-t' encoding.

iconv and pdftotext should be accessible from the shell, so you can run some test cases to help determine what might work.

cheers,

Torben

6: Re: Re: Re: Invalid Unicode character sequence found in pg index. (response to 4)

Posted by koffi akimbo on 06/04/06 05:30 PM

hello