Forum OpenACS Q&A: Invalid Unicode character sequence found in pg index.

I have gotten the following error on a daily basis for several weeks now. Any idea what is causing it?

Thanks,
Ryan

13/May/2006:23:09:17
    Error: Ns_PgExec: result status: 7 message: ERROR:  Invalid UNICODE character sequence found (0xe20000)

transaction error

13/May/2006:23:09:17 Error: Aborting transaction due to error: Database operation "dml" failed (exception ERROR, "ERROR: Invalid UNICODE character sequence found (0xe20000) ")

ERROR: Invalid UNICODE character sequence found (0xe20000)

SQL: insert into index3 (lexem,tid,pos) values ('budgetâ',107532, '{1293}')

I've determined the cause of this error is my search indexer for pdfs where I run

set txt [exec pdftotext $data -]

inside catch in search-procs.tcl.

Often the output from pdftotext is 99% UNICODE compliant, but there are some scattered A's with ^ on top of them, which chokes the index table.

Is there anyway to run another filter on the output from pdftotext to eliminate non-unicode characters?

pdftotext claims fixing their utility to work with unicode is not easy.
http://www.stillhq.com/ctpfaq/2002/comp.text.pdf-faq-2002-02.sgml

Thanks.

Depending on the OS used, you might want to prefilter using either "recode" "iconv" to silently remove those gremlins.
I'm using RHE - so iconv is available.
But I'm not sure what encoding to specify to and from.
For reading into the indexer, should I be converting the output from pdftotext to UNICODE or UTF-8? From LATIN1/ASCII. I've tried these combinations with no luck.

Any help from someone familiar with iconv and pdftotext?

Thanks

You want to convert to the encoding that your database is using.

'man iconv' explains how use iconv.

..and..

iconv -l

..states the available encodings.

http://linuxcommand.org/man_pages/pdftotext1.html

..mentions that the default text output is LATIN1 encoding, but that you can also specify the output to be another encoding using '-enc encoding-name'.

If the db is encoded in UNICODE, then try outputting pdftotext in UNICODE.

If pdftotext still outputs encoding with gremlins, consider processing the pdftotext output with iconv using the '-c' flag, since it silently removes characters that are not convertible to the '-t' encoding.

iconv and pdftotext should be accessible from the shell, so you can run some test cases to help determine what might work.

cheers,

Torben

hello