Forum OpenACS Q&A: Re: Invalid Unicode character sequence found in pg index.

Posted by Ryan Gallimore on 05/23/06 11:48 AM

I've determined the cause of this error is my search indexer for pdfs where I run

set txt [exec pdftotext $data -]

inside catch in search-procs.tcl.

Often the output from pdftotext is 99% UNICODE compliant, but there are some scattered A's with ^ on top of them, which chokes the index table.

Is there anyway to run another filter on the output from pdftotext to eliminate non-unicode characters?

pdftotext claims fixing their utility to work with unicode is not easy.
http://www.stillhq.com/ctpfaq/2002/comp.text.pdf-faq-2002-02.sgml

Thanks.