Forum OpenACS Q&A: Postgres character conversion issues

Collapse
Posted by Janine Ohmer on
This is a repost of a question I put on StackExchange. Figured I'd increase my odds since there are lots of Postgres users here. And the data in question actually is for a couple of OpenACS sites. :)

From http://serverfault.com/questions/252904/character-set-issues-in-postgres-upgrade:

I am moving some databases from Postgres 7.4.8 to 8.4.5, on CentOS 5. In the old database the encoding is UNICODE. So I did a text pg_dump, created my new databases like so:

createdb --template template0 --encoding unicode testdb

and imported the data:

psql testdb < testdb.dmp >& testdb.out

Unfortunately, these errors are sprinkled throughout:

ERROR: invalid byte sequence for encoding "UTF8": 0x8e

I know that iconv is the tool most often recommended to fix this, but I'm not sure what the conversion is. I tried running it from Unicode to UTF8, since that is what Postgres reports for my new database, but that produced a file full of what looked like Chinese characters. I also tried UTF8 to UTF8; that chokes on the first of the offending characters with this:

iconv: illegal input sequence at position 874924

I'd like to know a) why am I having problems when I'm not changing the character set? And b) is there any way of fixing this short of editing the dump file by hand?

Collapse
Posted by Torben Brosten on
Janine,

It looks like the byte sequence is the complaint, meaning that the characters just before and just after don't make sense. It's probably not the case for all 0x8e cases.

Sometimes inputs, such as registration data, are broken, either purposefully or just some kind of encoding conversion issue etc.

Regardless, look at the dump data surrounding the complaint to identify the cause. It may be a unique combination in the db. A guess substitution may be close enough to keep your task from being delayed much further.

cheers,

Torben