Forum OpenACS Q&A: Postgres character conversion issues
I am moving some databases from Postgres 7.4.8 to 8.4.5, on CentOS 5. In the old database the encoding is UNICODE. So I did a text pg_dump, created my new databases like so:
createdb --template template0 --encoding unicode testdb
and imported the data:
psql testdb < testdb.dmp >& testdb.out
Unfortunately, these errors are sprinkled throughout:
ERROR: invalid byte sequence for encoding "UTF8": 0x8e
I know that iconv is the tool most often recommended to fix this, but I'm not sure what the conversion is. I tried running it from Unicode to UTF8, since that is what Postgres reports for my new database, but that produced a file full of what looked like Chinese characters. I also tried UTF8 to UTF8; that chokes on the first of the offending characters with this:
iconv: illegal input sequence at position 874924
I'd like to know a) why am I having problems when I'm not changing the character set? And b) is there any way of fixing this short of editing the dump file by hand?
It looks like the byte sequence is the complaint, meaning that the characters just before and just after don't make sense. It's probably not the case for all 0x8e cases.
Sometimes inputs, such as registration data, are broken, either purposefully or just some kind of encoding conversion issue etc.
Regardless, look at the dump data surrounding the complaint to identify the cause. It may be a unique combination in the db. A guess substitution may be close enough to keep your task from being delayed much further.