Archive for category Python

Convert a MySQL DB from latin1 to UTF8

For work, I have to convert my very old database from latin1 based( with big5 in it ) to UTF8.

My Enviornment:

  • OS: Gentoo
  • Python: Python 2.4.3
  • MySQL: MySQL 5.0.70-log Gentoo Linux mysql-5.0.70-r1
  • Old encoding: latin1(big5)
  • Try to convert to: UTF8

And here’s my solution, and memo.
Assume your private parameters as following:

  • Your old db name: your_old_db
  • Your new db name: your_new_db
  • Your mysql account: your_db_acc
  • Your mysql server’s ip or dn: your_db_host
# dump out a sql through latin1 charset
mysqldump -u wwwuser -p –default-character-set=latin1 your_old_db > output.sql

# — fix origin output.sql’s big5 problem here ( do some regex replacement, it depends ) —

# convert the "real storage encoding" from "big5" to utf8, latin1 is just a tag note in db
piconv -f big5 -t utf8 output.sql > utf8.sql

# replace all charset and collate in sql file, and delete the line which include "SET NAMES latin1"
# for matching, the longer the better ex: match from "type=Mxxx….", you could do this from the beginning of each single line
sed -e ‘s/DEFAULT CHARSET=latin1/DEFAULT CHARSET=utf8/g’ -e ‘s/CHARSET=gbk/CHARSET=utf8/g’ -e ‘s/ collate gbk_bin//g’ -e ‘s/ COLLATE=gbk_bin//g’ -e ‘/SET NAMES latin1/d’ utf8.sql > utf8_after_sed.sql

# echo SET NAMES utf8 to a new file named tmp.sql
echo "SET NAMES utf8;" > tmp.sql

# merge the two files
cat utf8_after_sed.sql >> tmp.sql

# rename the file to utf_final.sql, just a clarify
mv tmp.sql utf8_final.sql

# import to a new unicode based db, we’re done here
mysql -h your_db_host -u your_db_acc -p –default_character_set utf8 your_new_db < utf8_ultimate.sql

Thanks sooooooo a lot for Tib‘s kindly provide his consultation.

, , , ,

No Comments