23rd July, 2007

Fun With UTF-8!

Monday, 12:34 pm in CodeGirl

Woohoo!  I am the queen of the femmegeeks!

So about a million years ago I stroked my ego a bit by going through an updating sk.log to serve XHTML correctly1.  Part of this involved enforcing UTF-8 encoding on the site.  Unfortunately, this had the effect of breaking the (admittedly limited) support I had for non-ASCII characters.  Which is kinda ironic, since UTF-8 supposedly makes this sort of stuff easier.

Anyway, my problem was two-fold.  Firstly, the LiveJournal XML-RPC update client just went absolutely spazmo whenever trying to upload a post containing complex characters, though this was nothing new.  The second was that UTF-8 characters weren’t ‘holding’ in the database and then being echoed back to the screen properly in posts and comments.  This was annoying.

After on-and-off tinkering, I’ve finally fixed both problems.  The fixes were reasonably simple, but actually finding the information on how to do it wasn’t.  First off, I had to go through my MySQL database and change all the collations to utf8_unicode_ci.  This was tedious, but not terribly difficult.

The next thing to do was to set my HTML forms to send UTF-8 data.  You do this by adding enctype='multipart/form-data' into the <form> tag (the same as you’d do for a file-submit form).

Next, you have to actually tell MySQL to explicitly expect UTF-8 data.  This is the ‘fun’ part, and the part about which I could find the least data; the only place that seemed to mention it were obscure Googled blog entries like yours truly’s.  To do this, you have to add:

mysql_query( "SET NAMES 'utf8';" );

Into your PHP script before every mysql_query() dealing with UTF-8 characters.  This includes SELECTs, UPDATEs and INSERTs.  I’m not entirely sure why you have to do this, but it’s got something to do with the MySQL installations on most shared hosting defaulting to latin1 encoding.  So you’ll be sending UTF-8 data from your form (since you’ve set your DOCTYPE and your enctype), but MySQL will try and pick it up as latin1, even though your tables are set to collate as utf8.  Yeah, it’s stupid I know.

Finally, if you use PHP functions like htmlentities(), you will need to tell these to explicitly use 'UTF-8' as part of the optional charset argument.

And that’s it.  Voilà, your forms should now deal with correctly-encoded UTF-8.

The LJ XML-RPC thing was a bit more annoying.  No matter how hard I tried, I couldn’t get it to accept non-ASCII characters, even as part of a UTF-8 encoded stream.  So in the end I kluged it, using this function some handy soul left in the PHP manual.

So hopefully that should be it.  I can now use accented characters with impunity and there should be no more annoying jibberish or un-crossposted posts.  Hooray.

  1. Fun fact; I have not written anything in XHTML since.  Nowadays, everything I write is HTML-strict. ^

Comments

Add Comment
auto insert line breaks
use log.code
use smilies
Verification
  • v-s.net v0.6 and all content (unless noted) © Dee.
  • sk.log v0.6 spat this out in 1.663 seconds.
  • 79 / 181,760
artistic-twobyfour