ODE TO A SHIPPING LABEL by Carlos Bueno Once there was a little o, with an accent on top like só. It started out as UTF8, (universal since '98), but the program only knew latin1, and changed little ó to "ó" for fun. A second program saw the "ó" and said "I know HTML entity!" So "ó" was smartened to "ó" and passed on through happily. Another program saw the tangle (more precisely, ampersands to mangle) and thus the humble "ó" became "ó"
135
Embed
Character Encoding & Unicode - How to (╯°□°)╯︵ ┻━┻ with dignity
Every developer will inevitably feel the pain of character encoding issues. We will cover the fundamentals every Python developer should know on character encoding and Unicode. We will teach you how to identify the types of problems that occur when dealing with character encoding and outline a set of best practices and useful libraries which can be used to avoid and fix character encoding issues.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ODE TO A SHIPPING LABEL!by Carlos Bueno!!Once there was a little o,!with an accent on top like só.!!It started out as UTF8,!(universal since '98),!but the program only knew latin1,!and changed little ó to "ó" for fun.!!A second program saw the "ó"!and said "I know HTML entity!"!So "ó" was smartened to "ó"!and passed on through happily.!!Another program saw the tangle!(more precisely, ampersands to mangle)!and thus the humble "ó"!became "ó"
Character Encoding & Unicode How to (╯°□°)╯︵ ┻━┻ with dignity
Esther Nam & Travis Fischer!PyCon US 2014, Montréal
Uni-wat?!
┻━┻ ︵ヽ ノ( ┻━┻
How to (╯°□°)╯︵ ┻━┻ with dignity
– Luke Sneeringer | Program Committee Chair
“You'll be pleased to know that your talk title crashed our meeting robot, which is a great argument for the relevance of this talk. :-) ...”
#! /usr/bin/python # -*- coding: utf8 -*- !# Opened file should be latin-1 encoded! # If it’s not, call tech support ASAP with open("input_file.csv") as input_file:
<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE html PUBLIC “-//W3C//DTD …> <html xmlns="http://www.w3.org/1999/xhtml" …>
Best Practices
Example Application
Author Review
G. van Rossum If you decide to design your own car there are thousands sort of car…
R. Ebert Every great car should feel new every time you drive it.
L. Torvalds Volvo isn’t evil, they just make really crappy cars.
Author Review
G. van Rossum If you decide to design your own car there are thousands sort of car…
R. Ebert Every great car should feel new every time you drive it.
L. Torvalds Volvo isn’t evil, they just make really crappy cars.
Application Processes Text
Author Review
G. van Rossum If you decide to design your own car there are thousands sort of car…
R. Ebert Every great car should feel new every time you drive it.
L. Torvalds Volvo isn’t evil, they just make really crappy cars.
Application Processes Text
PSQL
Author Review
G. van Rossum If you decide to design your own car there are thousands sort of car…
R. Ebert Every great car should feel new every time you drive it.
L. Torvalds Volvo isn’t evil, they just make really crappy cars.
Application Processes Text
PSQL
Encoding: Windows 1252 (CP-1252)
Montreal -> Montréal
psql=# set server_encoding to "utf-8";
My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Sample Review Text
My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Sample Review Text
My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Sample Review Text
My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his.Output from UTF-8 encoded PSQL database
My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montreal.” He told me he had paid 9400€ for his.Original CP-1252 Data
My friend said: “I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.” He told me he had paid 9400€ for his.Mixed CP-1252 & UTF-8
My friend said: �I cannot believe this is a Volvo! I had a car just like this when I lived in Montréal.� He told me he had paid 9400� for his.Interpreted as UTF-8 by database
Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)
Traceback (most recent call last): File "...", line ..., in <module> unicode_row = row_text.decode() UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 31: ordinal not in range(128)
>>>u'☃ Brrrr!'.encode('cp1252', 'strict') !Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/esther/ENV/lib/python2.7/encodings/cp1252.py", line 12, in encode return codecs.charmap_encode(input,errors,encoding_table) UnicodeEncodeError: 'charmap' codec can't encode character u'\u2603' in position 0: character maps to <undefined>
Pragmatic Unicode http://nedbatchelder.com/text/unipain.html !The Absolute Minimum You Must Know http://www.joelonsoftware.com/articles/Unicode.html !Chapter on Strings in “Dive into Python” by Mark Pilgrim http://getpython3.com/diveintopython3/strings.html !General questions, relating to UTF or Encoding Form http://www.unicode.org/faq/utf_bom.html !Unicode HOWTO (Python 2.7) http://docs.python.org/2/howto/unicode.html
“Just what the dickens is ‘Unicode’?” https://pythonhosted.org/kitchen/unicode-frustrations.html
Differences between these commonly confused encodings http://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html !“Latin-1” in MySQL is more like “CP-1252” https://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html !Why it's important to write tests with character boundary values http://labs.spotify.com/2013/06/18/creative-usernames/