Top Banner
Unicode & control Day 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
19

UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Dec 25, 2015

Download

Documents

Tyrone Terry
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Unicode & controlDay 13 - 9/24/14LING 3820 & 6820

Natural Language Processing

Harry Howard

Tulane University

Page 2: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Course organization

24-Sept-2014NLP, Prof. Howard, Tulane University

2

http://www.tulane.edu/~howard/LING3820/

The syllabus is under construction.

http://www.tulane.edu/~howard/CompCultEN/

Page 3: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Review of Unicode

24-Sept-2014

3

NLP, Prof. Howard, Tulane University

Page 4: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

ASCII characters

  0 1 2 3 4 5 6 7 8 9 A B C D E F

0 – – – – – – – – – – – – – – – –

1 – – – – – – – – – – – – – – – –

2   ! “ # $ % & ‘ ( ) * + , - . /

3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?

4 @ A B C D E F G H I J K L M N O

5 P Q R S T U V W X Y Z [ \ ] ^ _

6 ` a b c d e f g h i j k l m n o

7 p q r s t u v w x y z { | } ~ –

24-Sept-2014NLP, Prof. Howard, Tulane University

4

Page 5: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

6.2.1. Character encoding in Python

24-Sept-2014NLP, Prof. Howard, Tulane University

5

Page 6: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Open Spyder

24-Sept-2014

6

NLP, Prof. Howard, Tulane University

Page 7: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

6. Non-English characters: one code to rule them all

24-Sept-2014

7

NLP, Prof. Howard, Tulane University

Page 8: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

6.2.2. What happens when you type a non-ASCII character into a Python console?

1. >>> import sys 2. >>> sys.getdefaultencoding()

1. >>> special = 'ó' 2. >>> special 3. '\xc3\xb3' 4. >>> print special ó

24-Sept-2014NLP, Prof. Howard, Tulane University

8

Page 9: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

6.2.3. How to translate into and out of Unicode with decode() and encode()1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' 2. >>> uS1 = S1.decode('utf8') 3. >>> uS1 4. u'ca\xf1\xf3n'5. >>> len(uS1) 6. 5 7. >>> utf8S1 = uS1.encode('utf8')8. >>> print utf8S1 9. cañón

24-Sept-2014NLP, Prof. Howard, Tulane University

9

Page 10: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

6.2.4.1. How to turn on non-ASCII character matching with re.UNICODE1. >>> S1 = 'ca\xc3\xb1\xc3\xb3n' # same as before

2. >>> uS1 = S1.decode('utf8')

3. >>> uS1

4. u'ca\xf1\xf3n'

5. >>> import re

6. >>> lS1 = re.findall(r'\w{5}', uS1, re.U)

7. >>> lS1

8. [u'ca\xf1\xf3n']

9. >>> eS1 = ''.join(lS1)

10. >>> eS1

11. u'ca\xf1\xf3n'

12. >>> utf8S1 = eS1.encode('utf8')

13. >>> utf8S1

14. 'ca\xc3\xb1\xc3\xb3n'

15. >>> print

16. cañón

24-Sept-2014NLP, Prof. Howard, Tulane University

10

Page 11: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

6.2.5. How to translate between Unicode strings and numbers with ord() and unichar()1. >>> 'ó' 2. '\xc3\xb3' 3. >>> 'ó'.decode('utf8') 4. u'\xf3' 5. >>> ord(u'\xf3') 6. 243 7. >>> unichr(243) 8. u'\xf3' 9. test = unichr(243).encode('utf8')10. >>> print test 11. ó

24-Sept-2014NLP, Prof. Howard, Tulane University

11

Page 12: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

I am going to fold the Unicode chapter into §1 & §2 and move the next chapter (§8) up a notch, so the chapter numbering will change.

Chapter numbering

24-Sept-2014

12

NLP, Prof. Howard, Tulane University

Page 13: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Up to now, your short programs are entirely dependent on you for making decisions. This is fine for pieces of text that fit on a single line, but is clearly insufficient for texts that run to hundreds of lines in length. You will want Python to make decisions for you. How to tell Python to do so is the topic of this chapter, and falls under the rubric of control.

8. Control

24-Sept-2014

13

NLP, Prof. Howard, Tulane University

Page 14: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

The first step in making a decision is to distinguish those cases in which the decision applies from those in which it does not. In computer science, this is usually known as a condition.

8.1. Conditions

24-Sept-2014

14

NLP, Prof. Howard, Tulane University

Page 15: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

8.1.1. How to check for the presence of an item with in Perhaps the simplest condition in text processing is whether an item is present or not. Python handles this in a way that looks a lot like English:

1. >>> greeting = 'Yo!' 2. >>> 'Y' in greeting3. >>> 'o' in greeting4. >>> '!' in greeting5. >>> 'o!' in greeting6. >>> 'Yo!' in greeting7. >>> 'Y!' in greeting8. >>> 'n' in greeting9. >>> '?' in greeting10.>>> '' in greeting

24-Sept-2014NLP, Prof. Howard, Tulane University

15

Page 16: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

in & lists

Lists behave exactly like strings, with the proviso that the string being asked about match a string in the list exactly:

1. >>> fruit = ['apple', 'cherry', 'mango', 'pear', 'watermelon']

2. >>> 'apple' in fruit

3. >>> 'peach' in fruit

4. >>> 'app' in fruit

5. >>> '' in fruit

6. >>> [] in fruit

24-Sept-2014NLP, Prof. Howard, Tulane University

16

Page 17: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Python can understand sequences of in conditions

1. >>> 'app' in 'apple' in fruit2. # 'app' in 'apple' > True 3. # 'apple' in lst > True 4. >>> 'aple' in 'apple' in fruit5. >>> 'pea' in 'peach' in fruit

24-Sept-2014NLP, Prof. Howard, Tulane University

17

Page 18: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

8.1.2. How to check for the absence of an item with not in1. >>> not 'n' in greeting2. >>> 'n' not in greeting3. >>> 'Y' not in greeting4. >>> 'Y!' not in greeting5. >>> 'Yo' not in greeting6. >>> '' not in greeting7. >>> 'apple' not in fruit8. >>> 'peach' not in fruit9. >>> 'app' not in fruit10. >>> '' not in fruit11. >>> 'pee' not in 'peach' not in fruit12. >>> 'pea' not in 'peach' not in fruit13. >>> 'pea' not in 'apple' not in fruit

24-Sept-2014NLP, Prof. Howard, Tulane University

18

Page 19: UNICODE & CONTROL DAY 13 - 9/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

More on control

Next time

24-Sept-2014NLP, Prof. Howard, Tulane University

19