Top Banner
Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1 , F.-M. Blondel 1 , E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS
45

Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

Mar 26, 2015

Download

Documents

Robert Dillon
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

Strategy for systematic anonymisation of

multi-lingual interaction corpora.

C. Reffay1, F.-M. Blondel1, E. Giguet2

1 STEF – ENS-Cachan / IFÉ – ENS-Lyon 2 GREYC, Université Caen Basse-Normandie, CNRS

Page 2: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 2

Outline

• Introduction

• Anonymisation process– Marking process– Finding new forms– Replacement process

• Testing the process on a Galanet session

• What did we learn? What works?

• Next step…

Page 3: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 3

The corpus

• Galanet Session 2011-2012: “Nômades...nomadi...nómades... des langues”

(Resp.: SandrineD)• 4 teams : Italy, Brazil, France & Spain• During 3.5 months, • 103 teenagers, 83 authors wrote…

915 Messages containing (message body)• Volume: 47 740 forms, 217 477 characters• Lexicon: 9 655 distinct forms

Page 4: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 4

The objective is to share!

But anonymisation is a hard work (by hand)– The corpus may be enormous– Subtleties: homonyms & synonyms

Personal data are not sharable

Anonymisation… the solution?

Need a software to support

Page 5: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 5

Anonymisation purpose

• Hide personal information systematicallysystematically– Names (first names, last names, usernames…)– Identifiers (Passport, National Student Number, …)– Locations (city, street, address, coordinates)– Institution/Workplace (school, sport club, firm, …)– Contact references (e-mail, mobile, MSN, skype,

twitter, telephone/fax)– Explicit references (URL of homepages, blogs)– Social media usernames (facebook, MySpace, Hi5,

Soundcloud, Badoo, Bebo, Friendster, Netlog, …)

• Maintaining text coherence and consistency

Page 6: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 6

Personal data: examples• {(f331s2970m2)2011-11-30T19:24 Gabibr Re: Quelques

informations ... answers SandrineD (f331s2970m1)} “Eu amo a língua Francesa! Quem sabe falar francês me adiconem no meu FACEBOOK;) J'aime parler français! Qui peut parler français? M'ajouter dans FACEBOOK;) Nom: GABRIELA MEDEIROS.”

• {(f333s3016m2)2011-12-27T09:25 Miche Re: Les stéréotypes culinaires answers SandrineD (f333s3016m1)} “inviate i vostri documenti alla mia mail [email protected] grazie!!!;)”

• {(f330s2914m8)2011-10-22T19:52 PBS Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Yo me llamo Peimikà Bibiana. Como mi madre es tailandesa y mi padre es italiano, mi primer nombre, Peimikà, es tailandés y significa " dueña del amor ", mientras mi según nombre, Bibiana, es italiano y procede del etrusco " vibius " que significa " vida ". Me gusta mucho tener dos nombres (en Italia es más usual tener un nombre) y sobre todo estoy orgullosa de los orígenes diferentes que tienen y que hacen mi nombre aún más particular (además Peimikà no es muy difundido en tampoco en Tailandia y tampoco Bibiana en Italia”

Page 7: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 7

Just google it!

Page 8: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 8

Peimikà Bibiana… google search (2)

Page 9: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 9

Anonymisation Principles

1. All identified lexical forms must be (computationally) marked even if not modified by a replacement form.

2. Any reference (e.g.: name, institution or location) may be imprecise enough to encompass several hundreds people.

Original lexical form Replacement formReplaced

by

Mark

Once anonymised, no participant may be identifiable by an external person

Page 10: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 10

Anonymisation• Before:

{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)}

Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan…

• After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)}

Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues-Vives*, non loin de Perpignan…

Before After

Page 11: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 11

Hypotheses

• A fully automated method does not exist for all corpora

• Some decisions have to be taken by the researcher, not by the software

• Accuracy of the method will be achieved only for a given context (ex: Galanet)

• “Named entities” do not occur randomly

Let’s find the regularities Interactively with the expert: the researcher

Page 12: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 12

Concepts manipulated

Institution,Participant,

Public person,Relative,Street,

City…

Existing objects Named entities Lexical forms

Name,Surname,

Username,First name,Last name,Addresses,

Tel. number,MSN…

Pedro,KellyM,Eli, Elô,Kelly,

Bergamo, Canet,Rosa Luxembourg,

0609785643,

Real world CorpusReference

Page 13: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 13

Anonymisation process

Corpus toanonymise

Corpus with marked

Entities

Named entitiestransformation tableInitial list of

participants,usernames,institution…

Process/RulesDiscovering new forms

MarkingProcess

AnonymisedCorpus

ReplacementProcess

Page 14: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 14

Transformation table: example

Synonyms: the same entity has different forms

=≠

Homonyms: the same form refers to different entities

Page 15: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 15

Marking one form: Example (Kelly)A- List of all occurrences (with their context) with a concordancer

Page 16: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 16

Marking one form: Example (Kelly)

+

B- Update the transformation table (ex: Public person Gene Kelly)

Page 17: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 17

Marking one form: Example (Kelly)

C- Associate each occurrence to the appropriate entity

(=> In the corpus: Surround the occurrence by XML tags)

Last name, Normal form, unchangedrefers to the public person Gene Kelly

First name, Normal form, to be changedrefers to the participant KellyM

Page 18: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 18

Detecting new forms: 2 strategies

• Lexical rules: similar forms – Eli -> Elô Ely ELY Seli– Gabriela -> GABRIELA– José -> Jose

• Context rules: Similar context– First names: “mi chiamo …”, “accord avec …”– Cities: “Soy de …”, “vivo en …”, “j’habite à …”

Page 19: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 19

1st Strategy: Lexical variation rules

adriana Alexia Antonhy baptiste Cleisa Elô Ely ELY Seli Louise MAnuel Federiac fran Fran GABRIELA guillem iñigo Jacqueline jean Jose Kellly Leo léo MariAna mary May Miche michelina moni olalla oleguer

Adriana Alèxia Anthony Baptiste Cleissa Eli… Elouise Emmanuel Federica Ferran Gabriela Guillem Iñigo Jaqueline Jean José Kelly Léo Mariana Mary Michela Monica Olalla Oleguer

103Knownforms

31New

forms

Page 20: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 20

2nd Strategy : Context rules

103 Known first names (Adrià, …, Veronica)

145 contexts: Left/RightTotal: more than 250 tested rules

15 good new formsAntonhy Belle Bet Christine Fede Federiac Kellly Leo Line Maria May Peimikà Regina fran jean léo

47 rules approved

Page 21: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 21

Replacing process• Before:

{(f330s2880m3)2011-10-17T08:22 KellyM Re: Qui sommes- nous? answers CarlaN (f330s2880m1)}

Bonjour, je m'appelle Kellly. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Rosa Luxemburg à Canet, non loin de Perpignan…

• After:{(f330s2880m3)2011-10-17T08:22 FLG01 Re: Qui sommes- nous? answers ILG02 (f330s2880m1)}

Bonjour, je m'appelle Kittty*. J'ai 16 ans, je suis une élève en 1ère S dans le lycée Margherita Duras* à Aigues-Vives*, non loin de Perpignan…

Page 22: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 22

Conclusion

1. A new process/algorithm for anonymisation

2. Confront hypotheses to a first corpus– 47 rules approved for first names => 15 new forms– 103 first names => 31 existing derivations– Anonymisation not 100% auto => confirmed

3. Anonymisation possible? in a world with Google– Use Google to evaluate the frequency of a first name!

Page 23: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 23

Next steps…

• Finalize concrete anonymisation of this corpus– Discuss some choices with SandrineD for:– Usernames, cities, email addresses,…– Get feedback from SandrineD

• Verify on a bigger (Galanet) corpus:– The process– The rules

• Co-develop the tool :– within the research community… – in the (ANR) CORDIAL project?

Page 24: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

Grazie !

Page 25: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

More precisely

Page 26: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 26

New forms discovering: 2 strategies

103 Known first names (Adrià, …, Veronica)

LexicalRules

ContextRules

317 candidates145 contexts: Left/RightLeft: One form: 75 => 13780 occ.Left: 2 forms seq.: 123 => 1700 occ.Total: more than 250 tested rules

50 Auto34 frequent words

16 known

200 Easy180 common words

20 username

67 Tests5 common 31 good new forms

1 relative new: Maria

30 public names

47 rules approved

15 good new forms

Page 27: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 27

Contexts of 145 occ. of 103 first names(using TXM, case insensitive)

Page 28: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 28

The corpus lexicon

• A list of (lexical forms ► Frequence)– de ►1015– que ► 965– la ► 673– …– porque ► 48– …– Addams ► 1

9655 unique forms

Page 29: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 29

Who is concerned?

« Les applications informatiques à des fins pédagogiques et éducatives mobilisent des données permettant d’identifier directement mais aussi indirectement les personnes physiques. Une attention particulière doit être portée sur la collecte de données sensibles ainsi que sur les procédés d’anonymisation des données. »

(Mallet-Poujol 2004: p 21)

For more information, see the European Commission Directive (95/46/EC)

Page 30: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 30

Legal context (95/46/EC)

• (Art7) Member States shall provide that personal data may be processed only if: the data subject has unambiguously given his consent;…

• (Art8) Member States shall prohibit the processing of personal data revealing sensitive information (racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life)

• (Art8) […] Inform the data subject on:– The identity of the controller of the data collection,– The purposes of the processing – The recipients or categories of recipients of the data,– The existence of the right of access to and the right to rectify the

data concerning him

Page 31: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 31

Text coherence and consistency• {(f330s2914m11)2011-10-20T16:43 M_Cavalcanti Re: Por que me chamo

assim?! Answers Eloandrade (f330s2914m1)} “aaah, o meu é uma homenagem a uma de minhas tias e minha avó que se chamam Ana e ao resto de minhas tias que se chamam Maria. Daí, Mariana:)”

• {(f330s2914m10)-2011-10-20T21:06 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Gostei da criatividade da sua mãe MariAna! Rsrsrs”

• {(f330s2914m3)2011-10-28T00:54 LineCosta Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Ah meu nome é em homenagem a Jacqueline Kennedy, esposa do ex- presidente dos EUA, e também porque sempre foi um dos nomes preferidos do meu pai.: D”

• {(f330s2914m18)2011-10-19T20:36 Eloandrade Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Bem, minha mãe queria que meu nome começasse com a letra E (como o dela!), um certo dia ela viu o nome de uma atriz brasileira chamada Louise Cardoso. Gostou do " Louise ", mas queria com a letra E, então ficou " Elouise "! Só depois, quando eu cresci é que descobri que meu nome era de origem francesa.. Hahaha”

Page 32: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 32

TXM: http://textometrie.ens-lyon.fr/

Page 33: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 33

Named entities

A named entity is a lexical form identifying a precise object (first/last name,

communication ref., city, institution, etc.)

Examples:

Names: Christophe, Blondel, Giguet, Paris,

Communication ref.: 0678600614, …

Location: Grenoble, Paris, Parigi, …

Institution: ENS Cachan, CNRS, …

Page 34: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 34

Managing named entities

• Homonyms refer to different objects– In the corpus we have 2 participants named “Guillem”:

The same first name refers to different persons.– In “Gene Kelly”, Kelly = public person last name– in “Galdric, Kelly et Antonhy”, it’s a participant first name

• Different synonyms refer to the same object– Kellly & Kelly, – Anthony & Antonhy, – Elô & Elouise

Page 35: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 35

Referring to global entities

Page 36: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 36

Overall method and tools

1. Define a process/algorithm for anonymisation2. Confront hypotheses to a first corpus

– Using existing tools (Excel, TXM/Calico, Notepad++)– Doing many work by hand

(having automation in mind)– Facing/solving/avoiding problems– Evaluating/Suggesting (new) hypotheses

3. Discuss the result with the original researcher4. Verify on a second (bigger corpus)5. Co-develop the tool within the research

community

Page 37: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 37

Find Nei/nei with a concordancer

All occurrences refer to the Italian common word “nei”

Page 38: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 38

Another example

• {(f330s2914m5)2011-10-23T21:52 CR_Martins Re: Por que me chamo assim?! answers Eloandrade (f330s2914m1)} “Meu nome é Cleissa Regina, Cleissa porque minha mãe viu na tv uma repórter chamada Cleisa e achou parecido com o nome dela, Cléia e Regina porque o nome do meu pai é Reginaldo. Assim como a PBS gosto muito de ter 2 nomes e Cleissa é bem raro, nunca conheci ninguém chamado assim.”

Page 39: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 39

Peimikà Bibiana… a unique case? No! Let’s try Cleissa Regina…

Page 40: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 40

How to detect new forms?

• Lexical rules (look for similar forms): – Ignoring accents (ex: José, Jose)– Ignoring case (ex: José, jose, JOSÉ, …)– Levenstein distance between 2 forms: number of

extra/missing/inversion of characters– For graphy size <5 : Dist<=1– For graphy size >=5 : Dist<=2

• Context rules: (ex: “mi chiamo …”, “merci …”)

Page 41: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 41

Lexical variations 1/2

UPPER Exact

Levenstein Levenstein nb differences

Known New distance distance Case accents Add/Sup/Inv

Adriana adriana 0 1 1    

Alèxia Alexia 1 1   1  

Anthony Antonhy 2 2     2

Baptiste baptiste 0 1 1    

Cleissa Cleisa 1 1     1

Eli Elô 1 1     1

Eli Ely 1 1      

Eli ELY 1 2     1

Eli Seli 1 2 1   1

Elouise Louise 1 2 1   1

Emmanuel MAnuel 2 4 2   2

Federica Federiac 2 2     2

Ferran fran 2 3 1   2

Ferran Fran 2 2     2

Page 42: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 42

Lexical variations 2/2UPPER Exact

Levenstein Levenstein nb differences

Known New distance distance Case accents Add/Sup/Inv

Gabriela GABRIELA 0 7 7    

Guillem guillem 0 1 1    

Iñigo iñigo 0 1 1    

Jaqueline Jacqueline 1 1     1

Jean jean 0 1 1    

José Jose 1 1   1  

Kelly Kellly 1 1     1

Léo Leo 1 1   1  

Léo léo 0 1 1    

Mariana MariAna 0 1 1   2

Mary mary 0 1 1   1

Mary May 1 1     1

Michela Miche 2 2     2

Michela michelina 2 3 1   2

Monica moni 2 3 1   2

Olalla olalla 0 1 1    

Oleguer oleguer 0 1 1    

Page 43: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 43

Some good context rules (1/3)Context Total Known New New forms detected Accuracy

sou <F> 10 2 20%

appelle <F> 9 4 1 Kelly 56%

Cara <F> 7 1 1 May 29%

Ciao <F> 6 1 17%

Merci <F> 9 1 2 Belle, léo 44%

soy <F> 5 2 40%

equipe <F> 5 1 20%

Hombre <F> 4 1 25%

dicho <F> 3 1 33%

llamo <F> 3 2 1 Peimikà 100%

appel <F> 3 1 33%

raison <F> 3 1 33%

choix <F> 3 1 33%

chamam <F> 2 1 1 Maria 100%

tampoco <F> 2 1 50%

Page 44: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 44

Some good context rules (2/3)Context Total Known New New forms detected Accuracy

{BOM} <F>, 62 8 1 Fede 15%

je m’appelle <F> 5 5 100%

Accord avec <F> 9 4 1 Bet 56%

Concordo com a <F> 3 2 1 Line 100%

meu nome é <F> 3 2 67%

moi c’est <F> 2 2 100%

<F>, ho 8 2 25%

<F>, j’habite 2 2 100%

<F>, je 8 2 25%

je m’appel <F> 1 0 1 jean 100%

suis avec <F> 2 1 50%

<F> a dit 1 1 100%

dit el <F> 1 1 100%

diu el <F> 1 1 1 100%

nombre, <F> 2 1 1 Peimikà 100%

Page 45: Strategy for systematic anonymisation of multi-lingual interaction corpora. C. Reffay 1, F.-M. Blondel 1, E. Giguet 2 1 STEF – ENS-Cachan / IFÉ – ENS-Lyon.

IC'2012 - C Reffay, F-M Blondel, E Giguet 45

Generic context rules

Context Total Known New New forms detected Accuracy

<F>, <Known> 15 2 1 Regina 20%

<Known> i <F> 3 1 33%

<F> i <Known> 1 1 100%

<Known> et <F> 6 2 2 Antonhy, Leo 67%

<F> et <Known> 3 2 1 Federiac 100%

<Known> e <F> 3 1 33%

<F> e <Known> 3 1 33%