Lecture 3 dr rachel edita roxas

Post on 01-Nov-2014

24 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

Transcript

The Role of Technology in Language Learning

Rachel Edita Roxas, PhD

Aim of This Presentation

• To provide a state-of-the-art of computational linguistics, language documentation, and empathic computing towards the development of language learning applications in the Philippines

• To draw implications from these developments for future research, policy-making, and teaching and pedagogy particularly in the context of the current debate on the implementation of mother tongue-based multilingual education in the country

• Why is human language technology (or natural language processing) challenging?

• Why is human language computationally challenging?

• How did we deal with these questions in our HLT research in the Philippines?

Fruit flies like a banana

•The students greeted the teachers when they arrived.

User Profile

Ambiguity

• Multi-modality: text, audio, video

• Multi-disciplinary

• How do we build the data?• Where do we get the data?

Known and SpecifiedKnown but UnspecifiedUnknown and Unspecified

• Building Philippine language resources: grammar, lexicon, morphological information, corpora

• Manual Construction: Rule-based• Automatic Methods for language

resource extraction/generation: Example-based

• Some applications: Domain-specific

Language Resources

• Text: various linguistic levels such as lexical, syntactic and semantic– The Philippine Corpus– English-Filipino Lexicon– Filipino WordNet– Philippine Component of the

International Corpus of English• Speech Processing• Video: Sign Language Processing

The Philippine Corpus

• Initial work on the manual collection of documents on Philippine languages has been done by Dita, Roxas, and Inventado (2009) for four major Philippine languages namely, Tagalog, Cebuano, Ilocano and Hiligaynon with 250,000 words each, and the Filipino sign language with 7,000 signs.

• Now: currently working on another 4 Philippine languages

• The future goal is to be able to collect the corpora for other Philippine languages.

The Philippine Corpus

• The Philippine language corpora are also accessible online. It is especially of great importance to those interested in Philippine languages, both locally and internationally, that the corpora are available online and can be extended by native speakers of these languages from all over the world.

The Philippine Corpus

• An important aspect of the building of the Philippine corpus as with other corpus building endeavors is storing data and keeping track of the data and the processes performed on the data. Digitization of data is now made possible by existing technologies which made the storing and tracking of data much easier.

• Moreover, through the connectivity through the Internet, data are made even more accessible to anyone on the web. Therefore, Palito, an online corpus management system, was designed to use these technologies for corpora building.

Screenshot of Palito’s Front Page:

ccs.dlsu.edu.ph:8086/Palito

Sample of the Document Browsing Feature in Palito

Sample of the Word Frequency Counting Feature

of Palito

Sample Concordancer Result in Palito

Sample Video with Gloss and Transcription Viewing in Palito

The Philippine Corpus

• An unexplored but equally challenging area is the collection of historical documents that will allow research on the development of the Philippine languages through the centuries (Roxas, 2007).

• An interesting piece of historical information is in Doctrina Christiana, the first ever published work in the country in 1593 which shows the translation of religious material in the local Philippine script, the Alibata, and Spanish.

• Current digitalization efforts include scanned pages of the document.

A Page from Doctrina Christiana

An English-Filipino Lexicon

• Currently, there exists an English-Filipino lexicon initially based on the English-Filipino dictionary of the Komisyon sa Wikang Filipino, and augmented by new words by Lim et al (2007).

• It contains 23,520 English and 20,540 Filipino word senses with information on the part of speech and co-occurring words.

Filipino WordNet

• WordNet is a large collection of words in a language which are grouped into sets of synonyms called synsets, with each synset expressing a distinct concept.

• Initial work has been done by Borra, Pease, Roxas, and Dita (2010) on Filipino as it relates to English by Fellbaum (1998). The challenge in the building of a new WordNet is when synsets do not appear in existing synsets. In Filipino in particular, the word hilamos which means to wash one’s face does not have an equivalent English synset, and is represented as a hypernym of hugas which means to wash (Borra, et al, 2010).

The Philippine Component of the International Corpus of English

(ICE-PHI)• Compiled by Bautista, Lising, and Dayag and

released in 2004• About one million words distributed almost

evenly across 500 texts with specified categories

• Approximately 2000 words per text with some being composite to reach the 2000-word minimum

• Samples from the English spoken or written by adults aged 18 and above and who received formal education through the medium of English up to the postsecondary level

Current State of ICE-PHI

1. Since around June of 2008, the ICE-PHI team has started processing the lexical corpus. It was automatically tagged using the program MakeTag 1.0 and is (approximately) 90% accurate but it still needed to undergo manual verification to achieve 99-100% accuracy.

2. Before the actual analysis of the verb of the ten percent of ICE-PHI for the preparation of a grammar of the verb in Philippine English by Borlongan (2010), words bearing the tag V ‘verb’ and AUX ‘auxiliary’ were carefully verified. The tags were verified through ICE Corpus Utility Program (ICECUP) 3.1.

Studies Using ICE-PHI

• A number of studies have stemmed from the analysis of ICE-PHI – individual analyses of the component and analyses comparing ICE-PHI with other ICE components (British, Hong Kong, Indian, New Zealand, and Singapore).

• Two recently published studies making use of ICE-PHI are that of Bautista (2008) on the validation of Philippine English grammatical features and Borlongan (2008) on tag questions in Philippine English. Both studies juxtapose Philippine English with other Englishes.

Speech Processing

• Speech processing studies include automatic speech recognition (or speech-to-text) and text-to-speech. Studies that use the Filipino Speech Corpus (FSC) include speech-to-text (Cayaban, Climaco, Espina, & Guevara, 2001; Corpus, Liampo, Co, & Guevara, 2001; dela Vega, Co, & Guevara, 2002; Sagum, Ensomo, Tan, & Guevara, 2003; Tantan, Tan, & Guevara, 2003) and text-to-speech (Cayaban, et al., 2001; Co & Guevarra, 2003; Corpus, et al., 2001; Espina, Tan, & Guevara, 2002; Tupas, Co, & Guevara, 2002).

Speech Processing

• Other Filipino speech processing applications that were developed without the use of FSC include PinoyTalk (Casas, Rivera, Tan, & Villamil, 2004) and Tagapagsalita (Aralar, Coloso, Moneda, Ilao, & Cu, 2008), which use the Filipino voice recordings as corpus. Speaker identification and verification applications were also developed using a small corpus of 10 speakers each with five recordings of their individual passwords (Go, Manza, Realeza, & Ting, 2001; Jacinto, Nario, See, & Umali, 2002).

Speech Processing

• Ebarvia, Bayona, de Leon, Lopez, Guevara, Calingacion, and Naval (2008) developed a system that automatically recognizes emotions such as anger, boredom, happiness and satisfaction using an actual call center database. Chua, De Guia, Li, and Rojas (2009), on the other hand, came up with an application that recognizes emotions such as happiness, sadness, anger, fear, surprise, disgust, and neutral, using a corpus of 10,500 acted-emotion Filipino speech recordings.

Sign Language Processing

• The Filipino Sign Language (FSL) is also included in attempts to come up with a corpus on Philippine languages.

• Work has been done on processing this data such as in FSL number recognition by Sandjaja and Marcos (2009) using color-coded gloves for feature extraction using digital signal processing.

Color-coded Glove for FSL Number Recognition

Language Applications for Teaching

• Instructional Aids• Applications on Reading Comprehension• Applications on Composition Writing

• Automatic detection of code switching: ERDT Project from June 15, 2011 (for one year)

• Initial scope of the study: Textual information

• Next steps: audio

Spel:IT

• SpeL:IT is a courseware that aids children with specific language impairment to differentiate and recall similar sounding words.

• It has 16 stories with 46 CVC similar sounding words through visual and auditory illustrations which can be played over and over. The stories end with lesson drills, practice activities and story assessment.

Sample Screen at the Level of Word Recognition in Spel:IT

SalinLahi

SalinLahi is an interactive learning environment for Filipino language learning for kids between six to eight years old.

The users interact with Popoy, a boy within the same age bracket, as well as his other family members and friends, and join him in his activities as the users go through the lessons.

Interaction of SalinLahi’s User with Popoy

Interaction of SalinLahi’s User with Popoy’s Family

SalinLahi

• There are eleven lessons on basic Filipino and each lesson uses images, animation, interactive components, and audio and culminates with interactive exercises where feedback is immediately provided automatically. The application also keeps track of students’ progress.

Popsicle

Popsicle is an intelligent tutoring system with a primary function of tutoring English second-language learners. It is a software that identifies and corrects language errors committed by students while they are learning English.

Popsicle

The software initially assesses the grammatical competence of the learner based on an input essay document composed by the user, identifies the grammatical errors in the document, provides feedback and suggestions in natural language, and generates lessons on grammar that are tailor-fit to the individual needs of the learner. The evaluation uses the zone of proximal development (ZPD) in determining the level of user that the system has to consider in marking the input essay.

Zone of Proximal Distance for an English Composition in

Popsicle

MesCH

• Fajardo, Di, Novenario, and Yu in 2008 developed MesCH (Measurement System for Children’s Reading Comprehension), a software that accepts children’s stories and automatically generates multiple choice questions to test the child’s reading comprehension.

• The program rephrases parts of the story into four wh-questions (who, what, when, where), sequence questions (which came first), and vocabulary questions.

MesCH

• To cite as an example, with the sentence Slimy tadpoles came out from the eggs, the system generates the following possible stems:

 1. What came out from the eggs?2. Where did the slimy tadpoles come out?3. In the sentence, “Slimy tadpoles came out

from the eggs,” what does the verb “came out” mean?

4. In the sentence, “Slimy tadpoles came out from the eggs,” what does the adjective “slimy” mean?

MesCH

• The system considers principles in instructional assessment such as the formulation of four wh-questions and the construction of distractors through the use of entries in WordNet that relate with the correct answer.

HelloPol

HelloPol is a system wherein the user can dialogue in English with the system within the political domain (Alimario, Cabrera, Ching, Sia, & Tan, 2003). The main objective of the system is to provide answers to the users’ questions such that the user would not have an idea that it is actually a program he/she is conversing with. Thus, answers or replies of the system should be as natural as possible and should not be repetitive.

HelloPol

For the system to perform as such it has been fed with political news articles and information extraction has been integrated into the system to automatically extract relevant information from the articles into a more structured type of representation for use in the question-answering system.

The user may ask factoid questions (who, what, when, where) and the program answers these by referring to the database of information.

Picture Books

• Picture Books generates stories for children from an input picture containing the background and a set of characters and object stickers (Solis, Siy, Tabirao, & Ong, 2009).

• The child chooses the stickers and the system associates these to a theme and a (manually-created) ontology which are then used to generate a fable-type of story.

Screenshot of Picture Books’ Story Window

Automatic Essay Evaluator

• Another application developed for composition writing is the automatic essay evaluator. The evaluator, which was developed by Cruz, Escutin, Estioko, and Plaza in 2003, evaluates large collections of essay-type documents using the latent semantic analysis (LSA) technique (Cruz, et al., 2003).

• Rule-based natural language parsing is used for the grammar checking of the input sentences, while LSA is used to evaluate the content. The system was trained on corpora containing pre-graded essays gathered from a particular high school class, which were graded by at least two human teachers according to three criteria: (1) Mechanics, (2) organization, and (3) content.

Empathic Computing

• Empathic Research at the College of Computer Studies, De La Salle University

• Empathic computing is a marriage of many disciplines, notably affective computing, digital signal processing, social signal processing, sensor-rich and ambient intelligent, ubiquitous computing and machine learning.

• http://cehci.dlsu.edu.ph

55

Objective:

It aims to build human-centered systems, with

emphasis on feedback based on its user’s emotion, and system-initiated response,

i.e. for assistance and support.

56

Uniqueemotion & behavior

Uniquemirroring feedback

57

• Emotion Modeling. – Software that automatically

recognizes human emotion.– Considers a person’s voice to

determine emotion. – Considers a person’s non-verbal

cues to determine emotions.

58

• Emotion detection from verbal and non-verbal cues

• Audio and Videos files can be used to detect emotions.

• Uses Digital Signal Processing

• (Cu, et al, 2010)

Filipino Laughter

Pinoy Laughter is interesting.

5 Kinds of Laughter

• Natutuwa• Kinikilig• Nasasabik• Nahihiya• Mapanakit

Recognition of Emotion

• 73% accurate using audio information. • It seems that audio information is

more reliable than facial features in identifying the emotion carried by laughter.

• Restraint was noted (laughter not freely expressed)

• In the Filipino context, laughter is sometimes used to mask negative emotions.

Study of Rapport in Dialogues

• Rapport in Dialogues• Point attention to the facial

expressions of the members in the dialogue, the movement of the body, its position, the arms, gestures of the hands, the eye gaze, head movement and its tilting

• (Data was collected between a Filipino and a Japanese in Japan. Both used English as communication medium during this interaction.)

Affective Mirroring

• A psychological phenomenon called affective mirroring occurs during an interaction.

• Typically, there is high rapport when the person you talk with imitate your body position, gestures, movement and expression during an interaction.

Implications for Future Research

• Applications such as bilingual/multilingual translators, Philippine languages speech-to-text and text-to-speech systems for mobile and low-cost devices, speech training software, and dialogue analysis for data mining are just some prospects for future research for those involved in computation.

• As more applications are being made available for the languages that currently have corpora or available data, other languages should likewise be at the agenda of researchers working on computational and corpus linguistics. Of utmost importance is the documentation of endangered languages.

Implications for Future Research

• Reference grammars of Philippine languages have been impressively prepared recently, like those of Daguman (2004) and Dita (2004), following the footsteps of Schacter and Otanes (1972). But perhaps, a more corpus-based, corpus-driven approach should be explored by Philippine linguists.

Implications for Policy-Making

• Computational linguistics, language documentation, and the development of language applications should be a priority among policy-makers in education, research, and even government in particular. More sectors of the government should be more supportive of the endeavors of those working on this field of research.

Implications for Policy-Making

• Those who make policies in education should also look into how these resources and applications could mapped in, ultimately, the achievement of the goals of education in the country.

• And when one talks about technology in a developing country, one certainly has to touch on the issue of financing these technologies developed. There is obviously a big gap between what is ideal and what is realistic – these no matter how advanced these technologies are, if the financial and material resources of a specific school are not enough, they would not be able to harvest the benefits of the progress of technology.

Implications for Teaching and Pedagogy

• Curriculum and materials developers should look into how these resources and applications could be integrated – and even streamlined – in the curriculum and textbooks of their respective schools and teaching contexts.

• Classroom teachers can be as creative as they can be in using these resources and applications in the delivery of their instruction. The corpus could easily provide authentic examples in teaching languages, particularly in the teaching of less-documented languages and language varieties, as in the case of Philippine English versus “Standard” or American English. Teachers should also be easily aided by the assessment and evaluation tools when they provide evaluation to their students.

Implications for Teaching and Pedagogy

• But it is also important that teachers, most especially those already in service, be trained on how to make use of these resources and applications in their teaching contexts. Admittedly, the advancing afforded by new technology is not so easily adopted most especially by those who got used to the more traditional instructional techniques.

TED lecture: the Birth of a Word

• Deb Roy_ The birth of a word (TED) [clipnabber.com].mp4

The Role of Technology in Language Learning

Rachel Edita Roxas, PhD

top related