SIGN LANGUAGE SUBTITLING BY HIGHLY …hpcg.purdue.edu/idealab/pubs/JETS_2006.pdfj. educational technology systems, vol. 35(1) 61-87, 2006-2007 sign language subtitling by highly comprehensible

J. EDUCATIONAL TECHNOLOGY SYSTEMS, Vol. 35(1) 61-87, 2006-2007

SIGN LANGUAGE SUBTITLING BY HIGHLY

COMPREHENSIBLE “SEMANTROIDS”*

NICOLETTA ADAMO-VILLANI

Purdue University

GERARDO BENI

University of California Riverside

ABSTRACT

We introduce a new method of sign language subtitling aimed at young deaf

children who have not acquired reading skills yet, and can communicate only

via signs. The method is based on: 1) the recently developed concept of

“semantroid™” (an animated 3D avatar limited to head and hands); 2) the

design, development, and psychophysical evaluation of a highly compre-

hensible model of the semantroid; and 3) the implementation of a new

multi-window, scrolling captioning technique. Based on “semantic intensity”

estimates, we have enhanced the comprehensibility of the semantroid by:

i) the use of non-photorealistic rendering (NPR); and ii) the creation of a

3D face model with distinctive features. We have then validated the com-

prehensibility of the semantroid through a series of tests on human subjects

which assessed accuracy and speed of recognition of facial stimuli and hand

gestures as a function of mode of representation and facial geometry. Test

results show that, in the context of sign language subtitling (i.e., in limited

space), the most comprehensible semantroid model is a toon-rendered model

with distinctive facial features. Because of its enhanced comprehensibility,

this type of semantroid can be scaled to fit in a very small area, and thus it

is possible to display multiple captioning windows simultaneously. The

*This research is partially supported by the School of Technology at Purdue University (I3 grant –

Proposal #00006585 – http://www.tech.purdue.edu/cgt/I3/), by the Envision Center for Data Per-

ceptualization, and by the PHS-NIH grant “Modeling the non-manuals of American Sign Language”

(Award NO 5R01 DC005241-02).

61

� 2006, Baywood Publishing Co., Inc.

concurrent display of several progressive animated signed sentences allows

for review of information, a feature not present in any sign language subtitling

method presented so far. As an example of application, we have applied

the multi-window, scrolling captioning technique to a children’s video of a

chemistry experiment.

INTRODUCTION

According to the 2004 Annual Survey of Deaf and Hard of Hearing Children and

Youth (Gallaudet Research Institute, 2005), there are about 45,000 deaf school

age (K-12) children in the United States. Deaf children (who don’t know how to

read yet) don’t have access to visual information (TV, DVDs, interactive media,

etc.) with linguistic explanation. Linguistic explanation is given as speech for

hearing children, or as subtitles for non-hearing reading children (Captioning

Web, 2005). Considering that reading comprehension is significantly delayed in

deaf youngsters (the median reading comprehension of 17- and 18-year-old Deaf

is at a fourth grade level) (Holt, Traxler, & Allen, 1997), many young deaf children

are deprived of the opportunities for independent learning provided by visual

media. One solution to the problem is to use sign language as subtitles; but

traditional subtitling methods present the following difficulties.

First, we consider methods which are easily scalable so that the subtitles can fit

in a small portion of the screen. The requirement of fitting in a small area limits

the subtitles to the use of static symbols. The most advanced of such systems is

SignWriting, developed by Valerie Sutton (1974) and used worldwide for writing,

reading, and researching signed languages. SignWriting has many uses but as a

subtitling method has the disadvantage of being: 1) static, as written English; and

2) highly abstracted so that it requires significant amount of learning. Because

of these two factors, it is not easy to see the advantage over written English for

subtitling. If effort has to be invested in teaching a system of abstract symbols,

it might be more efficient to teach the Deaf how to read English and use English

for subtitles.

A more intuitive alternative to SignWriting could be the use of static images of

signers as represented, e.g., in ASL dictionaries (Flodin, 1994); however, this

method would require too many static images to follow, in real time, the messages

communicated by voice. Even in using English subtitles, the speed is often not

enough to keep up with the spoken word; and since using images for words

requires a much larger screen space, it would be impossible to follow the spoken

message by presenting such images as in comic strips at the bottom of the screen.

Turning to methods that use dynamic (i.e., moving) signing, we consider both

human signers and avatars. For a human signer, the aesthetic/emotional appeal

is not easily controlled or changed by a simple menu in software. Human signers’

appeal to different ages, genders, and ethnic groups varies and cannot be manipu-

lated. An avatar’s appearance, instead, can be easily modified in the user interface.

62 / ADAMO-VILLANI AND BENI

Moreover, human signers, unlike avatars, cannot easily be made artificially

emphatic without appearing ridiculous. Features that can be emphasized for

enhanced communication include: nails, color and size of eyes, eyebrows, lips,

etc. These types of emphasis can be realized easily in an avatar but not in a human

signer. A second difficulty with a human signer is that the background inter-

feres unless the signer is clothed in black against a black background, as in a

pantomime. But the dark background confuses the shadows at the edges of the

hands and makes the gesture less clear than if the background were light. It is

also in practice difficult to realize a very neutral darkly clothed signer on a dark

background. Some details always remain and tend to stand up and be distracting.

The third, and most challenging, problem with both human signers and full

body avatars is size. The full body avatar, like the human signer, must be cut at the

waist in order to fit in the restricted space at the bottom of the display. In doing so

two problems arise: first the trunk is still visible and remains a distractive factor;

second, the hands are positioned at a natural distance from the head and thus, in

order to be included, they require a significantly high vertical size (see Figure 1).

The objective of this research is the development of a new, improved method

of sign language subtitling which solves the majority of the above mentioned

problems. The method is based on: 1) the concept of “semantroid” (Adamo-

Villani & Beni, 2005); 2) the design, creation, and evaluation of a “highly

comprehensible” semantroid model; and 3) the development of a new scrolling

SIGN LANGUAGE SUBTITLING / 63

Figure 1. Size comparison between a full-body 3D avatar(DePaul University, 2005), left frame; and a semantroid, right frame.

technique that allows for simultaneous display of four animated signed sentences

at the bottom of the screen.

In the remainder of this article we discuss the development of a highly com-

prehensible rendering of the semantroid and its evaluation through a user study.

We describe the design and creation of a “semphace,” and we evaluate the

comprehensibility of semphace facial expressions through psychophysical

studies. The new multi-window, scrolling subtitling technique is discussed in

section 6; conclusive remarks and future work are presented in section 7.

RENDERING OF A

“HIGHLY COMPREHENSIBLE” SEMANTROID

A semantroid (Adamo-Villani & Beni, 2004) (from “semantic” and “android” is

a reduced avatar (limited to head and hands) which maximizes the semantic

content conveyed while minimizing the perceptual effort required to perceive

it. The concept has been quantified by the notion of “semantic intensity”

(Adamo-Villani & Beni, 2005). There are several advantages to using a seman-

troid versus using a human signer or a full-body avatar: 1) A semantroid, like

a full body avatar or human signer, represents naturally the signs and thus

requires no learning abstraction (in contrast with SignWriting) (Sutton, 1974).

2) A semantroid fits in much smaller space for the same meaning expressed by

either a human signer or an avatar. The semantroid, in fact, can position the hands

as close as possible to the head without significant loss of the meaning of the

gesture. This would be tiring for a human signer and is not realized in a full body

avatar since its purpose is to look as much as a human signer as possible. A

comparison is shown in Figure 1 where the semantroid image requires a 16.7%

shorter vertical dimension. It is also clear from the figure that, if necessary, the

semantroid vertical length can be reduced further by shifting the hands toward

the head without major loss of meaning. 3) A semantroid can be “optimized”

to improve semantic intensity and hence comprehensibility. “Optimization” of

semantroid comprehensibility is one of the main objectives of this research and

will be discussed next.

The rationale for the use of the semantroid instead of full avatar has been given

by Adamo-Villani and Beni (2005). The justification is based on comparing the

semantic intensity of the semantroid with the semantic intensity of the avatar.

Broadly speaking, semantic intensity is a measure of the ratio of the quantity of

“meaning” conveyed to the quantity of “effort” required to perceive such a

meaning. “Meaning” is closely related to information but is constrained by the

requirement that the information must be represented directly (visually) without

inference/abstraction analysis on the part of the perceiver. “Effort” is related to the

perceptual effort of the observer in perceiving the meaning, and it is affected by

two main factors:


1. The spatial distribution of the image. Intuitively it takes more effort to

perceive widely scattered elements than more compactly located elements.

2. The distribution of the meaning-conveying property. A measure (Adamo-

Villani & Beni, 2005) of this effort is the ratio of the variance of the possible

meanings to the variance of the nuances of meaning—i.e., of the different

values of the meaning-conveying property for a given meaning.

In the 3D to 2D (toon) shading transformation, factor (1) plays no role since the

overall spatial distribution does not change; factor (2) is the critical one.

Similarly to what we did for comparing semantroid with a full avatar, we can

determine which semantroid representation (or rendering) is the most compre-

hensible one by comparing semantic intensities. We consider three types of

rendering: i) 3D (or photorealistic rendering); ii) 2D (or toon-rendering); and

iii) hybrid (a combination of 3D and 2D renderings). A priori it is not clear

which rendering would be most effective. But we can use the following con-

siderations to form a plausible hypothesis.

Consider the rendered model as consisting of two parts: a) face and hands; and

b) air, neck, ears. In case (i) the rendering is 3D for both (a) and (b); in case (ii)

for neither (a) nor (b); and in case (iii) the rendering is 3D only for (a). Since all the

meaning is conveyed by part (a), it is clear that the semantic intensity of part (b)

increases in case (ii) and (iii)—i.e., when the rendering is 2D. This happens

because the number of variations of hues of the “same color” is reduced and the

“distance” in hue is increased. Both factors increase the semantic intensity by

reducing the perceptual effort.

For part (a) it is difficult to determine whether or not a decrease in information

offsets the reduction in perceptual effort. If it does, the semantic intensity of (a) is

reduced; hence, the hybrid case (iii) will have the largest overall semantic intensity

[hypothesis H1]. If it does not, then the 2D case (ii) will have the largest overall

semantic intensity [hypothesis H2]. A quantitative calculation of the semantic

intensity is not trivial and it is the subject of future research. In this article, we

test empirically the two hypotheses [H1], [H2].

2D and Hybrid Renderings

Non-photorealistic rendering (NPR) is any rendering technique that produces

images of simulated 3D worlds in a style other than realism. Often these styles

are reminiscent of paintings (painterly rendering), or of other techniques of

artistic illustration (sketch, pen and ink, etching, lithograph, etc.). In many

applications, such as visualization and design of effective diagrams, a non-

photorealistic rendering has advantages over a photorealistic image (Agrawala

& Stolte, 2001; DeCarlo & Santella, 2002; Gooch & Gooch, 2001; Laidlaw,

2001; Wilson & Ma, 2004). NPR images may convey information better by:

omitting extraneous detail; focusing attention on relevant features; and by clari-

fying, simplifying, and disambiguating shape (Gooch, Reinhard, & Gooch, 2004).


In an entertainment or artistic context, the control of image detail combined with

stylization increases the effectiveness to communicate the intent of its creator

(“amplification (of information) through simplification”) (Mcloud, 1993).

The specific NPR technique that outputs line-drawing style renderings of 3D

models is called toon-rendering. Toon rendering, also known as cartoon, comic, or

2D rendering, outputs imagery with meaningful abstraction (directed removal of

detail). The result is hand-drawn style images with bold edges and large regions

of constant color. In contrast with 3D shading models such as Gouraud’s or

Phong’s, toon rendering converts color to discrete levels and thus eliminates

smooth color shading and highlights.

Toon rendering has been used for several years in the animation industry, and

many commercial 3D software packages allow for rendering of 3D models with

flat shading and contour lines. For our work we have used Maya 6.5 software and

Mental Ray rendering algorithm. The semantroid model contours were rendered

by property difference and sample contrast. In particular, we rendered contours

around coverage (i.e., based on pixel coverage—where rendering samples detect

objects present in the scene), and between different materials. This allowed us to

clearly define the silhouette of the 3D surfaces with continuous bold edges.

Contour lines were also drawn based on normal contrast—i.e., between pixels

whose normal difference was larger than a set value. This enabled us to reveal the

curvature of the eyebrows and lips. Different flat surface shaders were applied to

different surface regions, shadows were not included. Figure 2 shows a 3D

rendering of the semantroid’s head model (b); a toon rendering produced with the

techniques described above (c); and a “hybrid” rendering (d)—hair and neck are

toon-shaded, face is 3D shaded.

“Optimization” via Color

In order to maximize the comprehensibility of all three representations, we have

chosen a particular color scheme. The choice of colors has been guided also by

semantic intensity considerations. Returning to factor (2) in the effort of meaning,

it is clear that the effort of distinguishing among the various meanings conveyed

by the hands and face is reduced by increasing the differences in color/shape of

the possible configurations of the hands and face. Because of the need to maintain

a minimum level of realism, the differences cannot be emphasized by modifying

the shapes of fingers and facial features. On the other hand, color can be used

to enhance contrast without loss of realism. This has determined our choice of a

female model so that the nails can be colored without loss of realism. Bright

colored nails make it easier to discern the finger configurations, especially on a

reduced scale. Similar considerations apply to brightly colored lips which help

discern different mouth configurations.

For the eyes, besides the brightness of the color, the choice of the hue is also

a factor that affects the perception effort. For the hue factor, we have followed



Fig

ure

2.

Fro

mth

ele

ft:

po

lyg

on

alh

ead

mo

del(a

);m

od

elre

nd

ere

dw

ith

3D

sh

ad

ing

(b);

mo

delre

nd

ere

dw

ith

2D

sh

ad

ing

(c);

mo

delre

nd

ere

dw

ith

“hyb

rid

”sh

ad

ing

(d).

research results in color perception (Hill & Scharff, 1997). These results were

obtained for comprehensibility of Websites with various foreground/background

color combinations, using different fonts and text sizes. The measure of com-

prehensibility was reaction time of the reader. No combination of font, size, or

color was found to be optimal, but for foreground/background the color com-

bination with fastest reaction time was green on yellow out of six possible

combinations which included yellow/blue. Based on these results, we have

selected green rather than blue eyes, since the general hue of the face approaches a

shade of yellow. Of course, this choice is very tentative and based on results

that may not be applicable; further studies should be done to determine the

highest comprehensibility of facial expression using different color combinations.

Figure 3 shows the hybrid shaded semantroid optimized for color.

COMPREHENSIBILITY EVALUATION OF 3D, 2D,

AND HYBRID RENDERINGS

The goal of this user study is to determine the comprehensibility of 3D, 2D,

and “hybrid” representations of facial expressions and hand gestures; a measure

of this comprehensibility is given by viewer’s accuracy and speed of recognition.


Figure 3. Hybrid shaded semantroid “optimized” for color.

Our hypothesis [H1] is that the hybrid representation is the most comprehensible

one (i.e., accurate and requiring the lowest recognition time). The experiment is

described in detail in Appendix A; below, we summarize and discuss the results.

Results

Results (reported in Appendix A) show that speed of recognition is highest for

the full 2D representation. Subjects were faster at recognizing 2D representations

(M = 1.60s) compared to 3D representations (M = 2.12s, p = 0.012), and compared

to hybrid representations (M = 1.91, p = 0.035). Results also show that there is

a marginal time advantage in recognition of hybrid representations versus 3D

representations (p = 0.068). Recognition accuracy is high for all three repre-

sentations, differences are negligible.

A preliminary conclusion is that facial and hand configuration recognition

appears to be quickest when face and hands are presented as illustrations, fol-

lowed by a combination of illustration and photorealistic renderings, and then

photorealistic (3D shaded) images. Thus, from the experiments we conclude that:

1) the full 3D case, as predicted, is the worst; 2) hypothesis H2 prevails over H1.

Discussion

First, let’s consider the result which shows that recognition time of the hybrid

representation is slightly lower than recognition time of the photorealistic repre-

sentation. This result can be explained with the semantic intensity considerations

presented above—i.e., part (b) of the hybrid rendering requires less perceptual

effort than the photorealistic one. The more interesting result is the superiority

of the 2D rendering. In fact, hypothesis H1, that the hybrid representation is the

most comprehensible, is not supported by the experiment. Why? And, how general

is this conclusion?

Part (a) of the 2D rendering has a higher semantic intensity because there is no

loss of meaning to offset the advantages of reduced perceptual effort. This no loss

of meaning in going from 3D to 2D cannot be general. Intuitively, it is apparent

that in many cases photorealistic rendering contains non-negligible details. But

the question may be raised as to whether or not the additional details present

in photorealistic rendering tend to disappear at small scale. In fact when the size

of the image is significantly reduced—this is our case for sign language subtitles—

variations in color and highlights in face and hands do not represent any changes

in meaning because they cannot be distinguished as meaningful facial signals or

meaningful hand deformations (for example, variations in shading produced by

deformations of the face geometry between the eyebrows and on the forehead

during frowning cannot be read as wrinkles (see Figure 4); a variation of shading

produced by reduction of the wrinkles on the knuckles during finger flexion

cannot be read as such) .When the size of the image is very small, facial signals

are conveyed almost entirely by shape and position of eyes, eyebrows, and mouth,



Fig

ure

4.

Lo

ss

ofm

ean

ing

ofco

lor

vari

atio

nin

sm

all

siz

eim

ag

es.F

ram

e1

co

nta

ins

a3D

mo

delo

fT

om

Han

ks’h

ead

(Bla

nz

&V

ett

er,

19

99

);fr

am

e2

sh

ow

sa

ph

oto

realis

tic

ren

deri

ng

ofth

em

od

elfr

ow

nin

g;

fram

e3

co

nta

ins

are

du

ced

scale

imag

eo

fth

esam

ere

nd

eri

ng

.In

fram

e3

vari

atio

ns

insh

ad

ing

inth

efo

reh

ead

an

deyeb

row

sare

as

can

no

tb

ere

ad

as

wri

nkle

s.

while hand configurations are mainly expressed by the outline of fingers and

nails. Therefore variations in face and hand shading represent variations in

nuances of meaning which require an additional perceptual effort, but do not

convey more meaning.

ENHANCED COMPREHENSIBILITY

VIA “SEMPHACE”

Results from the previous tests show that the mean recognition time for the

2D representation = 1.60s, the maximum recognition time = 2.53s for image

5 (I_AU15), and the minimum recognition time = 1.06s for image 2 (see

Appendix A). Our goal is to achieve a recognition time �1s considering that,

during animation playback, many facial signals are often held on the screen for

less than a second.

In order to increase recognition speed we: 1) developed a face model with

emphasized features (semphace—defined below); 2) measured speed and

accuracy of facial expression recognition of 2D, 3D, and hybrid representations

of semphace using the procedure followed in experiment 1; and 3) compared

speed and accuracy of facial expression recognition of 2D, 3D, and hybrid

representations of semphace vs. speed and accuracy of facial expression recog-

nition of 2D, 3D, and hybrid representations of the norm face.

The “Semphace” Model

A well-known finding in the literature on face recognition is that faces judged

as distinctive are more easily recognized than faces judged as typical (Benson

& Perret, 1994). The recognition efficiency of emphatic faces (faces with dis-

tinctive features) is related to a psychological phenomenon called the peak shift

effect: if a rat is rewarded for discriminating a rectangle from a square, it will

respond even more vigorously to a rectangle that is longer and skinnier that the

prototype, i.e., a rectangle that deviates more from a norm rectangle (Hansen,

1959; Ramachandran, & Hirstein, 1999). Similarly, it has been postulated that

humans recognize faces better and faster if the facial features deviate significantly

from an average face (Tversky & Baratz, 1985).

Though there haven’t been specific studies on recognition of facial expressions

of distinctive faces, both the documented ability of caricatures to augment the

communication content of images of human faces and semantic intensity con-

siderations have motivated our development of the semphace model in order to

increase recognition speed of facial stimuli. We call “emphace” (= emphatic face)

a representation of a face with some essential features emphasized for a specific

effect. Examples of emphaces are: 1) facial caricatures: representations of faces

whose essential (for individuality) features are emphasized for comic effect;

2) cartoon faces (e.g., manga) (Gravett, 2004): representations of faces whose


essential (for expressing feelings) features are emphasized for increased

emotional impact; and 3) semantic emphases = “Semphaces”: representations

of faces whose essential (for conveying meaning) features are emphasized for

increased understanding.

A semphace directly increases semantic intensity by reducing type (1) of

perceptual effort, i.e., by increasing the compactness of the geometric distribution

of the meaningful features—eyes and mouth. Such increase in compactness

(as measured by the ratio of the area occupied by significant features to the total

area of the face) can be significant. The challenge is to increase the area of the

significant features without compromising the perception of the face beyond a

minimum level of credibility.

In developing a distinctive face, two types of facial information can be altered:

i) featural information, which pertains to face elements (i.e., shape and size of

eyes, nose, mouth) referred to in isolation; and ii) configural information, which

relates broadly to spatial relationships among these face elements (i.e., distances

between eyes, nose, mouth) (Freire & Lee, 2000; Searcy & Bartlett, 1996). Some

research studies show that altering the spatial location of essential face elements

can often impair subjects’ recognition of faces and facial expressions (Tanaka &

Sengco, 1997) while, as mentioned previously, altering size and shape of essential

facial features can augment facial communication.

Based on these findings, when developing the semphace mode we have

altered featural information only, while maintaining a normal spatial relationship

between facial features. Considering that in ASL the most essential (for con-

veying meaning) facial features are eyes, mouth, and eyebrows (Wilbur, 2004),

we have modeled a semphace with distinctive eyes and mouth, and well-defined

eyebrows (see Figure 5). The semphace model was created as a morph target

derived from a face model with anatomically proportioned features. The extremely

enlarged eyes were produced by multiplying the scale parameters values of

the original model’s eyes by a factor of 2.5, and the enlarged mouth was pro-

duced by multiplying the scale parameters values of the original model’s

mouth by a factor of 1.8. The size of the nose was decreased by a factor of 0.5

since the nose is not essential in conveying meaning during signed com-

munication. Figure 5 shows the semphace model (a), a 3D rendering (b), a 2D

rendering (c), and a “hybrid” rendering (d). Figure 6 shows different semphace

expressions.

COMPREHENSIBILITY EVALUATION OF

2D, 3D, AND HYBRID REPRESENTATIONS

OF SEMPHACE

By semantic intensity considerations similar to those of section 3 it is expected

that the fully 3D semphace will be the less comprehensible; but it is not a priori

obvious whether the hybrid semphace will be more effective than the 2D rendered



Fig

ure

6.

Exa

mp

les

ofsem

ph

ace’s

facia

lexp

ressio

ns:

3D

ren

deri

ng

s,le

ft;

2D

ren

deri

ng

s,ri

gh

t.

Fig

ure

5.

Sem

ph

ace

po

lyg

on

alm

esh

(a);

3D

sh

ad

ed

sem

ph

ace

(b);

Hyb

rid

sh

ad

ed

sem

ph

ace

(c).

one. Experimentally we test two hypotheses: [H3] the 2D semphace representation

is the most comprehensible (readable) of the three semphace representations; and

[H4] the 2D semphace representation is more comprehensible than the 2D repre-

sentation of the norm face. The experiment is described in Appendix B; in this

section we summarize and discuss the results.

Results

Results show that speed of recognition is higher for the 2D semphace repre-

sentation (M = 1.06s) compared to 3D (M = 1.38s) and hybrid (M = 1.29s)

semphace representations, thus [H3] is confirmed. In addition they show that

subjects were significantly faster at recognizing 2D representations of the

semphace (M = 1.06s) compared to 2D representations of the norm face (M =

1.60s, p = 0.009), thus [H4] is confirmed as well. There is also a time advantage

in recognition of hybrid representations of the semphace (M = 1.29s) versus

hybrid representations of the norm face (M = 1.9ls, p = 0.020), and in 3D

representations of the semphace (M = 1.38s) versus 3D representations of the

norm face (M = 2.12s, p = 0.009). Recognition accuracy was 100% for all three

semphace representations.

Discussion

In summary, the experiments show that: 1) facial recognition is quickest

when the face has distinctive features such as enlarged eyes and mouth and

reduced nose (semphace); and 2) the most comprehensible representation of

a face with such distinctive features is the 2D representation. An important

parameter not tested in this experiment is the size of the image. As discussed

above, it is expected that 2D rendering has an advantage at small scales; it is

not clear whether the hybrid rendering would be more effective at large scale

but we have argued that a priori it is certainly possible. What is more difficult

to argue is whether the advantage of the semphace over the norm would exhibit

the same type of scale dependence. Actually it is not clear what advantage the

norm face would have at larger scale since no more significant details are added.

Thus, we may argue that the semphace will remain advantageous over the norm

face regardless of scale.

This is left to future research on semphaces since, besides this scale issue,

many other aspects of the concept need further exploration. Examples are: the

effect of varying the relative positions of the semantic features; the role of details

in the semantic features, e.g., eyelids, eyelashes, teeth, and tongue; the advantage

or cost of including other semantic features such as wrinkles; the effectiveness

of different gender, age, or race of the model; and most importantly, the quan-

tification of some parameter of realism to set boundaries for the size of the

significant features.


THE MULTI-WINDOW SCROLLING METHOD

The enhanced comprehensibility, and thus scalability, of the highly compre-

hensible semantroid has led to development of an efficient subtitling scrolling

method that allows for simultaneous display of multiple subtitling windows at

the bottom of the screen, throughout the entire video playback (see Figure 7). The

concurrent display of several animated sentences allows for review of infor-

mation—a feature non-existent in any sign language presentation method so far.

Traditional methods of captioning in sign language use one human signer

occupying a corner of the screen (usually at the bottom right). This method has

several drawbacks, as discussed previously, but, in addition, whether a human

signer is used or an avatar, the traditional method displays the message as a

sequence of signs so that only the last sign is visible. This is basically the method

that must, by necessity, be used by a human signer in direct sign language

conversation. But, in displaying subtitles, there is no reason to limit the display to

only the last sign which is being communicated. In fact, if the message is broken

down in signed sentences, more than one sentence could be playing on the screen

simultaneously. This method has both advantages and disadvantages.

It has the advantage that, similarly to written language, more than one sentence

is readable at a given time. This allows for review of information. Without

such review, it is very likely that, due to a lapse of attention, some meaning is

lost. The same happens in listening to a message in comparison with reading

it. Since the probability of missing one word or sentence is generally non-zero,

in a long and complex message, listening is most likely to lead to some loss

of meaning. In reading, such a loss of meaning due to lapse of attention is

prevented by the possibility of re-reading the sentence which had not been

attended to properly.

In captioned sign language, even more so than in listening to a message, it is

likely that attention will lapse. This is because the viewer must pay attention

simultaneously to both the main scene on the screen and to the captioned signs.

This situation is particularly severe when the viewer must pay attention to complex

explanations, as, e.g., in learning about science.

Thus, the possibility of re-reading signed sentences is very useful in general

and especially in cases of complex messages. And this is the basic advantage of

displaying simultaneously more than one signed sentence. On the other hand, the

method has the disadvantage of taking up a larger portion of the screen and thus

interfering with the main scene. This is certainly the case if full body avatars or

human signers were used. But even if the signers were shown only from the waist

up, as discussed previously, showing multiple signed sentences still can disrupt the

main scene significantly unless the signing display is considerably simplified.

This is achieved in our method by using the highly comprehensible semantroid

discussed above, which minimizes the screen area occupied, while maintaining a

high level of signing clarity.


Even with the highly comprehensible semantroid, ultimately the number of

signed sentences shown must be limited by the display area available. We have

found that a reasonable compromise is to display four signed sentences simul-

taneously. This allows for recovering from lapses of attention while at the same

time not occluding the main scene significantly. The method of displaying the

sentences is chronological from right to left so that the method of reading the

sentences is from left to right as in reading English. The scrolling method can

easily be adapted to reading from right to left as in Hebrew or from top to bottom

as in (some) Japanese writing. The latter case may, due to the screen aspect

ratio, require that the sentences displayed be reduced to three.

As a practical example, we have applied the highly comprehensible scrolling

semantroid as a subtitling tool for an educational video of a chemistry experi-

ment: the determination of pH of two common household products. A frame

extracted from the video is represented in Figure 7; the entire video is available

upon request.


Figure 7. Frame extracted from the video of the pH experiment withsimultaneous dislay of four animated signing sequences.

The arrow indicates the scrolling direction. The numbers indicatethe order of appearance on the screen.

The video is currently being used to assess the efficiency of the subtitling

method with a group of deaf children age K-3. The evaluation is carried out in

collaboration with the Indiana School for the Deaf (ISD), one of the leading

institutions in Deaf Education. We are presently evaluating the method’s success

based on: 1) deaf children’s reactions (emotional appeal of the model and

willingness to use); and 2) teachers’/parents’ feedback on the degree to which

the subtitling highly comprehensible semantroid improves deaf children’s

understanding of science concepts presented on video. In addition, the subtitling

highly comprehensible semantroid has been evaluated throughout its development

by deaf adults, Purdue faculty, and students knowledgeable in sign language and

deaf-related issues who have provided positive feedback on the comprehensibility

of the signs and the effectiveness of the method.

CONCLUSIONS AND FUTURE WORK

In this article, we have introduced a new method of subtitling motion pictures

for deaf children who cannot read and can only communicate via sign language.

The method is based on: 1) the use of highly comprehensible semantroids

(animated 3D avatars reduced to hands and head only) and “optimized” for

maximum comprehensibility of facial expressions and hand gestures; and 2) a

new multi-window, scrolling captioning technique. In order to produce a highly

comprehensible semantroid model, we have considered three variables: mode of

representation (or rendering); color; and facial geometry. In regard to mode of

representation, we have focused on three rendering techniques: 3D shading; 2D

shading; and hybrid shading (combination of 3D and 2D). We have estimated and

compared semantic intensity (and thus comprehensibility) of each representation

and then we have validated the hypotheses through a series of tests on human

subjects which measured the comprehensibility of each representation. Results

showed that, in the context of sign language subtitling (i.e., when the size of the

image is drastically reduced), the most comprehensible representation is obtained

by 2D rendering.

As for the color scheme, we have followed recent research results on color

contrast comprehensibility (Hill & Scharff, 1997) and we have produced a seman-

troid model that, while maintaining realism, has salient features with colors

that make them easily detectable on a small scale.

In regard to facial geometry, based on semantic intensity considerations, as

well as research findings on the ability of “distinctive” faces to augment com-

munication of facial expressions, we have developed a 3D model with emphasized

facial features (semphace). We have then evaluated the comprehensibility of

semphace through psychophysical studies which showed that, in restricted space,

a semphace is more comprehensible than a norm face.

Because of its enhanced comprehensibility, the highly comprehensible seman-

troid (i.e., toon-shaded, optimized for color, and with distinctive facial features)


can be scaled down to fit in a very small frame, thus it is possible to fit four

windows at the bottom of the screen where subtitles normally appear. Each

window displays a movie of one signed sentence; hence the subtitling windows

simultaneously display four sentences. The windows are scrolled from right to left

so that reading proceeds from left to right as in English (the order can be reversed

for other languages).

This being a first step in proposing the “highly comprehensible semantroid

subtitling method,” many aspects need to be tested and improved. For instance,

limits of scalability to smaller images need to be quantified. The semphace model

has been developed for comprehensibility testing purposes, but its aesthetic/

emotional appeal and its effects on child comprehension need to be further

investigated. Optimal number of simultaneous signed sentences is also a factor

to analyze. Could it be better to have only three sentences, but in larger size?

Because of the aspect ratio of the screen, the three sentences could be displayed

vertically. Would this affect the comprehensibility? Would this be an advantage

for persons who eventually will learn to read oriental languages which are written

from top to bottom?

Currently we are carrying out further experiments to confirm the validity of

the method. Preliminary results of a qualitative evaluation of the new captioning

method with a group of hearing and non-hearing signers confirm the effectiveness

of the new subtitling technique.

Another interesting issue, to be addressed in a future publication, is the

quantification (via semantic intensity calculations as well as empirical studies)

of the minimum image size for which the 3D representation becomes more

comprehensible than the 2D representation. For example, if only two cap-

tioning windows are displayed simultaneously, the size of the windows can

be larger and a 3D rendering of the semantroid could be more readable than

a 2D one.

Although many questions remain to be answered, this method of subtitling

in sign language helps bridge the gap between two rather different views on

presenting information to the Deaf. The two views are at opposite ends in

the scale of abstraction. On one hand (more concrete) we have human signers,

on the other (more abstract) we have symbolic representations of signs. In

between we have the method of subtitling by highly comprehensible seman-

troids, which may be regarded, in fact, as intermediate between subtitling

realized via a human signer (or a signing avatar), and subtitling realized with

an abstract system of sign representation, such as SignWriting. Table 1 shows

the relation of the present method to the other two. Thus, Table 1 shows that

the subtitling highly comprehensible semantroid fills a niche left by the two

extreme cases used so far to help the Deaf understand explanations of visual

information.


APPENDIX A

Experiment 1

Subjects

Fourteen students age 21-26 years, 8 out of 14 subjects with American Sign

Language skills.

Stimuli

Six sets of 10 images rendered from two different semantroid models; each

image has a resolution of 320 × 120 pixels. Two sets of images (one set per

semantroid model) contain 3D shaded (photorealistic) representations of the

semantroid, two sets contain fully toon rendered (2D) representations of the

semantroid, and two sets contain “hybrid” representations of the semantroid.

Every image of each set shows the semantroid in the neutral position on the

left, and the semantroid with a change in facial expression and/or hand

configuration on the right. For changes in facial expressions we considered

facial signals that are commonly produced during ASL communication based on

[A1] and [A2]: Action Units (AUs) 1, 2, and 15 from the Facial Action Coding

System (FACS) [A3], blinks, changes in pupil size, and changes in gaze direction.

Hand configurations were selected from the ASL finger-spelling alphabet

[A4]. Figure A1 shows a sample test image from one of the 3D shaded semantroid

sets on the left; a sample test image from one of the 2D semantroid sets in the

middle; and a sample test image from one of the “hybrid” shaded semantroid

sets on the right.

Procedure

Subjects were presented with images showing different facial expressions and

hand configurations. The images, randomly selected from the three sets (2D, 3D,


Table 1. Comparison between Subtitling Methods

Level of abstraction Motion Size

Human signer or avatar

Highly comprehensiblesemantroid

Sign Writing

Low (concrete)

Medium (parts removed +toon shading)

Very high(symbolic)

Yes

Yes

No

Large

Small

Small


Fig

ure

A1

.F

ram

e1

co

nta

ins

test

imag

e9

(R_A

U1

)fr

om

the

3D

sh

ad

ed

sem

an

tro

idset;

fram

e2

co

nta

ins

test

imag

e3

(E_lo

ok_le

ft_d

ow

n)

fro

mth

eto

on

sh

ad

ed

sem

an

tro

idset;

fram

e3

co

nta

ins

test

imag

e2

(C_b

link)

fro

mth

eh

yb

rid

sem

an

tro

idset.

Tab

leA

1.

Reco

gn

tio

nS

peed

Resu

lts

Sh

ow

ing

Mean

Tim

eO

ver

Ave

rag

eS

ub

ject

Data

for

Each

Imag

eo

fE

ach

Rep

resen

tatio

n

Re

pre

se

nta

tio

nIm

ag

e1

A_

AU

2*

Imag

e2

C_

Blin

k

Imag

e3

E_

Lo

ok_le

ft_d

ow

n

Imag

e4

I_A

U1

5

Imag

e5

N_

Dila

ted

_p

up

il

Mean

Std

.E

rro

rM

ean

St.

dE

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

r

2D

3D

Hyb

rid

1.1

2s

1.6

4s

1.4

5s

0.0

80

0.0

78

0.0

92

1.0

6s

1.5

4s

1.3

0s

0.0

94

0.0

89

0.0

98

1.5

4s

2.0

5s

1.8

9s

0.0

69

0.0

80

0.0

78

2.5

3s

2.9

7s

2.7

5s

0.0

91

0.0

98

0.0

99

2.4

2s

2.9

4s

2.7

2s

0.0

89

0.0

78

0.0

98

Re

pre

se

nta

tio

nIm

ag

e6

B_

No

ch

an

ge

Imag

e7

No

ch

an

ge_

No

ch

an

ge

Imag

e8

No

ch

an

ge_

Clo

sed

_sm

ile

Imag

e9

R_

AU

1

Imag

e10

V_

Lo

ok_ri

gh

t

Mean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

r

2D

3D

Hyb

rid

1.4

3s

1.9

5s

1.7

6s

0.0

96

0.0

98

0.0

69

1.2

7s

1.5

5s

1.4

1s

0.0

88

0.0

88

0.0

89

1.9

6s

2.5

2s

2.3

4s

0.0

80

0.0

97

0.0

98

1.2

3s

1.9

7s

1.6

9s

0.0

94

0.0

89

0.0

72

1.4

1s

2.0

9s

1.8

2s

0.0

96

0.0

79

0.0

77


Tab

leA

2.

Mean

Tim

eO

ver

Ave

rag

eS

ub

ject

Data

for

Each

Rp

ere

sen

tatio

n

Rep

resen

tatio

nM

ean

2D

3D

Hyb

rid

1.6

0s

2.1

2s

1.9

1s

Tab

leA

3.

Reco

gn

itio

nA

ccu

racy

Re

pre

-

se

nta

tio

n

Imag

e1

A_

AU

2*

Imag

e2

C_

Blin

k

Imag

e3

E_

Lo

ok_ri

gh

t_d

ow

n

Imag

e4

I_

AU

15

Imag

e5

N_

Dila

ted

_p

up

il

Imag

e6

B_

No

ch

an

ge

Imag

e7

No

ch

an

ge_

No

ch

an

ge

Imag

e8

No

ch

an

ge_

Clo

sed

_sm

ile

Imag

e9

R_

AU

1

Imag

e10

V_

Lo

ok_ri

gh

t

2D

3D

Hyb

rid

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

99

%

99

%

99

%

99

%

99

%

99

%

99

%

99

%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%

10

0%


and hybrid) were displayed on a 21-inch computer monitor with a resolution of

800 × 600 pixels. Each subject was asked to say “yes” upon detection of a change

in facial expression and/or hand pose between the semantroid on the left and

the one on the right. The response time for each image was recorded by an

experimenter who pressed a key at the end of the response. This stopped the timer

(and the video screen capture) that was started automatically upon display of

the image. After the key press, the image was automatically removed from the

screen and the subject was asked to fill out a feedback form in which she/he

identified the type of change in facial configuration and/or hand pose. The form

was handed to a second experimenter. The procedure was repeated for each image

of each set.

Results

The results obtained during Experiment 1 are reflected in Tables A1, A2,

and A3.

References

A1. Wilbur, R. B. (1987). American sign language: Linguistic and applied dimensions.

San Diego, CA: College-Hill**.

A2. Wilbur, R. B. (2004). After 40 years of sign language research, what do we know? In

S. Bradaric-Joncic and V. Ivasovic (Eds.), Sign language, deaf culture, and bilingual

education (pp. 9-38). Zagreb: ERF.

A3. Ekman, P., & Friesen, W. V. (1978). Facial action coding system: A technique for the

measurement of facial movement. Palo Alto: Consulting Psychologists press**.

A4. Flodin, M. (1994). Signing illustrated. New York: Perigee Reference.

APPENDIX B

Experiment 2

Stimuli

For Experiment 2 we followed the same procedure as Experiment 1 (described

in Appendix A) and we utilized the same subjects. The stimuli consisted of three

sets of new images rendered from the “semphace” 3D model and representing the

same changes in facial expressions and hand configurations as in Experiment 1.

Sample images used in Experiment 2 are represented in Figure B1.

Results

The results obtained during Experiment 2 are reflected in Tables B1 and B2

(recognition accuracy was 100% for all images).



Fig

ure

B1

.F

ram

e1

co

nta

ins

test

imag

e2

(C_b

link)

fro

mth

e3D

sh

ad

ed

sem

an

ph

ace

set;

fram

e2

co

nta

ins

test

imag

e8

(No

ch

an

ge_clo

sed

sm

ile)

fro

mth

eh

yb

rid

sem

ph

ace

set;

fram

e3

co

nta

ins

test

imag

e4

(I_A

U1

5)

fro

mth

eto

on

sh

ad

ed

sem

ph

ace

set.

Tab

leB

1.

Reco

gn

tio

nS

peed

Resu

lts

Sh

ow

ing

Mean

Tim

eO

ver

Ave

rag

eS

ub

ject

Data

for

Each

Imag

eo

fE

ach

Rep

resen

tatio

n

Re

pre

se

nta

tio

nIm

ag

e1

A_

AU

2*

Imag

e2

C_

Blin

k

Imag

e3

E_

Lo

ok_le

ft_d

ow

n

Imag

e4

I_A

U1

5

Imag

e5

N_

Dila

ted

_p

up

il

Mean

Std

.E

rro

rM

ean

St.

dE

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

r

2D

Sem

ph

ace

3D

Sem

ph

ace

Hyb

rid

Sem

ph

ace

1.0

4s

1.3

9s

1.2

8s

0.0

78

0.0

81

0.0

84

1.0

2s

1.3

3s

1.2

0s

0.0

91

0.0

83

0.0

95

1.0

5s

1.2

6s

1.1

9s

0.0

71

0.0

78

0.0

70

1.0

2s

1.3

6s

1.2

4s

0.0

86

0.0

91

0.0

87

1.0

7s

1.3

9s

1.3

0s

0.0

85

0.0

78

10

.09

0

Re

pre

se

nta

tio

nIm

ag

e6

B_

No

ch

an

ge

Imag

e7

No

ch

an

ge_

No

ch

an

ge

Imag

e8

No

ch

an

ge_

Clo

sed

_sm

ile

Imag

e9

R_

AU

1

Imag

e10

V_

Lo

ok_ri

gh

t

Mean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

rM

ean

Std

.E

rro

r

2D

Sem

ph

ace

3D

Sem

ph

ace

Hyb

rid

Sem

ph

ace

1.0

3s

1.4

5s

1.3

6s

0.0

91

0.0

87

0.0

78

1.1

2s

1.3

5s

1.2

9s

0.0

82

0.0

78

0.0

92

1.0

0s

1.4

2s

1.3

5s

0.0

80

0.0

91

0.0

99

1.1

0s

1.4

7s

1.3

5s

0.0

82

0.0

81

0.0

78

1.1

21

.41

s1

.32

s

0.0

85

0.0

72

0.0

83


ACKNOWLEDGMENTS

We thank Professor Ronnie Wilbur from the Department of Audiology and

Speech Sciences at Purdue University for her valuable contributions to the project.

We also thank Marie Nadolske for performing the signs and for her continuous

help and feedback on the project. In addition, we are grateful to all the signers who

have participated in the evaluation of the method and to the Indiana School for

the Deaf (ISD) for providing the testing ground for a thorough evaluation of the

ASL subtitling technique.

REFERENCES

Adamo-Villani, N., & Beni, G. (2004). Keyboard Controlled Signing Semantroid. IEEE

Proceedings of ICARCV—8th International Conference on Control, Automation,

Robotics and Vision, 741-746.

Adamo-Villani, N., & Beni, G. (2005). Semantic intensity: A measure of visual relevance.

Journal of Information, 9(3), 437-468.

Agrawala, M., & Stolte, C. (2001). Rendering effective route maps: Improving usability

through generalization. Proceedings of ACM Siggraph 2001, 241-249.

Benson, P. J., & Perret, D. I. (1994). Visual processing of facial distinctiveness. Perception,

23, 75-93.

Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3D faces.

Proceedings of ACM Siggraph, 1999, 187-194.

Captioning Web. Retrieved online at: http://www.captions.org/.

DeCarlo, D., & Santella, A. (2002). Stylization and abstraction of photographs. Pro-

ceedings of ACM Siggraph 2002, 769-776.

DePaul University. (2005). Sign language project. Retreived at

http://asl.cs.depaul.edu/project.info.html.

Flodin, M. (1994). Signing illustrated. New York: Perigee Reference.

Freire, A., & Lee, K. (2000). The face-inversion effect as a deficit in the encoding of

configural information: Direct evidence. Perception, 29, 159-170.

Gallaudet Research Institute. (2005). Regional and National Summary Report of Data

from the 2003-2004 Annual Survey of Deaf and Hard of Hearing Children and

Youth. Washington, DC: GRI, Gallaudet University [WWW document]. URL:

http://gri.gallaudet.edu/Demographics/2004_National_Summary.pdf


Table B2. Mean Time Over AverageSubject Data for Each Representation

Representation Mean

2D Semphace3D SemphaceHybrid Semphace

1.06s1.38s1.29s

Gooch, B., & Gooch, A. (2001). Non-Photorealistic Rendering. Wellesley, MA: AK

Peters, Ltd.

Gooch, B., Reinhard, E., & Gooch, A. (2004). Human facial illustrations: Creation and

psychophysical evaluation. ACM Transactions on Graphics, 23(1), 27-44.

Gravett, P. (2004). Manga: 60 years of Japanese comics (paperback). New York: Collins

Designs.

Hansen, M. H. (1959). Effects of discrimination training on stimulus generalization.

Journal of Experimental Psychology, 58, 3221-3334.

Hill, A., & Scharff, L. V. (1997). Readability of Websites with various foreground/

background color combinations, font types and word styles. Proceedings of the

Eleventh National Conference in Undergraduate Research, 2, 742-746.

Holt, J. A., Traxler, C. B., & Allen, T. E. (1997). Interpreting the scores: A user’s guide to

the 9th edition Stanford Achievement Test for Educators of Deaf and Hard-of-Hearing

Students. Orlando, FL: Harcourt Publishers.

Laidlaw, D. H. (2001). Loose, artistic “textures” for visualization. IEEE Computer

Graphics and Applications, 21(2), 6-9.

Mcloud, S. (1993). Understanding comics. New York: HarperCollins.

Ramachandran, V. S., & Hirstein, W. (1999). The science of art: A neurological theory

of aesthetic experience. Journal of Consciousness Studies, 6(6-7), 15-51.

Searcy, J. H., & Bartlett, J. C. (1996). Inversion and processing of component and

spatial-relational information in faces. Journal of Experimental Psychology: Human

Perception and Performance, 22(4), 904-915.

Sutton, V. (1974). Sign writing. Retrieved online at:

http://www.signwriting.org/labout/what/what02.html.

Tanaka, J. W., & Sengco, J. A. (1997). Features and their configuration in face recognition.

Memory and Cognition, 25(5), 583-592.

Tversky, B., & Baratz, D. (1985). Memory of faces: Are caricatures better than photo-

graphs? Memory and Cognition, 13, 45-49.

Wilbur, R. B. (2004). After 40 years of sign language research, what do we know? In

S. Bradaric-Joncic & V. Ivasovic (Eds.), Sign language, deaf culture, and bilingual

education, 9-38. Zagreb: ERF.

Wilson, B., & Ma, K. L. (2004). Rendering complexity in computer-generated pen-and-ink

illustrations. Proceedings of ACM Siggraph 2004, 103-111.

Direct reprint requests to:

Nicoletta Adamo-Villani

Dept. of Computer Graphics Technology

Purdue University

1419 Knoy Hall

401 N. Grant St.

West Lafayette, IN 47907

e-mail: [email protected]


SIGN LANGUAGE SUBTITLING BY HIGHLY …hpcg.purdue.edu/idealab/pubs/JETS_2006.pdfj. educational technology systems, vol. 35(1) 61-87, 2006-2007 sign language subtitling by highly comprehensible

Documents