PURPOSE Recognition of Hands-Free Speech and Hand Pointing Action for Recognition of Hands-Free Speech and Hand Pointing Action for Conversational TV Conversational TV Yasuo Ariki, Tetsuya Takiguchi and Atsushi Sako Department of Computer and System Engineering, Kobe University Conversational TV to which we can e nquire the information about TV contents. HANDS-FREE SPEECH RECOGNITION Multi-modal interaction with hands -free speech recognition and hand pointing re cognition. Multimedia analysis to broadcasted contents. Context awareness to understand user intension. w hatisthe m eaning ofthe w ord B R O AD- BAND? W hatisthat? M ultim edia database M ultim edia database Internet Presentation Transform Integration A nalysis R etrieval R etrieval Presentation M odality recognition H and-freespeech Speaker direction H and pointing action Frontend processing M eta data extraction N ew s Dram a Soccer Baseball Back end processing show m e the goal show m e the latestevent w ho ishe? CONVERSATIONAL TV O bserved signal Beam form ing H ands-free speech input Speaker direction estim ation Userutterance section detection Acousticm odeladaptation Speech recognition Userrecognition W ho,H ow m any W atching history Personal profile, U sercontext Speaking style E nquire, C onversation, Monologue H ands-free U tterance section Speech/NoiseG M M Em otion recognition Satisfaction Video editing Preferablecontentretrieval Sum m ary and explanation C ontextaw areness Pronoun, abbreviation User analysis C ontentsanalysis, editing, retrieval Recognition ofintension Action recognition Finger m ouse R ecognition ofrequirement Recognition ofprofile R ecognition ofm ind Speech recognition CONTEXT AWARENESS Skin color region extraction and noise reduction from cam era im ages Tw o-dim ensional coordinatesestim ation ofa fingerpointand head on im ages Three-dim ensionalcoordinates estim ation ofa fingerpointand head in a realworld by cam era calibration Estim ation ofa pointed coordinate on thescreen HAND POINTING RECOGNITION DEMONSTRATION 1 DEMONSTRATION 2 DEMONSTRATION 3 Function Y ou can tellyourtelevision w hatyou w antto w atch. Exam ple M ethod Face extraction and recognition U serspeech recognition Estim ation ofuserrequirem enton w hatofw ho Presentation ofexplanation video Sam espeech such as“show m egrand slam ”, butdifferentvideos. M r.M atsuiis on. M r.Ichiro is on. G rand slam ofMr.M atsui G rand slam ofM r. Ichiro Show m e grand slam Function Y ou can w atch the sports in yourpreferable style. Exam ple M ethod V ideo generation by digitalcam era w ork Eventrecognition U ser speech recognition Estim ation ofw hattheuser w ants Presentation ofthe corresponding video The television can understand even by talking “show m e the previousgoal” Scene with events Scene with events Show m e individual playsm ore Show m e the previousgoal Function Y ou can ask yourtelevision w hatyou do notknow Exam ple W hatis it? Itindicatesthelastim portantword. 東東東東東東東東東東東東東東東 東東東東東 東東東東東東東東東東 東東東東東東東東東 東東東東東 東東東東東東東東東東東東東東東東 。 S peech recognition result M ethod A nnouncerspeech recognition Im portantw ord extraction (TF/ID F) U serspeech recognition Estim ation oftheim portantw ord Presentation ofexplanation video Scenesw ith the im portantw ords Scenesw ith the im portantw ords Thetelevision can understand even by talking “w hatisit?”