Analysis, Indexing and Visualization of Presentation Videos Analysis, Indexing and Visualization of Presentation Videos Analysis, Indexing and Visualization of Presentation Videos Michele Merler email: [email protected] Computer Science Department, Columbia University Motivation & Domain Description Michele Merler email: [email protected] Computer Science Department, Columbia University Motivation & Domain Description Domain challenges: “WILD” ! Many videos are already archived Low quality Lack of Structure A quickly increasing quantity of presentation videos is GOAL : Help users efficiently Domain challenges: “WILD” ! • Lack of additional • Not recorded by • Unconstrained camera already archived Low quality Lack of Structure Videos of presentations are tools nowadays A quickly increasing quantity of presentation videos is publicly available and retrievable on the web GOAL : Help users efficiently and effectively access • Lack of additional sources of • Not recorded by professional • Unconstrained camera movements employed in a large variety of systems and effectively access (educational) information information (e.g. electronic cameramen • Slides Truncation Distance or E-learning Conference proceedings (educational) information (e.g. electronic copies of slides) • Light cannot be used as clue • Compression Conference proceedings Student presentations results 1-20 of 1,160 used as clue • Not edited • Standard processing Student presentations Corporate talks 659 events 9K authors 12K lectures 14K videos results 1-20 of 1,160 • Not edited does not apply 12K lectures 14K videos 3. Graphics Index Generation GOAL : Ensure end users satisfaction with how the 1. User Preferred Face Indexes 3. Graphics Index Generation GOAL : Ensure end users satisfaction with how the information extracted from the videos is presented Results 1. User Preferred Face Indexes Experimental Setup Results 1575 Amazon Mechanical Turk HITs (15 speakers x 3 ordering x 35 unique workers) Most people prefer Head & Shoulder FRONTALview 35% of votes went to Left and Right ¾ Head & Shoulder! Proposed Solution (15 speakers x 3 ordering x 35 unique workers) 35% of votes went to Left and Right ¾ Head & Shoulder! Confirms results of psychological studies on inference of Proposed Solution LBP Histogram + Color Histogram head 3D information from 3/4 view of face [Burke VR07] Index presentation videos based on four major cues: LBP Histogram + Color Histogram Online Clustering (visual + temporal) with avg. Linkage Text (+audio transcripts) Graphics Online Clustering (visual + temporal) with avg. Linkage Norm. Cross Correlation for Template Matching Speaker faces Graphics Mosaics 5 . 0 ) , ( 1 1 ) , ( 1 2 C k j i c x C C x S j + = ∑ = χ i x region It has better illumination It has better resolution I can see/tell more about the whole appearance of the person I can see better the eyes and expression of the person ( ) ( ) ) ( 4 . 0 ) ( 4 . 0 ) , ( 5 . 0 ) , ( 1 j j j i k jk i j C T t C S C x T c x C − + − + = + = β α χ j C cluster I can see better the eyes and expression of the person I prefer this pose of a person in general I picked the best out of a bunch of bad pictures None of the above(please explain your reason with a few words in the box below) BACK-END ) , ( ) , ( j i j i C x T C x S < > BACK-END j i j i < 2. Automatic Generation of Speakers Face Indexes User Preferred Textual Index Graphics Index 4. Textual Index Generation 2. Automatic Generation of Speakers Face Indexes User Preferred Face Indexes Textual Index Generation Graphics Index Generation 4. Textual Index Generation 1 3 4 Edges Connected Geometric + Edge Local Adaptive Otsu Face Indexes Generation Generation Selection based on 3 quality measures Viola Jones detector Face LoGedges Edges Connected Components Geometric + Edge Density Constraints Local Adaptive Otsu (LAO) Binarization Tesseract OCR Speaker Face Semantic Shot 2 5 1. Resolution Selection based on 3 quality measures Viola Jones detector Color skin filter Face Detection Completed Tasks Speaker Face Index Generation Semantic Shot Representation 2 5 h w × 1. Resolution Size of the face region Color skin filter Detection Rcscarch Interview with Client { i sited Project Space h w × Size of the face region Face Seeds Qiant House Resident Association Meeting 1 hour and 45 minutes of video, 8 student presentations Face Seeds Completed Tasks 1 hour and 45 minutes of video, 8 student presentations 13 slides per presentation (average) O t f MILTrack(prediction): Face Research Interview Client After vocabulary correction LAO + Tesseract 13 slides per presentation (average) VASTMM Browser [1] 6 t f P t f MILTrack(prediction): Viola Jones detector (observation): Face Interview Client Project Space House Resident Association Meeting correction Tesseract 8000 zed LAO + Tesseract VASTMM Browser [1] 6 t f 2. Pose Viola Jones detector (observation): Simplified Kalman filter: Tracking House Resident Association Meeting Tesseract 6000 cogniz ers FRONT-END O t P t t f f f ) 1 ( α α − + ← Left and right ¾ pose classifiers Edge histogram descriptor 2. Pose Simplified Kalman filter: 4000 Rec aracte t t t f f f ) 1 ( α α − + ← 15383 15385 15387 15371 15409 15355 Edge histogram descriptor SVM RBF kernel Face Tracks 2000 4000 mber Cha 6. Final Browser interface 15383 15385 15387 15371 15409 15355 SVM RBF kernel FaceTracerdataset Face Tracks 2000 Num 6. Final Browser interface Training Set (left ¾, front, right ¾) ~10K images Home Search Explore Collections Visual Search Login or sign up 0 Recognition Method Training Set (left ¾, front, right ¾) ~10K images Test Set ~12K images Average Test Accuracy 81.5% Selection of faces to match Search Home Search Explore Collections Visual Search Login or sign up Number Number Precision Recall Recognition Method Average Test Accuracy 81.5% Selection of faces to match LBP descriptor + Sq. L2 distance Tracks People Index Graphics Index Search Tips Phone + P05 + G08 Tag Number GT Words Number Rec. Words Precision Recall 3 Skin Ratio Matching Phone + P05 + G08 Tag People Index Graphics Index 2276 1126 0.495 0.665 Unique Speakers area skinPixels skinRatio # = > 185 . 1 R Unique Speakers Face Tracks 5. Semantic Shot Representation Enhanced Feature Based Mosaic area > > ⇔ = 107 . 0 185 . 1 skin Pixel RB G 5. Semantic Shot Representation Enhanced Feature Based Mosaic > > + + ⇔ = 112 . 0 107 . 0 ) ( skin Pixel 2 RG B G R Face Index Resolution 10 secs > + + 112 . 0 ) ( 2 B G R RG Select “best faces” to present to end user Face Index Generation PTZ Estimation SIFT + RANSAC on min max Resolution 10 secs Click on an icon to find the graphic in the video present to end user Generation SIFT + RANSAC on keyframes skinRatio w resolution w pose w Q ⋅ + ⋅ + ⋅ = 3 2 1 350 Overlay Recognized Text min max line Click on an icon to find the graphic in the video skinRatio w resolution w pose w Q ⋅ + ⋅ + ⋅ = 3 2 1 Average Track Matching Time (secs) 335 300 350 Overlay Recognized Text Tagl s Test on 3 335 Track Matching Face Selection Left/right34 Extraction 200 250 Problem Statement •Ecological Impact Text1 Text3 Text4 Text5 Text7 Text8 Text10 Frames 51 out of 58 with Head & shoulder, ¾ profile view Test on 3 student Left/right34 Extraction Skin-Res Extraction K-Means Computation 150 200 •Waste goes to Landfills •Energy Source • Cost Efficiency • Waste Disposal Bill Text1 Text3 Text4 Text5 Text7 Text8 Text10 Text Problem Phone 51 out of 58 with Head & shoulder, ¾ profile view student presenta- K-Means Computation 50 100 • Electrical Bill •Ms Wilson is looking for an eco-friendly, cost efficient, and easy to use product that will convert her solid waste into usable energy Segment video into semantically distinct shots based on slides People presenta- tion videos, 0 50 energy Enhance Graphics on slides Changes in text used to assess slide changes Graphics 45 minutes each 20 19 K-Means(100) select (100) min-min 1 2 3 Enhance Graphics Changes in text used to assess slide changes G each K-Means(100) select (100) min-min