AUTOMATIC SUBTITLE GENERATION - SFIT.CO.IN

i

AUTOMATIC SUBTITLE GENERATION

Submitted in partial fulfillment of the requirements

of the degree of

B. E. Computer Engineering

By

Alina Fargose 30

Alria Fargose 31

Supervisor:

Ms. Tejal Carwalo

Department of Computer Engineering

St. Francis Institute of Technology

(Engineering College)

University of Mumbai

2016-2017

ii

iii

iv

v

vi

vii

Abstract

The use of videos for the purpose of communication has witnessed a phenomenal growth in the

past few years. However, non-native language speakers or people with hearing disabilities are

unable to take advantage of this powerful medium of communication. To overcome the problems

caused by hearing disabilities or language barrier, subtitles are provided for videos. The subtitles

are provided in the form of a subtitle file most commonly having a .srt extension. Several software

have been developed for manually creating subtitle file, however software for automatically

generating subtitles are scarce.

The main objective of developing this system is to present an automated way to generate the

subtitles for audio and video. By replacing the tedious method of the current system will save time,

reduce the amount of work the administration has to do and will generate the subtitles

automatically with electronic apparatus. The system will first extract the audio, then recognisethe

extracted audio with the available speech recognition. Later the recognized audio is converted to

the text and saved in text file having extension “.srt”file. The .srt file is linked to the video and

video is played with subtitles.

vi

Contents

Chapter Contents Page

No.

1 INTRODUCTION 1

1.1 Description 2

1.2 Problem Formulation 3

1.3 Motivation 3

1.4 Proposed Solution 4

1.5 Scope of the project 4

2 REVIEW OF LITERATURE 5

3 SYSTEM ANALYSIS 8

3.1Functional Requirements 8

3.2Non Functional Requirements 8

3.3 Specific Requirements 9

3.4 Use Case Diagrams and Description 10

4 ANALYSIS MODELING 12

4.1 Data Modeling 12

4.2 Activity Diagrams 14

4.3 Functional Modeling 15

4.4 TimeLine Chart 17

5 DESIGN 19

5.1 Architectural Design 19

5.2 User Interface Design 22

6 IMPLEMENTATION 23

7 TESTING 39

8 RESULTS AND DISCUSSIONS 41

9 CONCLUSIONS 43

10 LITERATURE CITED 44

11 ACKNOWLEDGEMENTS 45

Chapter 2 Review of Literature

5

List of Figures

Fig. No. Figure Caption Page No.

3.4.1 Use Case Diagram 10

4.1.1 ER Diagram 12

4.2.1 Activity Diagram 14

4.3.1 DFD 0 (Context Level) 15

4.3.2 DFD 1 16

4.4.1 Timeline Chart 1 17



5.1 System Architecture 19

5.2.1 GUI 22

8.1 Selection of video 41

8.2 Processing 41

8.3 Processing 42

8..3 Playing the video with subtitles 42


6

Abbreviations

JAVE

Java Audio Video Editor

CMU

Carnegie Mellon University


7

Chapter 1

Introduction

Video has been around for a long time. Producing styles have evolved over the years,

distribution channels have emerged, interactivity has blossomed and technology has changed the

face of video forever. Video has become one of the most popular multimedia artifacts used on

PC’s and the internet. In a majority of cases sound plays an important role in the video.

Subtitles are text translations of the dialogues in a video displayed in real time during video

playback on the bottom of the screen. The subtitles may be in the same language as the video or

other language that will help people to understand the content of the video.

The most natural way lies in the use of subtitles. Subtitles provide missing information to

individuals who have difficulty processing speech and different auditory components of the

visual media. Subtitles are essential for children, who are deaf and hard of hearing, can be very

beneficial who are learning English and Hindi as a second language, can help those literacy

problems and those who are learning to read.

The main idea of developing the system is to present an automated way to generate subtitles

using audio extraction and speech recognition techniques which would replace the traditional

method of writing the file manually. In the traditional method the subtitles had to be added to a

particular video but this system will save the time, reduce the amount of work the administration

has to do and will minimize the human errors associated with this process. , manual subtitle

creation is a long and boring activity and it requires the presence of the user. Therefore,


8

Automatic Subtitle Generation is used.

Software like Subtitle Editor and Gaupol are built on the same line i.e these software require the

user to manually type the subtitles. This system details a system in Java that will automate this

process of typing subtitles by means of speech recognition.

1.1 Description

Automatic subtitle generator is a system which is used to generate the subtitles automatically and

play video files along with the subtitles.

The system consists of three modules: audio extraction, speech recognition, subtitle generation.

The subtitles that are generated are stored in .srt file, time Synchronization of .srt file is done

with the video. After the execution of all above modules the system plays the input video along

with the subtitles.

Audio Extraction:

The speech recognition engine requires a .wav audio file as input. Hence, it is necessary to

convert the video into .wav format .Ffmpeg will be used to accomplish this. Ffmpeg is used by

giants like VLC Media Player, Google Chrome, Blender etc. In order to use ffmpeg for

converting a video into audio, JAVE(Java Audio Video Encoder) is used.

Speech Recognition:

The .wav file obtained from the audio extraction phase will be passed forward for speech

recognition. An open source speech recognition engine called CMU Sphinx will be used in this

stage.

Subtitle Synchronization:

This is the final phase in the automatic subtitle generation process. Thus, finally the subtitles are

time synchronized with the videos.

1.2 Problem Formulation

Auditory processing disorder (APD), also known as central auditory processing disorder

(CAPD), is a complex problem affecting about 5% of school-aged children. These kids can't

process the information they hear in the same way as others because their ears and brain don't

fully coordinate. Something adversely affects the way the brain recognizes and interprets sounds,


9

most notably the sounds composing speech [1]

.

Automatic Subtitle Generations will help to understand the content of video files by providing

captions (subtitles) of what is being said in the video. School lectures can be video recorded for

such kids and must be shown to them along with the subtitles so they can perceive education in a

better manner.[1]

1.3 Motivation

Nowadays due to increase in use of syllabus related videos for teaching in classes, may that be at

school or college level, some students are unable to grasp what the speaker is trying to explain in

that video. If the videos are shown along with captions then it becomes easy to relate what the

video or the speaker in video want to convey. Also the videos are not compulsorily provided

with subtitles so manually writing this is an impossible task for any individual, also to search the

respective time synchronized file may take time. Instead by using this software any individual

can easily generate the captions and club it with video which can help students and all.

By creating subtitles following goals are achieved:

● The major benefit is that the viewer does not need to download the subtitle from Internet

if he wants to watch the video with subtitle.

● Captions help children with word identification, meaning, acquisition and retention.

● Captions can help children establish a systematic link between the written word and the

spoken word.

● Captioning has been related to higher comprehension skills when compared to viewers

watching the same media without captions.

● Captions provide missing information for individuals who have difficulty processing

speech and auditory components of the visual media.[2]

1.4 Proposed Solution:

The traditional method , the subtitles had to be created manually but in the proposed system an

automated way to generate the subtitles for audio and video is presented..

This system will first extract the audio, then recognise the extracted audio with the available


10

speech recognition.

The audio extraction routine is expected to return a suitable audio format that can be used by the

speech recognition module as pertinent material. It must handle a defined list of video and audio

formats. It has to verify the file given in input so that it can evaluate the extraction feasibility.

The next phase recognized speech will be converted to the text format and then the text file will

be generated having extension “.srt”.

The speech recognition routine is the key part of the system. Indeed, it affects directly

performance and results evaluation. First, it must get the type of input file then, if the type is

provided an appropriate processing method is chosen. Otherwise, the routine use a default

configuration

Later on, this “.srt” file can be opened in a media player to view the subtitles along with video.

The subtitle generation routine aims to create and write in a file in order to add multiple chunks

of text corresponding to utterances limited by gaps and their respective start and end times. Time

synchronisation consideration are of main importance.

1.5 Scope of the project

Earlier when subtitles had to be added to a particular video we had to use software that would be

to linking of .srt file with the video. Creation of .srt file is a long and tedious process. But using

this software user just has to input the video file and then the user will able to view the video

with required subtitle. It will provide subtitles for the input video.

Generated Captions will help children to understand the meaning along, with word identification,

acquisition and retention. Captions can help children establish a systematic link between the

written word and the spoken word.[3]

Watching a TV show or movie with closed captioning turned on can help people understand the

dialogue more clearly.


11

Chapter 2

Review of Literature

The last ten years have been the witnesses of the emergence of any kind of video content.

Moreover, the appearance of dedicated websites for this phenomenon has increased the

importance the public gives to it.

Video plays a vital role to help people understand and comprehend the information for example

the songs, movies or the video lectures or any other multimedia data relevant to the user. Hence,

here it becomes important to make videos available to the people having auditory problems and

even more for the people to remove the gaps of their native language. This can be best done by

the use of subtitles of the video. However, downloading subtitles of any video from the internet

is a tideous process. Consequently, to generate subtitles automatically through the software itself

and without the use of internet is a valid subject of research. Hence, the researchers has resolved

the above issue through three distinct modules namely Audio Extraction,Speech

Recognition,Subtitle generation.

Abhinav Mathur et al. have used the above phases for automatic subtitle generation which are

explained below:


12

Audio Extraction

For subtitle generation first, the input file goes through Audio Extraction where the input file

must be of the formats supported by FFMPEG standards such as .mp3, .mp4, .avi, .au, and .flac .

The input file goes to the demuxer where the video is separated from the audio. This audio is

then encoded where the stream is divided into frames and then converted into binary format.

The stream is compressed using the MP3 algorithm. In this algorithm the noise is removed using

the Psychoacoustic Model. This model provides lossy compression of the signal. This binary

data is then decoded where it is converted to sinusoidal signal. This signal then goes to the muxer

where the separated audio signals are combined together to produce a single(extracted) audio

signal and converted to .wav format which is then further used for speech recognition.

Speech Recognition

After the completion of audio extraction, the speech recognition part is carried out. Now, the

extracted .wav file is used to generate a .srt file using Speech Recognition in which the audio

undergoes three modules, the Front End, Decoder and the Knowledge Base.

Subtitle generation

The .srt file generated from the above module, that is, Speech Recognition contains the words

(lyrics) spoken in the audio file. The file consists of each word or sentence along with the time

interval (start time and the end time) in which it occurs in the song. This file is then further

embedded in the video where the lyrics are synchronized with the time and displayed with the

video.[4]

Boris Guenebaut, mentioned it is necessary to find solutions for the purpose of making these

media artifacts accessible for most people. Several software proposes utilities to create subtitles

for videos but all require an extensive participation of the user. The researcher has provided three

phases that are audio extraction, speech recognition and subtitle generation.The first phase

consists of audio extraction;the audio extraction routine is expected to return a suitable audio

format that can be used by the speech recognition module. It must handle a defined list of video

and audio formats. It has to verify the file given in input. The audio track has to be returned in

the most reliable format.

The second phase is speech recognition , this routine is the key part of the system. Indeed, it


13

affects directly performance and results evaluation. The third phase is Subtitle Generation,The

subtitle generation routine aims to create and write in a file with chunks of text corresponding

their respective start and end times. Time synchronization considerations are of main

importance.[5]

Smita Jawale et. al. mentioned that for creating subtitles there are three phases that have to be

implemented which are audio extraction , speech recognition and subtitle generation Also in [6]

researchers have used same three phases for subtitle generation.

Audio Extraction:In the implementation the researcher has used a custom library in Java named

JAVA AUDIO VIDEO ENCODER(JAVE). It is developed by Sauronsoftwares a UK based IT

firm. The JAVE (Java Audio Video Encoder) library is Java wrapper on the ffmpeg project.

Ffmpeg is used by giants like VLC Media Player, Google Chrome, Blender etc. In order to use

ffmpeg for converting a video into audio, JAVE is used. The JAVE (Java Audio Video Encoder)

library is Java wrapper on the ffmpeg project. JAVE is used to transcode audio and video files

from a format to another .Using JAVE library, only the portion of the video between the start and

end time selected by the user will be converted to audio .wav format suitable for speech

recognition.

Speech Recognition: Speech Recognition is done by using an open source software called CMU

SPHINX-4. The extracted audio in .wav format is given as input to the transcriber program. The

programming is done in java. Netbeans is used as IDE. Sphinx provides developer with three

elements. 1. Acoustic Model 2. Dictionary 3. Language Model

Subtitle Generation:- The module is expected to get a list of words andtheir respective speech

time from the speech recognition module and then to produce a SRT subtitle file. To do so,the

module must look at the list of words and use silence(SIL) utterances as delimitation for two

consecutive sentences. The list of words transcribed is then time stamped to introduce the

synchronization.

The drawback of this implementation is that the system will not be able to define punctuation in

the since it involves much more speech analysis and deeper design.[6]

To overcome the problems caused by hearing disabilities or language barrier, subtitles are

provided for videos. The subtitles are provided in the form of a subtitle file most commonly


14

having a .srt extension. Several software have been developed for manually creating subtitle file,

however software for automatically generating subtitles are scarce.

Thus in the survey we have found various mechanisms for automatic subtitle generation which

include three main steps which are Audio extraction, Speech recognition and Subtitle

generationTherefore in the proposed system Ffmpeg will be used for audio extraction which is

fast and accurate and for speech recognition and subtitle generation CMU Sphinx will be used

Which has best accuracy.

Chapter 3 System Analysis

8

Chapter 3

System Analysis

3.1 Functional Requirements

● All MPEG standard formats are supported for audio andvideo

● Captions appeared on the screen will be long enough to read .It is preferable to limit on-

screen captions to no more than two lines.

● Captions are synchronized with spoken words.

● The extracted text is in .srt format. The txt displayed will have a readable form

● Audio of any format can be extracted but speech recognition is only done in English and

Hindi.

3.2 Non-Functional Requirements

● System requirements: Compatible with all OS.

● Security: No security constraints.

● Performance: The text will synchronize with the video.

● Maintainability: The software is easy to maintain.

● Reliability: It will provide a good level of precision.

● Scalability: Software is scalable since multiple users can use it at the same time for their

benefits.

Chapter 4 Analysis Modeling

12

3.3 Specific Requirements

3.3.1 Hardware Requirements:

● A computer with windows 7 or higher operating system.

● Personal computer with following minimal specification:

RAM: 256/512 MB

Hard Disk: 80 GB

● Speakers.

3.3.2 Software Requirements:

● Any standard media player.

● Java SE 8

● Sphinx

3.4 Use-Case Diagrams and description

Fig 3.4.1 use case diagram


13

The user has to input the video, the user can either play the video or increase/decrease the

volume of the video. Once the input is obtained from the user, the audio is extracted from the

input video. Then the process of speech recognition takes place. Finally the subtitle file is

generated for video and the video and subtitle files are linked and played simultaneously.

Table 3.4.1 Use case template

Title: Automatic Subtitle generation system.

Description: User inputs the video and then automatically the subtitles are generated

for the video. The subtitles are generated by three main steps audio

extraction, speech recognition and creation of subtitle file.

Primary Actor: User.

Preconditions: User inputs the video and the format of the video is validated.

Postconditions: The subtitles for the video is generated.

Trigger: User wishes to watch the video with subtitles.

Exceptions: System displays error message saying Input format is not valid

System displays error message if video other than English language is

given as input.


14

Chapter 4

Analysis Modeling

4.1 Data Modeling

ER Diagram

Fig 4.1.1 ER Diagram

The ER diagram consists of three entities the three entities are Video,Words,Subtitle file .


15

Each entity contains their own attribute. The entity video contains two attribute video_path and

video_id. The video_path gives us the location where the video is stored the video_id gives the

unique id of the video.

The entity words contains two attributes word_id and word. theword_id is the unique id assigned

to each word.

The entity subtitle file has three attributes subtitle_id,timing_info and scentence_info. The

subtitle id is the id given to each subtitle .The Timing_info is the timestamps given to each

subtitle, and the sentence info is the subtitle generated.


16


17

4.2Activity Diagrams

Fig 4.2.1 Activity diagram


18

4.3 Functional Modeling

Data Flow Diagram

Fig 4.3.1 Context level DFD

Fig 4.3.2 Level 0 DFD

Data Flow Diagram


19

Fig 4.3.3 Level 1 DFD


20

4.4 TimeLine Chart

Fig 4.4.1

Fig 4.4.2

Chapter 5 Design

19

4.4 TimeLine Chart

Fig 4.4.3

Chapter 5 Design

20

Chapter 5

Design

5.1 Architectural Design

Fig 5.1.1

The process of automatic subtitle generation is shown in fig 5.1.1

1. Start: Take a media file input from the user and validate the input format.

2. Audio Extraction: Extract audio from the video file using ffmeg library in java. The

extracted audio is used as an input for speech recognition

3. Speech recognition: Sphinx algorithm is used for speech recognition. Sphinx identifies

words present in its dictionary.

4. Generation of subtitles: A .srt/.txt file is generated after speech recognition.

5. Synchronization of subtitles with video: the final step is time synchronization between

the subtitles file and the video. The spoken word should appear as text finally.

Chapter 5 Design

21

6. Apart from this user also has a chance of finding a particular topic or a word in the video

Audio Extraction:

Ffmpeg is used for audio extraction.ffmpeg is a very fast video and audio converter that can also

grab from a live audio/video source. Each input or output file can, in principle, contain any

number of streams of different types (video/audio/subtitle/attachment/data). The allowed number

and/or types of streams may be limited by the container format.

ffmpeg calls the libavformat library (containing demuxers) to read input files and get packets

containing encoded data from them. When there are multiple input files, ffmpeg tries to keep

them synchronized by tracking lowest timestamp on any active input stream.

Encoded packets are then passed to the decoder. The decoder produces uncompressed frames

(raw video/PCM audio/...) which can be processed further by filtering). After filtering, the

frames are passed to the encoder, which encodes them and outputs encoded packets. Finally

those are passed to the muxer, which writes the encoded packets to the output file.this is evident

through fig

Fig 5.1.2 Transcoding in ffmpeg algorithm

Chapter 5 Design

22

Speech Recognition:

Sphinx algorithm is used for speech recognition. Sphinx is an open source speech recognition

engine. Sphinx identifies words present in its dictionary.

Speech recognition consists of 3 modules

Acoustic model : An acoustic model is used in Automatic Speech Recognition to represent the

relationship between an audio signal and the phonemes or other linguistic units that make up

speech.

Language model : The sphinx algorithm tries to match sounds with word sequences. The

language model provides context to distinguish between words and phrases that sound similar.

Dictonary file : The list of words with their meanings that are stored in the database in an

alphabetical order.

The output contains the following fields:

a) Text: The transcribed text of the audio file

b) Wordtimes: When the actual words were spoken (or silences)

c) Best3: Best 3 guesses for all of the phrases from the file

Sub-title Generation:

The .srt file generated from the above module, that is, Speech Recognition contains the words

(lyrics) spoken in the audio file.The file consists of each word or sentence along with the time

interval (start time and the end time) in which it occurs in the song. This file is then further

embedded in the video where the lyrics are synchronized with the time and displayed with the

video

Chapter 5 Design

23

5.2User interface design:

Fig 5.2.1 main screen

Fig:5.2.2 selection of video

Chapter 5 Design

24

Chapter 7

Testing

The system tries its best in identifying the pronounced words in the video. If the system does

not find the spoken word in its dictionary then it prints spelling of similar sounding for it.

This means that entire accuracy of the system depends on the number of words in the

dictionary. A lot of the system accuracy depends on accent of the speaker in the video. If the

speaker is not fluent then there is a problem in recognising words. We have tired adding as

much words as possible in the dictionary. Individual modules were tested using unit testing

and when these modules were integrated integration testing is used. Detailed explanation

about both is explained below

7.1 Test Cases

Test Case 1: When user inputs erroneous input i.e some .txt file.

Result: In such situation the system won’t move further and then exception will be shown.

Test Case 2: When the system comes across a word in the video that is not present in the

dictionary.

Result: Here the system prints spelling of similar sounding for it.

7.2 Testing used

7.2.1 Unit Testing

Unit testing focuses verification effort on the smallest unit of software design-the software

component or module. Using the component-level design description as a guide, important

control paths are tested to uncover errors within the boundary of the module. The relative

complexity of tests and uncovered errors is limited by the constrained scope established for

unit testing. The unit testing is white box oriented and the step can be conducted in parallel

for multiple components.

Chapter 5 Design

25

In our project we coded different models separately, tested them and then integrated them

together as one. We took input as video files from the user; This input has to be in standard

video format i.e. .avi or .mp4.Then the system proceeds to the audio extraction and

conversion unit, where we test if the .wav file is generated in a proper format as suitable for

the speech recognition module. The next was the subtitle generation unit and the final was the

linking unit.

7.2 Integration testing:

Integration testing is the phase in software testing in which individual software modules are

combined and tested as a group. It occurs after unit testing and before validation testing.

Integration testing takes as its input modules that have been unit tested, groups them in larger

aggregates, applies tests defined in an integration test plan to those aggregates , and delivers

as its output the integrated system ready for system testing.While the system is working if

any of the system fails or does not get a proper input the entire system fails. And thus user

won’t be able to view his video with subtitles.

Testing was done on a 30 second video which contained 62 words.60 words were recognized

in that video and out of recognized words 25 words were right. That means the accuracy rate

was found to be 40.32%.

Chapter 5 Design

26

Chapter 8

Results and Discussion

Figure 8.1 Selection of the video

Chapter 5 Design

27

Figure 8.2 Processing (audio Extraction)

Figure 8.3 Processing(Subtitle Generation)

Chapter 5 Design

28

Figure 8.4 Playing the video with subtitles

Chapter 5 Design

29

Chapter 9

Conclusion

Through implementation of “Automatic subtitle generator” the system aim at creating

software that will generate subtitles for input videos automatically. This system will first

extract the audio, then recognize the extracted audio with the available speech recognition.

The next phase recognized speech will be converted to the text format and then the text file

will be generated having extension “.srt” .Later on, this “.srt” file can be opened in a media

player to view the subtitles along with video.The major benefit is that the viewer does not

need to download subtitle from internet, if he wants to watch the video with subtitles.The

tedious task of manual creation of subtitle is replaced.

Future Work:

The system is being implemented in Windows operating system. Modifications can be made

in such a way that this software works for other Operating systems too. Modifications in the

system can be made in such a way that Sub-titles in languages other than English can also be

generated.

Chapter 5 Design

30

Literature Cited

[1] Understanding Auditory Processing Disorders in Children[online]

available:http://www.asha.org/public/hearing/Understanding-Auditory-Processing-Disorders-

in-Children

[2] Automatic Subtitle Generation for Videos[online] available:

http://pnrsolution.org/Datacenter

[3]FFmpegBasicsMultimedia handling with a fast

audio and video encoder [online] available:http://ffmpeg.tv

[4] AbhinavMathur, ”Generating Subtitles Automatically using Audio Extraction and Speech

Recognition”, International Conference on Computational Intelligence & Communication

Technology, pg. 622-627, 2015.

[5] Automatic Subtitle Generation for Sound in Videos[online]available:http://www.diva-

portal.org/smash/get/diva2:241802/FULLTEXT01

[6]Automatic Subtitle Generation for English Language Videos [online]available:

http://www.internationaljournalssrg.org/IJCSE/2015/Volume2-Issue10/IJCSE-

V2I10P102.pdf

http://pnrsolution.org/Datacenter/Vol3/Issue6/89.pdf(website

http://pnrsolution.org/Datacenter/Vol3/Issue6/89.pdf(website

http://ffmpeg.tv/

http://ffmpeg.tv/

Chapter 5 Design

31

Acknowledgements

We would like to express our gratitude to our guide Ms. Tejal Carwalo for her advice and

guidance. Her frequent interactions and suggestions have made a significant contribution at

every stage of our project. We thank our H.O.D dr. Kavita Sonawane for giving us the

opportunity to work on this project. We also thank our project Coordinator Ms. Vincy Joseph,

Mr. Shamshuddin Khan and Ms. Nidhi Gaur for their support and guidance. We thank the

Almighty, our parents, siblings and friends for their support.

AUTOMATIC SUBTITLE GENERATION - SFIT.CO.IN

Documents