i AUTOMATIC SUBTITLE GENERATION Submitted in partial fulfillment of the requirements of the degree of B. E. Computer Engineering By Alina Fargose 30 Alria Fargose 31 Supervisor: Ms. Tejal Carwalo Department of Computer Engineering St. Francis Institute of Technology (Engineering College) University of Mumbai 2016-2017
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
i
AUTOMATIC SUBTITLE GENERATION
Submitted in partial fulfillment of the requirements
of the degree of
B. E. Computer Engineering
By
Alina Fargose 30
Alria Fargose 31
Supervisor:
Ms. Tejal Carwalo
Department of Computer Engineering
St. Francis Institute of Technology
(Engineering College)
University of Mumbai
2016-2017
ii
iii
iv
v
vi
vii
Abstract
The use of videos for the purpose of communication has witnessed a phenomenal growth in the
past few years. However, non-native language speakers or people with hearing disabilities are
unable to take advantage of this powerful medium of communication. To overcome the problems
caused by hearing disabilities or language barrier, subtitles are provided for videos. The subtitles
are provided in the form of a subtitle file most commonly having a .srt extension. Several software
have been developed for manually creating subtitle file, however software for automatically
generating subtitles are scarce.
The main objective of developing this system is to present an automated way to generate the
subtitles for audio and video. By replacing the tedious method of the current system will save time,
reduce the amount of work the administration has to do and will generate the subtitles
automatically with electronic apparatus. The system will first extract the audio, then recognisethe
extracted audio with the available speech recognition. Later the recognized audio is converted to
the text and saved in text file having extension “.srt”file. The .srt file is linked to the video and
video is played with subtitles.
vi
Contents
Chapter Contents Page
No.
1 INTRODUCTION 1
1.1 Description 2
1.2 Problem Formulation 3
1.3 Motivation 3
1.4 Proposed Solution 4
1.5 Scope of the project 4
2 REVIEW OF LITERATURE 5
3 SYSTEM ANALYSIS 8
3.1Functional Requirements 8
3.2Non Functional Requirements 8
3.3 Specific Requirements 9
3.4 Use Case Diagrams and Description 10
4 ANALYSIS MODELING 12
4.1 Data Modeling 12
4.2 Activity Diagrams 14
4.3 Functional Modeling 15
4.4 TimeLine Chart 17
5 DESIGN 19
5.1 Architectural Design 19
5.2 User Interface Design 22
6 IMPLEMENTATION 23
7 TESTING 39
8 RESULTS AND DISCUSSIONS 41
9 CONCLUSIONS 43
10 LITERATURE CITED 44
11 ACKNOWLEDGEMENTS 45
Chapter 2 Review of Literature
5
List of Figures
Fig. No. Figure Caption Page No.
3.4.1 Use Case Diagram 10
4.1.1 ER Diagram 12
4.2.1 Activity Diagram 14
4.3.1 DFD 0 (Context Level) 15
4.3.2 DFD 1 16
4.4.1 Timeline Chart 1 17
4.4.2 Timeline Chart 2 17
4.4.3 Timeline Chart 3 18
5.1 System Architecture 19
5.2.1 GUI 22
8.1 Selection of video 41
8.2 Processing 41
8.3 Processing 42
8..3 Playing the video with subtitles 42
Chapter 2 Review of Literature
6
Abbreviations
JAVE
Java Audio Video Editor
CMU
Carnegie Mellon University
Chapter 2 Review of Literature
7
Chapter 1
Introduction
Video has been around for a long time. Producing styles have evolved over the years,
distribution channels have emerged, interactivity has blossomed and technology has changed the
face of video forever. Video has become one of the most popular multimedia artifacts used on
PC’s and the internet. In a majority of cases sound plays an important role in the video.
Subtitles are text translations of the dialogues in a video displayed in real time during video
playback on the bottom of the screen. The subtitles may be in the same language as the video or
other language that will help people to understand the content of the video.
The most natural way lies in the use of subtitles. Subtitles provide missing information to
individuals who have difficulty processing speech and different auditory components of the
visual media. Subtitles are essential for children, who are deaf and hard of hearing, can be very
beneficial who are learning English and Hindi as a second language, can help those literacy
problems and those who are learning to read.
The main idea of developing the system is to present an automated way to generate subtitles
using audio extraction and speech recognition techniques which would replace the traditional
method of writing the file manually. In the traditional method the subtitles had to be added to a
particular video but this system will save the time, reduce the amount of work the administration
has to do and will minimize the human errors associated with this process. , manual subtitle
creation is a long and boring activity and it requires the presence of the user. Therefore,
Chapter 2 Review of Literature
8
Automatic Subtitle Generation is used.
Software like Subtitle Editor and Gaupol are built on the same line i.e these software require the
user to manually type the subtitles. This system details a system in Java that will automate this
process of typing subtitles by means of speech recognition.
1.1 Description
Automatic subtitle generator is a system which is used to generate the subtitles automatically and
play video files along with the subtitles.
The system consists of three modules: audio extraction, speech recognition, subtitle generation.
The subtitles that are generated are stored in .srt file, time Synchronization of .srt file is done
with the video. After the execution of all above modules the system plays the input video along
with the subtitles.
Audio Extraction:
The speech recognition engine requires a .wav audio file as input. Hence, it is necessary to
convert the video into .wav format .Ffmpeg will be used to accomplish this. Ffmpeg is used by
giants like VLC Media Player, Google Chrome, Blender etc. In order to use ffmpeg for
converting a video into audio, JAVE(Java Audio Video Encoder) is used.
Speech Recognition:
The .wav file obtained from the audio extraction phase will be passed forward for speech
recognition. An open source speech recognition engine called CMU Sphinx will be used in this
stage.
Subtitle Synchronization:
This is the final phase in the automatic subtitle generation process. Thus, finally the subtitles are
time synchronized with the videos.
1.2 Problem Formulation
Auditory processing disorder (APD), also known as central auditory processing disorder
(CAPD), is a complex problem affecting about 5% of school-aged children. These kids can't
process the information they hear in the same way as others because their ears and brain don't
fully coordinate. Something adversely affects the way the brain recognizes and interprets sounds,
Chapter 2 Review of Literature
9
most notably the sounds composing speech [1]
.
Automatic Subtitle Generations will help to understand the content of video files by providing
captions (subtitles) of what is being said in the video. School lectures can be video recorded for
such kids and must be shown to them along with the subtitles so they can perceive education in a
better manner.[1]
1.3 Motivation
Nowadays due to increase in use of syllabus related videos for teaching in classes, may that be at
school or college level, some students are unable to grasp what the speaker is trying to explain in
that video. If the videos are shown along with captions then it becomes easy to relate what the
video or the speaker in video want to convey. Also the videos are not compulsorily provided
with subtitles so manually writing this is an impossible task for any individual, also to search the
respective time synchronized file may take time. Instead by using this software any individual
can easily generate the captions and club it with video which can help students and all.
By creating subtitles following goals are achieved:
● The major benefit is that the viewer does not need to download the subtitle from Internet
if he wants to watch the video with subtitle.
● Captions help children with word identification, meaning, acquisition and retention.
● Captions can help children establish a systematic link between the written word and the
spoken word.
● Captioning has been related to higher comprehension skills when compared to viewers
watching the same media without captions.
● Captions provide missing information for individuals who have difficulty processing
speech and auditory components of the visual media.[2]
1.4 Proposed Solution:
The traditional method , the subtitles had to be created manually but in the proposed system an
automated way to generate the subtitles for audio and video is presented..
This system will first extract the audio, then recognise the extracted audio with the available
Chapter 2 Review of Literature
10
speech recognition.
The audio extraction routine is expected to return a suitable audio format that can be used by the
speech recognition module as pertinent material. It must handle a defined list of video and audio
formats. It has to verify the file given in input so that it can evaluate the extraction feasibility.
The next phase recognized speech will be converted to the text format and then the text file will
be generated having extension “.srt”.
The speech recognition routine is the key part of the system. Indeed, it affects directly
performance and results evaluation. First, it must get the type of input file then, if the type is
provided an appropriate processing method is chosen. Otherwise, the routine use a default
configuration
Later on, this “.srt” file can be opened in a media player to view the subtitles along with video.
The subtitle generation routine aims to create and write in a file in order to add multiple chunks
of text corresponding to utterances limited by gaps and their respective start and end times. Time
synchronisation consideration are of main importance.
1.5 Scope of the project
Earlier when subtitles had to be added to a particular video we had to use software that would be
to linking of .srt file with the video. Creation of .srt file is a long and tedious process. But using
this software user just has to input the video file and then the user will able to view the video
with required subtitle. It will provide subtitles for the input video.
Generated Captions will help children to understand the meaning along, with word identification,
acquisition and retention. Captions can help children establish a systematic link between the
written word and the spoken word.[3]
Watching a TV show or movie with closed captioning turned on can help people understand the
dialogue more clearly.
Chapter 2 Review of Literature
11
Chapter 2
Review of Literature
The last ten years have been the witnesses of the emergence of any kind of video content.
Moreover, the appearance of dedicated websites for this phenomenon has increased the
importance the public gives to it.
Video plays a vital role to help people understand and comprehend the information for example
the songs, movies or the video lectures or any other multimedia data relevant to the user. Hence,
here it becomes important to make videos available to the people having auditory problems and
even more for the people to remove the gaps of their native language. This can be best done by
the use of subtitles of the video. However, downloading subtitles of any video from the internet
is a tideous process. Consequently, to generate subtitles automatically through the software itself
and without the use of internet is a valid subject of research. Hence, the researchers has resolved
the above issue through three distinct modules namely Audio Extraction,Speech
Recognition,Subtitle generation.
Abhinav Mathur et al. have used the above phases for automatic subtitle generation which are
explained below:
Chapter 2 Review of Literature
12
Audio Extraction
For subtitle generation first, the input file goes through Audio Extraction where the input file
must be of the formats supported by FFMPEG standards such as .mp3, .mp4, .avi, .au, and .flac .
The input file goes to the demuxer where the video is separated from the audio. This audio is
then encoded where the stream is divided into frames and then converted into binary format.
The stream is compressed using the MP3 algorithm. In this algorithm the noise is removed using
the Psychoacoustic Model. This model provides lossy compression of the signal. This binary
data is then decoded where it is converted to sinusoidal signal. This signal then goes to the muxer
where the separated audio signals are combined together to produce a single(extracted) audio
signal and converted to .wav format which is then further used for speech recognition.
Speech Recognition
After the completion of audio extraction, the speech recognition part is carried out. Now, the
extracted .wav file is used to generate a .srt file using Speech Recognition in which the audio
undergoes three modules, the Front End, Decoder and the Knowledge Base.
Subtitle generation
The .srt file generated from the above module, that is, Speech Recognition contains the words
(lyrics) spoken in the audio file. The file consists of each word or sentence along with the time
interval (start time and the end time) in which it occurs in the song. This file is then further
embedded in the video where the lyrics are synchronized with the time and displayed with the
video.[4]
Boris Guenebaut, mentioned it is necessary to find solutions for the purpose of making these
media artifacts accessible for most people. Several software proposes utilities to create subtitles
for videos but all require an extensive participation of the user. The researcher has provided three
phases that are audio extraction, speech recognition and subtitle generation.The first phase
consists of audio extraction;the audio extraction routine is expected to return a suitable audio
format that can be used by the speech recognition module. It must handle a defined list of video
and audio formats. It has to verify the file given in input. The audio track has to be returned in
the most reliable format.
The second phase is speech recognition , this routine is the key part of the system. Indeed, it
Chapter 2 Review of Literature
13
affects directly performance and results evaluation. The third phase is Subtitle Generation,The
subtitle generation routine aims to create and write in a file with chunks of text corresponding
their respective start and end times. Time synchronization considerations are of main
importance.[5]
Smita Jawale et. al. mentioned that for creating subtitles there are three phases that have to be
implemented which are audio extraction , speech recognition and subtitle generation Also in [6]
researchers have used same three phases for subtitle generation.
Audio Extraction:In the implementation the researcher has used a custom library in Java named
JAVA AUDIO VIDEO ENCODER(JAVE). It is developed by Sauronsoftwares a UK based IT
firm. The JAVE (Java Audio Video Encoder) library is Java wrapper on the ffmpeg project.
Ffmpeg is used by giants like VLC Media Player, Google Chrome, Blender etc. In order to use
ffmpeg for converting a video into audio, JAVE is used. The JAVE (Java Audio Video Encoder)
library is Java wrapper on the ffmpeg project. JAVE is used to transcode audio and video files
from a format to another .Using JAVE library, only the portion of the video between the start and
end time selected by the user will be converted to audio .wav format suitable for speech
recognition.
Speech Recognition: Speech Recognition is done by using an open source software called CMU
SPHINX-4. The extracted audio in .wav format is given as input to the transcriber program. The
programming is done in java. Netbeans is used as IDE. Sphinx provides developer with three
elements. 1. Acoustic Model 2. Dictionary 3. Language Model
Subtitle Generation:- The module is expected to get a list of words andtheir respective speech
time from the speech recognition module and then to produce a SRT subtitle file. To do so,the
module must look at the list of words and use silence(SIL) utterances as delimitation for two
consecutive sentences. The list of words transcribed is then time stamped to introduce the
synchronization.
The drawback of this implementation is that the system will not be able to define punctuation in
the since it involves much more speech analysis and deeper design.[6]
To overcome the problems caused by hearing disabilities or language barrier, subtitles are
provided for videos. The subtitles are provided in the form of a subtitle file most commonly
Chapter 2 Review of Literature
14
having a .srt extension. Several software have been developed for manually creating subtitle file,
however software for automatically generating subtitles are scarce.
Thus in the survey we have found various mechanisms for automatic subtitle generation which
include three main steps which are Audio extraction, Speech recognition and Subtitle
generationTherefore in the proposed system Ffmpeg will be used for audio extraction which is
fast and accurate and for speech recognition and subtitle generation CMU Sphinx will be used
Which has best accuracy.
Chapter 3 System Analysis
8
Chapter 3
System Analysis
3.1 Functional Requirements
● All MPEG standard formats are supported for audio andvideo
● Captions appeared on the screen will be long enough to read .It is preferable to limit on-
screen captions to no more than two lines.
● Captions are synchronized with spoken words.
● The extracted text is in .srt format. The txt displayed will have a readable form
● Audio of any format can be extracted but speech recognition is only done in English and
Hindi.
3.2 Non-Functional Requirements
● System requirements: Compatible with all OS.
● Security: No security constraints.
● Performance: The text will synchronize with the video.
● Maintainability: The software is easy to maintain.
● Reliability: It will provide a good level of precision.
● Scalability: Software is scalable since multiple users can use it at the same time for their
benefits.
Chapter 4 Analysis Modeling
12
3.3 Specific Requirements
3.3.1 Hardware Requirements:
● A computer with windows 7 or higher operating system.
● Personal computer with following minimal specification:
RAM: 256/512 MB
Hard Disk: 80 GB
● Speakers.
3.3.2 Software Requirements:
● Any standard media player.
● Java SE 8
● Sphinx
3.4 Use-Case Diagrams and description
Fig 3.4.1 use case diagram
Chapter 4 Analysis Modeling
13
The user has to input the video, the user can either play the video or increase/decrease the
volume of the video. Once the input is obtained from the user, the audio is extracted from the
input video. Then the process of speech recognition takes place. Finally the subtitle file is
generated for video and the video and subtitle files are linked and played simultaneously.
Table 3.4.1 Use case template
Title: Automatic Subtitle generation system.
Description: User inputs the video and then automatically the subtitles are generated
for the video. The subtitles are generated by three main steps audio
extraction, speech recognition and creation of subtitle file.
Primary Actor: User.
Preconditions: User inputs the video and the format of the video is validated.
Postconditions: The subtitles for the video is generated.
Trigger: User wishes to watch the video with subtitles.
Exceptions: System displays error message saying Input format is not valid
System displays error message if video other than English language is
given as input.
Chapter 4 Analysis Modeling
14
Chapter 4
Analysis Modeling
4.1 Data Modeling
ER Diagram
Fig 4.1.1 ER Diagram
The ER diagram consists of three entities the three entities are Video,Words,Subtitle file .
Chapter 4 Analysis Modeling
15
Each entity contains their own attribute. The entity video contains two attribute video_path and
video_id. The video_path gives us the location where the video is stored the video_id gives the
unique id of the video.
The entity words contains two attributes word_id and word. theword_id is the unique id assigned
to each word.
The entity subtitle file has three attributes subtitle_id,timing_info and scentence_info. The
subtitle id is the id given to each subtitle .The Timing_info is the timestamps given to each
subtitle, and the sentence info is the subtitle generated.
Chapter 4 Analysis Modeling
16
Chapter 4 Analysis Modeling
17
4.2Activity Diagrams
Fig 4.2.1 Activity diagram
Chapter 4 Analysis Modeling
18
4.3 Functional Modeling
Data Flow Diagram
Fig 4.3.1 Context level DFD
Fig 4.3.2 Level 0 DFD
Data Flow Diagram
Chapter 4 Analysis Modeling
19
Fig 4.3.3 Level 1 DFD
Chapter 4 Analysis Modeling
20
4.4 TimeLine Chart
Fig 4.4.1
Fig 4.4.2
Chapter 5 Design
19
4.4 TimeLine Chart
Fig 4.4.3
Chapter 5 Design
20
Chapter 5
Design
5.1 Architectural Design
Fig 5.1.1
The process of automatic subtitle generation is shown in fig 5.1.1
1. Start: Take a media file input from the user and validate the input format.
2. Audio Extraction: Extract audio from the video file using ffmeg library in java. The
extracted audio is used as an input for speech recognition
3. Speech recognition: Sphinx algorithm is used for speech recognition. Sphinx identifies
words present in its dictionary.
4. Generation of subtitles: A .srt/.txt file is generated after speech recognition.
5. Synchronization of subtitles with video: the final step is time synchronization between
the subtitles file and the video. The spoken word should appear as text finally.
Chapter 5 Design
21
6. Apart from this user also has a chance of finding a particular topic or a word in the video
Audio Extraction:
Ffmpeg is used for audio extraction.ffmpeg is a very fast video and audio converter that can also
grab from a live audio/video source. Each input or output file can, in principle, contain any
number of streams of different types (video/audio/subtitle/attachment/data). The allowed number
and/or types of streams may be limited by the container format.
ffmpeg calls the libavformat library (containing demuxers) to read input files and get packets
containing encoded data from them. When there are multiple input files, ffmpeg tries to keep
them synchronized by tracking lowest timestamp on any active input stream.
Encoded packets are then passed to the decoder. The decoder produces uncompressed frames
(raw video/PCM audio/...) which can be processed further by filtering). After filtering, the
frames are passed to the encoder, which encodes them and outputs encoded packets. Finally
those are passed to the muxer, which writes the encoded packets to the output file.this is evident
through fig
Fig 5.1.2 Transcoding in ffmpeg algorithm
Chapter 5 Design
22
Speech Recognition:
Sphinx algorithm is used for speech recognition. Sphinx is an open source speech recognition
engine. Sphinx identifies words present in its dictionary.
Speech recognition consists of 3 modules
Acoustic model : An acoustic model is used in Automatic Speech Recognition to represent the
relationship between an audio signal and the phonemes or other linguistic units that make up
speech.
Language model : The sphinx algorithm tries to match sounds with word sequences. The
language model provides context to distinguish between words and phrases that sound similar.
Dictonary file : The list of words with their meanings that are stored in the database in an
alphabetical order.
The output contains the following fields:
a) Text: The transcribed text of the audio file
b) Wordtimes: When the actual words were spoken (or silences)
c) Best3: Best 3 guesses for all of the phrases from the file
Sub-title Generation:
The .srt file generated from the above module, that is, Speech Recognition contains the words
(lyrics) spoken in the audio file.The file consists of each word or sentence along with the time
interval (start time and the end time) in which it occurs in the song. This file is then further
embedded in the video where the lyrics are synchronized with the time and displayed with the
video
Chapter 5 Design
23
5.2User interface design:
Fig 5.2.1 main screen
Fig:5.2.2 selection of video
Chapter 5 Design
24
Chapter 7
Testing
The system tries its best in identifying the pronounced words in the video. If the system does
not find the spoken word in its dictionary then it prints spelling of similar sounding for it.
This means that entire accuracy of the system depends on the number of words in the
dictionary. A lot of the system accuracy depends on accent of the speaker in the video. If the
speaker is not fluent then there is a problem in recognising words. We have tired adding as
much words as possible in the dictionary. Individual modules were tested using unit testing
and when these modules were integrated integration testing is used. Detailed explanation
about both is explained below
7.1 Test Cases
Test Case 1: When user inputs erroneous input i.e some .txt file.
Result: In such situation the system won’t move further and then exception will be shown.
Test Case 2: When the system comes across a word in the video that is not present in the
dictionary.
Result: Here the system prints spelling of similar sounding for it.
7.2 Testing used
7.2.1 Unit Testing
Unit testing focuses verification effort on the smallest unit of software design-the software
component or module. Using the component-level design description as a guide, important
control paths are tested to uncover errors within the boundary of the module. The relative
complexity of tests and uncovered errors is limited by the constrained scope established for
unit testing. The unit testing is white box oriented and the step can be conducted in parallel
for multiple components.
Chapter 5 Design
25
In our project we coded different models separately, tested them and then integrated them
together as one. We took input as video files from the user; This input has to be in standard
video format i.e. .avi or .mp4.Then the system proceeds to the audio extraction and
conversion unit, where we test if the .wav file is generated in a proper format as suitable for
the speech recognition module. The next was the subtitle generation unit and the final was the
linking unit.
7.2 Integration testing:
Integration testing is the phase in software testing in which individual software modules are
combined and tested as a group. It occurs after unit testing and before validation testing.
Integration testing takes as its input modules that have been unit tested, groups them in larger
aggregates, applies tests defined in an integration test plan to those aggregates , and delivers
as its output the integrated system ready for system testing.While the system is working if
any of the system fails or does not get a proper input the entire system fails. And thus user
won’t be able to view his video with subtitles.
Testing was done on a 30 second video which contained 62 words.60 words were recognized
in that video and out of recognized words 25 words were right. That means the accuracy rate
was found to be 40.32%.
Chapter 5 Design
26
Chapter 8
Results and Discussion
Figure 8.1 Selection of the video
Chapter 5 Design
27
Figure 8.2 Processing (audio Extraction)
Figure 8.3 Processing(Subtitle Generation)
Chapter 5 Design
28
Figure 8.4 Playing the video with subtitles
Chapter 5 Design
29
Chapter 9
Conclusion
Through implementation of “Automatic subtitle generator” the system aim at creating
software that will generate subtitles for input videos automatically. This system will first
extract the audio, then recognize the extracted audio with the available speech recognition.
The next phase recognized speech will be converted to the text format and then the text file
will be generated having extension “.srt” .Later on, this “.srt” file can be opened in a media
player to view the subtitles along with video.The major benefit is that the viewer does not
need to download subtitle from internet, if he wants to watch the video with subtitles.The
tedious task of manual creation of subtitle is replaced.
Future Work:
The system is being implemented in Windows operating system. Modifications can be made
in such a way that this software works for other Operating systems too. Modifications in the
system can be made in such a way that Sub-titles in languages other than English can also be
generated.
Chapter 5 Design
30
Literature Cited
[1] Understanding Auditory Processing Disorders in Children[online]