Hough Based Deinterlacer
Post on 09-Jan-2016
23 Views
Preview:
DESCRIPTION
Transcript
Hough-Based Deinterlacing
Altera Industrial Placement
Summer 2014
Abdulaziz Azman
CID:00680225
Table of Contents
1 Summary 3
2 Introduction 3
3 Altera Overview and Project Scope 4
4 The Deinterlacer Project 5 4.2 Project Expectations 5
5 Team Management and Organizational Tools 6 5.1 Video IP Team Management 6 5.2 Project Management tools. 7
6 Implementing and Developing The Algorithm 8 6.1 The Deinteralcing Challenge 8
6.1.1 Deinterlacing Background 8 6.1.2 The Low Angled Problem 9 6.1.3 Project Motivation 10
6.2 The Hough-Based Deinterlacer Design 11 6.2.1 General Method 11 6.2.2 Hough Transform 13 6.2.3 Verification of Feasibility of Deinterlacer Design and Methodology 15
6.3 Research and Improvements 16 6.3.1 X biased Sobel 16 6.3.2 The Proximity Hough 17 6.3.3 Post-Processing Block 20
6.4 Conclusion and Personal Reflection 21
7 Altera OpenCL and High-Level Synthesis 21 7.1 The OpenCL tool flow 22 7.2 Compromises and Optimizations 23 7.3 Conclusion and Personal Reflection 25
8 Offsite and Extra Activities 25
9 Conclusion 26
10 Appendix 27 I Pseudo Code of Conventional and Proximity Hough Transform 27 II Post-Processing Block 28 III IBC paper submission 29
1 Summary Altera is a semiconductor manufacturing company based in the Silicon Valley. Altera manufactures FPGAs, PLD and
ASICs and has a large business unit portfolio that covers sectors such as broadcast, automotive, industrial and
communications. The company has many sites located across the globe with the European headquarters being located in
High Wycombe. The Altera Europe focuses on the business unit aspect of the company rather than the research and
manufacturing. The business unit I was assigned to is known as Broadcast. Altera Broadcast aims to provide IP solutions
and software tools that focus on broadcast related components. The current focus in the Altera Broadcast business unit is to
meet customer demands of 4k (UHD) processing by making IP cores that can handle 4k video bandwidth and data. A
challenge identified in 4k displays is when lower resolution interlaced sources are displayed on UHD television. To display
interlaced video source, the input video undergoes a decompression process known as deinterlacing which converts the
interlaced video to a progressive video suitable for digital displays. This video decompression process often generates image
artifacts which are further magnified by the up-scaling process required to match the 4k display resolution. There are two
goals in my project. The first is to develop a deinterlacing method proposed by my supervisor that targets to eliminate these
image artefacts. This phase of the project involves modifying the method and employing additional image processing
techniques to render the solution feasible. A feasible deinterlacer design is one that is able to consistently remove image
artifacts and can be fitted into a single FPGA chip. There were several novel methods that had to be incorporated to generate
significantly improved deinterlaced video outputs. The second goal in my project is to implement the finalized algorithm
into hardware using Altera high-level synthesis tools. The main challenges encountered during the project was to create a
deinterlacer design that produced consistent results. This feat is difficult due to the vast amount feature variation in video
sequences. The approach to tackling this problem was to research different computer vision and image processing methods
that could be exploited. The final result of the research is a novel adaptation of an established image processing algorithm
for feature detection and the inclusion of a smoothing post-processing technique. This is a personal achievement due to
the novelty of the algorithm. The inception of the solution involved a good understanding of the deinterlacing problem and
a low-level familiarity of the feature detection algorithm. Using the high-level synthesis tool was also a challenge because
it is the first exposure for me. Courses like Digital System Design and VHDL in my third year were relevant in this phase.
I was already accustomed to hardware performance measures, loop unrolling, memory interfaces and hardware timing
requirements. Throughout the placement I learned to communicate better with work colleagues and bosses namely when
giving progress updates and project resource allocation. I realize the importance of giving the right impression to your
manager regarding your capabilities in completing a specific task. A false impression would result in your manager over-
expecting results and over-loading you with more task to complete. I attended many Altera Broadcast related meetings and
video conferences which provided insight to the development of IP cores and the management of a team of engineers.
Overall I learnt the technical and productization of an engineering solution. The finished deinterlacer design produced at
the end of the placement has potential to be developed further. Improving the algorithm by incorporating suitable digital
filters could drastically reduce execution time. Inferring more hardware parallelism by increasing the number of data paths
would enable the deinterlacer to perform 4k processing at 60 frames per seconds.
2 Introduction
In recent years there has been a demand for higher resolution displays. Broadcast and digital television companies around
the world are seeking to satisfy the growing market demand for high-definition (1080i & 1080p) video quality broadcast
and television resolution. This advent in the market inevitably will require improved video processing components such as
scalers, codecs and deinterlacers. This industrial project focuses on designing a deinterlacer on an FPGA that would meet
the high quality deinterlaced video demands.
The project is divided into two main tasks, which are the algorithm development phase and the hardware implementation
phase. The algorithm development phase is meant to identify caveats in the general method proposed and exploring aspects
of the design that can be improved. This involved initially writing the method into C++ and making modifications to the
method based on feedback from video outputs generated. The proposed deinterlacing method introduces numerous image
artifacts, hence a large period of the placement was dedicated to the research and innovation of new methods to improve
these video outputs. The hardware implementation phase is to validate an Altera High-level synthesis (HLS) tool flow. The
HLS tool flow aims to increase the range of Altera customers by enabling easy translation of a widely used C-based
programming language to a registers-transfer level hardware description. This abstraction from hardware is attractive to
software engineers and allows them to exploit the FPGA architecture without having to learn hardware-description
languages. This report will center around the research and modification made to the method, an outline of the translation
process from C to OpenCL and an evaluation of the performance of the hardware generated.
The report also includes an overview of Altera and the engineering team I was assigned to. A basic description of the
management practices and tools used along with a brief account of company-related offsite activities are also included.
3 Altera Overview and Project Scope Altera is a manufacturing company of Programmable Logic Devices (PLD). The headquarters of Altera is located in Silicon
Valley but there are many Altera sites and offices around the world. The company portfolio includes research and
manufacturing of PLDs as well as providing FPGA related tools and solutions to specific sectors. These sectors are identified
as business units and the main Altera business units (BU) are Industrial, Automotive, Communications and Broadcast. A
hierarchy of the BUs can be seen below in figure 1 along with the teams under the Broadcast BU.
Figure 1 Diagram showing the Business Units inside Altera and teams under the Broadcast BU.
The teams in Altera are constantly changing based on customer demands. This is because Altera adopts a dynamic allocation
of resources. Engineers and funds are constantly being reassigned and reallocated to different teams to meet the immediate
demands of customers. This management approach is practical and strategic as it achieves optimal resource allocation.
Altera engineers are also not restricted to a single BU.
I was assigned to the Video IP team under the Broadcast business unit. The Broadcast BU aims to meet broadcast customer
demands by providing intellectual property cores, design softwares and tools. The Video IP team under Altera Broadcast
focuses to meet customer demands of 4k video displays and processing by providing solutions such as 4k IP cores and
design tools. Altera invest in creating readily available IPs to customers to encourage the usage of Alteras FPGAs.
Examples of video IPs available are scalers, deinterlacers, SDI and HDMI interfaces and chroma-resamplers. The project
assigned to me focuses on developing a deinterlacer component. Deinterlacers are included in digital video systems to
enable interlaced video to be displayed. It is therefore an important component to a digital display system.
Altera Business Units
Broadcast
Video Codec Video IP
Automotive Industrial Communication
4 The Deinterlacer Project
There are three phases in the Deinterlacer Project. The first is to implement, develop and conduct a feasibility study on the
deinterlacer method. This method is described in an internal Altera document authored by my supervisor, Jon Harris and is
explained in section 6. The second is to restructure the deinterlacer algorithm to suit a hardware implementation. The third
phase is the hardware synthesis of the deinterlacer design. While the primary objective of the third phase is to implement
the deinterlacer algorithm into hardware the secondary objective is to validate an Altera high-level synthesis tool. The tool
is still in its infancy and is available to the public. Though some features of the tool are only available to Altera employees.
The task division of the project was done using Jira to help easy monitoring. A breakdown of the project planning is
presented in the table of figure 2.
Phase Task Description Time
Allocated (weeks)
Time Completed (weeks)
Algorithm Development
Develop Deinterlacer Method in C++
Implement method described in patent draft and use OpenCV as interface to generate image outputs from C++ functions.
11 14
Algorithm Translation
Translate C++ Code to OpenCL
Rewrite sections of code in a pixel-streaming manner. Modify Code in a structure that the OpenCL compiler is able to compile and extract parallelism
2 4
Hardware Implementation
Use OpenCL tool flow to create first OpenCL based-program. Perform necessary optimizations and compromises
Main task was to use resource estimation to ensure that design will fit into card. Main task was to verify frame outputs from hardware using emulation. This was achieved using optimizations and compromise. Optimizations include inferring pipelines wherever possible, relaxing loops by reducing data dependency and inferring shift registers wherever possible. Compromises include reduction of intercept range and delta range (benchmark is 1080i).
6 2
Hardware Synthesis Synthesize design to FPGA 3 2
Figure 2: Table showing breakdown of project plan proposed by placement supervisor.
The time allocation was decided by my supervisor and he based the amount (in weeks) on the fact that the algorithm requires
refinement and that I am unfamiliar with the high-level synthesis tool. The algorithm development phase was allocated 11
weeks but took 13 weeks to complete. The prolonged period was due to the difficulty in making the deinterlacer produce
consistent outputs across a set of industry standard interlaced video sequences. A large research effort was put to make the
deinterlacer design a feasible solution to the deinterlacing problem (elaborated in section 6). Note that the term feasible is
used to describe a solution that consistently removes targeted artifacts and can be synthesized to fit on a single readily
available FPGA chip. The algorithm translation phase took longer than expected due to the fact that the C code was written
with no hardware consideration. The initial code used C classes and random accesses which are more difficult to translate
into hardware as compared to writing the algorithm in a pixel-streaming fashion. Hence the restructuring of the C code not
only involved a translation of the algorithm but major changes in the algorithm to render it suitable for pixel-streaming. The
modified algorithm had to achieve the same affect, or at least similar affect as it would in a pure C implementation. The
hardware implementation phase was completed a lot faster than expected. Out of the allocated 6 weeks, this phase was
completed in 2. The acceleration is due to the robustness of the high-level synthesis tools used (elaborated in section 7) and
the close communication between me, my supervisor and an Altera employee who is involved in the development of the
tools used.
4.2 Project Expectations The proposed method of deinterlacing is the first of its kind in that the Hough Transform is used to deinterlace a frame. Due
to its novelty and limited placement time, the performance of the Algorithm is not expected to be extremely robust and
efficient. It may produce additional image artifacts while targeting to refine certain sections of the image. It is expected
though, that the algorithm should work very well in certain situations where the algorithm parameters are fine-tuned to a
particular video sequence. This inconsistency in video deinterlacers are not surprising as even matured and well-established
deinterlacing algorithms do not produces consistent results for all video sequences.
The hardware generated from the HLS tool is not expected to be optimized and may take up a large amount of resources.
The target board used for this design is the Nallatec-pcie385n FPGA accelerator card, which has a Stratix V. The Stratix
FPGA family, is a higher end FPGAs available in the market today and hence will provide the highest number of FPGA
resources. Despite this the design is expected to use up most of the Stratix V and therefore is very expensive in terms of
FPGA resources. The design is therefore not suitable to be incorporated into an FPGA based display system which often
include other video components such as scalers, memory and port interfaces, video encoder and decoders etc. It is the hope
that when the FPGA resources increase in the future and when the HLS tool matures that the deinterlacer design would
become a practical solution for better deinterlaced video output.
5 Team Management and Organizational Tools A positive aspect of the placement was the constant exposure to meetings and organizational tools used in the Video IP
team. My manager and supervisor included me, whenever relevant, to video conferences and group discussions. As a
result I familiarized myself with the dynamics of the Video IP team management and the management tools they use for
IP core development.
5.1 Video IP Team Management Benjamin Cope who is based in Altera High Wycombe is managing the VIP group. The group consists of 11 engineers, 8
of which are based in Altera Penang. The engineers in the VIP group work closely together but are not based in the same
Altera site. The team therefore uses video conferences and online management tools to monitor the progress of all the
engineers. The video conferences, between Altera Europe and Penang, are held once a month to update on individual
progress and to highlight group milestones. I attended these meetings despite not having any contribution to the team project,
though on occasion I provide updates on the progress of my deinterlacer project.
The management system used by the VIP team is called the Agile system. The online management tool used by the VIP
team is called Jira. These management system and tools are used to keep track of work progress and to generate performance
measures. Though my industrial placement project is not part of the VIP teams project, I was included in their management
system to provide insight on how engineers organize themselves in developing softwares and IP cores.
The Agile system is a method of organizing a software development process especially when the team consist of individuals
with different functional expertise. The specializations in the VIP team range from members adept in specific IP cores such
as deinterlacers to members who focus on verification. The Agile system introduces management techniques and measures
such as Sprints, Story Points, Issues and Scrums. These measures provide an iterative and incremental framework for IP or
software development. The short time span allocated for a task (i.e weeks) allows the team to be very responsive to changes
in customer demands. The placement was an individual work project. Hence the tasks were divided over periods of months
instead of weeks. There are many online tools that adopt the Agile system to make it more versatile and accessible.
Jira is an online tool that adopts the Agile system. Jira aims to provide bug and software development and monitoring. Jira
allows easy issue allocation between team members by a simple, drag and drop action. Issues can be priorities and labeled
to allow easy tracking. For example an issue may be labeled as a bug fix or a new feature. End of sprints reports can also
be generated through Jira. These reports consist of a burn-down chart of the amount of story points achieved versus their
respective time taken for completion. The expected trajectory is also displayed in the burn-down chart. Burn-down chart
comparisons give the team an idea of how well they meet expected targets and provide feedback to the managers as to how
much task to assign to the team in the future. An example of a burn-down chart of the deinterlacer project is shown in figure
3.
Figure 3 showing a comparison of the burn down chart of deinterlacer project between the expected progress and the actual
progress.
Exposure to monthly video conferences with Penang and software management tools familiarized myself with the IP core
development process and the dynamics of engineers that are located far apart.
5.2 Project Management tools. Throughout the internship, meetings and project tools were used to help the progress of the projects. One-on-one meetings
between my supervisor and I occur almost on a daily basis depending on my progress or whether I require assistance.
The third phase of the project involved using an Altera OpenCL tool flow which many engineers in Altera Europe are
unfamiliar with. Andrei Hagiescu is an Altera employee based in Toronto and he is involved in developing the Altera
OpenCL tool flow and making OpenCL example designs. Andrei was helpful in many aspects of the latter phase of my
project. He provided constant feedback on the latest deinterlacer code revision and advised on a suitable FPGA development
kit that would suit my project application. The code feedback consisted of Andrei evaluating resource estimation and
optimization reports generated by the AOCL compiler and suggesting methods of improving and optimizing the hardware
being generated. The main role of Andrei was to guide the second phase of the project in the right general direction.
Code revisions in the hardware implementation phase were sent between me and Andrei through emails. Errors and bugs
encountered by me in the high-level synthesis tool were addressed to Andrei using Fogbugz. Fogbugz is a bug tracking
system where bug finds are reported as incidence which can be assigned to another teammate to help resolve. You can also
assign the bug to multiple teammates. Fogbugz will also retain a thread of correspondence between teammates for easier
tracking. There were several bugs in the high-level synthesis tool that I identified while using the tool. Workarounds and
suggestions were almost always immediately provided by the Altera employees that are involved with the HLS compiler.
Having one-on-one meetings with my supervisor and video conferences with Andrei definitely helped me develop my
communication skills. Supervisors and managers who do not have an idea on the amount of effort required to solve a
problem may impose unrealistic goals. It is therefore paramount that you convey the right impression about your capabilities
and the estimated amount of time and effort you would require to complete a task. A false impression would result in your
supervisor expecting more results than possible.
0
1
2
3
4
5
0 5 10 15 20 25 30 35 40
Ou
tsta
nd
ing
Tas
k
Weeks
Burndown Chart of Deinterlacer Project
Expected Progress
Actual Progress
C++ implementaion
OpenCL kernel translation
OpenCL kernel optimisation
Hardware Synthesis
6 Implementing and Developing The Algorithm At the onset of the placement, the deinterlacing method proposed has never been implemented. The method described was
general and required many additional image processing techniques to render the method a feasible solution to the
deinterlacing problem. Due to the lack of validation and research done in the proposed deinterlacing method, more than half
of the placement time was allocated to making the algorithm feasible.
6.1 The Deinteralcing Challenge
Deinterlacing has been an ongoing problem for decades. The primitive form of video compression blindly removes half of
the image information that reconstruction is a challenge. As a result decompression artifacts are difficult to avoid. Despite
the tendency to produce decompression artifacts, the lack of complex codecs and compression algorithms make it an
attractive technique for broadcast companies as they can avoid the extra cost of specialized broadcast equipment for the
codecs. The cost of compensating the simple compression technique is incurred on the receiving end of the transmission
where better deinterlacers are required to either totally eliminate or mitigate the decompression artifacts. The widespread
use of interlaced video format broadcasting has driven a large amount of research in the deinterlacing area. This section of
the report will delve deeper into the fundamentals of deinterlacing, the targeted decompression artifact and the project
motivation.
6.1.1 Deinterlacing Background
Deinterlacing is a method of converting interlaced scan video to progressive scan video. Interlace scan video only capture
either the horizontally even or odd lines of a video frame. A video frame containing either the odd or even lines are known
as sub-fields. This terminology will be used throughout this report. Transmitting and storing interlaced video format is
therefore an advantage as the same amount of bandwidth or memory is required for an increased frame rate. This
improvement in temporal resolution reduces image flickering in analogue television and as a result, interlaced format is
commonly used in analogue broadcast systems such as PAL and NTSC use interlaced video format. On the other hand
Progressive video captures and display all lines in a video frame and is the video format used in almost all digital video
display devices. Digital devices therefore will require a deinterlacing component in the system to allow the display of
interlaced scan video format. A simple illustration of the conversion between interlaced and progressive video can be seen
in figure 4.
Figure 4 An image created during industrial placement illustrating the capturing, storing and conversion of interlaced scan
fields to progressive video.
Deinterlacing is a widely researched area due to its vast applications in the televisions and video storage. The deinterlacing
methods can be crudely divided into 4 techniques, which are motion-compensation, motion-adaptive, directional
interpolation and non-adaptive (temporal and spatial) interpolation. Deinterlacing methods can also be divided as being
either intra-field or inter-field deinterlacing. Intra-field methods use only one sub-field to generate a full progressive frame
while inter-field deinterlacing uses multiple sub-fields. Directional and non-adaptive interpolation techniques tend to be less
algorithmically and computationally complex compared to their motion-detection based counterparts. Directional and non-
adaptive techniques therefore produce lower quality deinterlaced video output and are more prone to generate image
artifacts. A common image artifact that has yet to be fully resolved is the low-angled image artifact and it occurs in many
directional and non-adaptive (spatially) interpolation techniques.
6.1.2 The Low Angled Problem
As the name suggest, the low-angled image artifacts are artifacts that occur along straight and low-angled line. The term
low-angled is used to keep the exact angle of the line arbitrary as the artifacts are subjective and are more apparent on some straight edges than others. For the sake of definition we will assign low-angled lines as being lines that are less than
45 degrees to the horizontal. The figure 5 shows the low-angled artifacts generated by spatially non-adaptive deinterlacing
(i.e. LD and LA) and edge-dependent deinterlacing (i.e. ELA) in techniques.
(a) (b) (c)
Figure 5 Low-angled image artifacts generated using C++ and OpenCV of Line Doubling (LD), Line Averaging (LA) and
Edge-dependent Line Averaging (ELA). Images were generated using C++ and OpenCV.
The main reason that these artifacts occur at low-angles is due to the inherent lack of vertical resolution of interlaced video.
Recall that a sub-field only captures either horizontally odd or even lines of a full frame. Hence half of the horizontal
information is not captured. Deinterlacing techniques essentially reconstructs these missing horizontal frame lines by using
information contained within the sub-field. The closer the straight line is to the horizontal, the less horizontal information
of the line exist in the interlaced frame. This idea can be easily understood if you image a perfectly horizontal line in an
image, assuming the line to one pixel wide, there is a chance that the entire line might not be captured using an interlaced
format as some horizontal frame lines are ignored. In contrast a perfectly vertical line has half of its line information
available regardless if it is the odd or even sub-field being captured. To proceed further into the problem, there are several
crude classifications of straight lines. The relevant types for this problem are highly directionally textured and macro-lines.
The distinction lies in the thickness and space between consecutive lines. Textured lines are single pixel wide and are closely
packed while macro lines are multiple pixels wide and far apart from other lines. This illustration can be seen in figure 6 of
the 1080p NTSC Slices sequence.
Figure 6 Showing regions of highly textured, textured and macro-lines in the Slices test sequence created by NTSC.
Existing edge-dependent deinterlacing methods are insufficient due to the locality of the pixels they analyze. Only
neighborhoods of at most 20 pixels across are processed before performing the directional interpolation. The locality of
pixel analysis makes most existing edge-dependent deinterlacers suitable for highly directionally textured regions in frames
as shown in yellow of figure 6, this locality also means that it is unable to reconstruct regions with macro lines as shown in
blue of figure 6, which require a larger neighborhood of pixels to be processed. The existing schemes can at best approximate
the direction of these macro lines. This approximation of macro-lines and its drawback can be seen in figure 7 which shows
the deinterlacing of the 1080i (interlaced) NTSC Slice sequence using the Fine-Directional Deinterlacingi (FDD) method
which is a form of edge-dependent deinterlacing.
(a) (b)
Figure 7 Showing the progressive 1080p (a) and the FDD output (b) of the 1080i NTSC Slice test Sequence respectively.
The solution to reconstructing macro-lines is to analyze the entire sub-field in contrast to analyzing a local pixel
neighborhood. This image processing is done using a Hough Transform (HT), which will be further elaborated, in section
4.3. It is crucial to note that analyzing a delocalized set of pixels and reconstructing a single pixel using these pixel data will
result in a higher chance of other image artifacts being generated. This is because the pixel to be interpolated will be affected
by pixel values that are relatively distant. Obviously distant pixels, relative to the pixel to-be interpolated, tend not to
represent the same image feature in a frame. This is precisely why most existing deinterlacers do not venture to the
processing pixel domain further than 10 pixels to avoid such artifacts. Even for deinterlacers that analyze a neighborhood
of 5 pixels across generate artifacts due to an inaccurate and erroneous interpolation based on distant pixels.
To remove secondary image artifacts produced from pixel interpolation, we can focus on post-processing methods that
double-check the existence of an edge and perform a smoothing based on these edge confirmations. A large section of the initial phase of this project is dedicated to these post-processing methods. Despite the drawbacks of analyzing a large pixel
domain, there are several motivations to pursue a more delocalized pixel domain analysis technique for deinterlacing.
6.1.3 Project Motivation The increase in demand for HD (1080p) and UHD (2160p) television displays requires the conversion of existing videos of lower resolution to higher resolution. Image artifacts in deinterlaced video inevitably become more apparent after scaling
up. Hence previously tolerable and minute low-angled image artifacts now become more discernable to the human eye. The
improved deinterlacer would be able to provide a competitive edge to companies whose business models are predicated on
making high resolution digital televisions. It is important to note that there are still many interlaced video sources used today
such as in broadcast and video storage despite it being a primitive form of video compression. The ubiquity of interlaced
format make the deinterlacer component in digital television a crucial component that would improve the user experience.
FPGA area and resources are progressing at a much faster rate than television resolution demand. The lack of increase in
video resolution makes more FPGA resources available. Hence it is reasonable to invest more FPGA resources for better
deinterlacers to improve video quality.
The expected complexity and resource usage of the proposed deinterlacer design is less than that of motion compensation
deinterlacer designs. This is due to the fact that motion-compensation requires computationally heavy motion detectors that
involve computing motion vectors at different pixel domain levels to estimate motion. The expected deinterlaced output
quality and robustness of the proposed design should be comparable to that of motion compensated deinterlacers despite
having a lower resource usage and complexity. This point is revisited at the end of this report once the deinterlacer design
has been explained, implemented and verified.
6.2 The Hough-Based Deinterlacer Design
My industrial placement supervisor, Jon Harris, proposed the Hough-Based Deinterlacer design in a patent draft (yet to be
submitted). The novelty the patent claims is the application of the Hough Transform in deinterlacing video. The document
describes the method as well as a digital hardware realization of the deinterlacer. This section of the report is dedicated to
introducing the general method as described in the patent draft document.
6.2.1 General Method Hough-Based deinterlacer analyses an entire sub-field for edges. This is in contrast to existing deinterlacers where only
local pixel domains are processed. The Hough-Based deinterlacer aims to extract edge information to dictate the directional
interpolation of the pixels. Edge information may include variables such as intercept, gradient, start coordinates and end
coordinates of a line. The scheme extracts edge information via the image processing flow shown in figure 8.
(a)
(b)
(c)
(d)
Figure 8 Shows the functional flow of the deinterlacer and the image output of each processing stage. The image is the
200th field of the Table Tennis sequence.
The flow consists of an RGB to luma conversion process, an edge detection process, a line detection process, an offset mask
generation process and an interpolation process. Referring to figure 8 , the interpolator uses the processed sub-field from
the luma, sobel and offset mask generation block to perform directional interpolation to produce the deinterlaced output.
The scheme targets the luminance (i.e brightness information) of the subfield while ignoring the chrominance (i.e color
information) of the sub-field because the human eye is more sensitive to the variation in brightness of an image rather than
its color variation. This higher tolerance for changes in color information is precisely the reason why the Chroma of an image or video is usually sub-sampled. An RGB conversion is required depending on the format of video input into the
design. Instances where a conversion is not required is if the video input format uses the YUV color space which encodes a
video using YCbCr. Y contains the luminance information and Cb and Cr contains the chrominace information. Equation
(1) shows the equation to convert from digital RGB to digital luminance (i.e Y).
= 66. + . 129 + 25. + 128
23+ 16 (1)
The Red, Green, Blue and luminance value is represented using an 8 bit binary number and therefore ranges from 0 to 255.
The Sobel process is used to detect boundaries in the sub-field. Boundaries in images are characterized by a change in
luminance or chrominance. This change can be computed using image kernels that contain gradient operators. There are
numerous gradient operators but the Sobel was chosen due to the higher weightage give to the pixels vertically adjacent to
the center pixel. The Sobel also has smoothing properties by having the coefficient 2 in the kernels. Figure 9 shows common
edge operators including the Sobel operators.
[0 1
1 0]
[1 00 1
]
[1 0 11 0 11 0 1
]
[1 1 10 0 01 1 1
]
[1 0 12 0 21 0 1
]
[1 2 10 0 01 2 1
]
(a) (b) (c)
Figure 9 The Roberts, Prewitt and Sobel edge operators.
Having a Sobel Threshold will set the minimum output value of the Sobel Transform that we would consider an edge. The
image generated after applying the threshold will be a binary image. Increasing the threshold would reduce chances of
detecting erroneous edges but will simultaneously reduce the edge information. A compromise based on the users preference is therefore necessary to dictate the tolerance in gradient. The effect of varying the Sobel threshold is illustrated
in the figure 10 Take notice in the disappearance of the straight edge of the roof eave with increasing Sobel Threshold.
(a) (b)
(c) (d)
Figure 10 Shows the output of the Sobel Transform with increasing Sobel threshold.
The Hough Transform uses the binary edge information (Figure 10 (c)) to detect lines in an image through a voting system,
an explanation of the Hough transform is in section 4.3. Lines are detected and are used to generate an offset mask (shown
in Figure10 (d). An offset mask is an intermediate image that will tell the location of the pixel requiring directional
interpolation along with the offset value. A simple illustration of the function of the offset mask is shown in figure
The offset value is used to directionally interpolate a pixel. Directional interpolation is derived from the characteristic that
pixels along edges tend to have the same RGB or luminance value. Hence to reconstruct the edge, the pixels to-be
deinterlaced will take the average value between the top and bottom pixels at offsets determined by the angle of the edge
itself. Consider the following illustration in figure 11.
Figure 11 Shows the directional interpolation method along with a comparison of the results.
6.2.2 Hough Transform The Hough Transform (HT) was first introduced in 1962 in a patentii published by Hough Paul C V. The patent describes a
method of extracting image features, such as lines and ellipses, which can be mathematically parameterized. As an example
the patent uses straight lines, which are easily represented by an intercept value and a gradient value. The basic idea in the
HT is that for all high binary edge pixel, we compute the parameters of all possible feature orientation and accumulate these
values in a parameter space.
Say we wish to detect straight lines which are parameterized by the intercept parameter, , and a gradient parameter, . The size of the parameter space (also known as the Hough Space) would be , the range of and values of and are arbitrary and will be discussed further in the next section. Consider a high binary edge pixel at the center of a 100x100 image shown as the picture in Figure 12 (a). To find all possible line orientation for that particular pixel we
simply compute the intercepts for the entire Nm range. This set of Nm and values are then accumulated in the Hough Space shown in the right plot of figure 12. The Hough Space is represented using a color-map with a color-bar to the right
of the figure indicating the value of a specific element in the Hough Space.
(a) Edge Image (b) Hough Space
Figure 12 showing an edge image (left) with a single high binary pixel at its center and the corresponding set of accumulation points in a color map representation of the Hough Space.
Referring to figure 12 (b), notice that the maximum element value in the Hough Space is 1 as there is only 1 voting pixel.
We can therefore expect a similar Hough Space profile for individual pixels as we extend to more high binary edge pixels.
Performing the above iteration for an image containing a line angled at 45, as shown in figure 13 (a) generates the Hough Space accumulation pattern as shown in figure 13 (b).
(a) Edge Image (b) Hough Space
Figure 13 shows a 45 line in an edge image (left) and the corresponding Hough space generated.
The Hough Space has a maximum accumulation of 20 hits suggesting 20 pixels voting for a specific line at points = 0 and = 45degrees as expected. This point is shown in the Hough Space as the region in red. To extract maxima in the Hough Space we apply a threshold. The Hough Space coordinates that satisfy this threshold would then be the detected
lines.
After the Hough was invented, a wide variety of different versions have since been introduced to suit different applications.
In the context of the deinterlacer project, where straight and low-angled lines are concerned, several features were included
to better serve our purpose. The are two major modification made to the conventional Hough Transform. The first is that a
Cartesian coordinate system is used rather than a polar coordinate system. The second is that the set of angles a line can
take is restricted and bounded. These modifications reduce the performance of the Hough transform in that many lines are
drastically approximated. The following paragraphs explain the reason.
To represent a straight line, the parameters may either be expressed in terms of the angle and distant to origin (i.e Polar
Coordinate) or in terms of the gradient and intercept (i.e Cartesian coordinate). This design adopts the Cartesian coordinate
system because the offset value (i.e used for directional interpolation) can be easily derived from the gradient of a line. The
main drawback of using a Cartesian based method is that there is a problem representing a vertical line. This disadvantage
is ignored as we are not interested in representing vertical lines.
The term bounded-offset refers to the manner in which the line gradients are discretized. It is more natural to assume that we allow lines to take discrete values at regular intervals. These intervals would then dictate the angle resolution of the
detected lines. The key modification in this design is that the lines are bounded to take values determined by the offset
values. The detectable lines are therefore bounded to angles illustrated below in figure 14.
Figure 14 shows the angles discretization created by neighboring pixels.
These offset values are denoted by x as they are inherently the change in the x direction given that the change in the y direction is always 1 which is true in our case as the interpolation always happens between the current pixel and the pixel
directly above it. Using the delta x as a parameter rather than angles in degrees inevitably changes the parameter space.
A key feature of the using x is the improvement in resolution for very low angles. The resolution of the angles are variable unlike most parameters where the resolution remains constant throughout. The resolution is given by the derivative of the
equation describing the relationship between delta X and angles. This derivative is shown in (3).
= tan1 (1
x) , x (2)
x=
1
x2 + 1 , x (3)
Note that the derivative is always a negative value which is as expected as the angles decrease with x. It is also important to note that as x increases the derivative also decreases implying that the angle resolution exponentially improves with x. This advantage is also a drawback for higher valued angles such as those above 18 where the angle resolution is more than 5. The result is that lines detected are therefore an approximation or they remain undetected.
6.2.3 Verification of Feasibility of Deinterlacer Design and Methodology
The patent draft describes a method of the deinterlacer design with a demonstration in principle that the design should
resolve low-angled edges. As a proof of concept, the initial phase of the project was dedicated to creating a working
prototype of the deinterlacer.
The metric used for verification is the PSNR value which is a standard measure used by researchers involved with
deinterlacing or video compression and reconstruction. The PSNR value stands for the peak signal-to-noise ratio and can
be derived from a single or multiple frames. The PSNR is a ratio between the peak signal power value and peak noise value.
The peak value for an image, which has a single color channel represented as 8 bits, is 255. The noise is the average absolute
difference between the original and reconstructed image. Hence, in the case of deinterlaced video, the PSNR value can only
be calculated provided that the progressive version of the video is available. The PSNR can be shown mathematically in (4)
where I(i,j) is the progressive frame and K(i,j) is the deinterlaced frame.
= 20. log (255
) (4)
=1
. [(, ) (, )]2
1
=0
1
=0
(5)
The PSNR measure was a means to reconfirm quantitatively that the deinterlacer scheme is able to reconstruct low-angled
edges whilst not introducing new image artifacts. Though throughout most of the algorithm development phase of the
project, the frame outputs of the deinterlacer were visually assessed rather than quantitatively measured. This is because the
end users of the video outputs are people, hence it is crucial that deinterlaced videos are visually satisfactory.
By using these verification methods various aspects of the Algorithm were found lacking. A thorough description of the
caveats discovered and the improvements made in response are presented in the next section.
6.3 Research and Improvements
This section of the report is dedicated to several modifications and inclusions to the deinterlacer algorithm that were made.
These modifications were necessary to render satisfactory video outputs. My supervisor occasionally made suggestions on
how to improve the design but most of the modifications were purely invented by me. It is important to note that a large
amount of the industrial placement was dedicated to the research and improvement of the Algorithm. The research method
employed is trial-and-error based and is by no means systematical in approach or exhaustive. To help improve the design,
a wide variety of image processing techniques were experimented. The challenge was that there are not many deinterlacer
papers written that had a similar algorithm to the Hough-Based deinterlacer due to its novelty. As a result, I employed to
image processing techniques that are not usually used in deinterlacing like connected-component labeling and line
correlation techniques. The unconventionality of the proposed techniques meant that I had to constantly refer my idea to
my industrial supervisor to gain feedback on whether he finds the solution a realistic one that is worth pursuing. A detailed
account of the research is not included in the report. The following sections will only describe and highlight the
modifications that I made to the final revision of the deinterlacer design. Most of the modifications were invented and
proposed by me and some suggested by my supervisor.
Caveats in Algorithm Modifications made in Response
Insufficient Line Resolution
for detection
No solution provided, instead algorithm relies on line to be sufficiently thick to allow line approximation. Though increasing line resolution would improve line detection, this comes
at the usage of more hardware resources.
Over-detection of lines due
to high pixel luma variation
or high density of lines
I introduced a discriminatory process in the Hough that takes into account the distance
between voting pixels. This would reduce overlap between produced by regions with high
pixel luminance variation and regions with high line density. This modification was purely
invented by me.
Image artefacts produced
from directional
interpolation are very
apparent and are difficult to
contain.
Introduce a processing block after the Hough Transform as a means filtering out false
positive lines by consolidating detected edges with luma, edge and offset information. Most
of the consolidation methods were proposed by my supervisor and some by me.
Edge detection needs to
prioritize horizontal edges
Removed the y-directional Sobel kernel and only depend on x-direction Sobel kernel for
line detection. This was suggested by me.
Figure 15 Table showing the caveats in Hough-based deinterlacer algorithm along with modifications made in response.
6.3.1 X biased Sobel The proposed edge detection technique was a Sobel transform. To better capture vertical color gradient while ignoring
horizontal color gradient, the x-Sobel kernel was removed. The result is an edge detection method which account less for
straight lines closer to the vertical while accentuating straight edges closer to the horizontal. This simplification both
improves the detection rate of low-angled lines (as less erroneous high pixels are derived from horizontal gradient changes)
and reduces the complexity of the edge detection process. This reduction in complexity though is insignificant to the increase
in complexity brought about by other processes. The difference in the output binary edge image and the clear improvement
is show below in figure 16.
(a) (b) (c)
Figure 16 Shows (a) original grayscale image (b) Conventional Sobel Transform (c) X biased Sobel Transform.
6.3.2 The Proximity Hough The Proximity Hough (PH) method is the name I gave to a discriminatory process I incorporated into the traditional Hough
method. This new Hough method was invented by me and it forms the bulk of the research value added to the deinterlacer
algorithm. Not only does the line-detection rate drastically improve but also it is also very resistant to noise. The context of
noise in this case are high binary pixels which do not form a straight edge and can either be jagged, curvy edges or simply
regions of high pixel luma variation.
The objective of introducing the PH method was to tackle the problem of over-detection. Over-detection occurs when the
Hough Space gets too cluttered due to regions in the image with high luma variations or a high straight-line count. The
Hough Space in figure 17 clearly demonstrates this affect by performing the Hough Transform on an original image with
high binary pixel clusters and on an image with the high binary pixel clusters artificially removed.
(a) (b)
(c) (d)
Figure 17 shows the difference in Hough Space produces by images with and without high binary pixel clusters.
The PH transform works by storing additional information of a particular high edge binary pixel and stores these information
inside memory blocks. We could think of these memory blocks as bins but in reality they do not accumulate anything and
are simply updated with a different value. The information stored in these bins are the coordinates (x,y) of the voting pixel,
the x start point of the line and the only accumulation bin. These bins have dimensions exactly the same to that of the
accumulation bin, which is essentially the parameter space. It is therefore obvious that the coordinates and the x start point
is unique to a particular line. Recall that each element in the parameter space represents a unique line that can be drawn on
an image.
The general method of the PH transform is that once there is a hit for a particular line, the accumulation bin will only increment if the (x,y) coordinates of the previously voted pixel for that particular line is within a tolerable range. This
tolerance is given the name as proximity threshold which is the minimum distance between voting pixels. Hence if you
specify a proximity threshold of 0, then voting pixels must be adjacent or diagonally adjacent to one another. In the case
where the minimum distance is not met then the accumulation bin will not be incremented and the x start point bin will
remain unchanged. The x start point is only updated when a line first gets a hit. Hence it stores the x coordinate of the first pixel that voted for the respective line which corresponds to the start x coordinate of the line. The y coordinate is
neglected as it can be mathematically derived as we know the intercept and delta x value.
Note that the PH transform uses an additional 3 bins as compared to the conventional Hough transform. It also requires
more arithmetic computations and operations. A pseudo code of the conventional and the proximity Hough transform is
included in Appendix I. This increase in complexity will require more FPGA hardware resources and increase the latency.
The throughput of the hardware generated is every clock cycle as much of the algorithm is easily pipelined. It is key to note
that in terms of memory, the Proximity Hough requires a constant reading and writing to all bins whereas the conventional
Hough only requires a write as no feedback required to validate the a particular line hit. A comparison of resource usage
between the Proximity Hough and the Conventional Hough is shown in figure 19. The resource estimation is generated
using the Altera Offline OpenCL compiler which is able to estimate the amount of resources a design would occupy given
the development board or acceleration card used. In our case, these values were generated based on the Nalletech pcie385n
A7 accelerator card.
Figure 18 shows a bar graph of the resource usage in percentage of a Stratix V a7 FPGA.
The performance presented above is due to the reduced clustering of the Hough Space. Referring to figure 18 we can observe
that the Hough Space produced using PH (figure 19 (b)) is less clustered as compared to the Hough Space produced using
the conventional Hough (figure 19 (a)). The discriminatory process of the PH reduces the clustering of line hits and hence over-detection is avoided.
(a) (b)
(c) (d)
Figure 19 shows the Hough Space of the conventional (a) and the Proximity Hough transform (b) along with the binary
edge image input with high binary pixels clusters.
0
10
20
30
40
50
60
70
Logic utilization Dedicated LogicRegisters
Memory Blocks DSP Blocks
Re
sou
rce
Usa
ge
(%
)
Resource Type
Resource Usage of Conventional and Proximity Hough Transform
Conventional Hough
Proximity Hough
The performance of the PH in terms of robustness to regions of high pixel luma variation and in terms of number of rate of
lines detected is apparent though it increases the hardware resource usage. Despite this increase, the video quality output of
the Hough-Based Deinterlacer is greatly dependent on the performance of the line detection process. Hence it is justified to
invest a lot of research and development time and hardware resources on the line detection process.
An interesting improvement that could reduce the resource usage and the complexity of the Hough Transform is to map the
discriminatory process to a form of Hough Space filtering. Parameter space filtering of the Hough Transform is not novel
and has been used in applications such as. This improvement would definitely take more research time and hence was
neglected so that hardware implementation and other research areas could be explored. The next major research area is the
post-processing method which would be comparable in terms of resource usage to the line detection process.
6.3.3 Post-Processing Block The PH transform reduces image artifacts generated by decreasing the chances of detecting an erroneous line. Despite this
improvement, image artifacts still appear in certain regions of the image namely at along the periphery of edges, at intercepts
of edges and at edge endpoints. The objective of the post-processing block is to reduce the probability of generating image
artifacts once the line detection block has completed. Examples of the occurrence of these image artifacts are shown in
figure 20.
(a) (b)
Figure 20 (a) shows image artifacts occurring along the periphery of edges. Figure 21 (b) shows refined output from the
post-processing block.
The post-processing block eliminates the above artifacts by conducting luminance and offset checks. These checks consist
of exploiting the characteristic of similar luminance occurring along edges and consolidating the offset mask with the edge
image. The effectiveness of these checks was visually verified. The post-processing block also includes a smoothing of
deinterlaced pixels. This smoothing comprises of a blend between the base interpolation and the offset interpolation. This blend reduces any drastic pixel luminance introduced by offset interpolation. Detailed descriptions of the checks conducted
are included in the appendix under the Post-processing section (Appendix II). The introduction of the X-biased Sobel,
Proximity Hough and the Post-Processing blocks were crucial in making the initially proposed method a feasible one. I
particularly found this section of the internship challenging due to the elaborate solutions that I had to invent. Though there
were several other solutions such as connected-component labeling and straight-line correlation techniques that were
initially pursued but later discarded due to the high additional complexity introduced and poor robustness to different video
sequences. The final algorithm still has room for improvement namely when video sequences with highly textured regions
are concerned, though the algorithm produces impressive edge reconstruction when macro-lines are present in an image.
The next major section of the placement is the implementation of the Algorithm into hardware using the Alteras High Level Synthesis tool.
The final post-processing algorithm was translated into hardware. During this translation several key considerations had to
be made to generate a pixel throughput of 1 every clock cycle. This was a challenge as the post-processing algorithm consists
of several blocks (see appendix for detailed description of these blocks) that are inherently dependent upon the outputs of
previous post-processing blocks. To solve this, intermediate shift-buffers were introduced in between these blocks. These
buffers are read from multiple times throughout the main loop but are written to only once. By introducing latency, data-
dependencies in the main loop in the post-processing can be eliminated entirely. A graphical representation of the post-
processing algorithm showing the numerous internal blocks and shift buffers can be found in the appendix. Theoretically,
with the inclusion of these shift-buffers, the final algorithm should be easily translated into hardware that would generate a
throughput of 1 pixel every clock cycle. The challenge in the hardware implementation phase is to write the post-processing
algorithm in a fashion in which the high-level synthesis compiler would generate the intended hardware.
6.4 Conclusion and Personal Reflection The conclusion of this phase of the placement is that the proposed deinterlacer method is a feasible solution to the low-
angled problem. The algorithm works particularly well for refining the image artifacts at edges with a high luma difference.
This is key as the regions of high luma difference are regions that are most apparent to the human eye. A good example of
the result at regions with high luma difference is along the Ping Pong table edges shown in figure 21. Unexpected
improvement in performances were also discovered which further consolidates the feasibility of the design. The unexpected
performance is that slightly curved edges that exhibit similar image artifacts are detected and refined by the final algorithm.
Though the development was challenging due to the fine-tuning of design parameters across several video sequences, I was
motivated by the fact that the algorithm I am developing is tackling a real engineering problem in the interlaced video
broadcast industry. The algorithm development was very rewarding as many additions to the algorithm such as a
discriminatory Hough process and a post-processing block are novel methods. Hence I have not only contributed to the
deinterlacer design by implementing it in C but by adding 4 months of research value to the algorithm. I have experienced
the difficulty in designing a robust image processing algorithm that produces consistent results across the many variations
in video sequences. At the end of this phase the algorithm is finalized and ready to be implemented into hardware.
(a) (b)
Figure 21 shows output of the FDD method (a) with a comparison of the final result of the Hough-Based deinterlacer design
(a) taking input the 187th frame of the table tennis sequence.
7 Altera OpenCL and High-Level Synthesis The second phase of the project involves the implement of the deinterlacer design to hardware using Altera high-level
synthesis tools. A High-level synthesis tool is a program that generates RTL or logic from a higher-level programming
language such as python or C. The target tool for the deinterlacer design is called the Altera high-level synthesis tool. The
Altera HLS is in its infancy and the objective of this phase of the project is to validate the tool by recording the development
time, identifying bugs and providing user feedback. The Altera HLS compiler is uses the same high-level synthesis process
a more mature high-level synthesis tool called the Altera OpenCL compiler. The Altera OpenCL tool flow has useful debug
tools such as resource estimation and hardware emulation (elaborated in the next section). This would provide easier C code
translation to Hardware while still enabling validation of the high-level synthesis process working at the back-end of the
Altera OpenCL compiler. Despite this similarity there are several differences between Altera OpenCL and HLS.
The distinction between the Altera OpenCL and HLS compiler is in the target users. The increase in available FPGA
resources, brought about by better manufacturing technology and FPGA architecture, has made FPGAs an attractive solution for high-performance computing. The FPGA accelerates execution times of computations with little data-
dependencies that are easily parallelized. But to leverage the FPGA, the user requires a background on hardware design and
a familiarization with hardware development tools. To widen the range of users, Altera provides the Altera OpenCL
compiler (AOCL) as a solution. The AOCL compiler is based on a parallel-programming standard called OpenCL. This
standard supports a multitude of different platforms such as DSPs, CPUs, GPUs and FPGAs. Hence it is possible to create
a high-performance computing system with OpenCL that allows offloading to a variety of different platforms. The AOCL
is an elegant tool for software engineers who simply wish to implement their software on an FPGA due to the abstraction
from hardware interfaces and the complete generation of the final hardware system. On the other hand, the lack of control
over the hardware interfaces generated is unattractive to hardware designers. Altera provides another compiler, Altera HLS,
to target these hardware designers. The objective of the Altera HLS is mainly to generate single IP cores rather than an
entire hardware system. This IP core generated could then be instantiated and incorporated in a hardware system via Qsys.
The HLS compiler would both provide internal and external benefit to Altera. Internally, the HLS compiler could be used
to accelerate hardware prototyping. This is demonstrated in the hardware translation of the Hough-Based Deinterlacer
design. It is estimated that implementing the design into hardware would require 1 man-year, whilst the hardware translation
phase of this project was completed in under a month. Fast prototyping would help generate quick outputs to assess the
feasibility of an algorithm. The only drawback the HLS compiler would have is that it will not provide the hardware designer
register level control of the hardware. Externally, the HLS tool will provide customers an easy tool to modify or adapt
existing reference designs or Altera IP to suit their application. They would be able to tweak sections of C/C++ code to
include new components or change design parameters.
7.1 The OpenCL tool flow
The Altera OpenCL tool flow is designed to speed up the hardware development and design by introducing an Emulator
and Optimization reports generated from OpenCL kernels. The OpenCL kernels are C functions with restrictions and
extensions imposed by OpenCL. These restriction and extensions provide a framework for data-parallelism programming.
The kernels are written in a single-threaded, tasked-based fashion and hardware parallelism is inferred by unrolling loops
and pipelining computations. Hardware parallelism can also be inferred by replicating (i.e vectoriziation) kernels. Kernels
are launched via an OpenCL host program which is an executable. In addition to launching kernels, the host program
allocates, reads and writes to the target devices global memory. The host program is generated using any standard C compiler
which gives the flexibility to include any readily available C library into our host program. For this application, the OpenCV
library is included to read and write image to and from the target hardware.
The Altera OpenCL tool flow mimics a software-like debug flow due to the relative short compile time of the emulation
and optimization report generation. Emulation and Optimization reports take seconds to compile which is in stark contrast
to hardware compiles which usually take hours to complete. The complete OpenCL tool flow is shown in figure 21.
Figure 21 Flow chart of Altera OpenCL tool flow.
The Emulation feature allows functional debug of design without any hardware generation. The tool will produce a binary
file (.aocx) which contains program objects that target the FPGA. The binary file can be executed using any x86 processor
and simulate the hardware generated. Hardware performance is not provided by the emulation as no actual hardware is
running. Once the design generates output and is verified to be correct, the next stage is to improve the efficiency of hardware
generated and get a resource estimate. These feedback is provided by the optimization report.
The optimization report will give the engineer an idea of how efficient the compiler has generated the hardware. The report
consists of a list of successfully pipelined code sections, serially executed code sections, data dependencies and a resource
estimation of the design. This stage is crucial for the final hardware performance. Most of the development time is spent
reducing data dependencies alerted to by the report. A snippet from an optimization report is shown below in figure 22.
Figure 22 Snippet from an optimization report showing sections of pipelined code, serially executed code, data
dependencies and a resource estimation.
7.2 Compromises and Optimizations The hardware implementation phase consisted of 4 main kernel code revisions which are the serial hardware execution,
hardware parallelism introduced, data dependencies removed and optimized hardware code revisions. The code was
modified based on feedback from the information provided by the optimization report. The goal was to essentially remove
data-dependencies where possible to enable parallelism, shift-register inference of pixel buffers, removal of conditional
loops and memory access. The progress in terms of resource usage and execution times across the 4 code revisions is
displayed in the bar charts of figure 23 and 24.
Figure 23 Bar chart showing FPGA resource usage for all hardware revisions.
111
96
72
636558
4741
29 29 2825
3228
22
9
-5
15
35
55
75
95
115
Serial Hardware Execution Hardware ParalellismIntroduced
Data Dependencies Removed Optimized Optimization
FPGA Resource Usage
Memory Blocks Logic Utilization DSP Blocks Logic Registers
Figure 24 Bar chart showing kernel execution time of all hardware revisions running at 150 Mhz.
The first code revision (serial hardware execution) was meant to get the design to fit on a Stratix V A7 FPGA by
compromising design performance and reusing hardware blocks. To decrease the deinterlacer resource usage, the range of
y-intercept values was reduced which compromised the final video quality output. Note that due to the fact that the interested
angles are relatively small (i.e less than 45), this compromise did not significantly affect the final video output making it a
reasonable tradeoff.
The second code revision (hardware parallelism introduced) targets to generate efficient pixel buffers to reduce resource
usage. Pixel buffers were initially included in the algorithm translation phase to remove data-dependencies by introducing
latency. The pixel buffers would store a preset number of pixels (depending on process) by having the first element of the
buffer updated with the incoming pixel value and the final element in the buffer deleted. An efficient method of
implementing pixel buffers are using shift-registers. Shift-registers would mimic an array that has all its elements shifted
by one index while having the first element updated with a new value. For a shift-register interpretation of pixel buffers, the
access index has to be known at compile time. The shift-registers form a delay-line with signal taps at the respective access
indexes. Hence the number of index access also had to be kept at a minimal to reduce hardware signal taps generated. Pixel
buffers with dynamic indexes were written in C code in a circular buffer fashion which would infer a memory block where
there is a constant read and write to it.
The goal of the third code revision (data-dependencies removed) was to remove-data dependencies in the post-processing
kernel. As mentioned in Section 6.3.3, the post-processing algorithm inherently has data-dependencies which were removed
by introducing shift-buffers. The ideal result is a throughput of one pixel per clock cycle. Though the C code was written to
generate hardware that removes these data-dependencies, the compiler does not necessarily generate the intended hardware.
To achieve this performance there are three kinds of processes that need to be avoided which are conditional read from
memory, conditional for loops and buffer indexes that depend on a value from the same buffer. Conditional read from memory and for loops were removed by reading from memory and performing the for loop iteration regardless and having a boolean variable to validate the final assignment. Indexes that depend on a value from the buffer itself was solved
by duplicating the buffer and deriving the index from the cloned buffer. These amendments in the C code successfully
generated a post-processing hardware that had no data-dependencies in the main loop. The removal of data-dependencies
in the main-loop also allows the compiler to infer efficient pipelined computations. The efficiency of a pipelined loop is
measured by how many clock cycles between successively launched iterations. At the end of this processed all serially
executed sections had a pipeline efficiency of 50% (i.e 2 clock cycles between iterations). Kernel-level pipelines were also
Figure 24 Bar chart showing FPGA resource usage for all hardware revisions.
101.9
84.0
32.122.5
271.5 275.6
34.724.3
281.9 285.9
36.825.3
332.7
2.2 2.32.3 2.3 2.3 2.4
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
Serial Hardware Execution Hardware ParalellismIntroduced
Data Dependencies Removed Optimised Hardware
Kernel Execution Time
Sobel Kernel Hough Kernel Line Drawer Kernel Interpolator Kernel Frame Generator
introduce to enable a new set of inputs to be processed while the previous set of inputs was sent to the next kernel. The final
improvement made during this hardware revision was task parallelization through loop unrolling. For loops in the post-processing blocks were unrolled. This unrolling increases the resources usage but gives us a faster execution time.
The final hardware revision aims to produce optimized hardware. Optimized hardware was achieved by removing division
by constant, division and modulo operator. These operations are expensive due to the floating-point division. Division is
more expensive in hardware compared to multiplication. The division by a constant can be approximated into a
multiplication with a constant and a division by a power of 2 which in hardware is a shift operation and hence relatively
cheap. The constant of interest is the width of a video frame. A video frame with a width of 720 for example can be
approximated to a multiplication with a constant, 91, and a shift of 2 to the power of 16. The division by a variable is
translated into a ROM lookup table. Note that the range of values the denominator and the numerator of this particular
division operation is bounded to a limited set of discrete numbers (i.e -64 to 64 for denominator and 0 to width of video
frame for numerator). Hence a lookup table would be apt. These alterations in the C code reduced the resource usage and
execution time.
The improvement in kernel execution time and FPGA resource usage brought about the four code revisions are apparent
and is highly dependent on writing the kernel in a manner in which the compiler can extract parallelism. The key technique
is to reduce data-dependencies and conditional read and for loops. The hardware can be further optimized by reducing division operation and translating limited functions into lookup tables.
7.3 Conclusion and Personal Reflection Early introduction to the OpenCL and HLS tools at the onset of the placement would have produced a better final
deinterlacer design. There was a lack of hardware consideration during the development of the algorithm. Hardware
considerations like memory access patterns (i.e random access or bit-streaming), task parallelism and pipelining were not
taken into account. Therefore sections of the algorithm that could not easily be translated into a bit-streaming structure had
to be rewritten in a bit-streaming fashion to achieve the same output, which often resulted in a degraded version of the
design. About 2 weeks were dedicated to rewrite the algorithm in a bit-streaming fashion. Experimenting with the high-
level synthesis tools and understanding how to generate efficient hardware for video processing would have reduced the
time used in the hardware translation phase.
Overall this phase took about a month. The weekly video conference calls with Andrei Hagiescu in Toronto were especially
helpful. He provided crucial guidance and feedback on the direction and improvement of the design. The final code revision
generates hardware that is able to process video input at 720i at 30 frame per second. This performance is not up to par with
the market demand of deinterlacing 1080i video at 60 frames per second. I am confident that given more time in developing
the hardware using openCL or even perhaps a translation to Verilog would be able to achieve the 1080i at 60 frames per
second benchmark.
8 Offsite and Extra Activities On the 6th of August I paid a visit to the Altera site in Penang to visit the Malaysian team managed by Benjamin Cope.
Benjamin Cope is the manager of the Altera team I am in (i.e the Video IP team) and he suggested the visit. The aim of the
visit is to observe how the Altera in Malaysia operates. The site is located on the Island of Penang situated off the west coast
of the Malaysian Peninsular. The Island has a specially designated industrial zone that has many manufacturing, research
and electronic companies such as Intel, AMD, Motorola, Agilent and Altera. I was brought around the site by Ivan Teh,
who is the manager of the Malaysian VIP team. I conversed with the Malaysian VIP team asking about work-life balance
and the benefits Altera provides. On another note, I was impressed by the sheer size of Altera in Malaysia where there are
about 1000 employees. The visit was eye opening as it made me realize the scale in which Altera operated internationally
and its strong presence of in the growing Southeast Asian region. I would definitely consider working at an Altera site closer to home.
The IBC (International Broadcast Conference) is an annual event held at the Rai Amsterdam and runs for 5 days. The
conference attracts companies and organizations involved in future solutions of electronic media and technology. Altera
operates a booth and I was given the privilege to submit a poster that describes the Hough-Based deinterlacer. The poster is
included in Appendix III. The poster was displayed in the IBC Future Zone Section, which is a fairly new exhibition included
in the conference. The purpose of the Future Zone is to showcase interesting ideas and projects from research and
development labs and universities. The poster is included in the appendix. During the exhibition, the organizer of the Future
Zone is hoping to create an online archive where submitted posters and papers could be stored and accessed. I went to
Amsterdam on the 12th of September and showcased my design for the entire day along with my placement supervisor, Jon
Harris. During the exhibition, several people approached us expressing their interest in the solution. Some requested a copy
of the poster and some showed wanted to implementing the design on GPUs. The interaction with people from the broadcast
industry regarding the design reassured me that the deinterlacer project addresses a real problem in the deinterlacing world.
There were customers who expressed interest in the design and requested video samples from the deinterlacer for them to
evaluate. These video output sequences were generated and give to the customer for evaluation. A following quotation for
the IP core will be in place given that the customers are pleased with the video sequences. The identity of the customer is
confidential but their request for video output sequences are testament to the fact that the decompression artifacts generated
by most deinterlacers are a genuine problem in industry. The experience was both interesting and rewarding in that I am
able to work on a real engineering problem and have the opportunity to interact with potential customers.
The offsite activities have provided me with a platform to interact with both Altera employees in Penang and potential
customers and researches that attended the IBC conference. This exposure has helped me understand how products, software
and IP cores are developed internally and how they are further presented and marketed to customers. These activities have
definitely been insightful in terms of engineering product inception, development and marketing.
9 Conclusion
The project aims were to develop a deinterlacing algorithm that targets low-angled artifacts and to synthesize the design
into hardware using an Altera high-level synthesis compiler. I personally feel that these two objectives were met as the
proposed method has been developed into a feasible solution and has been successfully been implemented into a working
deinterlacer hardware. The feasibility of the deinterlacer is assessed in terms of weather it consistently removes the targeted
decompression artifacts and if it is able to fit onto a single FPGA chip. Though the final deinterlacer hardware does not
meet the market performance benchmark, this benchmark is speculated to be achievable given further development of the
design in either openCL or Verilog. I am confident in the robustness and success of the deinterlacer IP core that it would
make a good addition to the set of deinterlacers available in the Altera Video IP suite licensed to customers. The confidence
comes from the consistent video outputs generated by the deinterlacer algorithm across 10 common test video interlaced
sequence. The robustness of the algorithm is attributed to the key incorporation of image processing algorithms, namely the
proximity Hough transform and the post-processing block. These two processes formed the central focus of the research
done for the deinterlacer and have definitely improved the rate of line detection and edge refinement ability of the design.
A summary of the industrial placement achievements is summarized in the bulleted list below:
Invented a discriminatory process for the Hough transform to improve line detection rate and robustness to high pixel luminance variation.
Invented a post-processing algorithm that consolidates detected edges to eliminate artifacts generated by directional interpolation.
Successfully implemented deinterlacer into an FPGA in under a month using the Altera OpenCL compiler. Verified that hardware generates correct output. Achieved sub-par video performance in a limited amount of hardware development time and confident that
further hardware optimization can easily achieve industry standard video deinterlacing performance. Completed algorithm development and hardware implementation of design in 6 months. Received positive feedback from customers who compared the Hough-based deinterlacer design to their
existing deinterlacer (identity of company cannot be disclosed).
10 Appendix
I Pseudo Code of Conventional and Proximity Hough Transform
II Post-Processing Block Post Processing Functional Flow Diagram
Post-processing block descriptions
Modules Sub kernel
Blocks ID Description
Edge Image Generator
SK0 Edge image generator
EG_0 Generates binary edge image
OM Generator (offset mask)
SK1 Edge consolidation
Om_0 Consolidates Offset Mask and Binary Edge Image
Offset Mask Check Om_1 Checks presence of offset mask along suggested pixels
Offset Mask Expander
Om_2 Expands offset mask to include more peripheral pixels
Offset Mask Estimator
Om_3 Estimates true offset mask (this is due to the fact that detected offset mask is an approximation)
WM Generator (weight mask)
SK2 Offset mask raw Wm_1 Generates a weight mask from an offset mask
Post offset mask end roll off
Wm_2 Introduces a post roll-off weight to smoothen final output
Pre offset mask end roll off
Wm_3 Introduces a pre roll-off weight to smoothen final output
SK3 Luminance check Wm_4 Checks that top and bottom luminance has a significant variation which implies an edge.
Top and bottom check
Wm_5 Checks that weight mask exist at the top and bottom of targeted pixel.
Offset Check Wm_6 Checks that luminance along edge does not vary by a tolerance level.
Average Weight Wm_7 Performs an averaging on the weight mask to allow a smoother transition between edge and non-edge regions
Interpolator SK4 Interpolator Interpolate Performs linear or directional interpolation based on offset and weight mask.
III IBC paper submission
Using a Bounded Offset Hough Transform For Edge-Dependent Deinterlacing
Jon Harris (Altera) and Abdulaziz Azman (Imperial College London)
Abstract This paper presents an edge-dependent deinterlacer scheme which uses
a Bounded Offset Hough Transform. The scheme aims to resolve low-angled
artefacts which occur in images produced by most intra-field deinterlacing
methods. Existing edge-dependent deinterlacers analyze neighboring pixels to
recover edge information. While this is sufficient for edges closer to the vertical,
low-angled edge information is rarely sufficiently recovered. This is due to the
inherent lack of vertical edge information in low-angled edges. The scheme
proposes the use of a variant of the Hough Transform to analyze non-localized
pixels to extract more edge information which will be used for edge-dependent de-
interlacing. Comparing the output of the scheme with several known deinterlacing
schemes suggests significant improvement in image quality output.
I. Introduction
Interlaced scan signals were initially used in analogue CRT television
to improve the video frame rate without requiring additional bandwidth
and is achieved by sampling only the horizontally odd or even lines of an
image. The sampled image from one time instance in an interlaced video
is called a sub-field. Though the vertical resolution is effectively halved,
the temporal resolution is doubled which reduces image flicker in CRT
television. Interlaced video eventually formed the basis of Analogue
broadcast systems such as PAL and NTSC. Modern digital displays such
as LCD and plasma screens use a progressive video format which captures
and displays all horizontal lines at the same instance. To display interlaced
video on modern digital displays, a deinterlacer is required. An ideal
deinterlacer algorithm will be able to fully recover the missing horizontal
information in interlaced video. Full reconstruction is not an easy task and
theoretically impossible based on the Nyquist Sampling Theorem. Visual
artefacts from de-interlacing are therefore hard to avoid. Further
information regarding de-interlacing can be found in [1].
Existing deinterlacers can be crudely categorized to inter-field and
intra-field. Inter-field deinterlacers mainly extract motion information and
perform deinterlacing accordingly. The main drawback of inter-field
deinterlacing is the high complexity and the reliability of the motion
detection algorithm. Motion detection fails when large displacements are
involved which often results in poor video output. In contrast, Intra-field
deinterlacers only process a single sub-field and are therefore
algorithmically less complex than their inter-field counterparts.
Existing Intra-field deinterlacer schemes process and i
top related