arXiv:2205.12912v1 [cs.CV] 25 May 2022

Context-Aware Video Reconstruction for Rolling Shutter Cameras

Bin Fan Yuchao Dai* Zhiyuan Zhang Qi Liu Mingyi HeSchool of Electronics and Information, Northwestern Polytechnical University, Xi’an, China

Abstract

With the ubiquity of rolling shutter (RS) cameras, it is be-coming increasingly attractive to recover the latent globalshutter (GS) video from two consecutive RS frames, whichalso places a higher demand on realism. Existing solu-tions, using deep neural networks or optimization, achievepromising performance. However, these methods generateintermediate GS frames through image warping based onthe RS model, which inevitably result in black holes and no-ticeable motion artifacts. In this paper, we alleviate theseissues by proposing a context-aware GS video reconstruc-tion architecture. It facilitates the advantages such as oc-clusion reasoning, motion compensation, and temporal ab-straction. Specifically, we first estimate the bilateral motionfield so that the pixels of the two RS frames are warped to acommon GS frame accordingly. Then, a refinement schemeis proposed to guide the GS frame synthesis along with bi-lateral occlusion masks to produce high-fidelity GS videoframes at arbitrary times. Furthermore, we derive an ap-proximated bilateral motion field model, which can serve asan alternative to provide a simple but effective GS frame ini-tialization for related tasks. Experiments on synthetic andreal data show that our approach achieves superior perfor-mance over state-of-the-art methods in terms of objectivemetrics and subjective visual quality. Code is available athttps://github.com/GitCVfb/CVR.

1. IntroductionMany modern CMOS cameras equipped with rolling

shutter (RS) dominate the consumer photography marketdue to their low cost and simplicity in design, and are alsoprevalent in the automotive sector and motion picture indus-try [16, 48, 53, 63]. Within this acquisition mode, pixels onthe rolling shutter CMOS sensor plane are exposed from topto bottom in a row-by-row fashion with a constant inter-rowdelay. This leads to undesirable visual distortions calledthe RS effect (e.g. wobble, skew) in the presence of fastmotion, which is a hindrance to scene understanding and anuisance in photography. With the increased demand for

*Y. Dai is the corresponding author ([email protected]).

Input RS

Ground truth GS

RS 0

RS 1

t=0

t=0.5

t=1

Figure 1. GS video reconstruction example. The left columnshows two input consecutive RS images, and three ground-truthGS images at time 0, 0.5, and 1, respectively. Rows to the rightshow five GS frames (at times 0, 0.25, 0.5, 0.75, 1) extractedby [9] (top) and our method (below), followed by two correspond-ing zoom-in regions. The orange box represents occluded blackholes and the red box indicates motion artifacts specific to mov-ing objects. Our method recovers higher fidelity GS images dueto contextual aggregation and motion enhancement. Note that theblack image edges by our method are because they are not avail-able in both RS frames (cf . blue circle). Best viewed on Screen.

high quality and high framerate video of consumer-gradedevices (e.g. tablets, smartphones), video frame interpola-tion (VFI) has attracted increasing attention in the computervision community. Unfortunately, despite the remarkablesuccess, the currently existing VFI methods [2,18,38,39,57]implicitly assume that the camera employs a global shutter(GS) mechanism, i.e. all pixels are exposed simultaneously.They are therefore unable to produce satisfying in-betweenframes with rolling shutter video acquired by e.g. these de-vices in dynamic scenes or fast camera movements, result-ing in RS artifacts remaining [9].

To address this problem, many RS correction methods[13,17,24,43,56,64] have been actively studied to eliminatethe RS effect. In analogy to VFI generating non-existentintermediate GS frames from two consecutive GS frames,recovering the latent intermediate GS frames from two con-secutive RS frames, e.g. [10,24,62,63], serves as a tractable

1

arX

iv:2

205.

1291

2v1

[cs

.CV

] 2

5 M

ay 2

022

https://github.com/GitCVfb/CVR

goal that overcomes the limited acquisition framerate andRS artifacts of commercial RS cameras. This is signifi-cantly challenging because the output GS frames must fol-low coherence both temporally and spatially. To this end,traditional methods [62, 63] are often based on the assump-tion of constant velocity or constant acceleration cameramotion, which struggle to accurately reflect the real cam-era motion and scene geometry, resulting in the persistenceof ghosting and unsmooth artifacts [9, 24]. Recent deeplearning-based solutions have achieved impressive perfor-mance, but they typically can only recover one GS imagecorresponding to a particular scanline, such as the first [10]or central [24,61] scanline, limiting their potentials for viewtransitions from RS to multiple-GS.

In this paper, we tackle the task of reviving and relivingall latent views of a scene as beheld by a virtual GS cam-era in the imaging interval of two consecutive RS frames.Therefore, we must jointly deal with VFI and RS correctiontasks, i.e. interpolating smooth and trustworthy distortion-free video sequences. It is worth mentioning that the mostrelevant work to our task is [9], which is dedicated to thegeometry-aware RS inversion by warping each RS frame toits corresponding virtual GS counterpart. Nevertheless, asillustrated in Fig. 1, the GS images recovered by [9] stillsuffers from two limitations:

• Masses of black holes (cf . orange box). This is a com-mon issue for warping-based methods (e.g. [9, 44, 62–64]) due to the occlusion between the RS and GS im-ages, leading to the possibility of permanent loss ofsome valuable image contents. To maintain visual con-sistency, a cropping operation is used to discard the re-sulting holes, but may degrade the visual experience.

• Noticeable object-specific motion artifacts (cf . redbox). When recording dynamic scenes, the moving ob-ject violates the constant velocity motion assumptionof RS cameras used in [9], resulting in its inability toaccurately capture motion boundaries specific to mov-ing objects. Thus severe motion artifacts are generated.

In contrast, we investigate contextual aggregation andmotion enhancement based on the bilateral motion field(BMF) to alleviate these issues, which aims to synthesizecrisp and pleasing GS video frames by occlusion reason-ing and temporal abstraction. Specifically, we propose CVR(Context-aware Video Reconstruction architecture), whichconsists of two stages to recover a faithful and coherent GSvideo sequence from two input consecutive RS images. Inthe first initialization stage, we adopt a motion interpreta-tion module to estimate the initial bilateral motion field,which warps the two RS frames to a common GS version.We design two schemes to achieve this goal. One is basedon [9] which requires a pre-trained encoder-decoder net-work; the other is our proposed approximation of [9], with-out resorting to a deep network. Also, we show that this

simple approximation is able to provide a feasible solu-tion for the initial prediction. Afterward, a second refine-ment stage is introduced to handle black holes and ambigu-ous misalignments caused by occlusions and object-specificmotion patterns. As a result of exploiting bilateral motionresiduals and occlusion masks, it can guide the subsequentGS frame synthesis to reason about complex motion profilesand occlusions. Furthermore, inspired by [10], we proposea contextual consistency constraint to effectively aggregatethe contextual information, such that the unsmooth areascan be enhanced in an adaptive manner. Extensive exper-imental results demonstrate that our method surpasses thestate-of-the-art (SOTA) methods by a large margin in re-moving RS artifacts. Meanwhile, our method is capable ofgenerating high-fidelity GS videos.

The main contributions of this paper are three-fold:

1) We propose a simple yet effective bilateral motion fieldapproximation model, which serves as a reliable initial-ization for GS frame refinement.

2) We develop a stable and efficient context-aware GSvideo reconstruction framework, which can reason aboutcomplex occlusions, motion patterns specific to objects,and temporal abstractions.

3) Experiments show that our method achieves SOTA re-sults while maintaining an efficient network design.

2. Related Work

Video frame interpolation has been widely studied in re-cent years, which can be categorized into phase-based [31,32], kernel-based [5, 28, 36], and flow-based [2, 18, 38, 50]methods. With the latest advances in optical flow estimation[7, 51, 52], the flow-based VFI methods have been activelystudied to explicitly exploit motion information. After theseminal work [18], subsequent improvements are dedicatedto better intermediate flow estimation on one hand, such asquadratic [57], rectified quadratic [26], and cubic [4] flowinterpolations. Moreover, Bao et al. [2] strengthened theinitial flow field using the predicted depth map via a depth-aware flow projection layer. Park et al. estimated a symmet-ric bilateral motion [38] to produce the intermediate flowsdirectly, and they have recently developed an asymmetricbilateral motion model [39] to refine the intermediate frame.On the other hand, better refinement and fusion of detailswere focused on, including contextual warping [2, 33, 34],occlusion inference [3, 58], cycle constraints [27, 42] formore accurate frame synthesis, and softmax splatting [35]for more efficient forward warping, etc.

All of these VFI approaches work with a common as-sumption that the camera employs a GS mechanism. Hence,they are incapable of correctly synthesizing the in-betweenframes in the case of RS images. In this paper, we integrate

2

RS Frame 1 RS Frame 2

Exposure Time

d

Row

Row

Row

Readout Time

10.25 0.5 0.750:t

h

1

2h

......

Figure 2. RS mechanism over two consecutive frames. We aim atrecovering the latent GS images at time t ∈ [0, 1].

an effective motion interpretation module to boost the reli-able estimation of the initial flow field, yielding high-qualityresults without aliasing.

Rolling shutter correction advocates the mitigation orelimination of RS distortion, i.e. recovering the latent GSimage, from a single frame [22, 43, 44, 64] or multipleframes [1, 15, 24, 47, 54, 62]. Dai et al. [6] derived the dis-crete two-view RS epipolar geometry. Zhuang et al. [62]proposed a differential RS epipolar constraint to undistorttwo consecutive RS images, whose stereo version was fur-ther explored in [12]. Likewise, Lao et al. [23] developed adiscrete RS homography model to perform the plane-basedRS correction. Zhuang and Tran [63] presented a differ-ential RS homography to account for the scanline-varyingposes of RS cameras. In addition, some additional assump-tions are often taken into account, such as pure rotationalmotion [14, 22, 44, 45], Ackermann motion [40], and Man-hattan world [41]. With the rise of deep learning, manyappealing RS correction results have been achieved. Fortwo input consecutive RS frames, Liu et al. [24] put for-ward a deep shutter unrolling network to estimate the la-tent GS frame, and Fan et al. [10] proposed a symmetricnetwork architecture to efficiently aggregate the contextualcues. Zhong et al. [61] used a deformable attention moduleto jointly solve the RS correction and deblurring problem.Unfortunately, they can only hallucinate one GS image ata specific moment, e.g. corresponding to the first [10] orcentral [24, 61] scanline time, and thus fall short of recon-structing a smooth and coherent GS video.

Very recently, Fan and Dai [9] developed the first rollingshutter temporal super-resolution network to extract a highframerate GS video from two consecutive RS images. Itwarps each RS frame to a latent GS frame corresponding toany of its scanlines through geometry-aware propagation.As a result, undesirable holes (e.g. black edges) appear dueto the occlusion between the RS and GS images. Further-more, it leverages a constant velocity motion assumption,which does not accurately capture the motion boundariesand produces artifacts around the moving objects. Two ex-amples are shown in Figs. 1 and 6. In contrast, we propose aGS frame synthesis module, which is composed of contex-tual aggregation and motion enhancement layers, to reasonabout complex occlusions and motion patterns specific tomoving objects, resulting in a significantly improved per-formance of GS video reconstruction.

3. RS-aware Frame WarpingRS image formation model. When an RS camera is inmotion during the image acquisition, all its scanlines areexposed sequentially at different timestamps. Hence eachscanline possesses a different local frame, as illustrated inFig. 2. Without loss of generality, we assume that all pix-els in the same row are exposed instantaneously at the sametime. The number of rows in the image is h, and the con-stant inter-row delay time is τd. Therefore, the RS imageformation model can be obtained as follows:

bIr(x)cs = bIgs(x)cs , (1)

where Igs is virtual GS images captured at time τd(s−h/2),and b·cs denotes the extraction of pixel x in scanline s.RS effect removal by forward warping. Since an RSimage can be viewed as the result of successive row-by-row combinations of virtual GS image sequences withinthe imaging duration, one can invert the above RS imagingmechanism to remove RS distortions by

Ir(x) = Igs(x + ur→s), (2)

where ur→s is the displacement vector of pixel x from theRS image Ir to the virtual GS image Igs . Stacking ur→s ofall pixels yields a pixel-wise motion field, a.k.a. undistor-tion flow Ur→s, which can be used to RS-aware forwardwarping analogous to [9, 10, 24, 61]. However, when multi-ple pixels are mapped to the same location, forward warpingis prone to suffer from conflicts, inevitably leading to over-lapped pixels and holes. Softmax splatting [35] alleviatesthese problems by adaptively combining overlapping pixelinformation. Thus, the target GS frame corresponding toscanline s can be generated by

Igs =WF (Ir,Ur→s), (3)

whereWF represents the forward warping operator. We usesoftmax splatting in our implementation.Problem setup. As depicted in Fig. 2, time t and scan-line s correspond to each other. For compactness, in thefollowing we will discard the symbol s and use the sub-script t to denote the GS image Igt corresponding to timet. Following [12, 24, 63], we further assume that the read-out time ratio [62], i.e. the ratio between the total scanlinereadout time (i.e. hτd) and the inter-frame delay time, isequal to one. That is to say, the idle time between two adja-cent RS frames is ignored in a short period of imaging time(e.g. < 50 ms). This is proved to be effective to accountfor the scanline-varying camera poses, avoiding non-trivialreadout calibration [30]. Moreover, this also ensures tempo-rally tractable frame interpolation for RS images. See Ap-pendix A for further instructions. Consequently, the centralscanlines of the two consecutive RS images are recorded attime instances 0 and 1, respectively.

3

Given two RS frames Ir0 and Ir1 at adjacent times 0and 1, we aim to synthesize an intermediate GS frame Igt ,t ∈ [0, 1]. This time interval is chosen because, as observedin [10], many details of the recovered GS images corre-sponding to time t ∈ [−0.5, 0) ∪ (1, 1.5] are more likelyto be missing due to too much deviation from the temporalconsistency.

3.1. Bilateral Motion Field InitializationNetwork-based bilateral motion field (NBMF). To delivereach RS pixel x exposed at time τ (i.e. τ0 ∈ [−0.5, 0.5] orτ1 ∈ [0.5, 1.5], with subscripts indicating the image index)to the GS canvas corresponding to the camera pose at timet ∈ [0, 1], we need to estimate the motion field U0→t orU1→t (cf . Eq. (3)) to constrain each pixel’s displacement.Note that the subscripts 0 → t and 1 → t indicate the RS-aware forward warping from RS images Ir0 and Ir1 to Igt ,respectively. According to [9], we extend to the time di-mension to model the BMF U0→t and U1→t by a scalingoperation on the corresponding optical flow fields F0→1 andF1→0 between two consecutive RS frames, i.e.

U0→t(x) = C0→t(x) · F0→1(x),

U1→t(x) = C1→t(x) · F1→0(x),(4)

where

C0→t(x) =(t− τ0)(h− πv)

h,

C1→t(x) =(τ1 − t)(h+ π′v)

h,

(5)

represent the bilateral correction maps. πv and π′v encapsu-late the underlying RS geometry [9] to reveal the inter-RS-frame vertical optical flow, depending on the camera pa-rameters, the camera motion, and the depth and position ofpixel x. Furthermore, the BMF corresponding to differenttime steps t1 and t2 can be directly interconverted by

Ui→t2(x) =t2 − τt1 − τ

·Ui→t1(x), i = 0, 1. (6)

Note that the motion field for RS removal has a significanttime dependence (a.k.a. scanline dependence [9]). To cap-ture the correction map in Eq. (5), a geometric optimiza-tion problem was posed in [62,63] based on the differentialformulation [11, 29]. Recently, as shown in Fig. 3 (a), anencoder-decoder network was proposed in [9] to essentiallylearn the underlying RS geometry, such that the BMF canbe computed by Eq. (4) coupled with the estimated bidirec-tional optical flows, termed as NBMF. The arbitrary-timeGS images are then generated by image warping based onexplicit intra-frame propagation in Eq. (6). However, sincethe occlusion view is not available during warping, the re-sulting holes are visually unsatisfactory. Also, [9] is notadaptive to dynamic objects due to the reliance on a con-stant velocity motion assumption of the RS camera.

NBMF

Eqs. 5&6

Optical flows Correction maps

ABMF

Eq. 7

Optical flows Correction maps

(b) ABMF estimation

(a) NBMF estimation

Eq. 4

Eq. 4

Figure 3. Illustration of the initial BMF estimation, including (a)NBMF and its approximation (b) ABMF.

Approximated bilateral motion field (ABMF). We ob-serve that πv and π′v in Eq. (5) characterize the latent inter-GS-frame vertical optical flow, which are usually muchsmaller than the number of image rows h (cf . Appendix B.1for in-depth analysis). Hence, we propose an approximatedconstraint h− πv ≈ h ≈ h+ π′v to rewrite Eq. (5) as:

C0→t(x) = t− τ0,C1→t(x) = τ1 − t,

(7)

where the time dependence is retained while the parallax ef-fects (i.e. depth variation and camera motion) are neglected.That is, it is independent of the image content and can bepre-defined for a given image resolution. As depicted inFig. 3 (b), such approximation is able to reach the correctionmap and then the ABMF via Eq. (4) in a simple and straight-forward manner instead of relying on specialized deep neu-ral networks. Note that the interconversion between varyingABMF satisfies Eq. (6) as well. The experimental results inSec. 6.1 show that our ABMF, coupled with the contextualaggregation and motion enhancement, can serve as a strongand tractable baseline for GS frame synthesis.

4. Context-aware Video ReconstructionWe advocate recovering the intermediate global shutter

image Igt , t ∈ [0, 1] from two input consecutive rolling shut-ter images Ir0 and Ir1. In this section, we will explain how todesign a deep network to reason about time-aware motionprofiles and occlusions, such that the photorealistic time-arbitrary GS image can be recovered faithfully.

4.1. Architecture OverviewAs shown in Fig. 4, the proposed network consists of two

modules, i.e. an NBMF-based or ABMF-based motion in-terpretation module, and a context (i.e. occlusions and par-tial dynamics) aware GS frame synthesis module. Firstly,we estimate the bidirectional optical flow fields F0→1 andF1→0 between Ir0 and Ir1, followed by the BMF estimationU0→t and U1→t via Eq. (4), which is based on NBMF (i.e.Eq. (5)) or ABMF (i.e. Eq. (7)), as illustrated in Fig. 3.Then, the input RS frames are forward warped using theinitial bilateral motions, resulting in two initial intermedi-ate GS frame candidates at time t. Finally, the GS frame

4

RS frame 0

RS frame 1

GS Frame Synthesis Module

Motion Interpretation Module

Optical Flow Estimator

BMF

Estimator

GS frame candidate

GS frame candidate

Final GS frame at time

Occ

lusi

on

mas

k

CAL

MEL

Enh

ance

d B

MF

t

0 t

1 t

Eq. 9

Figure 4. Overall architecture. It has two main processes. First, two initial GS frame candidates are obtained by the motion interpretationmodule. The details of BMF estimator (i.e. NBMF or ABMF) are elaborated in Fig. 3. Then, a GS frame synthesis module is proposed toreason about complex occlusions, motion profiles, and temporal abstractions to generate the final high-fidelity GS image at time t ∈ [0, 1].

synthesis module takes the input RS frames, bidirectionaloptical flows, bilateral motion fields, and the initial interme-diate GS frame candidates to synthesize the final GS recon-struction by aggregating the context information and com-pensating for the motion boundaries adaptively. Note thatwe empirically find that our ABMF-based CVR approach(called CVR*) performs well despite its simplicity, whileour NBMF-based CVR approach (called CVR) can furtherimprove the quality of the final GS images.

Motion interpretation moduleM is composed of twosubmodules: an optical flow estimator and a bilateral mo-tion field estimator. We first utilize the widely used PWC-Net [51] as the optical flow estimator to predict the bidirec-tional optical flow. To obtain an effective initial BMF, wefollow [9] and use a dedicated encoder-decoder U-Net ar-chitecture [37,46], as shown in Fig. 3 (a), to estimate NBMFfor forward warping, which is termed asMN . Particularly,MN needs to be pre-trained by using the ground-truth (GT)central-scanline GS images for supervision. Alternatively,we propose to exploit its approximate version as shown inFig. 3 (b), i.e. an ABMF-based motion interpretation mod-uleMA, to yield a simpler and faster prediction of the ini-tial BMF. Finally, two initial intermediate GS frame candi-dates Ig0→t and Ig1→t can be generated by Eq. (3) based onthe initial BMF estimations U0→t and U1→t, respectively.

GS frame synthesis module G can be boiled down totwo main layers: a motion enhancement layer (MEL) and acontextual aggregation layer (CAL). Note that some blackholes and ambiguous misalignments may exist in the initialintermediate GS frame candidates due to heavy occlusionsand partial moving objects, degrading the visual experience.Therefore, we aim at alleviating artifacts at the boundariesof dynamic objects and filling the occluded holes. Towardsthis goal, Ir0, Ir1, F0→1, F1→0, U0→t, U1→t, I

g0→t, and

Ig1→t are concatenated and fed into G to estimate the BMFresiduals ∆U0→t and ∆U1→t and the bilateral occlusion

masks O0→t and O1→t. This time-aware occlusion mask isessential to guide GS frame synthesis to handle occlusions.We employ an encoder-decoder U-Net network [37, 46] asthe backbone of G, which has the same structure but dif-ferent channels as the network in MN . The network isfully convolutional with skip connections and leaky ReLuactivation functions. Besides, we leverage a sigmoid acti-vation function on the output channels corresponding to thebilateral occlusion mask to limit its value between 0 and1. Because G accepts cascades at different time instances,it can implicitly model the temporal abstraction to recoverGS frames corresponding to arbitrary time step t ∈ [0, 1].

Specifically, the final enhanced BMF can be obtained as:

U0→t = U0→t + ∆U0→t,

U1→t = U1→t + ∆U1→t,(8)

which can improve the quality of BMF by combining itwith the proposed contextual consistency constraint, espe-cially in motion boundaries and unsmooth regions. Subse-quently, we can produce two refined intermediate GS framecandidates Ig0→t and Ig1→t by RS-aware forward warping inEq. (3). Further, we assume that the content of the targetGS image corresponding to t ∈ [0, 1] can be recovered byat least one of the input RS images, which is promising asdiscussed in [10]. We therefore impose the constraint thatO1→t = 1 − O0→t. Intuitively, O0→t(x) = 0 impliesO1→t(x) = 1, i.e. target pixels can be faithfully renderedby fully trusting Ir1, and vice versa. Similar to [18, 37, 57],we also take advantage of the temporal distances 1− t and tfor the input RS frames Ir0 and Ir1, such that the temporally-closer pixels can be assigned a higher confidence. At last,the final intermediate GS frame Igt can be synthesized by

Igt =(1− t)O0→tI

g0→t + tO1→tI

g1→t

(1− t)O0→t + tO1→t. (9)

5

Input RS (Overlayed) DiffSfM [62] DiffHomo [63] BMBC [38] DAIN [2] Cascaded method

DeepUnrollNet [24] SUNet [10] RSSR [9] CVR* (Ours) CVR (Ours) Ground-truth

Figure 5. Qualitative results against baselines. Our method can successfully remove RS artifacts, yielding higher fidelity GS images.

Table 1. Quantitative comparisons on recovering GS images at time step t = 0.5. The numbers in red and blue represent the best andsecond-best performance. Our method is far superior to baseline methods and the proposed ABMF model is effective as an initialization.

Method Runtime PSNR↑ (dB) SSIM↑ LPIPS↓(seconds) CRM CR FR CR FR CR FR

DiffSfM [62] 467 24.20 21.28 20.14 0.775 0.701 0.1322 0.1789DiffHomo [63] 424 19.60 18.94 18.68 0.606 0.609 0.1798 0.2229DeepUnrollNet [24] 0.34 26.90 26.46 26.52 0.807 0.792 0.0703 0.1222SUNet [10] 0.21 29.28 29.18 28.34 0.850 0.837 0.0658 0.1205RSSR* 0.09 28.20 23.86 21.02 0.839 0.768 0.0764 0.1866RSSR [9] 0.12 30.17 24.78 21.23 0.867 0.776 0.0695 0.1659CVR* (Ours) 0.12 31.82 31.60 28.62 0.927 0.845 0.0372 0.1117CVR (Ours) 0.14 32.02 31.74 28.72 0.929 0.847 0.0368 0.1107

*: applying our proposed approximated bilateral motion field (ABMF) model.

4.2. Loss FunctionSimilar to [9, 24, 61], we use the reconstruction loss Lr,

the perceptual loss Lp [19], and the total variation loss Ltvto improve the quality of final GS and BMF predictions.Moreover, inspired by [10], we propose a contextual con-sistency constraint loss Lc to enforce the alignment of re-fined intermediate GS frame candidates with ground-truth,which is crucial to facilitate occlusion inference and motioncompensation. In short, our loss function L is defined as:

L = λrLr + Lp + λcLc + λtvLtv, (10)

where λr, λc and λtv are hyper-parameters. More detailscan be found in Appendix D.

5. Experimental SetupDatasets. We use the standard RS correction benchmarkdatasets [24] including Carla-RS and Fastec-RS, and dividethe training and test sets as in [24]. The Carla-RS dataset issynthesized based on the Carla simulator [8], involving gen-eral 6-DOF camera motions. The Fastec-RS dataset recordsreal-world RS images synthesized by a high-FPS GS cam-era mounted on a ground vehicle. Since they provide thefirst- and central-scanline GT supervisory signals, i.e. t = 0,0.5, and 1, we utilize this triplet as GT to train our network.Note that we add a small perturbation to make Eq. (9) workproperly, for example, transforming them to t = 0.01, 0.5,and 0.99, respectively. At the test phase, our method is ca-pable of recovering GS video frames at any time t ∈ [0, 1].

Training details. Our method is trained end-to-end usingthe Adam optimizer [21] with β1 = 0.9 and β2 = 0.999.We empirically set λr = 10, λc = 5, and λtv = 0.1. Theexperiments are performed on an NVIDIA GeForce RTX2080Ti GPU with a batch size of 4. We propose to train ournetwork in two stages. Firstly, we solely trainM. To trainthe ABMF-basedMA, we fine-tune PWC-Net [51] for 100epochs from its pre-trained model on the RS benchmark ina self-supervised way [9,20,25,55], and then ABMF can becomputed directly and explicitly. Note that the training de-tails of the NBMF-basedMN can be found in [9] with thesupervision of central-scanline GT GS images. Secondly,we jointly train the entire model (i.e. M and G) by L foranother 50 epochs. At this time, the learning rate of G is setto 10−4 for training from scratch, and that of M is set to10−5 for fine-tuning. We keep the vertical resolution con-stant and adopt a uniform random crop with a horizontalresolution of 256 pixels to augment the training data, simi-lar to [9, 10] for better contextual exploration.Evaluation strategies. As the Carla-RS dataset has theGT occlusion mask, we perform quantitative evaluationsas follows: Carla-RS dataset with occlusion mask (CRM),Carla-RS dataset without occlusion mask (CR), and Fastec-RS dataset (FR). Standard metrics PSNR and SSIM, andlearned perceptual metric LPIPS [59] are applied. HigherPSNR/SSIM or lower LPIPS score indicates better quality.Note that unless otherwise stated, we refer to the GS imagesat time t = 0.5 for consistent comparisons.

6

RS 0

RS 1

t=0 t=0.2 t=0.4 t=0.6 t=0.8 t=1

Figure 6. Example results of recovering six GS video images from the two input RS images (left column) by using RSSR [9], CVR*, andCVR (three rows from top to bottom), respectively. Apart from many unfriendly black holes at the GS image edges, RSSR generates localerrors and motion artifacts as shown in red circles. Our method can produce temporally consistent GS sequences with richer details.

Baselines. We perform comparisons with the followingbaselines. (i) DiffSfM [62] and DiffHomo [63] are tradi-tional two-image based RS correction methods that requiresophisticated optimization using RS models. (ii) SUNet[10] and DeepUnrollNet [24] recover only one GS framefrom two consecutive RS frames by designing specializedCNNs. While RSCD [61] achieves this goal from threeadjacent RS images. (iii) RSSR [9] generates a GS videofrom two consecutive RS images using deep learning, butsuffers from black holes and motion artifacts. Moreover,we integrate the proposed ABMF model into RSSR to yieldRSSR*. (iv) DAIN [2] and BMBC [38] are SOTA VFImethods that are tailored for GS cameras. (v) Cascadedmethod generates two GS images sequentially from threeconsecutive RS inputs using DeepUnrollNet, and then in-terpolates in-between GS ones using DAIN. (vi) CVR andCVR* are our proposed methods based on NBMF andABMF, respectively. Note that our RSSR*, RSSR, ourCVR*, and our CVR form a clear hierarchy of RS-basedvideo reconstruction methods.

6. Results and AnalysisIn this section, we compare with the baseline approaches

and provide analysis and insight into our method.

6.1. Comparison with SOTA MethodsWe report the quantitative and qualitative results in Ta-

ble 1 and Fig. 5, respectively. Our proposed methodachieves overwhelming dominance in RS effect removal,which is mainly attributed to context aggregation and mo-tion pattern inference. Furthermore, although our proposedABMF model is inferior to RSSR [9] when used to removethe RS effect (i.e. RSSR*), it can serve as a strong baselinefor GS video frame reconstruction when combined with GSframe refinement. We believe that our hierarchical pipelinecan provide a fresh perspective for the video reconstructiontask with RS cameras. More results and analysis are shown

in Appendix C.Note that our method is able to produce a continuous GS

sequence, which is far beyond [10, 24, 61], although [10]can decode the plausible details of the GS image at a spe-cific time. Traditional methods [62, 63] cannot estimate theunderlying RS geometry robustly and accurately, resultingin ghosting artifacts. They are also computationally inef-ficient due to the complicated handling. Due to inherentflaws in the network architectures, the VFI methods [2, 38]fail to remove the RS effect. An intuitive cascade of RS cor-rection and VFI methods tends to accumulate errors and isprone to blurring artifacts and local inaccuracies. Such cas-cades also have large models and thus be relatively time-consuming. In contrast, our end-to-end pipeline performsfavorably against the SOTA methods in terms of both RScorrection and inference efficiency. Note also that obnox-ious black holes and object-specific motion artifacts appearin [9], degrading the visual experience, as outlined in Sec. 1.In general, our CVR improves RSSR and therefore recovershigher realism results, and our CVR* also develops a newconcise and efficient framework for related tasks.

6.2. GS Video Reconstruction ResultsWe apply our method to generate multiple in-between

GS frames at arbitrary time t ∈ [0, 1]. The visual results for5× temporal upsampling are shown in Fig. 6. More resultsare provided in Appendix C.3. Our method can not onlysuccessfully remove the RS effect, but also can robustly re-construct smooth and continuous GS videos.

6.3. Ablation StudiesAblation on motion interpretation module M. We firstreplace NBMF and ABMF with linear BMF (i.e. LBMF),which is a widely used BMF initialization scheme in popu-lar VFI methods, e.g. [18, 34, 35, 38, 50]. Then, we replacePWC-Net with the SOTA optical flow estimation pipelineRAFT [52]. Finally, we freezeM and solely train G in the

7

Input RS: Ir0 Input RS: Ir1 O0→0.5 O1→0.5 ‖∆U0→0.5‖2 ‖∆U1→0.5‖2

LBMF RAFT-based w/o O w/o ∆U CVR (Ours) Ground-truth

LBMF (Crop) RAFT-based (Crop) w/o O (Crop) w/o ∆U (Crop) CVR (Crop) Ground-truth (Crop)

Figure 7. Visual results of ablation study. Our context-aware method is also adaptable to motion artifacts specific to moving objects.

Table 2. Ablation results for CVR architecture on M, G and L.

Settings PSNR↑ (dB) SSIM↑CRM CR FR CR FR

LBMF 26.10 25.97 25.78 0.806 0.771RAFT-based 30.50 29.89 27.99 0.917 0.840FreezeM 31.94 31.65 28.11 0.928 0.837T ·∆U 32.00 31.63 28.56 0.929 0.845w/o ∆U 31.90 31.65 28.32 0.928 0.841w/o O 28.22 26.31 24.04 0.902 0.813w/o Lr 31.80 31.53 28.31 0.927 0.840w/o Lp 31.60 31.34 28.49 0.929 0.842w/o Lc 31.88 31.64 28.44 0.928 0.842w/o Ltv 31.93 31.71 28.45 0.928 0.844Add L′c 31.97 31.58 28.55 0.929 0.844full model 32.02 31.74 28.72 0.929 0.847

training phase. As can be seen from Table 2 and Fig. 7,LBMF is extremely ineffective for the RS-based video con-struction task, which reveals the superiority of our proposedNBMF as well as ABMF. This could facilitate further re-search in related fields, especially the simpler ABMF. Sincethe RAFT-based full baseline is not easily optimized jointlyend-to-end, it is prone to unsmoothness at local motionboundaries. Additionally, training the entire network to-gether withM can improve model performance.Ablation on GS frame synthesis module G. We analyzethe role of each component of G in Table 2, including 1)multiplying ∆U by a normalized scanline offset T to ex-plicitly model its scanline dependence like [9,24,60], and 2)removing MEL (i.e. w/o ∆U) and CAL (i.e. w/o O), sep-arately. Combined with Fig. 7, one can observe that theyboth lead to performance degradation, especially removingCAL, which causes aliasing effects during context aggre-gation, e.g. misaligned wheels and black edges. Moreover,removing MEL will reduce the adaptability of our methodto object-specific motion artifacts, especially for the morechallenging Fastec-RS dataset. In summary, our method canadaptively infer occlusions and enhance motion boundaries.

Ablation on loss function L. We remove the loss terms oneby one to analyze their respective roles. We also directlyuse the L1 difference between the two refined intermediateGS frame candidates Ig0→t and Ig1→t to measure the contextconsistency, which constitutes a self-supervised loss termL′c. Adding L′c to Lc creates a loop loss. In Table 2, usingL′c does not lead to performance gains as supervising the fi-nal GS prediction (e.g. by Lr) has significant dominance incontext alignment. Overall, our loss function L is effectivebecause it performs best when all loss terms are used.

6.4. Limitation and DiscussionOur method relies on optical flow estimation, so there

may be aliasing artifacts in areas such as low/weak textures.Besides, although we have assumed that the pixels of thetarget GS image at time t ∈ [0, 1] are visible in one of theRS images, some of them at the edges of the GS image maynot be available, e.g. the lower right corner of GS images att = 0 in Figs. 1 and 6, due to severe occlusions from fastcamera motion or object motion. Future use of more framesmay be able to fill in these possible invisible regions.

7. ConclusionIn this paper, we have presented a context-aware archi-

tecture CVR for end-to-end video reconstruction of RS cam-eras, which incorporates temporal smoothness to recoverhigh-fidelity GS video frames with fewer artifacts and betterdetails. Moreover, we have developed a simple yet efficientpipeline CVR* based on the proposed ABMF model whichworks robustly with RS cameras. Our proposed frameworkexploits the spatio-temporal coherence embedded in the la-tent GS video via motion interpretation and occlusion rea-soning, significantly outperforming the SOTA methods. Wehope this study can shed light for future research on videoframe reconstruction of RS cameras.

8

References[1] Cenek Albl, Zuzana Kukelova, Viktor Larsson, Michal Polic,

Tomas Pajdla, and Konrad Schindler. From two rolling shut-ters to one global shutter. In Proceedings of IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages2505–2513, 2020. 3

[2] Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang,Zhiyong Gao, and Ming-Hsuan Yang. Depth-aware videoframe interpolation. In Proceedings of IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, pages3703–3712, 2019. 1, 2, 6, 7, 16

[3] Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao,and Ming-Hsuan Yang. Memc-net: motion estimation andmotion compensation driven neural network for video inter-polation and enhancement. IEEE Transactions on PatternAnalysis and Machine Intelligence, 43(3):933–948, 2021. 2

[4] Zhixiang Chi, Rasoul Mohammadi Nasiri, Zheng Liu, JuweiLu, Jin Tang, and Konstantinos N Plataniotis. All at once:temporally adaptive multi-frame interpolation with advancedmotion modeling. In Proceedings of European Conferenceon Computer Vision, pages 107–123, 2020. 2

[5] Myungsub Choi, Heewon Kim, Bohyung Han, Ning Xu, andKyoung Mu Lee. Channel attention is all you need for videoframe interpolation. In Proceedings of the AAAI Conferenceon Artificial Intelligence, volume 34, pages 10663–10671,2020. 2

[6] Yuchao Dai, Hongdong Li, and Laurent Kneip. Rolling shut-ter camera relative pose: generalized epipolar geometry. InProceedings of IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 4132–4140, 2016. 3

[7] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrick VanDer Smagt, Daniel Cremers, and Thomas Brox. Flownet:learning optical flow with convolutional networks. In Pro-ceedings of IEEE International Conference on Computer Vi-sion, pages 2758–2766, 2015. 2

[8] Alexey Dosovitskiy, German Ros, Felipe Codevilla, AntonioLopez, and Vladlen Koltun. Carla: an open urban drivingsimulator. In Proceedings of the 1st Annual Conference onRobot Learning, pages 1–16, 2017. 6

[9] Bin Fan and Yuchao Dai. Inverting a rolling shutter camera:bring rolling shutter images to high framerate global shuttervideo. In Proceedings of IEEE International Conference onComputer Vision, pages 4228–4237, 2021. 1, 2, 3, 4, 5, 6, 7,8, 12, 13, 14, 15, 16, 17

[10] Bin Fan, Yuchao Dai, and Mingyi He. Sunet: symmetricundistortion network for rolling shutter correction. In Pro-ceedings of IEEE International Conference on Computer Vi-sion, pages 4541–4550, 2021. 1, 2, 3, 4, 5, 6, 7, 12, 13, 16,17

[11] Bin Fan, Yuchao Dai, Zhiyuan Zhang, and Mingyi He. Fastand robust differential relative pose estimation with radialdistortion. IEEE Signal Processing Letters, 29:294–298,2021. 4

[12] Bin Fan, Ke Wang, Yuchao Dai, and Mingyi He. Rolling-shutter-stereo-aware motion estimation and image cor-

rection. Computer Vision and Image Understanding,213:103296, 2021. 3, 12

[13] Bin Fan, Ke Wang, Yuchao Dai, and Mingyi He. Rs-dpsnet:deep plane sweep network for rolling shutter stereo images.IEEE Signal Processing Letters, 28:1550–1554, 2021. 1

[14] Per-Erik Forssen and Erik Ringaby. Rectifying rollingshutter video from hand-held devices. In Proceedings ofIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 507–514, 2010. 3, 15, 18

[15] Matthias Grundmann, Vivek Kwatra, Daniel Castro, and Ir-fan Essa. Calibration-free rolling shutter removal. In Pro-ceedings of IEEE International Conference on Computa-tional Photography, pages 1–8, 2012. 3

[16] Johan Hedborg, Per-Erik Forssen, Michael Felsberg, andErik Ringaby. Rolling shutter bundle adjustment. In Pro-ceedings of IEEE/CVF Conference on Computer Vision andPattern Recognition, pages 1434–1441, 2012. 1

[17] Sunghoon Im, Hyowon Ha, Gyeongmin Choe, Hae-GonJeon, Kyungdon Joo, and In So Kweon. Accurate 3d recon-struction from small motion clip for rolling shutter cameras.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 41(4):775–787, 2018. 1

[18] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-HsuanYang, Erik Learned-Miller, and Jan Kautz. Super slomo:high quality estimation of multiple intermediate frames forvideo interpolation. In Proceedings of IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, pages9000–9008, 2018. 1, 2, 5, 7, 16

[19] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptuallosses for real-time style transfer and super-resolution. InProceedings of European Conference on Computer Vision,pages 694–711, 2016. 6, 16

[20] Rico Jonschkowski, Austin Stone, Jonathan T Barron, ArielGordon, Kurt Konolige, and Anelia Angelova. What mattersin unsupervised optical flow. In Proceedings of EuropeanConference on Computer Vision, pages 557–572, 2020. 6

[21] Diederik P Kingma and Jimmy Ba. Adam: a method forstochastic optimization. In Proceedings of the InternationalConference on Learning Representations, 2015. 6

[22] Yizhen Lao and Omar Ait-Aider. A robust method for strongrolling shutter effects correction using lines with automaticfeature selection. In Proceedings of IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 4795–4803, 2018. 3

[23] Yizhen Lao and Omar Ait-Aider. Rolling shutter homog-raphy and its applications. IEEE Transactions on PatternAnalysis and Machine Intelligence, 43(8):2780–2793, 2021.3

[24] Peidong Liu, Zhaopeng Cui, Viktor Larsson, and Marc Polle-feys. Deep shutter unrolling network. In Proceedingsof IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 5941–5949, 2020. 1, 2, 3, 6, 7, 8, 12, 14,15, 16

[25] Peidong Liu, Joel Janai, Marc Pollefeys, Torsten Sattler,and Andreas Geiger. Self-supervised linear motion deblur-ring. IEEE Robotics and Automation Letters, 5(2):2475–2482, 2020. 6

9

[26] Yihao Liu, Liangbin Xie, Li Siyao, Wenxiu Sun, Yu Qiao,and Chao Dong. Enhanced quadratic video interpolation. InProceedings of European Conference on Computer Vision,pages 41–56, 2020. 2

[27] Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-YuChuang. Deep video frame interpolation using cyclic framegeneration. In Proceedings of the AAAI Conference on Arti-ficial Intelligence, volume 33, pages 8794–8802, 2019. 2

[28] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, andAseem Agarwala. Video frame synthesis using deep voxelflow. In Proceedings of IEEE International Conference onComputer Vision, pages 4463–4471, 2017. 2, 16

[29] Yi Ma, Jana Kosecka, and Shankar Sastry. Linear differentialalgorithm for motion recovery: a geometric approach. Inter-national Journal of Computer Vision, 36(1):71–89, 2000. 4

[30] Marci Meingast, Christopher Geyer, and Shankar Sastry. Ge-ometric models of rolling-shutter cameras. arXiv preprintarXiv:cs/0503076, 2005. 3

[31] Simone Meyer, Abdelaziz Djelouah, Brian McWilliams,Alexander Sorkine-Hornung, Markus Gross, and Christo-pher Schroers. Phasenet for video frame interpolation. InProceedings of IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 498–507, 2018. 2

[32] Simone Meyer, Oliver Wang, Henning Zimmer, Max Grosse,and Alexander Sorkine-Hornung. Phase-based frame inter-polation for video. In Proceedings of IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 1410–1418, 2015. 2

[33] Guillaume Le Moing, Jean Ponce, and Cordelia Schmid.Ccvs: context-aware controllable video synthesis. In Pro-ceedings of Advances in Neural Information Processing Sys-tems, volume 34, 2021. 2

[34] Simon Niklaus and Feng Liu. Context-aware synthesis forvideo frame interpolation. In Proceedings of IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 1701–1710, 2018. 2, 7, 16

[35] Simon Niklaus and Feng Liu. Softmax splatting for videoframe interpolation. In Proceedings of IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition, pages5437–5446, 2020. 2, 3, 7

[36] Simon Niklaus, Long Mai, and Feng Liu. Video frameinterpolation via adaptive convolution. In Proceedings ofIEEE/CVF Conference on Computer Vision and PatternRecognition, pages 670–679, 2017. 2

[37] Avinash Paliwal and Nima Khademi Kalantari. Deep slowmotion video reconstruction with hybrid imaging system.IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 42(7):1557–1569, 2020. 5, 16

[38] Junheum Park, Keunsoo Ko, Chul Lee, and Chang-Su Kim.Bmbc: bilateral motion estimation with bilateral cost volumefor video interpolation. In Proceedings of European Confer-ence on Computer Vision, pages 109–125, 2020. 1, 2, 6, 7,16

[39] Junheum Park, Chul Lee, and Chang-Su Kim. Asymmetricbilateral motion estimation for video frame interpolation. InProceedings of IEEE International Conference on ComputerVision, pages 14539–14548, 2021. 1, 2

[40] Pulak Purkait and Christopher Zach. Minimal solvers formonocular rolling shutter compensation under ackermannmotion. In Proceedings of IEEE Winter Conference on Ap-plications of Computer Vision, pages 903–911, 2018. 3

[41] Pulak Purkait, Christopher Zach, and Ales Leonardis.Rolling shutter correction in Manhattan world. In Proceed-ings of IEEE International Conference on Computer Vision,pages 882–890, 2017. 3

[42] Fitsum A Reda, Deqing Sun, Aysegul Dundar, MohammadShoeybi, Guilin Liu, Kevin J Shih, Andrew Tao, Jan Kautz,and Bryan Catanzaro. Unsupervised video interpolation us-ing cycle consistency. In Proceedings of IEEE InternationalConference on Computer Vision, pages 892–900, 2019. 2

[43] Vijay Rengarajan, Yogesh Balaji, and AN Rajagopalan. Un-rolling the shutter: cnn to correct motion distortions. In Pro-ceedings of IEEE/CVF Conference on Computer Vision andPattern Recognition, pages 2291–2299, 2017. 1, 3

[44] Vijay Rengarajan, Ambasamudram N Rajagopalan, and Ran-garajan Aravind. From bows to arrows: rolling shutter recti-fication of urban scenes. In Proceedings of IEEE/CVF Con-ference on Computer Vision and Pattern Recognition, pages2773–2781, 2016. 2, 3

[45] Erik Ringaby and Per-Erik Forssen. Efficient video rectifica-tion and stabilisation for cell-phones. International Journalof Computer Vision, 96(3):335–352, 2012. 3

[46] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: convolutional networks for biomedical image segmen-tation. In Proceedings of the International Conference onMedical Image Computing and Computer-assisted Interven-tion, pages 234–241, 2015. 5

[47] Olivier Saurer, Kevin Koser, Jean-Yves Bouguet, and MarcPollefeys. Rolling shutter stereo. In Proceedings of IEEEInternational Conference on Computer Vision, pages 465–472, 2013. 3

[48] David Schubert, Nikolaus Demmel, Lukas von Stumberg,Vladyslav Usenko, and Daniel Cremers. Rolling-shuttermodelling for direct visual-inertial odometry. In Proceedingsof IEEE/RSJ International Conference on Intelligent Robotsand Systems, pages 2462–2469, 2019. 1

[49] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In Pro-ceedings of the International Conference on Learning Rep-resentations, 2015. 16

[50] Li Siyao, Shiyu Zhao, Weijiang Yu, Wenxiu Sun, DimitrisMetaxas, Chen Change Loy, and Ziwei Liu. Deep animationvideo interpolation in the wild. In Proceedings of IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 6587–6595, 2021. 2, 7

[51] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.Pwc-net: cnns for optical flow using pyramid, warping,and cost volume. In Proceedings of IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 8934–8943, 2018. 2, 5, 6

[52] Zachary Teed and Jia Deng. Raft: recurrent all-pairs fieldtransforms for optical flow. In Proceedings of European Con-ference on Computer Vision, pages 402–419, 2020. 2, 7, 13

[53] Subeesh Vasu, Mahesh MR Mohan, and AN Rajagopalan.Occlusion-aware rolling shutter rectification of 3d scenes. In

10

Proceedings of IEEE/CVF Conference on Computer Visionand Pattern Recognition, pages 636–645, 2018. 1

[54] Ke Wang, Bin Fan, and Yuchao Dai. Relative pose estimationfor stereo rolling shutter cameras. In Proceedings of IEEEInternational Conference on Image Processing, pages 463–467, 2020. 3

[55] Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, PengWang, and Wei Xu. Occlusion aware unsupervised learningof optical flow. In Proceedings of IEEE/CVF Conferenceon Computer Vision and Pattern Recognition, pages 4884–4893, 2018. 6

[56] Huicong Wu, Liang Xiao, and Zhihui Wei. Simultaneousvideo stabilization and rolling shutter removal. IEEE Trans-actions on Image Processing, 30:4637–4652, 2021. 1

[57] Xiangyu Xu, Li Siyao, Wenxiu Sun, Qian Yin, and Ming-Hsuan Yang. Quadratic video interpolation. In Proceedingsof Advances in Neural Information Processing Systems, vol-ume 32, 2019. 1, 2, 5, 16

[58] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, andWilliam T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision,127(8):1106–1125, 2019. 2

[59] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman,and Oliver Wang. The unreasonable effectiveness of deepfeatures as a perceptual metric. In Proceedings of IEEE/CVFConference on Computer Vision and Pattern Recognition,pages 586–595, 2018. 6

[60] Zhihang Zhong, Mingdeng Cao, Xiao Sun, Zhirong Wu,Zhongyi Zhou, Yinqiang Zheng, Stephen Lin, and ImariSato. Bringing rolling shutter images alive with dual reverseddistortion. arXiv preprint arXiv:2203.06451, 2022. 8

[61] Zhihang Zhong, Yinqiang Zheng, and Imari Sato. Towardsrolling shutter correction and deblurring in dynamic scenes.In Proceedings of IEEE/CVF Conference on Computer Vi-sion and Pattern Recognition, pages 9219–9228, 2021. 2, 3,6, 7, 14

[62] Bingbing Zhuang, Loong-Fah Cheong, and Gim Hee Lee.Rolling-shutter-aware differential sfm and image rectifica-tion. In Proceedings of IEEE International Conference onComputer Vision, pages 948–956, 2017. 1, 2, 3, 4, 6, 7, 12,15, 16, 17, 18

[63] Bingbing Zhuang and Quoc-Huy Tran. Image stitching andrectification for hand-held cameras. In Proceedings of Euro-pean Conference on Computer Vision, pages 243–260, 2020.1, 2, 3, 4, 6, 7, 12, 16, 17

[64] Bingbing Zhuang, Quoc-Huy Tran, Pan Ji, Loong-FahCheong, and Manmohan Chandraker. Learning structure-and-motion-aware rolling shutter correction. In Proceedingsof IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 4551–4560, 2019. 1, 2, 3, 16

11

In this appendix, we first derive a general paramet-ric form of bilateral motion field (BMF) by consideringthe readout time ratio, and then justify our problem setup.Next, we provide thorough analyses to our ABMF model,occlusion reasoning, and motion enhancement. Afterward,we show additional experimental results on RS correction,intermediate flow, generalization, and GS video recovery,which fully demonstrates the superiority of our pipeline. Wealso include a video demo to present dynamic GS video re-construction results. Additional details of the loss functionare then added. Furthermore, we report a partial ablationstudy of CVR*. At last, several failure cases are given tolook forward to possible future research.

Appendix A. Instructions on Problem SetupIn this section, we show a detailed derivation of the gen-

eral parameterization of BMF in the time dimension, fol-lowed by an explanation of our problem setup.

A.1. General formulation of BMFWe first give a brief description of the connection be-

tween the motion field U0→s and the optical flow F0→1 byaccounting for the first RS frame Ir0 as an example. Sincethis does not contain our contribution, we only give the nec-essary details to follow the derivation below. More detailsof this connection can be found in [9]. Suppose that to es-timate U0→s that warps each pixel (e.g. x in scanline κ) ofIr0 to the GS counterpart corresponding to its scanline s, thisconnection under the constant velocity motion model can beformulated as:

U0→s(x) = C0→s(x) · F0→1(x), (11)

where

C0→s(x) =γ(s− κ)(h− γπv)

h2, (12)

denotes the forward correction map. Here, γ ∈ (0, 1] is thereadout time ratio [62], h is the number of scanlines, and πvrepresents the latent inter-GS-frame vertical optical flow.

Next, we extend it to the time domain and derive a moregeneral formulation than that in the main paper, i.e. γ willbe taken into account. According to the definition in themain paper, the one-to-one correspondence between time tand scanline s will satisfy

t =γ

h

(s− h

2

). (13)

It is easy to verify that the central scanlines of Ir0 and Ir1correspond to time instances 0 and 1, respectively. Notethat the first scanline of Ir0 will coincide with time −γ2 .

Assume that t ∈ [0, 1] corresponds to the scanline s tobe restored, and τ0 ∈ [−γ2 ,

γ2 ] corresponds to the exposure

time of scanline κ of Ir0, we can obtain t − τ0 , γ(s−κ)h .

Further combining with Eq. (12) yields

C0→t(x) =(t− τ0)(h− γπv)

h. (14)

Similarly, the backward correction map that accounts forthe second RS frame Ir1 with τ1 ∈ [1 − γ

2 , 1 + γ2 ] can be

defined as:

C1→t(x) =(τ1 − t)(h+ γπ′v)

h. (15)

Note that Eqs. (14) and (15) model the bilateral correc-tion map through the time paradigm in a general sense. Inthis way, the general parametric form of BMF is modeled.This generality is reflected by the fact that, unlike Eq. (5) inthe main paper, we get τ0 ∈ [−γ2 ,

γ2 ] and τ1 ∈ [1− γ

2 , 1+ γ2 ]

instead of τ0 ∈ [−0.5, 0.5] and τ1 ∈ [0.5, 1.5]. This is be-cause we assume γ = 1 in our problem setup. In the fol-lowing, we will explain the feasibility of this setup.

A.2. Feasibility analysis of our problem setup

Our problem setup with γ = 1 is based on three mainreasons. Firstly, the Carla-RS and Fastec-RS datasets pro-posed by [24] are the only available RS correction bench-mark datasets. And they are constructed by using the as-sumption of γ = 1. Since we leverage these two datasets totrain our network, this setup can also be considered as weuse γ = 1 in advance. Secondly, as manifested by [9,10,24]and our experiments, the trained deep learning model basedon these two datasets can be successfully generalized to RSdata acquired by real cameras (i.e. γ may not be equal to1). Also, γ is directly set to 1 in [12, 63] to correct realRS images, which can avoid the non-trivial readout cali-bration. These studies demonstrate that assuming γ = 1is generally feasible for modeling RS correction problems.Thirdly, this setup can facilitate temporally tractable frameinterpolation between two consecutive RS images. Notethat we derive the general parameterization of BMF in Ap-pendix A.1, which will help the future exploration of moregeneral RS-based video reconstruction tasks (e.g. with ex-tremely small γ).

Appendix B. Additional Architecture AnalysesIn this section, we provide more in-depth analyses of our

method in terms of the proposed ABMF model, time-awareocclusion reasoning and motion enhancement architectures.

B.1. Further analysis on ABMF model

Here, we show the rationality of our ABMF model pro-posed in Subsection 3.1 of the main paper, i.e., we needto verify that h± |πv| ≈ h (equivalently, |πv|

h ≈ 0). Tothis end, we define |πv|

h as the vertical pixel displacement

12

vh

Figure A1. Statistical results of the vertical pixel displacementratio |πv|

hunder the Carla-RS and Fastec-RS datasets. Red is the

average value (AVG) and blue is the standard deviation (SD).

ratio. From [9], one can get |πv|h = |fv/(h+ fv)|, where

fv denotes the inter-RS-frame vertical optical flow and h isthe number of image rows. We employ the state-of-the-artoptical flow estimation pipeline RAFT [52] to obtain theoptical flow map between two adjacent RS frames. Then,we calculate the average value and standard deviation of|πv|h in each image of the test sets of Carla-RS and Fastec-

RS datasets. We plot their respective statistics in Fig. A1.One can observe that the latent inter-GS-frame vertical op-tical flow value is usually much smaller than the numberof image rows, i.e. the proposed ABMF model is conciseand reasonable. Note that γ|πv|

h ≤ |πv|h , which indicates

that our ABMF model is also valid under the general for-mulation of Appendix A.1. Furthermore, as illustrated inFig. A2, the ABMF-based RSSR* tends to have misalignederrors and unsmooth artifacts at local boundaries (e.g. depthvariation, slight blurring, monotonous texture, etc.) due tothe isotropic approximation of ABMF. Fortunately, the ex-perimental results demonstrate that our CVR* (i.e. combin-ing ABMF with the GS frame refinement module) enhanceslocal details and improves image quality in a coarse-to-fine manner, which can serve as an effective and efficientbaseline for RS-based video reconstruction. Note that ourCVR can further improve the fidelity and authenticity ofthe recovered GS images because a better initial estimate isprovided by using the network-based bilateral motion field(NBMF).

B.2. Further analysis on occlusion reasoning

As can be seen from Fig. A3, the severe occlusion ex-ists in the pool at the lower-left corner of the RS images(cf . blue circle). Since RSSR [9] only uses the contentsof a single RS image to synthesize the corresponding GSimage, a mass of occluded black holes inevitably appear(cf . red circles). In contrast, we mitigate this struggle byeffectively aggregating contextual information through oc-clusion inference. Interestingly, in the examples of timet = 0.5 shown in Fig. A3, the estimated bilateral occlu-

sion masks can vividly reflect the human intuitive observa-tion discussed in [10], i.e. the first and second rolling shutterimages contribute greatly to the lower and upper parts of thelatent GS image at time t = 0.5, respectively. Meanwhile,the GS image corresponding to time t = 0 or t = 1 willbe more convinced by the RS image that is closer in time,which reasonably follows the RS imaging mechanism. Ina nutshell, our method can restore high-quality GS imageswith richer details, enhancing the visual experience. Notethat our method can also model temporal abstractions inan end-to-end manner, which allows adaptively generatingtime-aware occlusion masks to obtain GS images at arbi-trary times.

B.3. Further analysis on motion enhancement

We further investigate the effectiveness of our motion en-hancement layer in Fig. A4, taking the correction to timet = 1 as an example. As illustrated by the red boxes, themotion enhancement scheme facilitates the quality of thebilateral motion field. As a result, the local image details(e.g. object-specific motion boundaries, small errors, etc.)are refined so as to encourage subsequent contextual aggre-gation. Combined with the proposed contextual consistencyconstraint, it can promote high-fidelity GS frame synthesiswith the assistance of bilateral occlusion masks.

Appendix C. Additional Experimental Results

In this section, we present more qualitative and quanti-tative experimental results on effect removal, intermediateflow, and generalization, etc. Furthermore, a video demois included to show the dynamic results of reconstructingslow-motion GS video from two consecutive RS frames.

C.1. RS effect removal

First, we visualize the intermediate flow in Fig. A5 andcompare it with the SOTA RS-based video reconstructionmethod RSSR [9]. In contrast to RSSR, our pipeline gener-ates intermediate flows with clearer motion boundaries formore accurate frame interpolation due to motion interpre-tation and occlusion reasoning. Then, we report more RSeffect removal results in Fig. A6 and Fig. A8 by compar-ing with the off-the-shelf video frame interpolation (VFI)and RS correction algorithms. Finally, in Table A1, wegive quantitative comparison results of GS image recoveryat time t = 1. In addition to the superior RS effect re-moval performance at time t = 0.5, our pipeline also sig-nificantly surpasses RSSR at time t = 1. In summary, theseexperimental results consistently demonstrate that the pro-posed method has superior RS effect removal capabilities,successfully restoring higher fidelity global shutter videoframes with fewer artifacts and richer details.

13

Input RS (Overlayed) RSSR* RSSR [9] CVR* (Ours) CVR (Ours) Ground-truthFigure A2. Example results of the effectiveness of our ABMF model. Since the ABMF model ignores the depth variations, the ABMF-based RSSR* may encounter misaligned errors and unsmooth artifacts at motion boundaries, while the (NBMF-based) RSSR [9] canalleviate these problems to some extent, but still not as well as it could be. Combined with the GS frame refinement module, ABMFprovides concise and tractable benefits for GS video recovery, while NBMF can yield better initialization to generate higher fidelity GSimages.

RS 0

RS 1

CVR (Ours)

CVR (Ours)

CVR (Ours)

RSSR

RSSR

RSSR

GT

GT

GT

1 tO0 tO

1 tO0 tO

1 tO0 tO

t=0

t=0

.5t=

1

Figure A3. Example results of the effectiveness of our occlusion reasoning. We show GS frame recovery at times 0, 0.5 and 1, respectively.The brighter the color in the bilateral occlusion mask, the higher the credibility. Our method can adaptively and efficiently reason aboutcomplex occlusions and temporal abstractions, leading to visually more satisfactory GS reconstruction results than RSSR [9].Table A1. Quantitative comparisons on recovering GS images at time step t = 1. The numbers in red and blue represent the best andsecond-best performance. In addition to the SOTA quantification performance for GS image recovery at time t = 0.5, our method alsoobtains almost consistent best metrics at time t = 1. Note that not only these, high-quality GS video frames corresponding to any timet ∈ [0, 1] can be accurately estimated by our method.

Method PSNR↑ (dB) SSIM↑ LPIPS↓CRM CR FR CR FR CR FR

DeepUnrollNet [24] 27.86 27.54 27.02 0.829 0.828 0.0555 0.0791RSCD [61] - - 24.84 - 0.778 - 0.1070RSSR [9] 29.36 26.57 24.89 0.900 0.824 0.0553 0.1109CVR* (Ours) 28.28 28.19 26.58 0.912 0.833 0.0444 0.1014CVR (Ours) 29.41 29.19 26.67 0.915 0.838 0.0403 0.1011

*: applying our proposed approximated bilateral motion field (ABMF) model.

14

Input RS: Ir1 Initial GS candidate: Ig1→1 Refined GS candidate: Ig1→1

∥∥∥Ig1→1 − Igt1

∥∥∥2

∥∥∥Ig1→1 − Igt1

∥∥∥2

‖∆U1→1‖2Figure A4. Example results of the effectiveness of our motion enhancement. The second to fifth columns show the initial intermediate GSframe candidates, the refined intermediate GS frame candidates, and their absolute differences with corresponding ground-truth, respec-tively. The sixth column indicates the mean of the BMF residual map (the brighter a pixel, the bigger the motion enhancement). Our CVReffectively enhances ambiguous motion boundaries for more accurate contextual alignment.

Input RS 1 Intermediate flow (RSSR) Intermediate flow (CVR) Estimated GS (RSSR) Ground-truthEstimated GS (CVR)

Figure A5. Visual results of the intermediate flow estimation. The estimated GS images in the fourth and fifth columns are obtained bywarping the input RS frames according to the intermediate flows in the second and third columns, respectively. Our CVR estimates theintermediate flow with clearer motion boundaries than RSSR [9] and thus generates more accurate and sharper GS content.

C.2. Generalization on other real data

To evaluate the generalization performance of the pro-posed method on real rolling shutter images, we utilize thedata provided by [62] and [14], in which the hand-held cam-eras move quickly in the real world to capture real RS imagesequences. As shown in Fig. A9, our CVR and CVR* caneffectively and robustly remove the RS effect to obtain con-sistent distortion-free images, which validates the excellentgeneralization performance of our method in practice.

C.3. GS video reconstruction demo

We attach a supplementary video demo video.mp4 todynamically demonstrate the GS video reconstruction re-sults. In the video, we show the 10× temporal upsamplingresults, i.e. evenly interpolating 11 intermediate GS framescorresponding to time steps 0, 0.1, 0.2, ..., 0.9, 1. In essence,our method is capable of generating GS videos with arbi-trary frame rates. Note that except for times 0, 0.5, and 1,our method has not been fed with GS images of other timeinstances during training. More qualitative results on RScorrection datasets [24] and real RS data [14,62] can be seenin the supplementary video. With these examples, we canconclude that our method not only achieves the state-of-the-art RS effect removal performance that is significantly bet-

ter than competing methods, but also has the superior abilityto recover high-quality and high-framerate GS videos.

Appendix D. Details of Loss Function

Assuming that T GS images at time instances {ti}T1 , ti ∈[0, 1] are to be recovered to supervise the training of ourmodel and that Igtti is the corresponding ground-truth (GT)GS image, our loss function L is a linear combination ofthe reconstruction loss Lr, perceptual loss Lp, contextualconsistency loss Lc, and total variation loss Ltv , i.e.

L = λrLr + Lp + λcLc + λtvLtv, (16)

where λr, λc and λtv are hyper-parameters. The pixel in-tensities of images are normalized.

The reconstruction loss Lr models the pixel-wise L1

loss betweeen the final GS frame prediction and the cor-responding ground-truth, given by

Lr =1

T

T∑i=1

∥∥∥Igti − Igtti

∥∥∥1. (17)

The perceptual loss Lp contributes to produce fine de-tails and improves the perceptual quality of the final inter-

15

Ground-truthCascaded methodInput RS (Overlay) DAINBMBC CVR (Ours)

Figure A6. Visual examples against off-the-shelf VFI approaches (i.e. BMBC [38], DAIN [2], and Cascaded method). Although thecascaded method can compensate the drawback of the VFI method that cannot remove RS artifacts, it is also prone to local errors due toerror accumulation, as shown by the red circles.

Table A2. Ablation results for CVR* architecture on MA and G.

Settings PSNR↑ (dB) SSIM↑CRM CR FR CR FR

RAFT-based 30.40 29.91 27.67 0.914 0.835FreezeMA 31.69 31.53 28.51 0.927 0.843T ·∆U 31.15 30.95 28.01 0.916 0.831w/o ∆U 31.61 31.41 27.99 0.925 0.831w/o O 30.96 30.80 23.89 0.913 0.804full model 31.82 31.60 28.62 0.927 0.845

mediate GS frame [19] by

Lp =1

T

T∑i=1

∥∥∥φ(Igti)− φ (Igtti )∥∥∥1, (18)

where φ is the conv4 3 features of the pre-trained VGG16network [49], as widely used in [18, 24, 34].

The contextual consistency loss Lc encourages the align-ment of the refined intermediate GS frame candidates andtheir ground-truth frame at time ti. This can also facilitatethe final enhanced BMF to reason about the underlying oc-clusions and the object-specific motion boundaries, whichare crucial for the final GS frame synthesis. Specifically,we define Lc as:

Lc =1

2T

T∑i=1

(∥∥∥Ig0→ti − Igtti

∥∥∥1

+∥∥∥Ig1→ti − Igtti

∥∥∥1

).

(19)The total variation loss Ltv enforces piecewise smooth-

ness in the final enhanced BMF [10, 28], i.e.

Ltv =1

2T

T∑i=1

(∥∥∥∇U0→ti

∥∥∥2

+∥∥∥∇U1→ti

∥∥∥2

). (20)

Appendix E. Ablations of the Proposed CVR*Additionally, we report the impact of different network

architecture designs on our CVR* in Table A2 by referring

to the GS images at time t = 0.5. Ablation results that arealmost consistent with CVR in the main paper can be ob-tained, which fully demonstrates the validity of the networkarchitecture we used.

Appendix F. Failure CasesWe have discussed that our pipeline may have blend-

ing and ghosting artifacts in image regions such aslow/weak/repetitive textures. We reckon this is because ourmethod exploits image-based warping, and thus potentialgross errors of the estimated BMF in these challenging re-gions can easily lead to contextual misalignment. In fact,this is a common challenge for the current RS correctionmethod based on image warping, e.g. [9, 62–64]. We showvisual results of the failure cases in Fig. A7. Similar to thetraining process of VFI methods [18,37,57], it will likely behelpful to use more GT GS images at different time steps tosupervise the training of our network. In the future, we alsoplan to improve the BMF estimation or design feature-basedaggregation schemes to ameliorate this weakness.

Input RS (Overlay) CVR (Ours) Ground-truth

Figure A7. Failure cases in challenging areas. The white pillarsand car tails lack texture and are thus prone to aliasing artifacts.

16

RS frame 1 SUNet DiffHomo DiffSfM

RSSR CVR* (Ours) CVR (Ours) Ground-truth

RS frame 1 SUNet DiffHomo DiffSfM

RSSR CVR* (Ours) CVR (Ours) Ground-truth

Figure A8. Rolling shutter effect removal examples against competing approaches (i.e. SUNet [10], DiffHomo [63], DiffSfM [62], andRSSR [9]). Even columns: Absolute difference between the corrected global shutter image and the corresponding ground-truth.

17

Original RS image CVR* (Ours) CVR (Ours)

Figure A9. Generalization results on real rolling shutter data with noticeable rolling shutter artifacts. The data in the first three rows arefrom [62], and the last three rows are from [14]. Consistent and high-quality correction results are obtained by our CVR and CVR*.

18

arXiv:2205.12912v1 [cs.CV] 25 May 2022

Documents