Advanced Driving Assistance System

Y U H U A N G

Y U . H U A N G 0 7 @ G M A I L . C O M

S U N N Y V A L E , C A L I F O R N I A

Advanced Driving Assistance System

mailto:[email protected]

Outline

DAS and ADAS functions;

ADAS software components;

Computer vision for ADAS;

Visual odometry and SLAM in ADAS;

Parking assist in ADAS;

Lane detection and tracking in ADAS;

Obstacle detection and tracking in ADAS;

Free space/Drivable area detection in ADAS;

Traffic Scene Understanding;

Range/Distance estimation in ADAS;

Time to contact for obstacle avoidance in ADAS;

Traffic sign/light recognition;

MobilEye: Deep learning with vision;

Detection and segmentation in ADAS;

E2E deep learning for autonomous driving;

Maneuver anticipation by learning;

Deep reinforcement learning for self driving;

Learn from Maps: Visual Common Sense;

Synthetic Autonomous Driving using GANs;

Augmented Reality in ADAS;

Appendix A: Deep Reinforcement Learning (RL);

Appendix B: Generative Adversarial Network.

Spectrum of DAS and ADAS Functions

ADAS Functionalities

ADAS Software Component Architecture

ADAS with Computer Vision

Generalized Camera

A general imaging model used to represent an imaging system.

All imaging systems perform a mapping from incoming scene rays to photo-sensitive elements on the image detector: conveniently using a set of virtual sensing elements called raxels.

Raxels include geometric, radiometric and optical properties.

A calibration method that uses structured light patterns to extract the raxel parameters of an arbitrary imaging system.

(a) catadioptric system, (b) dioptric wide-angle system, (c) imaging system made of a camera cluster, and (d) compound camera made of individual sensing elements, each including a receptor and a lens.

Generalized Camera

(a) A raxel is a virtual replacement for a real photosensitive element. A raxel may have radiometric and optical parameters.

(b) The notation for a raxel.

A raxel is a virtual photo-sensitive element that measures the light energy of a compact bundle of rays which can be represented as a single principle incoming ray.

An imaging system modeled as a set of raxels on a sphere surrounding the imaging system. Each raxel i has a position pi on the sphere, and an orientation qi; aligned with an incoming ray. Multiple raxels may be located at the same point (p1 = p2 = p3), but have different directions.

Generalized Camera

(a) A non-perspective imaging system and a perspective catadioptric system consists of a perspective camera a parabolic mirror. The imaging system is mounted on the translating stage. The laptop displays 26 patterns. (b) A sample bit pattern as seen through the parabolic catadioptric system.

(a) A perspective imaging system and a calibration system, consisting of a laptop and a translating stage. The axis of perspective camera is normal to the plane of the screen. (b) A sample bit pattern as seen through the perspective system.

Structure from Motion in Generalized Camera

Use a network of cameras as if they were a single imaging device, even when they do not share a common center of projection.

A schematic of a multi-camera system for an autonomous vehicle, including a forward facing stereo pair, two cameras facing towards the rear, and a panoramic multi-camera cluster in the middle. The lines represent the rays of light sampled by these cameras.


The generalized imaging model expresses how each pixel samples the light-field. This sampling is assumed to be centered around a ray starting at a point X,Y,Z, with a direction parameterized by (φ, θ), relative to a coordinate system attached to the camera.

The simplified model captures only the direction of the ray, parameterized by its Pluecker vectors q, q’.

Note: The Pluecker vectors of a line are a pair of 3- vectors: q, q’ , named the direction vector and moment vector. q is a vector of any length in the direction of the line.


Generalized Epipolar Constraint btw two views’ Plueker vectors

Generalized Point Reconstruction in the fiducial coordinate system

Solve the above equation for α1 by

Note: Parameters α1, α2, is the corollary of the depth in typical cameras:

Generalized Optic Flow Equation

Generalized Differential Epipolar Constraint

Multiple View Geometry in Generalized Camera

Most existing camera types, like pinhole cameras, sensors with radial or more general distortions, catadioptric cameras (central or non-central), etc.

A hierarchy of general camera models: The most general model has unconstrained projection rays whereas the most

constrained model dealt with here is the central model, where all rays pass through a single point.

Intermediate models are what we call axial cameras (all rays touch a single line), and x-slit cameras (rays touch two lines).

A multi-view geometry of completely non-central cameras, leading to the formulation of multi-view matching tensors, analogous to the fundamental/essential matrices, trifocal and quadri-focal tensors of perspective cameras.


Examples of imaging systems; (c)–(e) are non-central devices. (a) Catadioptric system. (b) Central camera (e.g. perspective, with or without radial distortion). (c) Camera looking at reflective sphere. (d) Omnivergent imaging system. (e) Stereo system.


Camera models, defined by 3D points and lines that have an intersection with all projection rays of a camera.


Parameterization of projection rays for different camera models:

In central cameras, all rays go through a single point, the optical center: a finite and infinite optical center. In axial cameras, all rays touch a line, the camera axis: a finite and an infinite camera axis. In x-slit cameras, there exist two lines – camera axes – that cut all projection rays: (i) both axes are finite lines or (ii) one of the two axes is a line at infinity.


Cases of multi-view matching constraints for central and non-central cameras. Columns “useful” contain entries of the form x-y-z etc. that correspond to sub-matrices of M that give rise to matching constraints linking all views: x-y-z refers to submatrices containing x rows from one camera, y from another etc.


Essential matrices for different camera models

Apply Motion Estimation for Predictive Collision Avoidance

Approaches to two of the major tasks for autonomous driving in urban environments: self-localization and ego-motion estimation, and detection of dynamic objects such as cars and pedestrians.

Using a restrictive motion model which allows to parameterize the motion with only 1 feature correspondence. Own an Instantaneous Center of Rotation (ICR);

The two front wheels are turned of a slight different angle to make the vehicle move instantaneously along a circle and, thus, turn about the ICR;

It reduces DoF to two, namely the rotation angle and the radius of curvature;

Circular motion: only 1 feature correspondence suffices for the epipolar geometry;

Straight motion: 1-Point RANSAC for removing outliers;

Very efficient algorithms for outlier removal and motion estimation.



Pedestrian & car detection: an appearance based detector that uses the information from camera images, a 2D-laser based detector providing structural info., and a tracking module that uses the combined info. from both sensors and estimates the motion for each tracked object;

The laser based detection applies a boosted Conditional Random Field (CRF) on geometrical and statistical features of 2D scan points.

The image based detector uses extended Implicit Shape Model (ISM).

It operates on a region of interest obtained from projecting the laser detection into the image to constrain the position and scale of the objects.

The tracking module applies an Extended Kalman Filter (EKF) with two motion models, fusing the info. from camera and laser.

Evaluation of Fisheye-Camera Based Visual Multi-Session Localization in a Real-World Scenario

Fully automated valet parking and charging of electric vehicles using only low-cost sensors.

To implement robust visual localization using only cameras and stock vehicle sensors.

Four monocular, wide-angle, fisheye cameras on a consumer car and implemented a mapping and localization pipeline.

Visual features and odometry are combined to build and localize against a key-frame-based 3-d map.

Evaluation of Fisheye-Camera Based Visual Multi-Session Localization in a Real-World Scenario

Bertha Benz’s Autonomous Driving

Mercedes Benz S-Class S 500 INTELLIGENT DRIVE followed the same route from Mannheim to Pforzheim, Germany, in fully autonomous manner;

Equipped with sensor HW and relied solely on vision/radar sensors with digital maps to obtain a understanding of traffic situations.




(a) Landmarks btw the mapping (top) and online (bottom). (b) Detected lane markings (red), sampled map (blue) and residuals (green).


Motion Estimation for Self-driving Cars with a Generalized Camera

Visual ego-motion estimation for a self-driving car equipped with a close-to-market multi-camera system.

By modeling the multi-camera system as a generalized camera and applying the non-holonomic motion constraint of a car, this leads to a novel 2-point minimal solution for the generalized essential matrix where the full relative motion including metric scale can be obtained.

Existence of degeneracy when the car undergoes straight motion in the special case with only intra-camera correspondences where the scale becomes unobservable and provide a practical alternative solution.


The 6-vector Pluecker line

The epipolar constraint btw two Plueker lines


Relative motion R and t between Vk and Vk+1


A non-linear refinement is applied using all the inliers found from RANSAC to get a better estimate of ρ and θ.

Pi and Pi′ - camera projection matrices

Pose Estimation for a Multi-camera System with Known Vertical Direction

Minimal 4-point and linear 8-point algorithms to estimate relative pose of a multi-camera system with known vertical directions, i.e. absolute roll and pitch angles.

Solve minimal 4-point algorithm with the hidden variable resultant method and it leads to an 8-degree univariate polynomial that gives up to 8 real solutions.


The pipeline of minimal 4-point algorithm:

1. transforms the Pluecker line correspondences with the roll and pitch angles of the correspondence frames from the IMU.

2. the minimal 4-point algorithm gives the relative pose estimated from the transformed Pluecker line correspondences.

3. relative pose in original correspondence frames is computed.

The generalized epipolar constraint

the generalized essential matrix


It is also possible to solve for tˆ, Rˆy linearly with 8 Pluecker line correspondences.

Solve for the generalized essential matrix by SVD.

Apply RANSAC for robust estimation to reject outlier correspondences and to determine the solution from multiple solutions from minimal 4-point algorithm.

Learning Towards Detection/Tracking of Lane Markings

Large appearance variations in lane markings caused by factors such as occlusion, shadows, and changing lighting conditions of the scene. A learning-based approach using visual inputs from a camera mounted at front. A pixel-hierarchy feature descriptor to model contextual information shared by lane markings with the surrounding road region; Boosting to select relevant contextual features for detecting lane markings; Particle filters to track the lane markings, without the knowledge of vehicle speed, by assuming the lane markings to be static and then learning the possible road scene variations from the statistics of tracked model parameters. Challenging daylight and night-time road video sequences.

Learning Towards Detection/Tracking of Lane Markings

Pipeline of the proposed approach: detection with boosting on contextual features, and particle-filter based tracking to learn some road scene variations

Parking Assistance System

Backing-out and heading-out maneuvers in perpendicular or angle parking lots are one of the most dangerous maneuvers. A vision-based ADAS to automatically warn the driver in such scenarios. A monocular grayscale camera was installed at the back-right side of a vehicle. A Finite State Machine (FSM) defined according to three CANBus variables and a manual signal provided by the user is used to handle activation/deactivation of the detection module. The traffic detection module computes spatiotemporal images from a set of pre-defined scan-lines which are related to the position of the road.

A spatio-temporal motion descriptor (Spatio-Temporal Histograms of Oriented Lines, STHOL) accounting for the number of lines, their orientation and length of the spatio-temporal images. Some parameters of the proposed descriptor are adapted for nighttime conditions. A Bayesian framework triggers warning using multivariate normal density functions.


Driver and camera Field of View (FOV) in countries with right-hand traffic. (a) Back-out perpendicular parking. (b) Back-out angle parking. (c) Heading-out perpendicular parking. (d) Heading-out angle parking


FSM for detection module

Overview of the spatio-temporal detection module

Two examples of the scan-lines and spatio-temporal images.


Overview of the STHOL feature selection architecture.

Obstacle Detection by Monocular Cameras & Wheel Odometry

Extracts static obstacles from depth maps out of multiple consecutive images; Solely relies on the readily available wheel odometry (not visual odometry); To handle the resulting higher pose uncertainty, fuses obstacle detections over time and between cameras to estimate the free and occupied space around the vehicle; Using monocular fisheye cameras, cover a wider field of view and detect obstacles closer to the car, which are often not within the standard field of view of a classical binocular stereo camera setup.

3D Traffic Scene Understanding

A prob. generative model for 3D scene layout and location/orientation of objects; Scene topology, geometry and activities are inferred from short video sequences; A diverse set of visual cues in the form of vehicle tracklets, vanishing points, semantic scene labels, scene flow and occupancy grids; Likelihoods for each of visual clues integrated into a prob. generative model; Learn all model parameters from training data using contrastive divergence.

Learning-Based Lane Departure Warning Systems with a Personalized Driver Model

Misunderstanding of driver correction behaviors (DCB) is the primary reason for false warnings of lane-departure-prediction systems. A learning-based approach to predicting unintended lane-departure behaviors (LDB) and the chance for drivers to bring the vehicle back to the lane. A personalized driver model for lane-departure and lane-keeping behavior is established by combining the Gaussian mixture model and the hidden Markov model. Based on that, online model-based prediction to predict the forthcoming vehicle trajectory and judge whether the driver will demonstrate an LDB or a DCB. A warning strategy based on model-based prediction that allows the lane-departure warning system to be acceptable for drivers according to the predicted trajectory. In addition, the naturalistic driving data of 10 drivers is collected through the University of Michigan Safety Pilot Model Deployment program to train the personalized driver model and validate this approach.


Lane departure prediction (LDP) aims to estimate whether a vehicle will depart from the lane, thus allowing time for a driver to take effective action to avoid a crash.

TLC (time to lane crossing)-based prediction; vehicle-variable based vehicle-position estimation; detection of the lane boundary using real-time road images;

TLC: predict the road boundary and the vehicle trajectory and then calculate the time when they intersect;

Assume road curvature is small: the ratio of lateral distance to lateral velocity or the ratio of the distance to the line crossing.

TLC-based methods tend to have a higher FAR when the ego vehicle drives close to the lane boundary; Apply the component of GMM to representing the hidden modes in the HMM;


To model drivers’ lane-keeping and lane-departure characteristics with 5 variables: Vehicle Speed (v), Relative Yaw Angle (ψ), Relative Yaw Rate (ψ˙), Road Curvature (ρ), and Lateral Displacement (∆y).

Vision-based ACC with a Single Camera: Bounds on Range and Range Rate Accuracy

Vision-based Adaptive Cruise Control (ACC) system which uses a camera as input; To compute range and range-rate from a single camera and discuss how the imaging geometry affects the range and range rate accuracy; There are two cues which can be used: size of the vehicle in the image and position of the bottom of the vehicle in the image; A much better estimate can be achieved using the road geometry and the point of contact of the vehicle and the road; Assume a planar road surface and a camera mounted so that the optical axis is parallel to the road surface; a point on the road at a distance Z in front of the camera will project to the image at a height y: y=fH/Z, where H is the camera height; To determine the vehicle distance, first detect the point of contact btw the vehicle and the road and then compute the distance to the vehicle: Z=fH/y; The error at 90m is abut 10%, at 44m about 5%.


Schematic diagram of the imaging geometry. The camera is mounted on vehicle (A) at a height (H). Rear of vehicle (B) is at a distance (Z1) from the camera. The point of contact btw the vehicle and the road projects onto the image plane at a position (y1).


A typical sequence where the host vehicle decelerates so as to keep a safe headway distance from the detected vehicle. The detected target vehicle (the truck) is marked by a white rectangle. As the distance to the target vehicle decreases the size of the target vehicle in the image increases.

Vehicle Dynamics Estimation for Camera-based Visibility Distance Estimation

The presence of an area with low visibility conditions is a relevant information for autonomous vehicle as far as environment sensing is important regarding safety. A generic sensor of visibility using an onboard camera in a vehicle. Estimating the range to the most distant object belonging to the plane of the road having at least 5% of contrast. The depth map of the vehicle environment is obtained by aligning the road plane in successive images. It exploits the dynamics of the vehicle which is given or observed from pro-prioceptive sensors classically available on public vehicles. A method using detection of lane markings: contrast attenuation at distance; A mono-camera method adapted to fog using Koschmieder’s model; A method using stereo-vision: the distance to the furthest point of the road surface with a contrast greater than 5% gives the visibility distance.

Integrated Vehicle and Lane Detection with Distance Estimation

An integrated system that combines vehicle detection, lane detection, and vehicle distance estimation in a collaborative manner. Adaptive search windows for vehicles provide constraints on the width btw lanes.

By exploiting constraints, the search space for lane detection can be efficiently reduced. Local patch constraints for lane detection to improve the reliability of lane detection.

Utilize lane marker with the associated 3D constraint to estimate the camera pose and the distances to frontal vehicles.


First detect the three vanishing points estimation from an image and estimate the focal length from these vanishing points; Estimate the camera pose from six 2D-3D corresponding points, i.e. image coordinates and associated 3D world coordinates with known lane width W1 (approximately 3.75 m) in 3D world distance, to generate the projection matrix M; Assume the vehicle be located on the road plane

oY = 0.


Analyze a Ground Vehicle’s Lateral Movements for Reliable Autonomous City Driving with a Single Camera

For safe urban driving, keep a car within a road-lane boundary. Requires human and robotic drivers to recognize the boundary of a road-lane and the location of the vehicle wrt the boundary of a road-lane. Analyzes a stream of perspective images to produce info. about a vehicle’s lateral movements, such as distances from a vehicle to a road lane’s boundary and detection of lane-changing maneuvers.

A perspective transformation btw the camera plane and the roadway plane

Analyze a Ground Vehicle’s Lateral Movements for Reliable Autonomous City Driving with a Single Camera

A vanishing point’s location and the horizon line on a perspective image gives info. about road scene geometry. If roll and yaw angles are zero, pitch angle is got from a vanishing point:

If a road plane is flat, perpendicular to an image plane, the vanishing point is exactly mapped to the camera center, resulting in the zero pitch angle;

Analyze difference of vanishing point and principal point as

Range Estimation with a Monocular Camera for Vision-Based Forward Collision Warning System

Range estimation for vision-based FCW with a camera; To estimate virtual horizon from size/position of vehicles; Vision-based FCW in highway/urban traffic environ.

Range Estimation Using Size Info.: If real width of a vehicle is known, range to the vehicle is calculated as d=FW/w, 𝐹- focal length of camera, 𝑤/𝑊 – vehicle width in image/3-d space. Range Estimation Using Position Info.: assume both roll and yaw angles are zero, range from the pitch angle.

Vision-based FCW system Camera pitch angle is zero Camera pitch angle is nonzero

Range Estimation with a Monocular Camera for Vision-Based Forward Collision Warning System

A robust range estimation method which provides range information even when road inclination varies continuously or lane markings are not seen; Determine horizon only from size and position of vehicles in image; vertical coordinate of horizon Yh = Yb-HcWa/wa, Hc-camera height, Yb-vehicle bottom line position, wa-vehicle width in image, Wa-vehicle real width.

Calculate range with the estimated virtual horizon.

Average horizon

Virtual horizon

min/max width of a vehicle at Yb

Robust Vehicle Detection and Distance Estimation Under Challenging Lighting Conditions

Real-time monocular-vision based techniques for simultaneous vehicle detection and inter-vehicle distance estimation. A collision warning system by detecting vehicles ahead, and by identifying safety distances to assist a distracted driver, prior to occurrence of an imminent crash.


A single-sensor multi-info. fusion framework, showing examples of successful vehicle detection.


Real vehicle distance estimation based on pixel distance info. in 2D image plane Distance estimation based on bird’s eye view

Pitch Angle Estimation Using a Vehicle Mounted Monocular Camera for Vehicle Target Range Measurement

Range measurement using a Vehicle-Mounted monocular camera for ADAS; Optical flow of feature points is estimated from the monocular camera; Estimate camera ego-motion, and optimize nonlinearly using the GM method; Estimating pitch angle relative to the road surface from the translation vector; The pitch angle and the pitch angle rate decomposed from the rotation matrix are composed using an average transfer method.

Influence of the pitch angle for range measurement



Flowchart of the whole processing framework


Depth estimation with the motion-stereo algorithm.


Pitch angle estimation from the translation vector of vehicle motion.


Pitch angle from the rotation matrix of vehicle motion; Pitch angle synthesis using the average transfer method.

Time To Contact for Obstacle Avoidance

Time to Contact (TTC) for obstacle detection and reactive control of motion that does not require scene reconstruction or 3D depth estimation; TTC is a measure of distance expressed in time units; TTC can be used to provide reactive obstacle avoidance for local navigation; TTC can be measured from the rate of change of size of features; Steer a vehicle using TTC to avoid obstacles while approaching a goal;

TTC is not on camera optics or object size, but on depth distance & camera velocity.

distance btw camera and obstacle

velocity of camera wrt obstacle

TTC

size (or scale) of object in image

time derivative of this scale


Classical methods to compute TTC rely on the estimation of optical flow and its first derivative; Optical flow methods are iterative and tend to be computationally expensive and relatively imprecise; Calculating derivative of optical flow to estimate TTC further amplifies noise, generally leading to an unstable and unreliable estimate of TTC; Temporal derivative of the area of a closed active contour avoids the problems associated with the computation of image velocity fields and their derivative; When affine camera models are assumed, affine image conditions are required; Camera motion is sometimes restricted to planar motion, or to not include vertical displacements or cyclotorsion;


Scale Invariant Ridge Segment (SIRS): in a norm. Laplacian scale space; Bayesian Driving: a prob. distribution of the robot command functions;


Optic flow and TTC (𝜏 ): uIx+vIy+It=0, (xIx+yIy)/𝜏+It=0; Given G=xIx+yIy (radial gradient), then 𝜏=-∑G2/ ∑GIt; Projection geometry: x/f=X/Z, y/f=Y/Z; Optic flow (u, v) vs 3-d motion (U, V, W):

u/f = U/Z – (X/Z)(W/Z), v/f = V/Z – (Y/Z)(W/Z); u=(fU-xW)/Z, v=(fV-yW)/Z;

Case I: translational motion along the optic axis (U=V=0); CG+It = 0, C = -W/Z = -1/ 𝜏; then min ∑C (CG+It)2 to C;

Case II: translational relative to a planar object ⊥ optic axis; A=fU/Z, B=fV/Z, AIx+BIy+CG+It=0, then min ∑C(AIx+BIy+CG+It)2 to C;

Case III: translational motion along optic axis; Plane as Z=Z0+pX+qY, P=(p/f)(W/Z0), Q=(q/f)(W/Z), min ∑C [G(C+Px+Qy)+It]2 to C;

Case IV: translational motion relative to the planar object; Given P/C, Q/C, F=1+(P/C)x+(Q/C)y, then min ∑C [F(AIx+BIy+CG)+It]2 to C; Given A/C, B/C, D=G+(A/C)Ix+(B/C)Iy, then min ∑C [D(C+xP+yQ)+It]2 to C;

Time to Contact: Recognizing by Motion Patterns

TTC map method: 1st, segment the image into a large number of super pixels, and estimate a TTC value for each super pixel using the standard IBD-based TTC method; 2nd, assume the TTC of each super pixel can be reliably computed, and super pixels belong to the same coherent object have roughly similar estimated TTC values. 3rd, the super pixels can be aggregated into different objects based on the ranges of these estimated TTC values that are close to each other.

Forward Collision Warning with a Single Camera

A vision based Forward Collision Warning (FCW) system for highway safety. Get time to contact (TTC) and possible collision course directly from the size and position of the vehicles in the image without computing a 3D representation.


A Forward Collision Warning (FCW) is issued when the time-to-contact (TTC) is lower than a certain threshold - typically 2 seconds.

S - ratio btw the width in the image in consecutive frames


The position of the vehicle boundaries in the image and their optic flow to determine if we are in fact on a possible collision course. Let Z(0) be the distance at time t = 0 and set Z(0) = 1 in some arbitrary units.

Tracking the left/right edge points of the followed vehicle as a function of time; These lines are then extrapolated to time t = TTC; If Xl(t) is still to the left and Xr(t) still to the right then the target vehicle is on a collision course; If both Xl(t) and Xr(t) are to one side then the target vehicle in not on a collision course with the camera mounted in the host vehicle.

On a collision course.

One of the vehicles performs an avoidance maneuver.

Even with a rough estimate of Z(0) then convert the lateral position to meters and create a safety margin around vehicle.

Traffic Sign Recognition in ADAS

Traffic Light Recognition in ADAS

MobilEye: Deep Learning with Vision




MobilEye: Semantic Free Space

MobilEye: Traffic Light Detection

MobilEye: Traffic Scene Understanding

MobilEye: Forward Collision Warning

MobilEye: Lane Departure Warning

MobilEye: Pedestrian Collision Warning

DeepLanes: E2E Lane Position Estimation using Deep NNs

Positioning a vehicle btw lane boundaries is the core of a self-driving car.

Approach to estimate lane positions directly using a deep neural network that operates on images from laterally-mounted down-facing cameras.

To create a diverse training set, generate semi-artificial images.

Estimate the position of a lane marker with sub-cm accuracy at 100 frames/s on an embedded automotive platform, requiring no pre- or post-processing.

The label ti ∈ [0, . . . , 316] for image Xi corresponds to the row with the pixel of the lane marking that is closest to the bottom border of the image

two cameras

DeepLanes: E2E Lane Position Estimation using Deep NNs

Formulated as the classification task of estimating the lane position.

Using a real world bg, various types of lane markings have been artificially placed to synthesize regular lane markings (a, b) and varying light conditions (c, d).

For a given image Xi , the deep NN computes a softmax prob. output vector Yi = (y0, . . . , y316), yk - row k in image Xi for the position of the lane marking.

Free-Space Detection with Self-Supervised Online Trained FCNs

FCN can be trained in a self-supervised manner and achieve similar results compared to training on manually annotated data, thereby reducing the need for large manually annotated training sets.

Rely on a stereo-vision disparity system, to automatically generate (weak) training labels for the color-based FCN.

Additionally, facilitate online training of the FCN instead of offline.

Consequently, given that the applied FCN is relatively small, the free-space analysis becomes highly adaptive to any traffic scene that the vehicle encounters.

Free-Space Detection with Self-Supervised Online Trained FCNs

Detecting Unexpected Obstacles for Self-Driving Cars: Fusing Deep Learning and Geometric Modeling

A fully convolutional network is used to predict a pixel-wise semantic labeling of (i) free-space, (ii) on-road unexpected obstacles, and (iii) background.

The geometric cues are exploited using a SoA detection approach that predicts obstacles from stereo input images via model-based statistical hypothesis tests.

A principled Bayesian framework to fuse semantic and stereo-based detection.

The mid-level Stixel representation is used to describe obstacles in a flexible, compact and robust manner.

MultiNet: Joint Semantic Reasoning for Autonomous Driving

An approach to joint classification, detection and semantic segmentation via a unified architecture where the encoder is shared amongst the three tasks.

Trained end-to-end and performs extremely well in the challenging KITTI dataset, outperforming the state-of-the-art in the road segmentation task.

Visualization of the label encoding. Blue grid: cells, Red cells: cells containing a car, Grey cells: cells in don’t care area. Green boxes: ground truth boxes

Solving street classification, vehicle detection and road segmentation in one forward pass.

MultiNet: Joint Semantic Reasoning for Autonomous Driving

MultiNet architecture

Deep Learning on Highway Driving (Stanford U.)

Meditated Perception

Lane Detection

Driver Simulator in NVidia

Learning Direct Perception in Autonomous Driving

System architecture: ConvNet processes TORCS image and estimates 13 indicators for driving. Based on indicators and speed, driving commands computed from a controller to TORCS drive the host car.

(The Open Racing Car Simulator)

Computer vision-based autonomous driving systems: mediated perception approaches, behavior reflex approaches and direct perception based approach; Map an input image to a small number of key perception indicators that directly relate to the affordance of a road/traffic state for driving; Train a deep CNN using 12 hours of human driving in a video game and show that our model can work well to drive a car in a very diverse set of virtual environments.

Learning a Driving Simulator at Comma.ai

Apply Variational AutoEncoders with classical, learned cost functions using Generative Adversarial Networks for embedding realistic looking road frames: alternating the training of generative and discriminator networks; Learn a transition model in the embedded space using action conditioned Recurrent Neural Networks with sequences of length of 15 frames: teacher forcing in the first 5 frames and fed the outputs back as new inputs in the remaining 10 frames (RNN hallucination); Successfully simulate all the relevant events for driving.

Deep Learning for Maneuver Anticipation

A sensory-fusion deep learning architecture which jointly learns to anticipate and fuse multiple sensory streams; The architecture consists of Recurrent Neural Networks (RNNs) that use Long Short-Term Memory (LSTM) units to capture long temporal dependencies; A training procedure which allows the network to predict the future given only a partial temporal context; A diverse data set with 1180 miles of natural freeway and city driving, that can anticipate maneuvers 3.5 seconds before they occur in realtime with a precision and recall of 90.5% and 87.4% respectively.

Deep Learning for Maneuver Anticipation

Multi-Agent, Reinforcement Learning for Autonomous Driving

Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated negotiation skills with other road users when overtaking, giving way, merging, taking left and right turns and while pushing ahead in unstructured urban roadways.

Deep reinforcement learning to the problem of forming long term driving strategies.

How policy gradient iterations can be used, and the variance of the gradient estimation using stochastic gradient ascent can be minimized, without Markovian assumptions.

Decompose the problem into a composition of a Policy for Desires (which is to be learned) and trajectory planning with hard constraints (which is not learned). The goal of Desires is to enable comfort of driving, while hard constraints guarantees the safety of driving.

A hierarchical temporal abstraction called an “Option Graph” with a gating mechanism that significantly reduces the effective horizon and thereby reducing the variance of the gradient estimation even further. The Option Graph plays a similar role to “structured prediction” in supervised learning, thereby reducing sample

complexity, while also playing a similar role to LSTM gating mechanisms used in supervised deep networks.

Learn from Maps: Visual Common Sense for Autonomous Driving

To develop a model for road layout inference given imagery from on-board cameras, without any reliance on high-definition maps.

Leverage the availability of standard navigation maps and corresponding street view images to construct an automatically labeled, large-scale dataset for this complex scene understanding problem.

By matching road vectors and metadata from navigation maps with Google Street View images, assign ground truth road layout attributes (e.g., distance to an intersection, one-way vs. two-way street) to the images.

Then train deep conv. networks to predict these road layout attributes given a single monocular RGB image.

This model learns to correctly infer the road attributes using only panoramas captured by car-mounted cameras as input.

Additionally, this method may be suitable to the novel application of recommending safety improvements to infrastructure (e.g., suggesting an alternative speed limit for a street).

SAD-GAN: Synthetic Autonomous Driving using GANs

Learning synthetic driving using generative neural networks.

To make a controller trainer network using images plus key press data to mimic human learning.

A stable GAN (DCGAN) to make predictions btw driving scenes using key presses.

Train the model on one video game and tested the accuracy and compared it by running the model on other maps to determine the extent of learning.

Generator GAN model Discriminator GAN model

SAD-GAN: Synthetic Autonomous Driving using GANs

CNN Architecture (AlexNet)

Train a generator network to predict images given an image and a key press. The discriminator is trained to distinguish btw generated images and images from the dataset. After obtaining a sufficiently efficient generator, the generator network is deployed in action to predict all three images from a given image. The three images: result from left, up and right key press from the present situation.

Augmented Reality in ADAS

DEEP REINFORCEMENT LEARNING (DEEP RL)

Appendix A: (Mostly copied from DeepMind’s RL slides)

Deep Reinforcement Learning


RL is a general-purpose frame for decision-making RL is for an agent with the capacity to act Each action influences the agent's future state Success is measured by a scalar reward signal Goal: select actions to maximize future reward

DL is a general-purpose framework for representation learning Given an objective Learn representation that is required to achieve

objective Directly from raw inputs Using minimal domain knowledge

Deep Reinforcement Learning: AI = RL + DL A single agent can solve any human-level task RL defines the objective DL gives the mechanism RL + DL = general intelligence


At each step t the agent: Executes action at

Receives observation ot

Receives scalar reward rt

The environment: Receives action at

Emits observation ot+1

Emits scalar reward rt+1

Experience is a sequence of observations, actions, rewards

o1, r1, a1, ..., ot-1 ,rt-1 ,at-1, ot , rt

The state is a summary of experience st = f (o1, r1, a1, ..., ot-1 ,rt-1 ,at-1, ot , rt)

In a fully observed environment st = f (ot )

A RL agent may include one or more of these components: Policy: agent's behaviour function Value function: how good is each state and/or action Model: agent's representation of the environment

A policy is the agent's behaviour It is a map from state to action: Deterministic policy: a = π (s) Stochastic policy: π (a|s) = P [a|s]

A value function is a prediction of future reward “How much reward will I get from action a in state s?"

Q-value function gives expected total reward from state s and action a under policy π with discount factor

Qπ(s, a) = E[rt+1 + ϒrt+2 + ϒ2rt+3 + … |s, a]

Value functions decompose into a Bellman equation Qπ(s, a) = E s’,a’ [r + ϒ Qπ(s’, a’) |s, a]


An optimal value function is the maximum achievable value

Q*(s, a) = maxπ Qπ(s, a) = Qπ*(s, a)

Once we have Q* we can act optimally,

π*(s) = argmaxaQ

*(s, a)

Optimal value maximizes over all decisions, then informally:

Q*(s, a) = rt+1 + ϒmaxat+1rt+2 + ϒ2maxat+2rt+3 + … = rt+1 + ϒmaxat+1 Q*(st+1, at+1)

Formally, optimal values decompose into a Bellman equation

Q*(s, a) = E s’ [r + ϒ maxa’Q*(s’, a’) |s, a]

Model is learnt from experience

Acts as proxy for environment

Planner interacts with model

e.g. using look-ahead search



Markov Decision Process (MDP) defined by (S, A, P), where S: state space A: action space p(r, s’ | s, a): a transition probability distribution

Extra objects defined depending on problem setting μ: Initial state distribution ϒ: discount factor

In each episode, the initial state is sampled from μ, and the process proceeds until the terminal state is reached. For example: Taxi robot reaches its destination (termination = good) Waiter robot finishes a shift (fixed time) Walking robot falls over (termination = bad)

Goal: maximize expected reward per episode Deterministic policies: a = π (s) Stochastic policies: a ~ π(a | s) Parameterized policies: πθ



Reinforcement learning is a difficult problem in the learning system.

A solution based on Dynamic Programming with two basic principles.

1. If an action causes bad immediately, then it learns not to do that action again;

2. If all actions in a certain situation leads to bad results, then it should be avoided.

The approximation of the optimal value function in a given state is equal to the true value of that state plus some error in the approximation;

Relationship btw successive states, defined by the Bellman equation;

If assumed the function approximator is a LUT, perform sweeps in state space;

Use a function approximator to generalize and interpolate values of states;

Gradient descent on the mean squared Bellman residual in MDP.


Q-learning solves the problem of having to take max over a set of integrals; Q-learning finds a mapping from state/action pairs to Q-values;

Q-value is sum of reinforcements received in associated action and the given policy;

Advantage learning (AL) does not share the scaling problem of Q-learning; In AL, the value associated with each action is called an advantage.

The state value is defined to be the maximum advantage in that state;

For the state/action pair (x, u) an advantage is defined as the sum of the value of the state and the utility (advantage) of performing action u rather than the action;

AL can find a sufficiently accurate approximation to the advantage function in a number of training iterations that is independent of this ratio.

Temporal difference (TD) learning learns the value function directly from the experience return for selecting the action and then following the policy. Multi-step return variants instead of one-step return TD(0), called TD(λ) , 0≤λ≤1.


Value-based RL Estimate the optimal value function Q(s, a) This is the maximum value achievable under any policy

Policy-based RL Search directly for the optimal policy π*

This is the policy achieving maximum future reward

Model-based RL Build a model of the environment Plan (e.g. by look-ahead) using model

Use deep NNs to represent Value function Policy Model

Optimize loss function by SGD

Value-based Deep RL

Q-Networks: Represent value function by Q-network with weights w Q(s, a, w) ≈ Q*(s, a)

Q-learning: Optimal Q-values obey the Bellman equation

Treat right-hand side r + ϒ maxa’ Q(s’,a’,w) as a target Minimize MSE loss by SGD

Converges to Q* using table lookup representation. But diverges using neural networks due to:

Correlations between samples Non-stationary targets

Deep RL at Atari Game


End-to-end learning of values Q(s, a) from pixels s Input state s is stack of raw pixels from last 4 frames Output is Q(s, a) for 18 joystick/button positions Reward is change in score for that step

Network architecture and hyper-parameters fixed across all games


Value-based Deep RL

Double DQN: Remove upward bias caused by maxaQ(s, a, w) Current Q-network w is used to select actions Older Q-network w- is used to evaluate actions

Prioritized replay: Weight experience according to surprise Store experience in priority queue according to DQN error

Dueling network: Split Q-network into two channels Action-independent value function V(s, v) Action-dependent advantage function A(s, a, w)

Combined algorithm: 3x mean Atari score vs Nature DQN

Gorila (General Reinforcement Learning Architecture)

• 10x faster than Nature DQN on 38 out of 49 Atari games • Applied to recommender systems within Google

Policy-based Deep RL

Represent policy by deep network with weights u

Define objective function as total discounted reward

Optimize objective end-to-end by SGD i.e. Adjust policy parameters u to achieve more reward How to make high-value actions more likely:

The gradient of a stochastic policy π (a|s, u) is given by

The gradient of a deterministic policy a = π(s) is given by

if a is continuous and Q is differentiable


Actor-Critic Algorithm: Estimate value function Q(s,a,w) ≈ Qπ(s, a) Update policy parameters u by SGD or

Asynchronous Advantage Actor-Critic Algorithm (A3C): Estimate state-value function

Q-value estimated by an n-step sample

Actor is updated towards target

Critic is updated to minimize MSE w.r.t. target

4x mean Atari score vs Nature DQN


Deep RL with Continuous Actions: high-dim. continuous action spaces? Can't easily compute maxaQ(s, a)

Actor-critic algorithms learn without max

Q-values are differentiable w.r.t a Deterministic policy gradients (DPG) exploit knowledge of /

DPG is the continuous analogue of DQN Experience replay: build data-set from agent's experience Critic estimates value of current policy by DQN

To deal with non-stationarity, targets u-, w- are held fixed Actor updates policy in direction that improves Q

In other words critic provides loss function for actor

Deterministic Deep Policy Gradient (DDPG): give a stable solution with NN



Fictitious Self-Play (FSP): Can deep RL find Nash equilibria in multi-agent games? Q-network learns “best response" to opponent policies By applying DQN with experience replay c.f. fictitious play Policy network π(a|s, u) learns an average of best

responses Actions a sample mix of policy network and best

response

Neural FSP in Texas Hold'em Poker Heads-up limit Texas Hold'em NFSP with raw inputs only (no prior knowledge of

Poker) vs SmooCT (3x medal winner 2015, handcrafted

knowledge)


Model-based Deep RL

Learn a transition model of the environment p(r, s’ | s, a) Plan using the transition model

e.g. Look-ahead using transition model to find optimal actions

Deep Models Represent transition model p(r, s’ | s, a) by deep network Define objective function measuring goodness of model e.g. number of bits to reconstruct next state Optimize objective by SGD

Compounding errors? Errors in the transition model compound over the trajectory By the end of a long trajectory, rewards can be totally wrong Model-based RL has failed (so far) in Atari!

Deep networks of value/policy can “plan" implicitly Each layer of network performs arbitrary computational step n-layer network can “look-ahead" n steps Are transition models required at all?

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

To address the lack of generalization issue, propose an actor-critic model whose policy is a function of the goal a.w.s. the current state, to better generalize;

To address the data inefficiency issue, propose AI2-THOR framework, which provides an environment with high quality 3D scenes and physics engine.

Enables agents to take actions and interact with objects. Collect a huge number of training samples efficiently. No need feature engineering, feature matching or 3D reconstruction.

Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning

Control of Memory, Active Perception, and Action in Minecraft

Deep Q-Network (DQN), Deep Recurrent Q-Network (DRQN), Memory Q-Network (MQN), Recurrent Memory Q-Network (RMQN), and Feedback Recurrent Memory Q-Network (FRMQN).

Control of Memory, Active Perception, and Action in Minecraft

Examples of maps. (a) has an I-structured topology where the location of indicator (yellow/green), goals (red/blue), and spawn locations (black circle) are fixed across episodes. (b) has two goals and two rooms with color patterns. (c) consists of randomly generated walls and two goals. The agent can be spawned anywhere except for goal locations. (d) is similar to (c) except that it has an indicator at the fixed location (yellow/green) and a fixed spawn location.

Generating Text with Deep Reinforcement Learning

A schema for sequence to sequence learning with a Deep Q-network (DQN), which decodes the output sequence iteratively.

To enable the decoder to first tackle easier portions of the sequences, and then turn to cope with difficult parts.

In each iteration, an encoder-decoder Long Short-Term Memory (LSTM) network is employed to, from the input sequence, automatically create features to represent the internal states of and formulate a list of potential actions for the DQN.

Next, the DQN learns to make decision on which action (e.g., word) will be selected from the list to modify the current decoded sequence.

The newly modified output sequence is used as the input to the DQN for the next decoding iteration.

In each iteration, bias the reinforcement learning’s attention to explore sequence portions which are previously difficult to be decoded.


Iteratively decoding with DQN and LSTM; the encoder-decoder LSTM network is depicted as gray-filled rectangles on the bottom; the top-left is the graphical illustration of the DQN with bidirectional LSTMs; the dash arrow line on the right indicates the iteration loop.


Appendix B:

Generative Adversarial Networks

(GAN) and Applications

(Partially copied from OpenAI’s GAN slides)

Generative Modeling

Have training examples x ~ pdata(x ) Want a model that draw samples: x ~ pmodel(x ) Where pmodel ≈ pdata

Conditional generative models Speech synthesis: Text ⇒ Speech Machine Translation: French ⇒ English

French: Si mon tonton tond ton tonton, ton tonton sera tondu. English: If my uncle shaves your uncle, your uncle will be shaved

Image ⇒ Image segmentation

Environment simulator Reinforcement learning Planning

Leverage unlabeled data

x ~ pdata(x )

x ~ pmodel(x )

Adversarial Nets Framework

A game between two players: 1. Discriminator D

2. Generator G

D tries to discriminate between: A sample from the data

distribution.

And a sample from the generator G.

G tries to “trick” D by generating samples that are hard for D to distinguish from data.

GANs

A framework for estimating generative models via an adversarial process, to train 2 models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G.

The training procedure for G is to maximize the probability of D making a mistake.

This framework corresponds to a minimax two-player game: In the space of arbitrary functions G and D, a unique

solution exists, with G recovering training data distribution and D equal to 1/2 everywhere;

In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with BP.

There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.

GANs

GANs

GANs

GANs

Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator).

Conditional Generative Adversarial Nets

GAN extended to a conditional model if both the generator and discriminator are conditioned on some extra information y, such as class labels or data from other modalities.

Conditioning by feeding y into both the discriminator and generator as additional input layer.

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

A generative parametric model, LAPGAN, capable of producing high quality samples of natural images.

Uses a cascade of convnets within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion.

At each level of the pyramid, a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach.

Samples drawn from the model are of higher quality than alternate approaches.

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Bridge the gap between the success of CNNs for supervised learning and unsupervised learning.

A class of CNNs called Deep Convolutional Generative Adversarial Networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning.

Via training, the deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both generator and discriminator.

Additionally, use the learned features for general image representations.



f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

Generative neural samplers are probabilistic models that implement sampling using feed-forward neural networks;

These models are expressive and allow efficient computation of samples and derivatives, but cannot be used for computing likelihood or for marginalization;

The generative adversarial training method allows to train such models through the use of an auxiliary discriminative neural network;

The generative-adversarial approach is a special case of an existing more general variational divergence estimation approach;

Any f-divergence can be used for training generative neural samplers.


[26] F. Nielsen and R. Nock. On the chi-square and higher-order chi distances for approximating f-divergences. Signal Processing Letters, IEEE, 21(1):10–13, 2014.

[10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pp2672–2680, 2014.

Definition:


Variational Divergence Minimization (VDM):

Use the variational lower bound on the f-divergence Df(P|Q) in order to estimate a generative model Q given a true distribution P;

Use two NNs, generative model Q and variational function T: Q taking as input a random vector and outputting a sample of interest, parametrizing Q through a vector θ and write Qθ; T taking as input a sample and returning a scalar, parametrizing T using a vector ω and write Tω.

Learn a generative model Q θ by finding a saddle-point of the following f-GAN objective function, where we minimize wrt θ and maximize wrt ω as


Samples from three different divergences

Energy-based GANs

It views the discriminator as an energy function that attributes low energies to the regions near the data manifold and higher energies to other regions;

A generator is seen as being trained to produce contrastive samples with minimal energies, while the discriminator is trained to assign high energies to these generated samples;

Use the discriminator as an energy function allows to use various architectures and loss functionals in addition to binary classifier with logistic output;

Instantiation of EBGAN framework as using an auto-encoder architecture, with the energy being the reconstruction error, in place of the discriminator;

A single-scale architecture can be trained to generate high-resolution images.

Energy-based GANs

EBGAN architecture with an auto-encoder discriminator

o Propose the idea “repelling regularizer” which fits well into the EBGAN auto-encoder model, to keep the model from producing samples that are clustered in one or a few modes of pdata (similar to “mini-batch discrimination” by Salimans et.al); oImplementing the “repelling regularizer” has a pulling-away (PT) effect at a representation level; oThe PT term defined as

Energy-based GANs

Generation from LSUN bedroom full-images. Left(a): DCGAN generation. Right(b):EBGAN-PT generation.

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

GAN learns a generator network G that generates samples from the generator distribution PG by transforming a noise variable z ~ Pnoise(z) into a sample G(z).

This generator is trained by playing against an adversarial discriminator network D that aims to distinguish between samples from the true data distribution Pdata and the generator’s distribution PG.

InfoGAN, an information-theoretic extension to the GAN that is able to learn disentangled representations in a completely unsupervised manner.

InfoGAN is a GAN that also maximizes the mutual information between a small subset of the latent variables and the observation.

Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset.

It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset.

Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing supervised methods.


GAN uses a simple factored continuous input noise vector z, so it is possible that the noise will be used by the generator in a highly entangled way, causing the individual dimensions of z to not correspond to semantic features of the data.

Decompose the input noise vector into two parts: (i) z, which is treated as source of incompressible noise; (ii) c, which we will call the latent code and will target the salient structured semantic features of the data distribution.

The generator network owns with both the incompressible noise z and the latent code c, so the form of the generator becomes G(z, c).

Information-theoretic regularization: there should be high mutual information btw latent codes c and generator distribution G(z, c).


Manipulating latent codes on 3D Faces: the effect of the learned continuous latent factors on the outputs as their values vary from −1 to 1.

Generative Adversarial Text to Image Synthesis

A deep architecture and GAN formulation to effectively bridge SoA techniques in text and image modeling, translating visual concepts from characters to pixels.

To train a deep convolutional generative adversarial network (DC-GAN) conditioned on text features encoded by a hybrid character-level CRNN.

Both the generator network G and the discriminator network D perform feed-forward inference conditioned on the text feature.

Generative Adversarial Text to Image Synthesis

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network

SRGAN, a generative adversarial network (GAN) for image superresolution (SR).

Capable of inferring photo-realistic natural images for 4 upscaling factors.

A perceptual loss function which consists of an adversarial loss and a content loss.

The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images.

A content loss motivated by perceptual similarity instead of similarity in pixel space.

The deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks.



Autoencoder that leverages learned representations to better measure similarities. By combining a VAE with a GAN, use learned feature representations in the GAN discriminator as

basis for the VAE reconstruction objective. Replace element-wise errors with feature-wise errors to better capture the data distribution while

offering invariance towards e.g. translation. The method learns an embedding in which high-level abstract visual features (e.g. wearing glasses)

can be modified using simple arithmetic.

Autoencoding Beyond Pixels Using a Learned Similarity Metric


Variational autoencoder: consists of two networks that encode a data sample x to a latent representation z and decode latent representation back to data space, respectively;

The VAE regularizes the encoder by imposing a prior over the latent distribution p(z). The VAE loss is minus the sum of the expected log likelihood (the reconstruction

error) and a prior regularization term;

GAN: Discriminator + Generator

Kullback-Leibler divergence


Image-to-Image Translation with Conditional Adversarial Nets

Conditional adversarial networks as a general-purpose solution to image-to-image translation problems.

These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping.

It is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks.


Training a conditional GAN to predict aerial photos from maps. The discriminator, D, learns to classify between real and synthesized pairs. The generator learns to fool the discriminator. Unlike an unconditional GAN, both the generator and discriminator observe an input image.


Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space

Synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network.

Introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions (227x227) than previous generative models, and does so for all 1000 ImageNet categories.

A unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks".

PPGNs are composed of 1) a generator network G that is capable of drawing a wide range of image types and 2) a replaceable "condition" network C that tells the generator what to draw.

Improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate.


Deep Generator Network-based Activation Maximization (DGN-AM) involves training a generator G to create realistic images from compressed features extracted from a pretrained classifier network E;

To generate images conditioned on a class, an optimization process is launched to find a hidden code h that G maps to an image that highly activates a neuron in another classifier C (not necessarily the same as E);

A major limitation with DGN-AM, is the lack of diversity in the generated samples;

Idea: adding a prior on the latent code that keeps optimization along the manifold of realistic-looking images; to unify and interpret activation maximization approaches as a type of energy-based model where the energy function is a sum of multiple constraint terms: (a) priors and (b) conditions;

Metropolis-adjusted Langevin sampling repeatedly adds noise and gradient of log p(x, y) to generate samples (Markov chain);

Denoising autoencoders estimate required gradient;

Use a special denoising autoencoder hat has been trained with multiple losses, including a GAN loss, to obtain best results.


Different variants of PPGN models tested. The Noiseless Joint PPGN-h (e) empirically produces the best images. In all variants, perform iterative sampling following the gradients of two terms: the condition (red arrows) and the prior (black arrows). (a) PPGN-x: a p(x) prior modeled via a DAE for images.(b) DGN-AM. (c) PPGN-h: a learned p(h) prior modeled via a multi-layer perceptron DAE for h. (d) Joint PPGN-h: treating G + E1 + E2 as a DAE that models h via x. (e) Noiseless Joint PPGN-h. (f) A pre-trained image classification network (here, AlexNet trained on ImageNet) serves as the encoder network E component. (g) attaching a recurrent, image-captioning network to the output layer of G.


A-Fast-RCNN: Hard Positive Generation via Adversary for Object Detection

How to learn an object detector invariant to occlusions, deformations? occlusions - object deformations also follow a long-tail.

Learn an adversarial network that generates examples with occlusions and deformations.

The goal of the adversary is to generate examples that are difficult for the object detector to classify.

To create adversarial examples in convolutional feature space and not generate the pixels directly since the latter is a much harder problem.

Fast-RCNN -> A-Fast-RCNN: ASDN + ASTN; Source codes: https://github.com/xiaolonw/adversarial-frcnn

https://github.com/xiaolonw/adversarial-frcnn




Two types of feature generations by adversarial networks competing against the

Fast-RCNN (FRCN) detector: The first type of generation is occlusion: Adversarial Spatial Dropout Network (ASDN) which

learns how to occlude a given object such that it becomes hard for FRCN to classify;

The second type of generation is deformation: Adversarial Spatial Transformer Network (ASTN)

which learns how to rotate “parts” of the objects and make them hard to recognize by the detector;

Both the networks ASDN and ASTN are learned simultaneously in conjunction with

the FRCN during training;

Joint training prevents the detector from overfitting to the obstacles created by the

fixed policies of generation;

Adversarial networks modify features to make the object harder to recognize.


The network architecture of ASDN and combines with Fast RCNN approach.


Network architecture for combining ASDN - ASTN network. 1st occlusion masks are created and then channels are rotated to generate hard examples for training.

Beyond Face Rotation: Global and Local Perception GAN for Preserving Frontal View Synthesis

A Two-Pathway Generative Adversarial Network (TP-GAN) for photorealistic frontal view synthesis by simultan. perceiving global structures and local details.

Four landmark located patch networks are proposed to attend to local textures in addition to the commonly used global encoder-decoder network.

Combination of adversarial loss, symmetry loss and identity preserving loss. The combined loss function leverages both frontal face distribution and pre-

trained discriminative deep face models to guide an identity preserving inference of frontal views from profiles.

Directly leverages the synthesized identity preserving image for downstream tasks like face recognition and attribution estimation.

The Generator contains two pathways with each processing global or local transformations. The Discriminator distinguishes btw synthesized frontal views and ground-truth frontal views.


Synthesis loss function: Pixel-wise loss;

Symmetry loss;

Adversarial loss;

Identity preserving loss;

Overall objective function:


Synthesis results by TP-GAN under different poses. From left to right: poses as 90◦ , 75◦ , 60◦ , 45◦ , 30◦ and 15◦ . The ground truth frontal images are provided at the last column.


How to Train a GAN? Tips and Tricks

1. Normalize the inputs

2: A modified loss function

3: Use a spherical Z (not uniform, but Gaussian distribution)

4: Batch Norm

5: Avoid Sparse Gradients:

ReLU, MaxPool

6: Use Soft and Noisy Labels

7: DCGAN / Hybrid Models

KL + GAN or VAE + GAN

8: Use stability tricks from RL

9: Use the ADAM Optimizer for generator (SGD for discriminator)

10: Track failures early

check norms of gradients

11: Dont balance loss via statistics (unless you have a good reason to)

12: If you have labels, use them

Auxillary GANs

13: Add noise to inputs, decay over time

14: [not sure] Train discriminator more (sometimes) especially have noise

15: [not sure] Batch Discrimination

16: Discrete variables in C-GANs

17: Dropouts in G in both train/test stage

Improved Techniques for Training GANs

For semi-supervised learning in generation of images that humans find visually realistic;

Techniques that are heuristically motivated to encourage convergence: Feature matching addresses the instability of GANs by specifying a new objective for the

generator that prevents it from overtraining on the current discriminator;

Allow the discriminator to look at multiple data examples in combination, and perform what we call “Min-batch discrimination”: any discriminator model that looks at multiple examples in combination, rather than in isolation, could potentially help avoid collapse of the generator;

Historical averaging: the historical average of the parameters can be updated in an online fashion so this learning rule scales well to long time series;

One sided label smoothing: reduce the vulnerability of NNs to adversarial examples;

Virtual batch normalization: each example x is normalized based on the statistics collected on a reference batch of examples that are chosen once and fixed at the start of training, and on x itself (only in the generator network, cause too expensive computationally).

Towards Principled Methods for Training GANs

Questions: Why do updates get worse as the discriminator gets better? Both original and new cost function.

Why is GAN training massively unstable?

Is the new cost function following a similar divergence to the JSD? If so, what are its properties?

Is there a way to avoid some of these issues?

Jensen-Shannon Divergence


Theorems 2.1-2.2 tell us that there are perfect discriminators which are smooth and constant almost everywhere in M and P. The fact that the discriminator is constant in both manifolds points to the fact that we won’t really be able to learn anything by backproping through it.

If the two distributions we care about have supports that are disjoint or lie on low dimensional manifolds, the optimal discriminator will be perfect and its gradient will be zero almost everywhere.

To conclude, state the following theorem on the divergences of Pr and Pg as

Note: these divergences will be maxed out even if two manifolds lie arbitrarily close to each other.

The samples of generator might look impressively good, yet both KL divergences will be infinity.


Theorem 2.3 points us that attempting to use divergences out of the box to test similarities btw the distributions we typically consider might be a terrible idea;

So, if these divergences are always maxed out attempting to minimize them by gradient descent isn’t really possible;

As the approximation to the optimal discriminator gets better, either see vanishing gradients or the massively unstable behavior see in practice, depending on which cost function we use.


This is the inverted KL minus two JSD. JSDs are in the opposite sign, which means they are pushing for the distributions to be different, which seems like a fault in the update.

KL appearing in the equation is KL(Pg|Pr), not the one equivalent to maximum likelihood.

KL assigns an extremely high cost to generating fake looking samples, and an extremely low cost on mode dropping; and the JSD is symmetrical so it shouldn’t alter this behaviour.

This explains GANs (when stabilized) create good looking samples, and justifies what is commonly conjectured, that GANs suffer from an extensive amount of mode dropping.


Even if we ignore the fact that the updates have infinite variance, we still arrive to the fact that the distribution of the updates is centered, meaning that if we bound the updates the expected update will be 0, providing no feedback to the gradient;

In all cases, using this updates lead to a notorious decrease in sample quality;

The variance of the gradients is increasing, which is known to delve into slower convergence and more unstable behaviour in the optimization.


An important question now is how to fix the instability and vanishing gradients issues;

To break the assumptions of these theorems is add continuous noise to the inputs of the discriminator, therefore smoothening the distribution of the probability mass;

This theorem therefore tells us that the density PX+ε(x) is inversely proportional to the average distance to points in the support of PX, weighted by the probability of these points;

In the case of the support of PX being a manifold, we will have the weighted average of the distance to the points along the manifold;

How we choose the distribution of the noise will impact the notion of distance we are choosing;

Different noises with different types of decays can therefore be used.


This theorem proves that we will drive our samples g(z) towards points along the data manifold, weighted by their probability and the distance from our samples;

The 2nd term drives our points away from high probability samples, again, weighted by the sample manifold and distance to these samples;

Generator’s backprop term is through samples on positive measure that discriminator care about.


In Theorem 3.3 the two terms can be controlled. The 1st term can be decreased by annealing the noise, and the 2nd term can be minimized by a GAN when the discriminator is trained on the noisy inputs, since it will be approximating the JSD btw the two continuous distributions.

Because of the noise, can train the discriminator till optimality without any problems and get smooth interpretable gradients.

Wasserstein GAN

What does it mean to learn a probability distribution?

VAEs focus on the approximate likelihood of the examples, share the limitation of the standard models and need to fiddle with additional noise terms;

GANs offer much more flexibility in the definition of the objective function, including JSD, and all f-divergences as well as some exotic combinations;

Question: how close the model distribution and the real distribution are, or equivalently, on the various ways to define a distance or divergence?

The Earth-Mover (EM) distance or Wasserstein-1

Wasserstein GAN: based on the Kantorovich-Rubinstein duality

Note that f: the 1-Lipschitz function

Wasserstein GAN

The fact that the EM distance is continuous and differentiable a.e. means that we can (and should) train the critic till optimality;

The fact that we constrain the weights limits the possible growth of the function to be at most linear in different parts of the space, forcing the optimal critic to have this behaviour.

Wasserstein GAN

BEGAN: Boundary Equilibrium GAN

An equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based GAN.

Balances the generator and discriminator during training. Gives an approximate convergence measure, fast and stable training

and high visual quality. A way of controlling trade-off btw image diversity and visual quality. If generated samples cannot be distinguished by the discriminator

from real ones, the distribution of their errors should be the same, including their expected error.

Allow to balance the effort allocated to the generator and discriminator so that neither wins over the other.


Network architecture for the generator and discriminator in BEGAN.


Used 3x3 convolutions with exponential linear units (ELUs) applied at their outputs. Each layer is repeated a number of times (typically 2). More repetitions led to even better visual results. The convolution filters are increased linearly with each down-sampling. Down-sampling is implemented as sub-sampling with stride 2 and up-sampling is

done by nearest neighbor. At the boundary between the encoder and the decoder, the cube of processed data is

mapped via fully connected layers, not followed by any non-linearities, to and from an embedding state.

The generator uses the same architecture (though not the same weights) as the discriminator decoder.

The input state is sampled uniformly. The BEGAN model is easier to train: no batch normalization, no dropout, no transpose

convolutions and no exponential growth for convolution filters.

Thanks!

Advanced Driving Assistance System

Engineering