Top Banner
Anastasios Tefas [email protected] Contributors: P. Nousi, N. Passalis, D. Triantafyllidou, M. Tzelepi Artificial Intelligence and Information Analysis Laboratory Department of Informatics Aristotle University of Thessaloniki Deep Learning for Drone Vision in Cinematography
75

Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Apr 16, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Anastasios Tefas

[email protected]

Contributors: P. Nousi, N. Passalis, D. Triantafyllidou, M. Tzelepi

Artificial Intelligence and Information Analysis Laboratory

Department of Informatics

Aristotle University of Thessaloniki

Deep Learning for Drone Vision in

Cinematography

Page 2: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Introduction

Page 3: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Neural Networks

• Deep Convolutional Networks (CNNs) [1] are among the

state-of- the-art techniques for Visual Information Analysis

[1] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521.7553 (2015): 436-444.

3

Page 4: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Neural Networks

• Composed of a series of convolutional and pooling

layers

• Usually a fully connected layer is used for

classification

• Fully convolutional architectures do exist!

• Capable of learning hierarchies of increasingly

abstract visual features (from simple edges to object

parts and concepts)

4

Page 5: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Unmanned Aerial Vehicles (UAVs), also known as

drones, are becoming increasingly popular for video

shooting tasks

• Flexible!

• They can capture spectacular shots!

5

Page 6: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Flying drones in a professional shooting setting requires

the coordination of several people

• One for controlling the flight path of each drone

• One for controlling the main shooting camera of each

drone

• At least one director, technician, etc

6

Page 7: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Several parts of the shooting process can be

automated, reducing the load of human

operators

• Goal: One human controls multiple drones

7

Page 8: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Shooting Pipeline

• A drone must be able to quickly identify whether one

or more objects of interest exist in a scene

• Apart from the main drone camera, multiple smaller

resolution cameras might be also available to aid this task

8

Page 9: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Shooting Pipeline

• Decide whether the detected humans are part of the

crowd or are persons of interest (e.g., cyclists)

• The drone must flight away from the crowd

• The drone must follow the detected persons of interest

9

Page 10: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Shooting Pipeline

• After detecting a person of interest the camera must

be appropriately rotated toward the person of

interest

• Different shot types have different specifications

• We need the position of the person w.r.t. the camera and

its pose

10

Page 11: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Several (quite demanding) subtasks!

• Detect whether and where an object of interest exist

(cyclist, boat, monuments, etc. )

11

Page 12: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Track a detected (or selected) object

12

Page 13: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Detect where crowd exists

• Comply with legislation

• Detect emergency landing points

• Provide heatmaps of the estimated probability of

crowd presence in each location

13

Page 14: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Detect where crowd exists

14

Page 15: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Identify a detected person (e.g., a well-known cyclist)

15

Page 16: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Estimate the pose of a detected object • Allows for appropriately controlling the camera according to

the specifications of each shot type (e.g., orbit around a target

or acquire a profile shot)

16

Page 17: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Camera control was traditionally handled as a

purely geometric problem

• We can also perform camera control using only

visual information

17

Page 18: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• The aforementioned tasks can be solved using deep

Convolutional Neural Networks (CNNs)

• Deploying a deep CNN on a drone is not

straightforward

• Significant memory and processing power

constraints exist

• State-of-the-art CNNs, such as the VGG-16, consist of

hundreds of millions parameters making them unsuitable for

handling real-time tasks on-board.

18

Page 19: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Assisting Cinematography Tasks

• Light-weight models are needed!

• Slight delays can result to control lag

• Different illumination conditions can affect the

performance of the models

• Training set augmentation!

19

Page 20: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection and Tracking

Page 21: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection

• Faster R-CNN [1]

• Region-based object detectors are plagued by inefficient

external region proposal schemes

• Key idea: Utilize CNN feature maps for both detection and

region proposal in a fully convolutional network

• Also assumes prior "anchor" boxes, and region proposals are

fine-tuned anchor boxes

[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal

networks." Advances in Neural Information Processing Systems. 2015.

21

Page 22: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection

• Faster R-CNN [1]

• Trained using a double objective of classification loss plus

bounding box regression loss

• Precision increases with number of region proposals but...

• ... at 5fps on a K40 GPU, it is among the slowest deep object

detectors

[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal

networks." Advances in Neural Information Processing Systems. 2015.

22

Page 23: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection

• Single Shot Detection (SSD) [1]

• Fully convolutional object detection at multiple scales

• Predicts center x, y coordinates for multiple objects as well as

their class

• Several feature maps of different resolutions are used for

the final prediction

• Adjusts priors on bounding boxes instead of outright predicting

the width and height

[1] Liu, Wei, et al. "SSD: Single Shot Multibox Detector." European Conference on Computer

Vision. 2016

23

Page 24: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection

• Single Shot Detection (SSD) [1]

• Performs hard negative examples mining, thus not all

unannotated regions are considered as negatives

• On an NVIDIA Titan X: 46fps for the 300x300 version, 19fps

for the 512x512 version

[1] Liu, Wei, et al. "SSD: Single Shot Multibox Detector." European Conference on Computer

Vision. 2016

24

Page 25: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection • You Only Look Once (YOLO v2) [1] [2]

• Fully convolutional object detection at multiple scales

• Predicts center x, y coordinates for multiple objects as well as their class

• Adjusts priors on bounding boxes instead of outright predicting the width and

height

• All unannotated regions in the input image are considered as negative

examples

• On an NVIDIA Titan X: 67fps for the 416x416 version, 40fps for the

544x544 version [1]Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." IEEE Conference on Computer Vision and

Pattern Recognition. 2016.

[2] Redmon, Joseph, and Farhadi, Ali. "YOLO9000: Better, Faster, Stronger." IEEE Conference on Computer Vision and

Pattern Recognition. 2017.

25

Page 26: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection

• We evaluated the faster detector (YOLO) on an GPU

accelerated embedded system (NVIDIA TX-2) that will be

available on our drone

• Adjusting the input image size allows for increasing the throughput

• Real-time detection is not yet possible with satisfactory accuracy

26

Model Input Size FPS

YOLO v.2 604 3

YOLO v.2 544 4

YOLO v.2 416 7

YOLO v.2 308 10

Tiny YOLO 604 9

Tiny YOLO 416 15

Page 27: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Using object detectors for drone-

based shooting

• Fine-tuning a pretrained model on a new domain (e.g.,

boat/bicycle detection), instead of training from scratch

usually yields better results

• Tiny versions of the proposed detectors (e.g., Tiny

YOLO) can increase the detection speed (but at the cost

of accuracy)

27

Page 28: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Using object detectors for drone-

based shooting • Reducing the input image size can also increase the

detection speed • However, this can significantly impact the accuracy when

detecting very small objects (which is the case for drone shooting)

*Using unofficial evaluation code (results might slightly differ)

28

Model Input Size Pascal 2007 test mAP*

YOLO v.2 544 77.44

YOLO v.2 416 74.60

YOLO v.2 288 67.12

YOLO v.2 160 48.72

YOLO v.2 128 40.68

Page 29: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Lightweight Approach to Object

Detection • Our approach: train lightweight fully convolutional

object-specific (e.g., face, bicycle, football player)

detectors • e.g., for face detection we trained a 7-layer fully convolutional

face detector on 32 × 32 positive and negative examples [1]

• During deployment on larger images the network very efficiently

produces a heatmap indicating the probability of a face as well as

its location in the image

[1] Triantafyllidou, Danai, Paraskevi Nousi, and Anastasios Tefas. "Fast deep convolutional face

detection in the wild exploiting hard sample mining." Big Data Research (2017).

29

Page 30: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Face detection examples

Page 31: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Face detection examples

Page 32: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Face detection examples

Page 33: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Lightweight Approach to Object

Detection • Domain-specific knowledge may be exploited to train such

lightweight object detector for specific events

• e.g., for cycling races, train detector to recognize professional

bicycles

33

Page 34: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Bicycle detection

Page 35: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

• Bicycle detection

Page 36: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Football player detection

Page 37: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Limitations

• Speed vs accuracy trade-off:

• lightweight models don’t perform as well as heavier architectures

(think YOLO and tiny YOLO variant)

• in our approach, accuracy is increased by the use of domain-

specific object detectors

• as well as a strategic training methodology of progressive positive

and hard negative mining, which mimics the natural learning

process

37

Page 38: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Limitations

• Train with fixed size images:

• detection of larger or smaller objects requires the forward-pass

of a spatial pyramid of the input

• which is made efficient through the fully convolutional

architecture of the detectors

38

Page 39: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Combining Detectors with

Trackers on Drones • The deployed detector can be combined with fast trackers to

achieve satisfactory real-time performance

• The detector can be called only a few times per second, while the

used tracker provides the “detections” in the intermediate frames

• We evaluated several trackers on the NVIDIA TX-2:

[1] Vojir, Tomas, Jana Noskova, and Jiri Matas. "Robust scale-adaptive mean-shift for tracking." Pattern Recognition Letters 49 (2014): 250-258.

[2] Hare, Sam, et al. "Struck: Structured output tracking with kernels." IEEE transactions on pattern analysis and machine intelligence 38.10 (2016):

2096-2109.

[3] Held, David, Sebastian Thrun, and Silvio Savarese. "Learning to track at 100 fps with deep regression networks." European Conference on

Computer Vision. Springer International Publishing, 2016. 39

Model Device FPS

ASMS [1] CPU 81

STRUCK [2] CPU 7

THUNDERSTRUCK [2] GPU 100

GOTURN [3] GPU 30

Page 40: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Detection

40

Page 41: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Tracking

• In the recent Visual Object Tracking (VOT) challenges CNNs

have been taken up the first places in terms of either speed or

accuracy

• In VOT2015 [1], the two best scoring trackers, MDNet and

DeepSRDCF, were both based on CNNs

• Another CNN-based tracker, SODLT, also achieved a high

performance on the benchmark

• However, precise tracking comes at the cost of slow models

with online updates or heavy architectures

• But they certainly served to show CNNs can be effectively used

in tracking tasks

[1] Kristan, Matej, et al. "The visual object tracking vot2015 challenge results." Proceedings of the IEEE

international conference on computer vision workshops. 2015.

41

Page 42: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Object Tracking

• In VOT2016 [1], eight of the submitted trackers were CNN-

based and another six combined convolutional features with

Discriminative Correlation Filters

• Five out of the top ten ranked trackers were CNN-based

• The winner of the challenge, C-COT is based on a VGG-16

architecture and computes convolutions in continuous space via

learnable, implicit interpolation

• Of the runner ups, Siam-FC is a somewhat lighter CNN-based

model which deployed a learnable correlation layer to measure

the similarity between target and various candidates in a fully

convolutional fashion

[1] Kristan, Matej, et al. "The visual object tracking vot2016 challenge results." Proceedings of the IEEE international

conference on computer vision workshops. 2016

42

Page 43: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Fully Convolutional Image Segmentation

Page 44: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Crowd Detection for Safe

Autonomous Drones

• There are limited previous efforts on crowd detection, using

computer vision techniques

• Related research works involving crowds, e.g., crowd

understanding, crowd counting, and human detection and

tracking in crowds, consider crowded scenes

44

Page 45: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Crowd Detection for Safe

Autonomous Drones • State-of-the art approaches on crowd analysis utilize deep

learning techniques

• In [1] an effective Multi-column Convolutional Neural Network

architecture is proposed to map the image to its crowd

density map

• In [2] a switching convolutional neural network for crowd

counting is proposed, aiming to leverage the variation of

crowd density within an image

[1] Zhang, Yingying, et al. "Single-image crowd counting via multi-column convolutional neural network." Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

[2] Sam, Deepak Babu, Shiv Surya, and R. Venkatesh Babu. "Switching convolutional neural network for crowd

counting." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Vol. 1. No. 3. 2017.

45

Page 46: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Switch CNN: Sam, Deepak Babu, Shiv Surya, and R. Venkatesh Babu. "Switching convolutional neural network for crowd counting." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Vol. 1. No. 3. 2017

Crowd Detection for Safe

Autonomous Drones

Page 47: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Deep Detectors on Drones

• In [1] a Fully Convolutional Model for crowd detection is

proposed

• Complies with the computational requirements of the crowd

detection task

• Allows for handling input images with arbitrary dimension

• Subspace learning inspired Two-loss Convolutional Model

• Softmax Loss preserves between class separability

• Euclidean Loss aims at bringing the samples of the same

class closer to each other

[1] Tzelepi, Maria, and Anastasios Tefas. "Human Crowd Detection for Drone Flight Safety Using Convolutional Neural

Networks." in European Signal Processing Conference (EUSIPCO), Kos, Greece, 2017.

47

Page 48: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Deep Detectors on Drones

48

Page 49: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Deep Detectors on Drones

49

Page 50: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Deep Detectors on Drones

Page 51: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Deep Detectors on Drones

51

Page 52: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Reducing the Complexity of CNNs

Page 53: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Bag-of-Features

Pooling • Global polling (GMP [1] , SPP [2]) techniques can be used to

reduce the size of the fully connected layers and allow the

network handle arbitrary sized images

• A Bag-of-Features-based approach was used to provide a

trainable global pooling layer that is capable of

• reducing the size of the model,

• increasing the feed-forward speed,

• increasing the accuracy and the scale invariance,

• adjust to the available computational resources.

[1] Azizpour, Hossein, et al. "From generic to specific deep representations for visual recognition." Conference on

Computer Vision and Pattern Recognition Workshops. 2015.

[2] He, Kaiming, et al. "Spatial pyramid pooling in deep convolutional networks for visual recognition." European

Conference on Computer Vision. 2014.

53

Page 54: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Bag-of-Features

Pooling

• The whole network, including the proposed layer, is optimized

end-to-end towards the task at hand.

• The proposed method can be readily implemented with the

existing Deep Learning Frameworks (Tensorflow, Caffe,

PyTorch, etc.)

54

Page 55: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Bag-of-Features

Pooling

More information can be found in our paper “Learning Bag-of-

Features Pooling for Deep Convolutional Neural Networks” (ICCV

2017, Friday, poster session 8)

55

Page 56: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Bag-of-Features

Pooling

• The method was evaluated on a pose estimation task

• Estimate the pose (yaw, pitch, roll) of the main actors (e.g.,

cyclists, boats, etc)

• Allows for appropriately controlling the camera according to

the specifications of each shot type (e.g., orbit around a target

or profile shot )

56

Page 57: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Bag-of-Features

Pooling

• Use an object detector to locate and crop the object

• Train a CNN to directly regress the pose of the cropped

object

• Advantages:

• No need for 3D models (only a training set of pose-

annotated objects are needed)

• More robust to variations of the object (especially if the

training set is appropriately augmented)

57

Page 58: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Bag-of-Features

Pooling

• Comparing the proposed pooling technique to other state-of-

the-art techniques (ALFW dataset)

The first number in the CBoF technique indicates the spatial segmentation level

58

Page 59: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Convolutional Bag-of-Features

Pooling • Demonstrating the ability of global pooling techniques to

adjust to the available computational resources on-the-fly by

altering the input image size (the results are reported on a

concept detection task)

59

Page 60: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Knowledge Transfer

Page 61: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Knowledge Transfer

• Knowledge transfer techniques (e.g., distillation, hint-based

training) also allows for increasing the performance of smaller

and more lightweight models

• Neural Network Distillation [1] • Train a large and complex model

• Train a smaller model to regress the output of the larger model

• The temperature of the softmax activation function is increased to maintain

more information

• Hints for Thin Deep Nets [2] • The basic distillation idea is followed

• A random projection is used to provide hints for intermediate layers

[1] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the Knowledge in a Neural Network." NIPS 2014

Deep Learning Workshop. 2014.

[2] Romero, Adriana, et al. "Fitnets: Hints for thin deep nets." International Conference on Learning

Representations. 2015. 61

Page 62: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Knowledge Transfer

• Similarity Embedding-based Knowledge Transfer

• Instead of matching the output of the layers, the smaller model is trained to

match the similarities between the training samples

• Similarity Embeddings [1] were used to this end

• This allows for directly transferring the knowledge, even when different

number of neurons are used in each layer, without regressing the output of

the layer

[1] Passalis, Nikolaos and Tefas, Anastasios. "Dimensionality Reduction Using Similarity-Induced Embeddings."

IEEE Transactions on Neural Networks and Learning Systems. 2017.

62

Page 63: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Knowledge Transfer

• Preliminary results on a pose classification task are reported

• The nearest centroid classifier was used to evaluate the “quality” of

the knowledge transfer on an intermediate layer

63

Page 64: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Reinforcement Learning

Page 65: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Camera Control

• Deep Reinforcement Learning techniques can be used to provide

optimal end-to-end control of the camera

• Deep Q Learning [1] (discrete control)

• Policy Gradients [2] (continuous control)

• The reward function can be used to measure the quality of the

obtained shots according to cinematography objectives

[1] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015):

529-533.

[2] Lillicrap, Timothy P., et al. "Continuous control with deep reinforcement learning." arXiv preprint. 2015.

65

Page 66: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Proof of concept

• To examine the whether it is possible to directly control the camera

using visual information we used a simple PID controller

• The aim was to keep the detected bounding box to a specific

position and appropriately adjust the zoom

• Very good results were obtained in our simulations

66

Page 67: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Proof of concept

67

Page 68: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Training and Deployment

Page 69: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Deep Learning Frameworks

• There are several deep learning frameworks that can be used to

train and deploy deep learning models!

• Deploy-oriented frameworks/libraries: Caffe, Tensorflow,

Darknet

• Darknet is not well documented, but the code is quite simple

and it is easy to use it in deploy-orineted code

• Other frameworks/libraries, e.g., PyTorch, are more research-

oriented than deploy-oriented

• Training the models usually requires high-end GPUs (e.g., GTX-

1080, Titan X, etc.)

• Training on CPU is infeasible!

69

Page 70: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Drone Deployment

• GPU-accelerated hardware, e.g., the NVIDIA TX-2 module, must

be used during the deployment to ensure adequate performance

TX2 example

• integrated 256-core NVIDIA Pascal GPU

• a hex-core ARMv8 64-bit CPU complex

• 8GB of LPDDR4 memory

70

Page 71: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Drone Deployment

• The Robotic Operating Systems (ROS) is used to provide a

seamless integration platform

• Other solutions are also possible, but ROS is well established in

the robotics community!

• Each deep learning algorithm can be executed as a ROS node

• Grouping several deep learning tasks into the same node can

reduce the communication overhead in some cases and improve the

performance of the system

71

Page 72: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

Drone Deployment

• NVIDIA also provides a set of deploy optimization tools that can

further accelerate the models.

• Using the TensorRT library with a crowd detection Caffe model

leads to a significant speedup

• Without TensorRT: 45 fps

• With TensorRT: 100 fps

72

Page 73: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

NVIDIA Redtail Project

• Drones that fly autonomously and follow a trail through a

forest using only visual information

• The technical and implementation details of this project are

provided in their paper "Toward Low-Flying Autonomous MAV

Trail Navigation using Deep Neural Networks for Environmental

Awareness" [1]

• The implementation is open-source and available at

https://github.com/NVIDIA-Jetson/redtail!

• ROS nodes, interface with the Pixhawk flight controller, etc.

• TensorRT implementation of YOLO

[1] Smolyanskiy, Nikolai, et al. "Toward Low-Flying Autonomous MAV Trail Navigation using Deep

Neural Networks for Environmental Awareness." arXiv preprint. 2017.

73

Page 74: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

NVIDIA Redtail Project

• Object detection @ 1fps using a 16bit YOLO

variant

• TrailNet used to provide orientation and lateral

offset (treated as classification problem)

• Over-confident networks perform worse

• Learning the dataset vs. performing well

in deployment

• TrailNet was the only one able to fly

autonomously even though it didn't

achieve the best accuracy

• Modifications to the used networks to be able

to run on limited resources devices (TX-1)

• Removing layers, etc.

74

Page 75: Deep Learning for Drone Vision in Cinematography · Assisting Cinematography Tasks • Flying drones in a professional shooting setting requires the coordination of several people

17/02/2017 This project has received funding from the European Union’s Horizon 2020

research

and innovation programme under grant agreement No 731667 (MULTIDRONE)

Q & A

Thank you very much for your attention!

Contact: Prof. A. Tefas, [email protected]

www.multidrone.eu

75