Distribution Systems for 3D Teleimmersive and Video 360
Content: Similarities and Differences
Klara Nahrstedt
Department of Computer Science
University of Illinois at Urbana-Champaign
ACM Multimedia Systems, June 12, 2018, Amsterdam, Netherlands
Overview
• Motivation
• 3D Teleimmersive Video Representation
• Video 360 Representation
• Similarities and Differences in Content Representation
• Distribution of 3DTI Video
• Distribution of Video 360
• Similarities and Differences in Content Distribution
• Conclusion
3D Teleimmersive (3DTI) Systems
3Source: http://tele-immersion.citris-uc.org; http://monet.cs.illinois.edu/projects/cyphy-multi-modal-teleimmersion-for-tele-physiotherapy/teleimmersion-gallery/
High-End Tele-Presence Environments
2
Traditional telephony and videoconferencing provide some of these elements, including ease of
use and audio quality, yet fail on most others. Our Coliseum effort aims to advance the state of
videoconferencing by applying recent advances in image-based modeling and computer vision to
bring these other elements of face-to-face realism to remote collaboration.
Scene reconstruction, the task of building 3D descriptions using the information contained in
multiple views of a scene, is an established challenge in computer vision [Longuet-Higgins 81].
It has seen remarkable progress over the last few years due to improved algorithms [Seitz 97,
Narayanan 98, Pollefeys 99] and faster computers. The Coliseum system is based on the Image-
Based Visual Hulls (IBVH) image-based rendering scene reconstruction technology of MIT
[Matusik 00]. Our recent Coliseum efforts have shown that the IBVH method can operate at
video rates from multiple camera streams hosted by a single personal computer [Baker 02].
Each Coliseum participant works on a standard PC with LCD monitor and a rig housing five
video cameras spaced at roughly 30 degree increments, as shown in Figure 1. During a
teleconferencing session, Coliseum builds 3D representations of each participant at video rates.
The appropriate views of each participant are rendered for all others and placed in their virtual
environments, one view of which is shown in Figure 2. The impression of a shared space results,
with participants free to move about and express themselves in natural ways, such as through
voice, gesture, and gaze.
Handling five video streams and preparing 3D reprojection views for each of numerous
coparticipating workstations at video rates is a formidable task. Tight control must be exercised
on computation, process organization, and inter-desktop communication. At project inception,
we determined we needed an effective speedup of about one hundred times over the MIT IBVH
processing on a single PC to reach utility. Our purpose in this paper is to detail some of the major
issues in attaining this performance.
Figure 1. The Coliseum immersive videoconferencing system
Cis
co T
ele
-pre
sence
HP
Halo
UN
CH
P C
ole
siu
m
Multi-Camera Live Broadcast Systems
http://www.dailymail.co.uk/sciencetech/article-2336893/New-TV-cameras-bring-Matrix-style-bullet-time-
trickery-live-sports-coverage.html
Multi-Camera Broadcast Systems
https://thegadgetflow.com/portfolio/slingstudio-
multi-camera-broadcaster/
https://www.myslingstudio.com/ https://www.cinfo.es/our-products/synthetrick/multicam
https://www.spiideo.com/sports/
360-Degree Video
7360 Degrees Cameras – CoolPile.com: http://coolpile.com/tag/360-degrees-cameras
3D Teleimmersive Video Representation
3D Teleimmersive Stereo Video and Free Viewpoint Video Capture
3DTI Viewing
Photo courtesy of Prof. Ruzena Bajcsy.
Singapore, 2014
3D Stereo Video Representation
Wu, Ahsan, Kurillo, Agarwal, Nahrstedt, Bajcsy, “Color-plus-Depth Level-of-Detail in 3D Teleimmersive Video: A Psychophysical Approach”, ACM Multimedia 2011
Free-Viewpoint 3D Video Representation
Example of 3D representation captured by different cameras
ca
me
ra-1
Cam
era
-2
Cam
era
-3
Cam
era
-8
camera
direction
source: http://zing.ncsl.nist.gov/~gseidman/vrml/
Angle
θ
View Model
OiOu
3DTI Data Model
• 3D frame for camera i at time t: fi,t
• Each pixel in the frame carries color+depth data and can be independently rendered
• Stream for camera i• Si = { fi,t1 fi,t2 … }
• Macro-frame• Ft = { f1,t f2,t … fn,t }
…
…
1 n
f1,t1 fn,t1Ft1
…f1,t2 fn,t2Ft2
S1 Sn
360-Degree Video Representation
360-Degree VideoUser’s Viewport
Generation of 360-Degree Video • Capturing of multiple 2D videos together with their metadata• Stitching videos together and further editing them in spherical video• Encoding spherical video considering projection, interactivity, storage and delivery formats
(this will impact decoding and rendering processes)
Video 360 Viewing and Navigation
https://en.wikipedia.org/wiki/Head-mounted_display
Controller
Example of HDM (Head-Mounted Displays) – Oculus Rift, Samsung Gear VR, HTC Vive,
360-Degree Video Data Model• Field-of-View or Viewport – display region on the Head-Mounted Display
• Fraction of omnidirectional view of the scene
• Viewport defined by a device-specific viewing angle (typically 120 degrees) which delimits horizontally scene from head direction center, called viewport center
• Viewport Resolution – 4K (3840x2160) pixels• Resolution of full 360-degree video – at least 12K (11520x6480)
• Video Framerate – order of HMD refresh rate 100Hz – 100 fps
• Motion-to-Photon Latency requirement • Less than 20 ms for VR – much smaller than Internet request-reply delay
• Need viewport prediction
• Bitrate – Video 360 vs HEVC (8K video at 60fps is approx. 100 Mbps)
• Tiling- Spatial divide of spherical video into in independent tiles
Tiles and Spherical Maps
Issues with Spherical Mapping to Tiles• Viewport distortion• Spatial quality variance Considerations of sphere-to-plane mapping and viewing probability of tiles are IMPORTANT• Overall spherical distortion of segment is the sum of distortion over all pixels the segment
covers
Xie et al. “360ProbDASH: Improving QoE of 360 Video Streaming Using Tile-based HTTP Adaptive Streaming”, ACM MM 2017
Video 360 Spherical-to-Plane Projections
Carbillon, Simon, Devlic, Chakareski, “Viewport-Adaptive Navigable 360-Degree Video delivery”, May 2017Nasrabadi et al. “Adaptive 360-Degree Video Streaming using Scalable Video Coding”, ACM Multimedia 2017
Video 360 Capture as Spherical Video
Equirectangular Projection – stretches poles and reduces efficiency of codingPyramid Projection – sees degradation on sides Cubemap – maps 90 degree FOV to sides of cube and provides hence less degradation
Encoding and Delivery Formats • Codecs
• AVC/H.264, HEVC/H.265• VP8, VP9
• Delivery Formats• DASH/HLS (Dynamic Adaptive HTTP)
• MPEG-DASH Standard considers tiling
• MPD (Media Presentation Description) –Modified for Video 360
• SRD (Spatial Relation Description) integrated into MPD
• HEVC considers video tiles
• MPEG – Immersive media standard ISO/IEC 23090
• Part 1: Use cases• Part 2: OMAF (Omnidirectional Media
Application Format)• Description of equirectangular projection
format• Metadata for interoperable rendering of
360-degree monoscopic and stereoscopic audio-visual data
• Storage format (ISO base media file format/MP4
• Codecs: HEVC, MPEG0H 3D audio
• Part 3: Immersive video• Part 4: Immersive Audio
Graf, Timmerer, Mueller, “Towards Bandwidth Efficient Adaptive Streaming of Omnidirectional Video over HTTP”, ACM MMSys 2017
Similarities and Differences of Representations
Similarity Parameter 3DTI Video 360-Degree Video
Multi-camera Views Yes (view) Yes (viewport)
Joint coordinate system Yes Yes
Bitrate consideration Yes Yes
View change Yes Yes
Difference Parameter 3DTI Video 360-Degree Video
Video Format Color-Plus-Depth Color
Smallest item to adapt 3DTI frame tile
Frame Representation Frame manipulation at Pixel level (RGB, Depth, Polygons)
Frame manipulation at tiles and Region of Interest level
Coding Simple zlip Complex HVEC
Resolution 640x480 or 1080p 4K to 16K
Resolution for diverse devices No Yes
Format for diverse navigation No Yes
Distribution Systems of 3DTI Video
Multi-Camera 3DTI Transmission System
P
camera
av
dis
pla
y
CCR
GG switch
Site -2
A
microphone
camera
av
dis
pla
y
RC
C
Gswitch
Site-1
A
microphoneC = camera
A = microphone
G = gateway
R = renderer
Internet
25
Approach: Multi-stream Hierarchical Adaptation
Multi-stream Adaptation(Stream Selection)
• Camera orientation:
• User view orientation: cos = , , where is the angle between camera and user view
• Selection (SI) – View-Centric Stream Selection
where T is a user specified parameter
cameradirection
Zhenuy Yang, Klara Nahrstedt, Bin Yu, Ruzena Bajcsy, “A Multi-stream Adaptation framework for Bandwidth Management in 3D Teleimmersion”, ACM NOSSDAV 2006, May 2006, Newport, Rhode Island
View-Centric Stream Differentiation
3D capturing
8
4
6 2
3D camera
transmission
8
4
6 2
3D rendering
user viewstreams contributingmore to user view
less important streams
Timing Performance Validation
Macro-Frame Delay at Sender side
Macro-frame Completion Interval at Receiver Side (End-to-End Delay UIUC-UCB)
Immersive View-Centric Multi-View Multi-Party 3DTI
Z. Yang et al. “ViewCast: View Dissemination and Management for Multi-Party 3D Tele-immersive Environments, ACM Multimedia 2007
Multi-Party Multi-View Telepresence
Example of 3D representation captured by 4 camerasca
mer
a-1
Cam
era-
2
Cam
era-
3
Cam
era-
8
c1c2
Camera
c3
c4
c5
c6c7c8
view
Multi-stream contents
Multi-view environment
High resource demand
Multi-stream dependency
Real-time interactivity
Telepresence Session Control
G
G GR
C
C
A
C
C R
A
R
A
C
C
C = camera
A = microphone
G = gateway
R = renderer
Decoupled control and data plane Hierarchical control Global session controller Local session controllers at G
Coordinated global control plane Monitor data plane Configure data plane
Data plane at TI participants Session routing table (SRT) Stream forwarding
Global Session Controller
(SRT)
Matching
Field (ID)
Forwarding
Action Bitrate
Site-X
Site-Y
Site-Z
ViewCast: Middleware (Overlay) FrameworkA three-layer multi-party/multi-stream
management framework
View-aware Stream
Differentiation/Selection
Overlay network
Service Middleware
Network
Tele-immersive Application
ViewCast
V2
V3
V4
V1
U2.wU3.w
U4.w
user view
U2
session
controller
U3
U4
3D capturing
8
4
6 2
3D camera
transmission
8
4
6 2
3D rendering
User/node’s view request
streams contributingmore to user view
less important streams
V2
V3
V4
V1
U2.wU3.w
U4.w
U2
session
controller
U3
U4
U3.w
victim
Why view change a problem?
Streams/View
GC = 100%, Ii (Oi) = 24
average 3.2
better than
MC–3
performance
but with 22%
less rejection
ratio
Immersive and Non-ImmersiveMulti-Party Multi-View (Live Broadcast) Systems
Arefin Ahsan , Zixia Huang, Klara Nahrstedt, Pooja Agarwal, “4D TeleCast: Towards Large Scale Multi-site and Multi-view Dissemination of 3DTI Content”, IEEE ICDCS 2012, Makau, China.
TI Components & Participants
• Immersive Participants
• Tight Interactivity
• Limited ScaleP
camera
avd
isp
lay
CCR
GG switch
SITE -2
A
microphone
Be
rke
ley
camera
avd
isp
lay
RCC
Gswitch
SITE-1
A
microphone
Illin
ois INTERNET
C = camera
A = microphone
G = gateway
R = renderer
S = sensors
S• Non-immersive Participants
• Large Scale
SITE-3
RG
SITE-4
RG
SITE-5
RG
SITE-6
RG
SITE-7
RG
SITE-9
RG
SITE-8
RG
SITE-10
RG
SITE-10
RG
Producers
NI Viewers
Producers
View/Stream Concepts among Immersive Participants
3D capturing
8
4
6 2
3D camera
transmission
8
4
6 2
3D rendering
user viewstreams contributing
more to user view
less important
streams
Content Producer(Immersive Participant)
Content Producer(Immersive Participant)
View/Stream Concept among Non-Immersive Participants
Site-A Site-B1
35
7
1
3
7
Dis
pla
y
view
Viewers
(Non-immersive Participant)
5
Camera
v1
v1 = [ ]
3D streams 3D streams
4D Content
Content
Producers 5
7
5
6 5 4 7 6 5
6 5 47 6 5> > > > >
Site-B1
35
7
1
3
7
Dis
pla
y
view
Viewers
5
Camera
v2
Site-A
Producers
v2 = [ ]4 3 5 6 7 5
4 3 56 7 5> > > > >
Multi-View Video among Non-Immersive Participants
Approach: 4D TeleCast
Producer Tier
Site-A Site-B
Site-C
Viewer Tier
C G ViewerCamera Communication
Gateway
InternetGSC
LSC LSC LSC
GSC – Global Session Controller LSC – Local Session Controller
Vie
wer
Tie
r
G
C
C
AS
R
G
C
C
AS
R
Producer Tier
Site-BSite-A
Site-C
D
Internet
CDNCDN-P2P
Infrastructure Management
[CDN Assisted Peer]
Wang’08, Liu’10, Chang’09
4D TeleCast
Distribution
Core Server
Edge Server
CDN
Request view V1={S1, S2, S3}
s1 s2 s3
Request view: V1={S1, S2, S3}
s1
s3
s2 s1
s2 s3
Request view: V1={S1, S2, S3}
Request view: V1={S1, S2, S3}
s2 s3
s1
Multi-stream Dependency (Problem Description)
S1A S2
A
>dbuffv1 = {S1
A, S2A}
U1
dbuff
Send to
display
time
S1A
S1A
U2
S2A S1
AS2
A
U1
CDN
Victim stream
Maximum allowed delay bound = dbuff
S1A
Violation of delay bound by dbuff Waste of bandwidthVictim streams
U3
Understanding E2E Delay
u2
A
B
CSite-A
Site-BSite-C
Layer-0
Layer-1
Layer-2
τ
Δ
end
-to
-en
d d
elay
Pro
du
cer
u1 s1A s1
B s1C
…
u3
u2
u1
u3
u3
u2
u1
τ = layer size
Δ = distance from source
• Use Delay Layer Hierarchy
00.10.20.30.40.50.60.70.80.9
1
0 2 4 6 8 10 12 14 16 18
Fra
cti
on
of
vie
wers
Maximum layer of accepted streams
Distribution Systems for Video 360
Pipeline of 360-Degree Video
Graf, Timmerer, Mueller, “Towards Bandwidth Efficient Adaptive Streaming of Omnidirectional Video over HTTP”, ACM MMSys 2017
Challenges of 360-degree Video Distribution
• Real-Time Stitching
• Simulator Sickness in Interactivity Scenarios• Enable to react to HMD head movements as fast as the HMD refresh rate (120 Hz)
• Viewport extraction in real-time • Challenge: difficult to predict user orientation for more than 3 seconds
• Challenge: if short-term prediction is needed, how do we avoid rebuffering/stall under small playout buffers?
• Avoidance of bandwidth waste (if one downloads viewports that are not needed)
• Tiles prefetching error
MPEG-DASH Video Distribution System for Single 2D Video Stream
dash.js. https://github.com/Dash-Industry-Forum/dash.js/wiki.
MPEG-DASH Video 360 Video Streaming using Tiles
Graf, Timmerer, Mueller, “Towards Bandwidth Efficient Adaptive Streaming of Omnidirectional Video over HTTP”, ACM MMSys 2017
360-Video Streaming Systems
• Tiling for Adaptive Streaming• Video divided into tiles• Depending on the mapping of spherical video projection, different tiles will be
streamed• Tiles currently viewed by users are streamed at high quality and the rest with low
resolution
• Personalized Viewport-Only Streaming – Asymmetric Panorama viewing• Also called asymmetric panorama viewport adaptive streaming
• Methods: Truncated Pyramid Projection (TSP), Cubemap• Video divided into segments• When client moves head, the viewport center changes and new viewport must be
display• Decrease of bitrate without decrease of quality of viewport
ISO/IEC JTC1/SC29/WG11/M. 2016. VR/360 Video Truncated Square Pyramid Geometry for OMAF.
Tile-based HTTP Adaptive Streaming and Head Movement PredictionXie, Xu, Ban, Zhang, Guo, “360ProbDASH: Improving QoE of 360 Video Streaming Tile-based HTTP Adaptive Streaming”, ACM Multimedia 2017
Tile-based HTTP Adaptive Streaming for 360 Video
Data Model at 360ProbDASH Server
ERP – Raw Panoramic Video• ERP is divided into video chunks• Each chunk is cropped into N tiles, indexed in raster-scan order• Each tile is encoded into segments with M bit-rate levels• MxN optional segments stored at server and ready for pre-fetching and streaming
360ProbDASH Approach
• Pre-fetch Segments by predicting viewport• Use probabilistic model for prediction
• Leverage Linear Regression Prediction of Orientation
• Distribution of Prediction Errors • Long-term predictions are hard
• 5 users data collection for short term prediction error (3 seconds)
Yourstory.comYaw prediction
Pitch prediction
Roll prediction
Delta = 3 sec
Tile-based Adaptive Video Streaming
• Ochi et al use tile-based streaming where spherical video is mapped to equirectangularvideo and video is cut into 8x8 tiles
• Hosseini and Swaminathan use hexa-face sphere-based tiling of 360-degree video to take into account projection distortion
• Description of tiles with MPEG-DASH Spatial Relation Description
• Quan et al use prediction of head movement to deliver tiles
• Weaknesses of Tiling systems• Time and energy consuming reconstruction • Coding inefficiency due to independent tiling• Server management of files is difficult due to large amount of quality levels and large MPD files• Client selection process is complex• Mixed bit-rate tiles can result in visible border and quality inconsistence in combined-tiles rendering• Multiple Decoders
D. Ochi, Y. Kunita, A. Kameda, A. Kojima, and S. Iwaki. Live streaming system for omnidirectional video. In Proc. of IEEE Virtual Reality (VR), 2015.M. Hosseini and V. Swaminathan. Adaptive 360 vr video streaming: Divide and conquer! In IEEE International Symposium on Multimedia (ISM), 2016.F. Quan, B. Han, L. Ji, and V. Gopalakrishnan. Optimizing 360 video delivery over cellular networks. In ACM SIGCOMM AllThingsCellular, 2016.
QER Viewport-Adaptive StreamingCarbillon, Simon, Devlic, Chakareski, “Viewport-Adaptive Navigable 360-Degree Video delivery”, May 2017
Viewport Adaptive Streaming System
Carbillon, Simon, Devlic, Chakareski, “Viewport-Adaptive Navigable 360-Degree Video delivery”, May 2017
Approach: QER - Quality Emphasized Region • Not only bit-rate adaptation but also QER server adaptation where different regions have
different quality• QER – Quality Enhanced Region
• Each QER is represented by Quality Emphasis Center (QEC)• Full video gets delivered in certain projection representation (equirectangular, cube, ..), but it has different
versions of video QEC• Client device selects the right representation and extracts viewport
• Viewport-adaptive streaming similar to DASH• Client runs adaptation algorithm to select video representation; selects QER and QEC of available QER• QEC selection is based on smallest orthodromic distance
• Orthodromic distance –shortest distance between two points on surface of sphere, measured along surface of sphere
• Video segment length• Temporal Chunk sent from server – 1-10 seconds• Tradeoff between short and long segments
• Expanded MPD• MPD file expanded with new information
• Coordinates of its QEC in degrees• Two angles (0,360) degrees and (-90,90) degrees
• All representations assume the same reference coordinate system
QER-Based Viewport Adaptive Streaming
Carbillon, Simon, Devlic, Chakareski, “Viewport-Adaptive Navigable 360-Degree Video delivery”, May 2017
Examples of Experimental Results• Metric to extract viewport – (1) MS-SSIM: Multi-Scale Structural Similarity and (2) PSNR• Original equirectangular video of full quality - 4K video with 1080p resolution• QEC - in center of face encoded with best quality, other faces at 25% of full quality• Distance - for d = 0, QEC and viewport center match 0.98; as d increases, quality decreases• QEC numbers - With increased QEC number, quality increases; shorter segments are better
Similarities and Differences of Distribution Systems
Similarity Parameter 3DTI Video 360-Degree Video
Dealing with Bandwidth Adapt Views Adapt Viewports
View change yes yes
Navigation Via mouse yes Via mouse yes
Client adaptation yes yes
Streaming Protocols TCP-based TCP-based
Difference Parameter 3DTI Video 360-Degree Video
Dealing with Bandwidth Adapt Views/Streams Adapt Viewports/Tiles
Encoding Standards zlip/some efforts in MPEG/OMAF on 3DTI compression
MPEG-DASH considersomnidirectional video tiles
Distribution Style Real-time view-based telepresencestyle or live view-based broadcast
On-demand DASH-style
Clients homogeneous heterogeneous
Viewing Flat 2D or 3D displays Head-Mounted Displays
Streaming Protocols TCP-Based HTTP-based Standard MPEG-DASH
Navigation Via mouse only Via mouse, head movement, hand movement
Conclusion and Summary• 360-degree video is becoming possible for
• 3D teleimmersive video or • Omnidirectional video
• First solutions are coming up in terms of • capture, encoding and viewing
• But distribution represents challenge• Real-time live streaming or • Near-real-time distribution of 360-degree video
• A lot of presented material will be published in a survey paper • “Scalable 36-Degree Video Streaming: Challenges, Solutions and Opportunities”• Authors: Michael Zink, Ramesh Sitaraman, Klara Nahrstedt • Journal Venue: Proceedings of IEEE Special Issue• Editors: Boris Koldehofe, Ralf Steinmetz, …• Coming up in early 2019