Page 1
BigStation: Enable Scalable Real-time Signal Processing in Large MU-MIMO Systems
Qing Yang Xiaoxiao Li § Hongyi Yao¶ Ji Fang ‡ Kun Tan † Wenjun Hu † Jiansong Zhang† Yongguang Zhang †
†Microsoft Research Asia, Beijing, China MSRA and CUHK, Hong Kong
§ MSRA and Tsinghua University, Beijing, China ¶ MSRA and USTC, He Fei, An Hui, China
‡ MSRA and BJTU, Beijing, China
Page 2
Motivation
• Demand for more wireless capacity – Proliferation of mobile devices: wireless access is
primary
– Data-intensive applications: video, tele-presence
– “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO VNI 2011-2016)
SIGCOMM 2013, Hong Kong, Aug 2013 2
Page 3
Motivation
• Demand for more wireless capacity – Proliferation of mobile devices: wireless access is
primary
– Data-intensive applications: video, tele-presence
– “amount of net traffic carried on wireless will exceed the amount of wired traffic by 2015” (sourced from CISCO VNI 2011-2016)
SIGCOMM 2013, Hong Kong, Aug 2013 3
Can we engineer next wireless network to match existing wired network –
Giga-bit wireless throughput to every user?
Page 4
How to Gain More Wireless Capacity
• More spectrum (DSA) – Spectrum is scarce, shared resource and there is a
limit
• Spectrum reuse (micro cell, pico cell, …) – Existing cells are already small (like Wi-Fi)
– Increased deployment and management complexity
• Spatial multiplexing (MU-MIMO) – More promising
SIGCOMM 2013, Hong Kong, Aug 2013 4
Page 5
Background: MU-MIMO
• Transmit to/Receive from multiple mobile stations
SIGCOMM 2013, Hong Kong, Aug 2013 5
Access Point (AP)
Joint Signal Processing
mobile
mobile mobile
m AP antennas n total client antennas
mobile mobile
mobile
𝑌 = 𝐻S, 𝑋 = 𝐻∗(𝐻𝐻∗−1)𝐻𝑌 Uplink:
S = 𝐻∗𝐻 −1𝐻∗𝑋 Y = 𝐻𝑆 = 𝑋 Downlink:
S = 𝑋
• In theory, linearly scale capacity with # of AP antennas
Page 6
How Many Antennas do We need
• … for giga-bit wireless link per user
SIGCOMM 2013, Hong Kong, Aug 2013 6
# of ant 1 2 4 8 16 32 64 128
20MHz 72.2M 144M 289M 578M 1.2G 2.3G 4.6G 9.2G
40MHz 150M 300M 600M 1.2G 2.4G 4.8G 9.6G 19.2G
80MHz 325M 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G
160MHz 650M 1.3G 2.6G 5.2G 10.4G 20.8G 41.6G 83.2G
802.11n
802.11ac Large-scale MU-MIMO systems
Giga-bit to 20 concurrent users: 160MHz channel with at least 40 antennas
Page 7
Challenge
• Can we build a scalable AP to support such large-scale MU-MIMO operation? – When n, so as m, increases large?
SIGCOMM 2013, Hong Kong, Aug 2013 7
Access Point (AP)
Joint Signal Processing
mobile mobile mobile
m AP antennas
n total client antennas
mobile mobile mobile
Page 8
Computation and Throughput Requirement: a Back-of-Envelope Estimation
• Setting: 160MHz, 40 antennas • Data path:
– 160MHz channel width 𝑟 = 5Gbps sa. per ant. – 40 antennas 200Gbps in total
• Computation: – Channel inverse (once every frame): 𝑂(𝑚𝑛2𝑟/𝑡𝑓)
269 GOPS – Spatial demutiplexing/precoding: 𝑂(𝑚𝑛𝑟) 1.5 TOPS – Channel Decoding: 𝑂(𝑛𝑟) 5.5 TOPS – 7.27 TOPS in total!
• State-of-art multi-core CPU achieves only 50 GOPS
SIGCOMM 2013, Hong Kong, Aug 2013 8
Page 9
A Single Central Processing Unit
SIGCOMM 2013, Hong Kong, Aug 2013 9
Access Point (AP)
Joint Signal Processing
mobile mobile mobile
m AP antennas
n total client antennas
mobile mobile mobile
Page 10
BigStation AP
BigStation: Parallelizing to Scale
SIGCOMM 2013, Hong Kong, Aug 2013 10
Simple Processing
Unit
mobile mobile mobile
m AP antennas
n total client antennas
mobile mobile mobile
Simple Processing
Unit
Simple Processing
Unit
Simple Processing
Unit
Simple Processing
Unit Inter-connecting
Network
Page 11
Outline
• Parallel architecture
• Parallel algorithms and optimization
• Performance
• Conclusion
SIGCOMM 2013, Hong Kong, Aug 2013 11
Page 12
Naive Architecture
• A pool of processing servers
– Sending samples of the same frame to one server…
SIGCOMM 2013, Hong Kong, Aug 2013 12
• A pool of processing servers
• Enough processing capability with ⌈𝑡𝑝/𝑡𝑓⌉ servers
Page 13
Naive Architecture
SIGCOMM 2013, Hong Kong, Aug 2013 13
• Issue: long processing latency for a frame (~1𝑠)
• Wireless protocols requirement: milliseconds
Page 14
Our Approach: Distributed Pipeline
SIGCOMM 2013, Hong Kong, Aug 2013 14
• Parallelizing MU-MIMO processing into 3-stage pipeline • At each stage, the computation is further parallelized
among multiple servers
Channel inversion
Spatial demultiplexing
Channel decoding
Page 15
Channel inversion
Spatial demultiplexing
Channel decoding
Data Partitioning across Servers
• Exploiting data parallelism inside MU-MIMO
SIGCOMM 2013, Hong Kong, Aug 2013 15
OFDM signal
Partitioning by subcarriers
Page 16
Channel inversion
Spatial demultiplexing
Channel decoding
Data Partitioning across Servers
• Exploiting data parallelism inside MU-MIMO
SIGCOMM 2013, Hong Kong, Aug 2013 16
OFDM signal
Partitioning by spatial streams
Page 17
Example
• Giga-bit to 20 users – 160MHz 468 parallel subcarriers
• Subcarrier partitioning – Each server needs to handle a minimum of 10Mbps data
• Spatial stream partitioning – Each server needs to handle 5Gbps data
• Generally within existing server’s processing capability – Multi-core (4~16)
– 10G NIC
SIGCOMM 2013, Hong Kong, Aug 2013 17
Page 18
Summary
• Distributed pipeline for low latency
• Exploiting data parallelism across servers at each processing stage
• If single datum is still beyond capability of a single processing unit
– Building deeper pipeline (see paper for details)
SIGCOMM 2013, Hong Kong, Aug 2013 18
Page 19
Outline
• Parallel architecture
• Parallel algorithms and optimization
• Performance
• Conclusion
SIGCOMM 2013, Hong Kong, Aug 2013 19
Page 20
Computation Partitioning in a Server
• Three key operations in MU-MIMO
– Matrix multiplication
– Matrix inversion
– Viterbi decoding (channel decoding)
SIGCOMM 2013, Hong Kong, Aug 2013 20
Page 21
Parallel Matrix Multiplication
• Divide-and-conquer
SIGCOMM 2013, Hong Kong, Aug 2013 21
𝐻∗ 𝐻 = 𝐻1∗𝐻2
∗ 𝐻1
𝐻2
= 𝐻1∗𝐻1 + 𝐻2
∗𝐻2
Core 1 Core 2
Page 22
Parallel Matrix Inversion
• Based on Gauss-Jordan method
SIGCOMM 2013, Hong Kong, Aug 2013 22
ℎ11 ℎ12
ℎ21 ℎ22
ℎ1𝑛
ℎ2𝑛
ℎ31 ℎ32 ⋱ ⋮
ℎ𝑛1 ℎ𝑛2 … ℎ𝑛𝑛
1 00 1
00
00
00
00
⋱ ⋮
0 0 … 1
Core 1 Core 2
Page 23
Parallel Matrix Inversion
• Based on Gauss-Jordan method
SIGCOMM 2013, Hong Kong, Aug 2013 23
ℎ11 ℎ12
ℎ21 ℎ22
ℎ1𝑛
ℎ2𝑛
ℎ31 ℎ32 ⋱ ⋮
ℎ𝑛1 ℎ𝑛2 … ℎ𝑛𝑛
1 00 1
00
00
00
00
⋱ ⋮
0 0 … 1
Core 1 Core 2
1 00 1
00
0 00 0
⋱ ⋮
0 0 … 1
𝑖11 𝑖12
𝑖21 𝑖22
𝑖1𝑛
𝑖2𝑛
𝑖31 𝑖32 ⋱ ⋮
𝑖𝑛1 𝑖𝑛2 … 𝑖𝑛𝑛
Page 24
Parallel Viterbi Decoding
• Challenge: sequential operations on a continuous (soft-)bit stream
• Solution: – Artificially divide bit-stream into blocks
SIGCOMM 2013, Hong Kong, Aug 2013 24
Core 1
Core 2
Page 25
Parallel Viterbi Decoding
• Challenge: sequential operations on a continuous (soft-)bit stream
• Solution: – Artificially divide bit-stream into blocks – Add overlaps to ensure converging to optimal
SIGCOMM 2013, Hong Kong, Aug 2013 25
Core 1
Core 2
Page 26
Parallel Viterbi Decoding
• How to choose a right block size? – The tradeoff between latency and overhead
• Our goal: fully utilize the computation capacity while keeping 𝐿 minimal
• Optimal size: 𝐿∗ = 2𝐷𝑢/(𝑚𝑣 − 𝑢)
SIGCOMM 2013, Hong Kong, Aug 2013 26
Core 1
Core 2
L D G
𝑢: stream bit rate 𝑣: processing rate per core 𝑚: # of cores
Page 27
Optimization: Lock-free Computing Structure
• Complex interaction between communication and computation threads
SIGCOMM 2013, Hong Kong, Aug 2013 27
(1.31x ) Contention at output buffer Lock free
Page 28
Optimization: Communication
• Parallelizing communication among multiple cores
• Dealing with incast problem
– Application-level flow control
• Isolating communication and computation on different cores
SIGCOMM 2013, Hong Kong, Aug 2013 28
Page 29
Outline
• Parallel architecture
• Parallel algorithms and optimization
• Performance
• Conclusion
SIGCOMM 2013, Hong Kong, Aug 2013 29
Page 30
Micro-benchmarks
• Platform: Dell server with an Intel Xeon E5520 CPU (2.26 GHz, 4 cores)
SIGCOMM 2013, Hong Kong, Aug 2013 30
Channel inversion
Page 31
Micro-benchmarks
SIGCOMM 2013, Hong Kong, Aug 2013 31
Spatial demultiplexing Viterbi decoding
Page 32
Micro-benchmarks
SIGCOMM 2013, Hong Kong, Aug 2013 32
6 users, 100Mbps
20 users, 600Mbps
50 users, 1Gbps
Page 33
Prototype
• Software radio: Sora MIMO Kit – 4x phase coherent radio chains – Extensible with an external clock
SIGCOMM 2013, Hong Kong, Aug 2013 33
Page 34
Capacity Gain
SIGCOMM 2013, Hong Kong, Aug 2013 34
Caped at a constant value due to random-user selection!
Page 35
Capacity Gain
SIGCOMM 2013, Hong Kong, Aug 2013 35
Overprovisioned AP antennas
6.8x
Page 36
Processing Delay
SIGCOMM 2013, Hong Kong, Aug 2013 36
Light load (1 frame per 10ms) Heavy load (back-to-back frames)
860𝜇𝑠
Page 37
Things I didn’t talk about
• How to get channel state in a scalable way
– Argos [Shepard, et al., mobicom 2012]
– JMB [Rahul, et al., SIGCOMM 2012]
• MU-MIMO MAC
– Better user selection than random? (Future work)
• Automatic gain control in large scale MU-MIMO
– Future work
SIGCOMM 2013, Hong Kong, Aug 2013 37
(related and future work)
Page 38
Conclusions
• Scale processing of large MU-MIMO systems is possible
– Exploiting parallelism of MU-MIMO operations and processing servers
– Developing a distributed processing pipeline
• Large-scale MU-MIMO is a promising way to scale wireless capacity by another 100x
– Yet, many challenges remains (user-selection, AGC …)
SIGCOMM 2013, Hong Kong, Aug 2013 38
Page 39
Thanks! Take you questions!
SIGCOMM 2013, Hong Kong, Aug 2013 39