Top Banner
SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 1 Opening the Black Box of Deep Neural Networks in Physical Layer Communication Jun Liu, Kai Mei, Dongtang Ma, Senior Member, IEEE and Jibo Wei, Member, IEEE Abstract—Deep Neural Network (DNN)-based physical layer techniques are attracting considerable interest due to their potential to enhance communication systems. However, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this letter, we aim to quantitatively analyse why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques and their cost in terms of computational complexity. We further investigate and also experimentally validate how information is flown in a DNN-based communication system under the information theoretic concepts. Index Terms—Deep neural network (DNN), physical layer communication, information theory. I. I NTRODUCTION D EEP neural networks (DNN) have recently drawn a lot of attention as a powerful tool in science and engineering problems such as protein structure prediction, image recog- nition, speech recognition and natural language processing that are virtually impossible to explicitly formulate. Although the mathematical theories of communication systems have been developed dramatically since Claude Elwood Shannon’s monograph “A mathematical theory of communication” [1] provides the foundation of digital communication, the wireless channel-related gap between theory and practice motivates researchers to implement DNNs in existing physical layer communication. In order to mitigate the gap, a natural thought is to let a DNN to jointly optimize a transmitter and a receiver for a given channel model without being limited to component- wise optimization. In [2], a pure data-driven end-to-end com- munication system is proposed to jointly optimize transmitter and receiver components. Then, the authors consider the linear and nonlinear steps of processing the received signal as a radio transformer network (RTN) which can be integrated into the end-to-end training process. The ideas of end-to-end learning of communication system and RTN through DNN are extended to orthogonal frequency division multiplexing (OFDM) in [3]. Another natural idea is to recover channel state information (CSI) and estimate the channel as accurate as possible by implementing a DNN so that the effects of fading could be Manuscript received June 2, 2021; revised X X, 2021; accepted X X, 2021. This work was supported in part by National Natural Science Founda- tion of China (NSFC) under Grant 61931020, 61372099 and 61601480. (Cor- responding author: Jun Liu.) Jun Liu, Kai Mei, Dongtang Ma, and Jibo Wei are with the College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China (E-mail: {liujun15, meikai11, dong- tangma, wjbhw}@nudt.edu.cn). reduced. The authors of [4] propose an end-to-end DNN-based CSI compression feedback and recovery mechanism which is further extended with long short-term memory (LSTM) [5]. In [6], a residual learning based DNN designed for OFDM channel estimation is introduced. Furthermore, in order to mitigate the disturbances, in addition to Gaussian noise, such as channel fading and nonlinear distortion, [7] proposes an online fully complex extreme learning machine-based symbol detection scheme. Comparing with traditional physical layer communication systems, the above-mentioned DNN-based techniques show competitive performance. However, what has been missing is to understand the dynamics behind the DNN in physical layer communication. In this paper, we attempt to first give a mathematical explanation to reveal the mechanism of end-to-end DNN-based communication systems. Then, we try to unveil the role of the DNNs in the tasks of CSI recovery, channel estimation and symbol detection. We believe that we have developed a concise way to open as well as understand the “black box” of DNNs in physical layer communication. To summarize, our main contributions of this paper are twofold: Instead of proposing a scheme combining a DNN with a typical communication system, we analyse the be- haviours of a DNN-based communication system from the perspectives of the whole DNN (communication sys- tem), encoder (transmitter) and decoder (receiver). Our simulation results verify that the constellations produced by autoencoders are equivalent to the (locally) optimum constellations obtained by the gradient-search algorithm which minimize the asymptotic probability of error in Gaussian noise under an average power constraint. We consider the tasks of CSI recovery, channel estimation and symbol detection as a typical inference problem. The information flow in the DNNs of these tasks is estimated by using matrix-based functional of Renyi’s -entropy to approximate Shannon’s entropy. The remainder of this paper is organized as follows. In Section II, we give the system model and formulate the problem. Next, simulation results are presented in Section III. Finally, the conclusions are drawn in Section IV. Notations: The notations adopted in the paper are as fol- lows. We use boldface lowercase x, capital letters X and calligraphic letters X to denote column vectors, matrices and sets respectively. In addition, and E {·} denote respectively the Hadamard product and the expectation operation. arXiv:2106.01124v2 [eess.SP] 6 Jun 2021
5

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, …

Jan 22, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, …

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 1

Opening the Black Box of Deep Neural Networksin Physical Layer Communication

Jun Liu, Kai Mei, Dongtang Ma, Senior Member, IEEE and Jibo Wei, Member, IEEE

Abstract—Deep Neural Network (DNN)-based physical layertechniques are attracting considerable interest due to theirpotential to enhance communication systems. However, moststudies in the physical layer have tended to focus on theapplication of DNN models to wireless communication problemsbut not to theoretically understand how does a DNN work in acommunication system. In this letter, we aim to quantitativelyanalyse why DNNs can achieve comparable performance in thephysical layer comparing with traditional techniques and theircost in terms of computational complexity. We further investigateand also experimentally validate how information is flown ina DNN-based communication system under the informationtheoretic concepts.

Index Terms—Deep neural network (DNN), physical layercommunication, information theory.

I. INTRODUCTION

DEEP neural networks (DNN) have recently drawn a lot ofattention as a powerful tool in science and engineering

problems such as protein structure prediction, image recog-nition, speech recognition and natural language processingthat are virtually impossible to explicitly formulate. Althoughthe mathematical theories of communication systems havebeen developed dramatically since Claude Elwood Shannon’smonograph “A mathematical theory of communication” [1]provides the foundation of digital communication, the wirelesschannel-related gap between theory and practice motivatesresearchers to implement DNNs in existing physical layercommunication. In order to mitigate the gap, a natural thoughtis to let a DNN to jointly optimize a transmitter and a receiverfor a given channel model without being limited to component-wise optimization. In [2], a pure data-driven end-to-end com-munication system is proposed to jointly optimize transmitterand receiver components. Then, the authors consider the linearand nonlinear steps of processing the received signal as a radiotransformer network (RTN) which can be integrated into theend-to-end training process. The ideas of end-to-end learningof communication system and RTN through DNN are extendedto orthogonal frequency division multiplexing (OFDM) in [3].Another natural idea is to recover channel state information(CSI) and estimate the channel as accurate as possible byimplementing a DNN so that the effects of fading could be

Manuscript received June 2, 2021; revised X X, 2021; accepted XX, 2021. This work was supported in part by National Natural Science Founda-tion of China (NSFC) under Grant 61931020, 61372099 and 61601480. (Cor-responding author: Jun Liu.)

Jun Liu, Kai Mei, Dongtang Ma, and Jibo Wei are with the Collegeof Electronic Science and Technology, National University of DefenseTechnology, Changsha 410073, China (E-mail: {liujun15, meikai11, dong-tangma, wjbhw}@nudt.edu.cn).

reduced. The authors of [4] propose an end-to-end DNN-basedCSI compression feedback and recovery mechanism which isfurther extended with long short-term memory (LSTM) [5].In [6], a residual learning based DNN designed for OFDMchannel estimation is introduced. Furthermore, in order tomitigate the disturbances, in addition to Gaussian noise, suchas channel fading and nonlinear distortion, [7] proposes anonline fully complex extreme learning machine-based symboldetection scheme.

Comparing with traditional physical layer communicationsystems, the above-mentioned DNN-based techniques showcompetitive performance. However, what has been missing isto understand the dynamics behind the DNN in physical layercommunication.

In this paper, we attempt to first give a mathematicalexplanation to reveal the mechanism of end-to-end DNN-basedcommunication systems. Then, we try to unveil the role ofthe DNNs in the tasks of CSI recovery, channel estimationand symbol detection. We believe that we have developed aconcise way to open as well as understand the “black box” ofDNNs in physical layer communication. To summarize, ourmain contributions of this paper are twofold:

• Instead of proposing a scheme combining a DNN witha typical communication system, we analyse the be-haviours of a DNN-based communication system fromthe perspectives of the whole DNN (communication sys-tem), encoder (transmitter) and decoder (receiver). Oursimulation results verify that the constellations producedby autoencoders are equivalent to the (locally) optimumconstellations obtained by the gradient-search algorithmwhich minimize the asymptotic probability of error inGaussian noise under an average power constraint.

• We consider the tasks of CSI recovery, channel estimationand symbol detection as a typical inference problem. Theinformation flow in the DNNs of these tasks is estimatedby using matrix-based functional of Renyi’s 𝛼-entropy toapproximate Shannon’s entropy.

The remainder of this paper is organized as follows. InSection II, we give the system model and formulate theproblem. Next, simulation results are presented in Section III.Finally, the conclusions are drawn in Section IV.

Notations: The notations adopted in the paper are as fol-lows. We use boldface lowercase x, capital letters X andcalligraphic letters X to denote column vectors, matrices andsets respectively. In addition, � and E {·} denote respectivelythe Hadamard product and the expectation operation.

arX

iv:2

106.

0112

4v2

[ee

ss.S

P] 6

Jun

202

1

Page 2: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, …

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 2

InformationSource

Transmitter WirelessChannel

Receiver Destination

Encoder Decoder

Error Propagation

PDF

Autoencoder

s

s

sx y

( )f

f Θ ( )g

g Θ

z sv

( )|p v z

Fig. 1. Schematic diagram of a general communication system and itscorresponding autoencoder representation.

II. SYSTEM MODEL AND PROBLEM FORMULATION

In this section, we first describe the considered systemmodel and then provide a detailed explanation of the problemformulation from three different perspectives.

A. System Model

As shown in the upper part of Fig. 1, let’s consider the pro-cess of message transmission from the perspectives of a typicalcommunication system and an autoencoder, respectively. Weassume that an information source generates a sequence of 𝑘-bit message symbols 𝑠 ∈ {1, 2, · · · , 𝑀} to be communicatedto the destination, where 𝑀 = 2𝑘 . Then the modulationmodules inside the transmitter map each symbol 𝑠 to a signalx ∈ R𝑁 , where 𝑁 denoted the dimension of the signal space.The signal alphabet is denoted by x1, x2, · · · , x𝑀 . Duringchannel transmission, 𝑁-dimensional signal x is corrupted toy ∈ R𝑁 with conditional probability density function (PDF)𝑝 (y|x) =

∏𝑁𝑛=1 𝑝 (𝑦𝑛 |𝑥𝑛). In this paper, we use 𝑁/2 band-

pass channels, each with separately modulated inphase andquadrature components to transmit the 𝑁-dimensional signal[8]. Finally, the received signal is mapped by the demodulationmodule inside the receiver to 𝑠 which is an estimate of thetransmitted symbol 𝑠. The procedures mentioned above havebeen exhaustively presented by Shannon.

From the point of view of filtering or signal inference,the idea of autoencoder-based communication system matchesNorbert Wiener’s perspective [9]. As shown in the lower partof the Fig. 1, the autoencoder consists of an encoder and adecoder and each of them is a feedforward neural network(NN) with parameters (weights) 𝚯 𝑓 and 𝚯𝑔, respectively.Note that each symbol 𝑠 from information source usually needsto be encoded to an one-hot vector s ∈ R𝑀 and then is fedinto the encoder. Under a given constraint (e.g., average signalpower constraint), the PDF of a wireless channel and a lossfunction to minimize error symbol probability, the encoderand decoder are respectively able to learn to appropriatelyrepresent s as z = 𝑓𝚯 𝑓

(s) and to map the corrupted signal v toa estimate of transmitted symbol s = 𝑔𝚯𝑔

(v) where z, v ∈ R𝑁 .Here, we use z1, z2, · · · , z𝑀 denoted the transmitted signalsfrom the encoder in order to distinguish it from the transmittedsignals from the transmitter.

B. Understanding Autoencoders on Message Transmission

From the prospective of the whole autoencoder (communi-cation system), it aims to transmit information to destinationwith low error probability. The symbol error probability, i.e.,the probability that the wireless channel has shifted a signalpoint into another signal’s decision region, is

𝑃𝑒 =1𝑀

𝑀∑𝑚=1

Pr (s ≠ s𝑚 |s𝑚 transmitted). (1)

The autoencoder can use the cross-entropy loss function

Llog(s, s;𝚯 𝑓 ,𝚯𝑔

)= − 1

𝐵

𝐵∑𝑏=1

𝑀∑𝑖=1

s(𝑏) [𝑖] log(s(𝑏) [𝑖]

)= − 1

𝐵

𝐵∑𝑏=1

s(𝑏) [𝑠](2)

to represent the price paid for inaccuracy of prediction wheres(𝑏) [𝑖] denotes the 𝑖th element of the 𝑏th symbol in a trainingset with 𝐵 symbols. In order to train the autoencoder tominimize the symbol error probability, the optimal parameterscould be found by optimizing the loss function(

𝚯∗𝑓 ,𝚯

∗𝑔

)= arg min(𝚯 𝑓 ,𝚯𝑔)

[Llog

(s, s;𝚯 𝑓 ,𝚯𝑔

) ]subject to E

[‖z‖2

2]≤ 𝑃av

(3)

where 𝑃av denotes the average power. In this paper, we set𝑃av = 1/𝑀. Now, we must be very curious about how do themappings z = 𝑓𝚯 𝑓

(s) look like after the training was done.

C. Encoder: Finding a Good Representation

Let’s pay attention to the encoder (transmitter). In thedomain of communication, an encoder needs to learn a robustrepresentation z = 𝑓𝚯 𝑓

(s) to transmit s against channeldisturbances, such as thermal noise, channel fading, nonlineardistortion, phase jitter, etc. This is equivalent to find an coded(or uncoded) modulation scheme with the signal set of size𝑀 to map a symbol s to a point z for a given transmittedpower, which maximizes the minimum distance between anytwo constellation points. Usually the problem of finding goodsignal constellations for a Gaussian channel1 is associated withthe search for lattices with high packing density which is anold and well-studied problem in the mathematical literature[11].

We use the algorithm proposed in [12] to obtain the opti-mum constellations. Consider a zero-mean stationary additivewhite Gaussian noise (AWGN) channel with one-sided spec-tral density 2𝑁0. For large signal-to-noise ratio (SNR), theasymptotic approximation of the (1) can be written as

𝑃𝑒 ∼ exp(− 1

8𝑁0min𝑖≠ 𝑗

z𝑖 − z 𝑗

22

). (4)

1The problem of constellation optimization is usually considered under thecondition of the Gaussian channel. Although the problem under the conditionof Rayleigh fading channel has been studied in [10], its prerequisite is thatthe side information is perfect known.

Page 3: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, …

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 3

To minimize 𝑃𝑒, the problem can be formulated as{z∗𝑚

}𝑀𝑚=1 = arg min

{z𝑚 }𝑀𝑚=1

(𝑃𝑒)

subject to E[‖z‖2

2]≤ 𝑃av

(5)

where{z∗𝑚

}𝑀𝑚=1 denotes the optimal signal set. This optimiza-

tion problem can be solved by using a constrained gradient-search algorithm. We denote {z𝑚}𝑀𝑚=1 as a 𝑀 × 𝑁 matrix

Z = [z1, z2, · · · , z𝑀 ]𝑇 . (6)

Then, the 𝑠th step of the constrained gradient-search algorithmcan be described by

Z′

𝑠+1 = Z𝑠 − [𝑠∇𝑃𝑒 (Z𝑠) (7a)

Z𝑠+1 =Z′

𝑠+1∑𝑖

∑𝑗

(Z′𝑠+1 [𝑖, 𝑗]

)2 (7b)

where [𝑠 denotes step size and ∇𝑃𝑒 (Z𝑠) ∈ R𝑀×𝑁 denotesthe gradient of 𝑃𝑒 respect to the current constellation points.It can be written as

∇𝑃𝑒 (Z𝑠) = [g1, g2, · · · , g𝑀 ]𝑇 (8)

where

g𝑚 ∼ −∑𝑖≠𝑚

exp

(−‖z𝑚 − z𝑖 ‖2

28𝑁0

) (1

‖z𝑚 − z𝑖 ‖22+ 1

4𝑁0

)1z𝑚−z𝑖 .

(9)The vector 1z𝑚−z𝑖 denotes 𝑁-dimensional unit vector in thedirection of z𝑚 − z𝑖 .

Comparing (3) to (5), the mechanism of the encoder in anautoencoder-based communication system has been unveiled.The mapping function of encoder can be represented as{

𝑓𝚽∗𝑓(s𝑚)

}𝑀𝑚=1

→{z∗𝑚

}𝑀𝑚=1 (10)

when the PDF used for generating training samples is multi-variate zero-mean normal distribution z−z ∼ N𝑁 (0, 𝚺) where0 denotes 𝑁-dimensional zero vector and 𝚺 = (2𝑁0/𝑁) I is a𝑁 × 𝑁 diagonal matrix.

D. Decoder: Inference

Finally, it is the time to zoom in the lower right cornerof the Fig. 1 to investigate what happens inside the decoder(receiver). As Fig. 2(a) shown, for the tasks of DNN-based CSIrecovery, channel estimate and symbol detection, the problemcan be formulated as an inference model. For the sake ofconvenience, we can denote the target output of the decoder asz instead of s because we can assume z = 𝑓𝚯 𝑓

(s) is bijection.If the decoder is symmetric, the decoder also can be seen as asub autoencoder which consists of a sub encoder and decoder.Its bottleneck (or middlemost) layer codes is denoted as u.Here we use z to denote CSI or transmitted symbol which wedesire to predict. The decoder infers a prediction z = 𝑔𝚯𝑔

(v)according to its corresponding measurable variable v.

z

PDF

z

TrainingData

( )|p v z v

ˆzu

...... ......

( )g

g Θ

Decoder

v z1t

2t3t

1t

2t

3t

v1t 2t

z

u

st1

t

...

...

...

... st

2t

( )1 |p t v ( )2 1|p t t ( )3 2|p t t ( )1|s sp −t t

( )1ˆ |p z t ( )1 2|p t t ( )2 3|p t t ( )1 |s sp − t t

(a)

(b)

( )|p v z

s

Fig. 2. (a) An inference model with a DNN decoder of size (2𝑆 − 1) hiddenlayers for learning. (b) The graph representation of the decoder with (𝑆 − 1)hidden layers in both sub encoder and decoder. The solid arrow denotes thedirection of input feedforward propagation and the dashed arrow denotes thedirection of information flow in the error back-propagation phase.

If the joint distribution 𝑝 (v, z) is known, the expected(population) risk C𝑝 (v,z)

(𝑔𝚯𝑔

,Llog

)can be written as

E[Llog

(z, z;𝚯𝑔

) ]=

∑v∈V ,z∈Z

𝑝 (v, z) log(

1𝑄 (z|v)

)=

∑v∈V ,z∈Z

𝑝 (v, z) log(

1𝑝 (z|v)

)+

∑v∈V ,z∈Z

𝑝 (v, z) log(𝑝 (z|v)𝑄 (z|v)

)= 𝐻 (z|v) + 𝐷KL (𝑝 (z|v) | |𝑄 (z|v))≥ 𝐻 (z|v)

(11)where 𝑄 (·|v) =𝑔𝚯𝑔

(v) ∈ 𝑝 (Z) and 𝐷KL (𝑝 (z|v) | |𝑄 (z|v))denotes Kullback-Leibler divergence between 𝑝 (z|v) and𝑄 (z|v) [13]2. If and only if the decoder is given by the con-ditional posterior 𝑔𝚯𝑔

(v) =𝑝 (z|v), the expected (population)

risk reaches the minimum min𝑔𝚯𝑔

C𝑝 (v,z)(𝑔𝚯𝑔

,Llog

)= 𝐻 (z|v).

In physical layer communication, instead of perfectly know-ing the channel-related joint distribution 𝑝 (v, z), we only have

a set of 𝐵 i.i.d. samples D𝐵 :={(

v(𝑏) , z(𝑏))}𝐵

𝑏=1from 𝑝 (v, z).

In this case, the empirical risk is defined as

C𝑝 (v,z)(𝑔𝚯𝑔

,L,D𝐵

)=

1𝐵

𝐵∑𝑏=1

L[z𝑏 , 𝑔𝚯𝑔

(v𝑏)]. (12)

Practically, the D𝐵 from 𝑝 (v, z) usually is a finite set.This leads the difference between the empirical and expected(population) risks which can be defined as

gen𝑝 (v,z)(𝑔𝚯𝑔

,L,D𝐵

)=C𝑝 (v,z)

(𝑔𝚯𝑔

,Llog

)−

C𝑝 (v,z)(𝑔𝚯𝑔

,L,D𝐵

).

(13)

We now can preliminarily conclude that the DNN-basedreceiver is an estimator with minimum empirical risk fora given set D𝐵, whereas its performance is inferior to the

2If 𝑋 is a continuous random variable the sum becomes an integral whenits PDF exists.

Page 4: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, …

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 4

(a)

(b)

M=8, N=2 M=16, N=2 M=16, N=3

M=8, N=2 M=16, N=2 M=16, N=3

Fig. 3. Comparisons of (a) optimum constellations obtained by gradient-search technique and (b) constellations produced by autoencoders.

optimal with minimum expected (population) risk under agiven joint distribution 𝑝 (v, z).

Furthermore, it is necessary to quantitatively understandhow information flows inside the decoder. Fig. 2(b) showsthe graph representation of the decoder where t𝑖 andt′𝑖 (1 ≤ 𝑖 ≤ 𝑆) denote 𝑖th hidden layer representations startingfrom the input layer and the output layer, respectively. Here,we use the method proposed in [14] to illustrate layer-wisemutual information by three kinds of information planes (IPs)where the Shannon’s entropy is estimated by matrix-basedfunctional of Renyi’s 𝛼-entropy [15]. Its details are given inAppendix.

III. SIMULATION RESULTS

In this section, we provide simulation results to illustratethe behaviour of DNN in physical layer communication.

A. Constellation Comparison

Fig. 3(a) shows the optimum constellations obtained bygradient-search technique proposed in [12]. When 𝑁 = 2 and3, the algorithm were run allowing for 1000 and 3000 steps,respectively. The step size [ = 2 × 10−4. Fig. 3(b) shows theconstellations produced by autoencoders which have the samenetwork structures and hyperparameters with the autoencodersmentioned in [2]. The autoencoders were trained with 106

epochs, each of which contains 𝑀 different symbols.When 𝑁 = 2, the two-dimensional constellations produced

by autoencoders have a similar pattern to the optimum con-stellations which form a lattice of (almost) hexagonal. Specifi-cally, in the case of (𝑀 = 8, 𝑁 = 2), one of the constellationsfound by the autoencoder can be obtained by rotating theoptimum constellation found by gradient-search technique. Inthe case of (𝑀 = 16, 𝑁 = 2), the constellation found by theautoencoder is different from the optimum constellation butit still forms a lattice of (almost) equilateral triangles. In thecase of (𝑀 = 16, 𝑁 = 3), one signal point of the optimumconstellation is almost at the origin while the other 15 signalpoints are almost at the surface of a sphere with radius 𝑃av andcentre 0. This pattern is similar to the surface of a truncated

200 400 600 800 1000 1200 14000.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

S a(z|z L

S)

Size of Training Set B

SNR=3 dB, N=64 SNR=15 dB, N=64 SNR=30 dB, N=64 SNR=15 dB, N=128 SNR=15 dB, N=512

Fig. 4. The entropy 𝑆𝛼 (z |zLS) with respect to different values of SNR and𝑁 .

icosahedron which is composed of pentagonal and hexagonalfaces. However, the three-dimensional constellation producedby an autoencoder is a local optima which is form by 16 signalpoints almost in a plane.

From the perspective of computational complexity, the costto train an autoencoder is significantly higher than the cost oftraditional algorithm. Specifically, an autoencoder which has 4dense layers respectively with 𝑀 , 𝑁 , 𝑀 and 𝑀 neurons needsto train (2𝑀 + 1) (𝑀 + 𝑁) + 2𝑀 parameters for 106 epochswhereas the gradient-search algorithm only needs 2𝑀 trainableparameters for 103 steps.

B. Information Flow

We consider a common channel estimation problemfor an OFDM system with 𝑁 subcarriers. Let z Δ

=

[𝐻 [0] , 𝐻 [1] , · · · , 𝐻 [𝑁 − 1]]𝑇 which denotes frequency im-pulse response (FIR) vector of a channel. For the sake ofconvenience, we denotes the measurable variable as v Δ

= zLSwhere zLS represents the least-square (LS) estimation of z.Usually, it can be obtained by using training symbol-basedchannel estimation. In this paper, we use linear interpolationand the number of pilots 𝑁𝑝 = 𝑁/4 = 16.

According to (11), the minimum logarithmic expected (pop-ulation) risk for this inference problem is 𝐻 (z|zLS) which canbe estimated by Renyi’s 𝛼-entropy 𝑆𝛼 (z|zLS) =𝑆𝛼 (z, zLS) −𝑆𝛼 (zLS) with 𝛼 = 1.01. Fig. 4 illustrates the entropy𝑆𝛼 (z|zLS) with respect to different values of SNR and 𝑁 .As can be seen, 𝑆𝛼 (z|zLS) monotonically decreases as thesize of training set increases. When 𝐵 → ∞, 𝑆𝛼 (z|zLS)decreases slowly. It is because the joint distribution 𝑝 (z, zLS)can be perfectly learned and therefore the empirical risk isapproaching to the expected risk. Interestingly, when 𝐵 > 580,the lower the SNR or the larger input dimension 𝑁 is, thesmaller 𝐵 is needed to obtain the same value of 𝑆𝛼 (z|zLS).

Fig. 5(a), (b), (c) illustrate the behaviour of the IP-I, IP-II and IP-III in a DNN-based OFDM channel estimator withtopology “128 − 64 − 32 − 16 − 8 − 16 − 32 − 64 − 128” where

Page 5: SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, …

SUBMITTED TO IEEE WIRELESS COMMUNICATIONS LETTERS, VOL. XX, NO. X, JUNE 2021 5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.50.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.50.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.50.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

T1

T2

T3

T4

I(T;V

' )

I(T;V)

020406080100120140160180200220240260280300320340360380400420440460480500520540560580600620640660680700720740760780800820840860880900

(a) IP-I

T1-T'1

T2-T'2

T3-T'3

T4-T'4

I(T' ;V

' )

I(T;V)

020406080100120140160180200220240260280300320340360380400420440460480500520540560580600620640660680700720740760780800820840860880900

(b) IP-II T1-T

'1

T2-T'2

T3-T'3

T4-T'4

I(T' ;V

)

I(T;V)

020406080100120140160180200220240260280300320340360380400420440460480500520540560580600620640660680700720740760780800820840860880900

(c) IP-III0 100 200 300 400 500 600 700 800 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

MSE

Iterations(d) Loss Curve

Fig. 5. The three IPs and loss curve in a DNN-based channel estimator.

linear activation function is considered and the training sampleis constructed by concatenating the real and imaginary parts ofthe complex channel vectors. Batch size is 100 and learningrate [ = 0.001. We use 𝑉 and 𝑉 ′ to denote the input andoutput of the decoder, respectively. The number of iterationsis illustrated through a colour bar. From IP-I, it can be seen thatthe final value of mutual information I (𝑇 ;𝑉 ′) in each layertends to be equal to the final value of I (𝑇 ;𝑉), which meansthat the information from 𝑉 has been learnt and transferredto 𝑉 ′ by each layer. In IP-II, I (𝑇 ′;𝑉 ′) < I (𝑇 ;𝑉) in eachlayer, which implies that all the layers are not overfitting.The tendency of I (𝑇 ;𝑉) to approach the value of I (𝑇 ′;𝑉)can be observed from IP-III. Finally, from all the IPs, it iseasy to notice that the mutual information does not changesignificantly when the number of iterations is larger than 200.Meanwhile, according to Fig. 5(d), the MSE reaches a verylow value and also does not change sharply. It means that200 iterations are enough for the task of 64-subcarrier channelestimation using a DNN with the above-mentioned topology.

IV. CONCLUSION

In this paper, we propose a framework to understand themanner of the DNNs in physical communication. We findthat a DNN-based transmitter essentially tries to producea good representation of the information source. Then, wequantitatively analyse the information flow in a DNN-basedcommunication system. We believe that this framework hasthe potential for the design of DNN-based physical commu-nication.

APPENDIX AMATRIX-BASED FUNCTIONAL OF RENYI’S 𝛼-ENTROPY

For a random variable 𝑋 in a finite set X, its Renyi’s entropyof order 𝛼 is defined as

𝐻𝛼 (𝑋) = 11 − 𝛼

log∫X𝑓 𝛼 (𝑥) 𝑑𝑥 (14)

where 𝑓 (𝑥) is the PDF of the random variable 𝑋 . Let{𝑥 (𝑏)

}𝐵𝑏=1 be an i.i.d. sample of 𝐵 realizations from the random

variable 𝑋 with PDF 𝑓 (𝑥). The Gram matrix K can be definedas K [𝑖, 𝑗] = ^

(𝑥𝑖 , 𝑥 𝑗

)where ^ : X × X ↦→ R is a real valued

positive definite and infinitely divisible kernel. Then, a matrix-based analogue to Renyi’s 𝛼-entropy for a normalized positivedefinite matrix A of size 𝐵 × 𝐵 with trace 1 can be given bythe functional

𝑆𝛼 (A) = 11 − 𝛼

log2

[𝐵∑

𝑏=1_𝑏 (A)𝛼

](15)

where _𝑏 (A) denotes the 𝑏th eigenvalue of A, a normalizedversion of K:

A [𝑖, 𝑗] = 1𝐵

K [𝑖, 𝑗]√K [𝑖, 𝑖] K [ 𝑗 , 𝑗]

. (16)

Now, the joint-entropy can be defined as

𝑆𝛼 (A,B) = 𝑆𝛼

[A � B

tr (A � B)

]. (17)

Finally, the matrix notion of Renyi’s mutual information canbe defined as

𝐼𝛼 (A; B) = 𝑆𝛼 (A) + 𝑆𝛼 (B) − 𝑆𝛼 (A,B) . (18)

REFERENCES

[1] C. E. Shannon, “A mathematical theory of communication,” The Bellsystem technical journal, vol. 27, no. 3, pp. 379–423, 1948.

[2] T. O’shea and J. Hoydis, “An introduction to deep learning for thephysical layer,” IEEE Transactions on Cognitive Communications andNetworking, vol. 3, no. 4, pp. 563–575, 2017.

[3] A. Felix, S. Cammerer, S. Dörner, J. Hoydis, and S. Ten Brink, “OFDM-autoencoder for end-to-end learning of communications systems,” in2018 IEEE 19th International Workshop on Signal Processing Advancesin Wireless Communications (SPAWC). IEEE, 2018, pp. 1–5.

[4] C.-K. Wen, W.-T. Shih, and S. Jin, “Deep learning for massive MIMOCSI feedback,” IEEE Wireless Communications Letters, vol. 7, no. 5,pp. 748–751, 2018.

[5] T. Wang, C.-K. Wen, S. Jin, and G. Y. Li, “Deep learning-based CSIfeedback approach for time-varying massive MIMO channels,” IEEEWireless Communications Letters, vol. 8, no. 2, pp. 416–419, 2018.

[6] L. Li, H. Chen, H.-H. Chang, and L. Liu, “Deep residual learning meetsOFDM channel estimation,” IEEE Wireless Communications Letters,vol. 9, no. 5, pp. 615–618, 2019.

[7] J. Liu, K. Mei, X. Zhang, D. Ma, and J. Wei, “Online extreme learningmachine-based channel estimation and equalization for OFDM systems,”IEEE Communications Letters, vol. 23, no. 7, pp. 1276–1279, 2019.

[8] B. Sklar and P. K. Ray, Digital Communications Fundamentals andApplications. Pearson Education, 2014.

[9] S. Yu, M. Emigh, E. Santana, and J. C. Príncipe, “Autoencoders trainedwith relevant information: blending Shannon and Wiener’s perspectives,”in 2017 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP). IEEE, 2017, pp. 6115–6119.

[10] J. Boutros, E. Viterbo, C. Rastello, and J.-C. Belfiore, “Good latticeconstellations for both Rayleigh fading and Gaussian channels,” IEEETransactions on Information Theory, vol. 42, no. 2, pp. 502–518, 1996.

[11] G. C. Jorge, A. A. de Andrade, S. I. Costa, and J. E. Strapasson,“Algebraic constructions of densest lattices,” Journal of Algebra, vol.429, pp. 218–235, 2015.

[12] G. Foschini, R. Gitlin, and S. Weinstein, “Optimization of two-dimensional signal constellations in the presence of Gaussian noise,”IEEE Transactions on Communications, vol. 22, no. 1, pp. 28–38, 1974.

[13] A. Zaidi, I. Estella-Aguerri et al., “On the information bottleneckproblems: models, connections, applications and information theoreticviews,” Entropy, vol. 22, no. 2, p. 151, 2020.

[14] S. Yu and J. C. Principe, “Understanding autoencoders with informationtheoretic concepts,” Neural Networks, vol. 117, pp. 104–123, 2019.

[15] L. G. S. Giraldo, M. Rao, and J. C. Principe, “Measures of entropyfrom data using infinitely divisible kernels,” IEEE Transactions onInformation Theory, vol. 61, no. 1, pp. 535–548, 2014.