-
Analysis of the tradeoff between
compression ratio and security level in real-time voice
communication
Par
Abdallah Attie
Encadr par:
Dr Ahmad Fadlallah
Dr Mohamad Raad
Soutenance le 09.07.2014 devant le jury compos de:
Dr. Wafaa Abou Diab
Dr. Bassem Bakhash
Dr. Ahmad Fadlallah
-
i
Abstract
The project aims at analysis of the tradeoff between security
level and compression ratio in real-time
voice communication. The problem stated in the project is that
the combination between variable
bitrate compression same length encryption will induce
vulnerability to traffic analysis. The variation
of packet sizes can leak information about the conversation
starting with language identification,
identifying certain phrases, and reconstructing phonemes. The
solution to this problem is to rely on
constant bitrate compression or to pad the sent frames to a
multiple of 16, 32, or 64 bytes. Each
padding schemes has a security gain in the form of increased
immunity to the described traffic
analysis systems. The security level is escaladed with the
increase in the size of the encryption block.
The research project we conduct aims at analysis of the impact
of those padding schemes on the
bitrate of the VoIP stream. We created for this purpose a test
bed that simulates the compression,
encryption and sending/receiving of the speech over RTP socket.
The resulting bitrates are calculated
with and without the overhead of packetization. In conclusion,
the resulting data allow proper clear
perspective of the tradeoff between three parameters: security
level, bitrate, and quality.
-
ii
Contents
Abstract
.......................................................................................................................................
i
List of Figures
............................................................................................................................
iv
List of Tables
..............................................................................................................................
v
List of References
......................................................................................................................
vi
Chapter I Introduction
...............................................................................................................
1
Compression
...............................................................................................................................
1
Types of Speech Coders
..............................................................................................................
2
Variable Bit-Rate Coding
............................................................................................................
3
Speech Coding State of the Art
....................................................................................................
3
Adaptive Multi Rate (AMR)
....................................................................................................
3
Opus
........................................................................................................................................
4
Speex
......................................................................................................................................
5
Security
.......................................................................................................................................
6
Symmetric and Asymmetric Encryption
..................................................................................
7
Block and Stream Encryption
..................................................................................................
8
Common Encryption Algorithms
.............................................................................................
8
Report Structure
........................................................................................................................
10
Chapter II Literature Review and Problem Formulation
............................................................ 11
Traffic Analysis of Encrypted Voice Stream
.............................................................................
11
Information leakage via variable
bit-rate................................................................................
12
Example of traffic analysis
........................................................................................................
14
Mitigation Techniques
...............................................................................................................
15
Chapter III Test-Bed
..................................................................................................................
17
Test-bed requirements
...............................................................................................................
17
Test-bed elements
.....................................................................................................................
18
-
iii
Speex Encoder
......................................................................................................................
19
AES Encryption
....................................................................................................................
22
RTP Sending/Receiving
........................................................................................................
24
Dataset
..................................................................................................................................
25
Test-bed overview
.....................................................................................................................
26
Chapter IV Results and Conclusion
............................................................................................
27
Narrow Band Results
................................................................................................................
27
Wide Band Results
....................................................................................................................
29
Statistical Analysis
....................................................................................................................
30
Conclusion and future recommendations
...................................................................................
32
-
iv
List of Figures
Figure I-1: Block Diagram of the Opus Encoder
..............................................................................
5
Figure II-1: Distribution of bit rates used to encode four
phonemes with Speex ............................. 13
Figure II-2: Overview of training and detection process
................................................................
14
Figure II-3: Robustness to padding
................................................................................................
15
Figure IV-1: NB padding overhead (without packetization)
........................................................... 27
Figure IV-2: NB rate vs quality (without packtization)
..................................................................
27
Figure IV-3: NB rate vs quality (with packtization)
.......................................................................
28
Figure IV-4: NB padding overhead (with packetization)
...............................................................
28
Figure IV-5: WB overheaad (without packetization)
.....................................................................
29
Figure IV-6: WB Rate vers Quality (without packetization)
.......................................................... 29
Figure IV-7: Wide Band overhead (with packetization)
.................................................................
30
Figure IV-8: Wide Band rate versus quality (with packetization)
................................................... 30
Figure IV-9: Stream Cipher 95% Confidence Interval
...................................................................
31
Figure IV-10: Stream and 128 bit padding Confidence Interval
..................................................... 31
Figure IV-11: Stream and 512 bit padding confidence interval
...................................................... 32
Figure IV-12: Stream and 256 bit Confidence Interval
..................................................................
32
-
v
List of Tables
Table I-1: Characteristics of Standardized Speech Coding
Algorithms in Each of Four Broad
Categories Error! Bookmark not defined.
Table I-2 Comparison Between the 3 Speech Encoders
................... Error! Bookmark not defined.
Table III-1 Quality vurses bitrate for Speex narrowband
...............................................................
21
Table III-2 Quality vurses bitrate for Speex wideband
...................................................................
21
-
vi
List of References
1 M. Arjona Ramrez and M. Minami, "Low bit rate speech coding,"
in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed.,
New York: Wiley, 2003, vol. 3, pp. 1299-1308.
2 P. Kroon, "Evaluation of speech coders," in Speech Coding and
Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam:
Elsevier Science, 1995, pp. 467-494.
3 M. Hasegawa-Johnson and A. Alwan, "Speech Coding: Fundamentals
and Applications" in Wiley Encyclopedia of Telecommunications, J.
G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1256-1265.
4 Wiki.hydrogenaud.io, (2014). Variable Bitrate - Hydrogenaudio
Knowledgebase. [online] Available at:
http://wiki.hydrogenaud.io/index.php?title=VBR [Accessed 22 Jun.
2014].
5 E. Ekudden et al, "THE ADAPTIVE MULTI-RATE SPEECH CODER",
Ericson Research.
6 Tools.ietf.org, (2014). RFC 6716 - Definition of the Opus
Audio Codec. [online] Available at:
http://tools.ietf.org/html/rfc6716#section-2.1.8 [Accessed 22 Jun.
2014].
7 Speex.org, (2014). Introduction to CELP Coding. [online]
Available at:
http://www.speex.org/docs/manual/speex-manual/node9.html [Accessed
24 Jun. 2014].
8 C. V. Wright, L. Ballard, F. Monrose, and G. M. Masson.
Language identication of encrypted VoIP trafc: Alejandra y Roberto
or Alice and Bob? In Proceedings of the USENIX Security Symposium,
2007.
9 C. V. Wright, L. Ballard, S. E. Coull, F. Monrose, and G. M.
Masson. Spot me if you can: Uncovering spoken phrases in encrypted
VoIP conversations. In Proceedings of the IEE Symposium on Security
and Privacy, 2008.
10 Tools.ietf.org, (2014). RFC 6562 -Guidelines for the Use of
Variable Bit Rate Audio with Secure RTP. [online] Available at:
http://tools.ietf.org/html/rfc6562#section-2.1.8 [Accessed 30 Jun.
2014].
-
1
Chapter I Introduction
Security and performance are two important issues that any
network operator should be concerned
about. Such concern is escalated when dealing with real-time
voice communication. One of the
reasons behind this is that performance directly affects the
user experience in such real-time
application. Furthermore, the context of voice conversations is
always personal, and consequently
has more severe privacy requirements with respect to other
applications (web browsing for example).
Enhancing performance requires compression of the network
stream, while preserving privacy as a
security aspect requires encryption of the exchanged data. Our
project aims at finding the optimal
solution of combining the two operations (compression and
encryption) since they don't get along
together by nature. That is because compression removes
redundancy from data while encryption
adds it. Our research looks into the possibilities in the
domains of both encryption and compression
in order to find the optimal combination using existing
tools.
Compression
To achieve performance requirements, one of the most important
techniques used is compressing the
exchanged data throughout the network. With the data being a
voice signal, this gives it a certain
structure rendering it compressible at high ratios with
no/minimal distortion. Therefore, speech
coding has always been a hot research area in which many
approaches are adopted with different
perspectives and one outcome: minimizing the needed bandwidth
while preserving voice quality at
an important level.
Speech coding is an application of data compression on digital
audio signals containing speech.
Speech coding uses speech-specific parameter estimation using
audio signal processing techniques to
model the speech signal, combined with generic data compression
algorithms to represent the
resulting modeled parameters in a compact bit-stream. [1]
The techniques employed in speech coding are similar to those
used in audio data compression and
audio coding where knowledge in psychoacoustics is used to
transmit only data that is relevant to the
human auditory system. For example, in voice-band speech coding,
only information in the frequency
band 400 Hz to 3500 Hz is transmitted but the reconstructed
signal is still adequate for intelligibility.
-
2
A sampling rate of 8 kHz is needed for narrowband coding. Also,
wideband coding codes information
in the frequency band reaching 7 8 kHz, which requires sampling
or rate 16 kHz.
Speech coding differs from other forms of audio coding in that
speech is a much simpler signal than
most other audio signals, and a lot more statistical information
is available about the properties of
speech. As a result, some auditory information which is relevant
in audio coding can be unnecessary
in the speech coding context. In speech coding, the most
important criterion is preservation of
intelligibility and "pleasantness" of speech, with a constrained
amount of transmitted data. [2]
Types of Speech Coders
There are different types of speech encoders:
Waveform coders attempt to code the exact shape of the speech
signal waveform, without
considering the nature of human speech production and speech
perception. These coders are
high-bit-rate coders (typically above 16 kbps).
Linear prediction coders (LPCs), on the other hand, assume that
the speech signal is the output
of a linear time-invariant (LTI) model of speech production. The
transfer function of that
model is assumed to be all-pole (autoregressive model). The
excitation function is a quasi-
periodic signal constructed from discrete pulses (18 per pitch
period), pseudorandom noise,
or some combination of the two. If the excitation is generated
only at the receiver, based on a
transmitted pitch period and voicing information, then the
system is designated as an LPC
voice coder (vocoder). LPC vocoders that provide extra
information about the spectral shape
of the excitation have been adopted as coder standards between
2.0 and 4.8 kbps.
LPC-based analysis-by-synthesis coders (LPC-AS), on the other
hand, choose an excitation
function by explicitly testing a large set of candidate
excitations and choosing the best. LPC-
AS coders are used in most standards between 4.8 and 16
kbps.
Sub-band coders are frequency-domain coders that attempt to
parameterize the speech signal
in terms of spectral properties in different frequency bands.
These coders are less widely used
than LPC-based coders but have the advantage of being scalable
and do not model the
incoming signal as speech. Sub-band coders are widely used for
high-quality audio coding.
Table 1.1 shows the four discussed types of speech coders.
[3]
Speech Coder Class Rates (kbps) Complexity Standardized
Applications
Waveform coders 1664 Low Landline telephone
Sub-band coders 12256 Medium Teleconferencing, audio
LPC-AS 4.816 High Digital cellular
LPC vocoder 2.04.8 High Satellite telephony, military
Table I-1: Characteristics of Standardized Speech Coding
Algorithms in Each of Four Broad Categories
-
3
Variable Bit-Rate Coding
One of the important techniques in speech coding is using
variable bitrate while coding the speech
signal. The main idea behind this technique is the fact that not
all speech signals need the same bitrate
in coding. In Variable Bitrate (VBR) coding, the user chooses
the desired quality level and/or a range
of allowable bitrates. Then the encoder tries to maintain the
selected quality during the whole stream
by choosing the optimal amount of data to represent each frame
of audio. The main advantage is that
the user is able to specify the quality level and conserve as
much space as possible, but the
inconvenience is that the final file size is quite
unpredictable.
Most modern encoders are able to perform VBR encoding, including
(but not limited to) nearly all
popular MP3, AAC, (Ogg) Vorbis, Musepack, and WMA encoders.
[4]
Speech Coding State of the Art
The two most important applications of speech coding are mobile
telephony and Voice over IP.
Consequently, the standards for speech compression are organized
and published by the International
Telecommunication Union (ITU) responsible for development in the
mobile technology and by the
Internet Engineering Task Force (IETF).
This section presents the most widely used encoders in the
domain: Adaptive Multi-Rate (AMR) is
an encoder developed and adopted by the ITU. It is used in WCDMA
networks. On the other hand,
Opus and Speex are two sibling encoders developed by Xiph.org
and adopted by IETF.
Adaptive Multi Rate (AMR)
The Adaptive Multi-Rate speech coder is based on the Algebraic
CELP (ACELP) technology and is
referred to as a Multi-Rate ACELP (MR-ACELP) coder. The coder is
capable of operating at 8
different bit-rates denoted coder modes. The frame size is 20
milliseconds with 4 sub-frames of 5
milliseconds. A look-ahead of 5 ms is used. The 12.2 Kbit/s mode
is equivalent to the GSM EFR
coder while the 7.40 Kbit/s mode is equivalent to the EFR coder
for the IS-136 system.
The AMR speech coder was developed to fulfill a challenging set
of performance requirements for
clean speech, speech in background noise, tendering and degraded
channel conditions. The highest
mode is the GSM EFR coder, which provides speech quality
comparable to fixed-line quality. The
lowest mode provides communication quality. The range of
bit-rates and the high quality provides
flexibility to trade quality and capacity as well as to optimize
quality under changing channel
-
4
conditions. The quality was shown to be significantly higher
than for existing speech services in GSM.
[5]
Opus
Opus codec is developed by Xiph.org and standardized for
multimedia streaming and VoIP
applications by the IETF. Opus can handle a wide range of audio
applications, including Voice over
IP, videoconferencing, in-game chat, and even remote live music
performances. It can scale from low
bit-rate narrowband speech to very high quality stereo music.
Supported features are [7]:
Bit-rates from 6 kb/s to 510 kb/s
Sampling rates from 8 kHz (narrowband) to 48 kHz (fullband)
Frame sizes from 2.5 ms to 60 ms
Support for both constant bit-rate (CBR) and variable bit-rate
(VBR)
Audio bandwidth from narrowband to full-band
Support for speech and music
Support for mono and stereo
Support for up to 255 channels (multistream frames)
Dynamically adjustable bitrate, audio bandwidth, and frame
size
Good loss robustness and packet loss concealment (PLC)
Floating point and fixed-point implementation
The Opus codec is a real-time interactive audio codec. It is
composed of a layer based on Linear
Prediction [LPC] and a layer based on the Modified Discrete
Cosine Transform [MDCT]. The main
idea behind using two layers is as follows: in speech, linear
prediction techniques (such as Code-
Excited Linear Prediction, or CELP) code low frequencies more
efficiently than transform (e.g.,
MDCT) domain techniques, while the situation is reversed for
music and higher speech frequencies.
Thus, a codec with both layers available can operate over a
wider range than either one alone and can
achieve better quality by combining them than by using either
one individually. [6]
The Opus encoder consists of two main blocks: the SILK encoder
and the CELT encoder. However,
unlike the decoder, a valid (though potentially suboptimal) Opus
encoder is not required to support
all modes and may thus only include a SILK encoder module or a
CELT encoder module. The output
bit-stream of the Opus encoding contains bits from the SILK and
CELT encoders, though these are
not separable due to the use of a range coder. A block diagram
of the encoder is illustrated below.
[6]
-
5
Opus encoder is standardized for VoIP applications by the IETF.
The reference (RFC 6716) defines
the encoder/decoder. Furthermore, IETF has published
specifications for packet payload format of
Opus frames.
Speex
Speex encoder is the sibling of Opus, It is developed also by
Xiph.org, and it has a very similar
approach to Opus. The options featured by the two encoders are
similar to a great extent. However,
in our research we are interested more in experimenting with
Speex rather than Opus. The reason
behind this will be explained later throughout the course of the
report.
Speex is based on CELP, which stands for Code Excited Linear
Prediction. The CELP technique is
based on three ideas:
The use of a linear prediction (LP) model to model the vocal
tract
The use of (adaptive and fixed) codebook entries as input
(excitation) of the LP model
The search performed in closed-loop in a perceptually weighted
domain'
Speex is designed to compress voice at bitrates ranging from 2
to 44 kbps. Some of Speex's features
include:
Narrowband (8 kHz), wideband (16 kHz), and ultra-wideband (32
kHz) compression in the
same bit-stream
Intensity stereo encoding
Packet loss concealment
Variable bitrate operation (VBR)
Voice Activity Detection (VAD)
Figure I-1: Block Diagram of the Opus Encoder
-
6
Discontinuous Transmission (DTX)
Fixed-point port
Acoustic echo canceller
Noise suppression
The following table shows a comparison between the 3 discussed
coders.
Codec Rate (kHz) bitrate (kbps)
delay
frame+lookahead
(ms)
multirate VBR license
Speex 8, 16, 32
2.15-24.6 (NB) 20+10 (NB)
yes yes
open-
source/
4-44.2 (WB) 20+14 (WB) free
software
Opus 8, 16, 22 6 - 510 2.5-60 yes yes open-
source/
AMR-
NB 8 4.75-12.2 20+5? yes proprietary
AMR-
WB 16 6.6-23.85 20+5? yes proprietary
(G.722.2)
Table I-2 Comparison Between the 3 Speech Encoders
To sum up, the bibliographic work has led us to emphasize the
concept of variable bitrate (VBR).
This is due to reasons that are explained in Chapter II.
Furthermore, the literature that we are dealing
with in our research is based on working with Speex encoder.
Consequently, Speex will be our
designated encoder in the test-bed.
Security
The other concern in our research is security. As stated in the
beginning of the chapter, privacy has
great importance for real-time voice communication applications,
whether in mobile telephony or
voice over IP. In this section, we review the concept of
encrypting voice data along with the state of
the art in the field.
Encryption is the process of converting plain text "unhidden" to
a cryptic text "hidden" to secure it
against data thieves. This process has another part where
cryptic text needs to be decrypted on the
other end to be understood. As dened in RFC 2828 [Reference],
cryptographic system is "a set of
cryptographic algorithms together with the key management
processes that support use of the
algorithms in some application context." This denition denes the
whole mechanism that provides
the necessary level of security comprised of network protocols
and data encryption algorithms.
-
7
The goals of any cryptography system fall into 5 categories:
Authentication: This means that before sending and receiving
data using the system, the
receiver and sender identity should be veried.
Secrecy or Condentiality: Usually this function (feature) is how
most people identify a
secure system. It means that only the authenticated people are
able to interpret the message
(date) content and no one else.
Integrity: Integrity means that the content of the communicated
data is assured to be free
from any type of modication between the end points (sender and
receiver). The basic form
of integrity is packet check sum in IPv4 packets.
Non-Repudiation: This function implies that neither the sender
nor the receiver can falsely
deny that they have sent/received a certain message.
Service Reliability and Availability: Since secure systems
usually get attacked by intruders,
which may affect their availability and type of service to their
users. Such systems should
provide a way to grant their users the quality of service they
expect.
The category of our interest is confidentiality. Consequently,
the reference to security throughout the
report is meant to address the confidentiality goal of the
implemented security system. Furthermore,
the attack on the system is based on traffic analysis and not
the conventional cryptanalysis. This idea
will be discussed in details in the next chapter.
Symmetric and Asymmetric Encryption
Data encryption procedures are mainly categorized into two
categories depending on the type of
security keys used to encrypt/decrypt the secured data. These
two categories are: Asymmetric and
Symmetric encryption techniques. In symmetric encryption, the
sender and the receiver agree on a
secret (shared) key. Then they use this secret key to encrypt
and decrypt their exchanged messages.
The main concern behind symmetric encryption is how to share the
secret key securely between the
two peers. If the key gets known for any reason, the whole
system collapses. On the other hand,
Asymmetric encryption is where two keys are used. To explain
more, what Key1 can encrypt only
Key2 can decrypt, and vice versa. It is also known as Public Key
Cryptography (PKC), because users
tend to use two keys: public key, which is known to the public,
and private key, which is only known
to the user.
In the project, we will be interested in experimenting with
symmetric encryption. This is because the
state of the art in the domain of speech encryption is based on
symmetric ciphers. The reason behind
is that symmetric algorithms in general are less complex than
asymmetric ones. The reduction in
complexity is of great importance to such real-time application,
running usually on platforms with
limited capabilities (Mobile Phones).
-
8
Block and Stream Encryption
One of the main categorization methods for encryption techniques
commonly used is based on the
form of the input data they operate on. The two types are Block
Cipher and Stream Cipher.
Stream cipher operates on a stream of data by operating on it
bit by bit. Stream cipher consists of two
major components: a key stream generator, and a mixing function.
Mixing function is usually just an
XOR function, while key stream generator is the main unit in
stream cipher encryption technique.
In a block cipher method, data is encrypted and decrypted in
blocks. In its simplest mode, you divide
the plain text into blocks, which are then fed into the cipher
system to produce blocks of cipher text.
ECB (Electronic Codebook Mode) is the basic form of block cipher
where data blocks are encrypted
directly to generate its correspondent ciphered blocks.
There are many variances of block cipher, where dierent
techniques are used to strengthen the
security of the system. The most common methods are: ECB
(Electronic Codebook Mode), CBC
(Chain Block Chaining Mode), and OFB (Output Feedback Mode). ECB
mode and the CBC mode
use the cipher block from the previous step of encryption in the
current one, which forms a chain-like
encryption process. OFB operates on plain text in away similar
to stream cipher that will be described
below, where the encryption key used in every step depends on
the encryption key from the previous
step. There are other modes like CTR (counter) and CFB (Cipher
Feedback). CTR mode is used to
transform a block cipher into a stream cipher. The idea is
simple; a block mode is used to generate a
key stream, which is mixed (mainly XORed) with the plain
text.
The recommended mode of operation for real-time voice
communication is obviously the stream
cipher. This is due to the nature of transferred data, which is
in the form of stream. However, in we
explore the option of using block cipher. The feasibility of
using block ciphers for encryption of voice
data comes from the perspective of trading off performance for
security. We will elaborate more on
that later in the course of the report.
Common Encryption Algorithms
Here we discuss 5 of the most famous ciphers present in the
state of the art. Among these algorithms,
AES and KASUMI are implemented in real-time voice communication
security. AES is standardized
for voice over IP in the Secure Real-time Transport Protocol
(SRTP), which is a profile for RTP. On
-
9
the other hand, KASUMI was standardized by the ITU for GSM and
consequent communication
systems.
DES: (Data Encryption Standard), was the rst encryption standard
to be recommended by NIST
(National Institute of Standards and Technology). It is based on
the IBM proposed algorithm called
Lucifer. DES became a standard in 1974. Since that time, many
attacks and methods recorded that
exploit the weaknesses of DES, which made it an insecure block
cipher.
3DES: As an enhancement of DES, the3DES (Triple DES) encryption
standard was proposed. In this
standard the encryption method is similar to the one in original
DES but applied 3 times to increase
the encryption level. But it is a known fact that 3DES is slower
than other block cipher methods.
AES: (Advanced Encryption Standard), is the new encryption
standard recommended by NIST to
replace DES. Rijndael (pronounced Rain Doll) algorithm was
selected in 1997 after a competition to
select the best encryption standard. Brute force attack is the
only eective attack known against it, in
which the attacker tries to test all the characters combinations
to unlock the encryption. Both AES
and DES are block ciphers.
Blowsh: It is one of the most common public domain encryption
algorithms provided by Bruce
Schneier - one of the world's leading cryptologists, and the
president of Counterpane Systems, a
consulting rm specializing in cryptography and computer
security. Blowsh is a variable length key,
64-bit block cipher. The Blowsh algorithm was rst introduced in
1993.This algorithm can be
optimized in hardware applications though it's mostly used in
software applications. Though it suers
from weak keys problem, no attack is known to be successful
against it.
KASUMI: It is a block cipher used in UMTS, GSM, and GPRS mobile
communications systems. In
UMTS, KASUMI is used in the confidentiality (f8) and integrity
algorithms (f9) with names UEA1
and UIA1, respectively. In GSM, KASUMI is used in the A5/3 key
stream generator and in GPRS in
the GEA3 key stream generator.
KASUMI was designed for 3GPP to be used in UMTS security system
by the Security Algorithms
Group of Experts (SAGE), a part of the European standards body
ETSI. SAGE agreed with 3GPP
technical specification group (TSG) for system aspects of 3G
security (SA3) to base the development
on an existing algorithm that had already undergone some
evaluation. They chose the cipher
algorithm MISTY1 developed and patented by Mitsubishi Electric
Corporation. The original
algorithm was slightly modified for easier hardware
implementation and to meet other requirements
set for 3G mobile communications security.
-
10
In January 2010, Orr Dunkelman, Nathan Keller and Adi Shamir
released a paper showing that they
could break Kasumi with a related key attack and very modest
computational resources. Interestingly,
the attack is ineffective against MISTY.
Report Structure
In the first chapter of this report, we were acquainted with the
state of the art of both compression
and encryption. We reviewed the encoding concepts along with the
widely used encoders. We also
reviewed security in brief manner. Cipher types and modes were
presented with emphasis on the
application of VoIP and Mobile telephony.
In the second chapter, we have a brief literature review stating
the main problem the project tries to
tackle: the bad combination between VBR and stream encryption.
The papers stating security
vulnerabilities are reviewed briefly. The solution for the
problem is discussed and the perspective that
the project works in is determined.
Chapter III exhibits the test-bed that we created in order to
test for bitrates. The test-bed is consisted
of 3 main elements (or stages): encoding, encryption,
sending/receiving.
The fourth and final chapter includes all the obtained results.
These results are the obtained bitrates
throughout different setups spanning the whole space of options
found in our field of interest. This
chapter also includes the concluding the statement along with
future recommendations.
-
11
Chapter II Literature Review and Problem Formulation
The main problem to be tackled in this project can be presented
and explained in a very simple and
brief manner. The combination between variable bit-rate
compression and length preserving
encryption (stream cipher) induces security weaknesses in the
form vulnerability to traffic analysis.
The solution is reducing information leaking by reducing the
variation of bitrate in the transmitted
stream. This is acquired by relying on constant bitrate (CBR) or
by using padding. In brief, our project
emphasizes on the analysis of the cost of padding in the context
of bandwidth. We aim at performing
tests of using padding and reaching a conclusion about the cost
of padding and consequently its
feasibility. They proposition by the research project should
answer the question about the possibility
of gaining trusted security level using existing tools.
In this chapter, we exhibit the weakness invoked by using
variable bit-rate compression and then we
discuss the perspective adopted in tackling this problem.
Traffic Analysis of Encrypted Voice Stream
In 2007, a paper was published under the title of Language
Identification of Encrypted VoIP Traffic.
After that by 2 years another paper, Spot me if you can:
Uncovering spoken phrases in encrypted
VoIP conversations. The most important paper in the context was
published in 2011 and titled by:
Phonotactic Reconstruction of Encrypted VoIP Conversations:
Hookt on fon-iks. The inferred
common idea from the titles is extraction of certain information
(language, some phrases, phoneme
reconstruction) from encrypted VoIP stream. A key point is not
revealed in the titles: such extraction
relies on variable bit-rate compression.
The Secure RTP (SRTP) framework [RFC3711] is a widely used
framework for securing RTP
sessions [RFC3550]. SRTP provides the ability to encrypt the
payload of an RTP packet, and
optionally add an authentication tag, while leaving the RTP
header and any header extension in the
clear. A range of encryption transforms can be used with SRTP,
but none of the predefined encryption
transforms use any padding; the RTP and SRTP payload sizes match
exactly.
When using SRTP with voice streams compressed using variable bit
rate (VBR) codecs, the length
of the compressed packets will depend on the characteristics of
the speech signal. This variation in
packet size will leak a small amount of information about the
contents of the speech signal. This is
potentially a security risk for some applications. For example,
[spot-me] shows that known phrases
-
12
in an encrypted call using the Speex codec in VBR mode can be
recognized with high accuracy in
certain circumstances, and [fon-iks] shows that approximate
transcripts of encrypted VBR calls can
be derived for some codecs without breaking the encryption. How
significant these results are, and
how they generalize to other codecs, is still an open question.
This memo discusses ways in which
such traffic analysis risks may be mitigated.
Information leakage via variable bit-rate
Generally speaking, the codec takes as input the audio stream
from the user, which is typically
sampled at either 8000 or 16000 samples per second (Hz). At some
fixed interval, the codec takes the
n most recent samples from the input, and compresses them into a
packet for efficient transmission
across the network. To achieve the low latency required for
real-time performance, the length of the
interval between packets is typically fixed between 10 and 50ms,
with 20ms being the common case.
Thus for a 16 kHz audio source, we have n = 320 samples per
packet, or 160 samples per packet for
the 8 kHz case.
Many common voice codecs are based on a technique called
code-excited linear prediction (CELP).
For each packet, a CELP encoder simply performs a brute-force
search over the entries in a codebook
of audio vectors to output the one that most closely reproduces
the original audio. The quality of the
compressed sound is therefore determined by the number of
entries in the codebook. The index of the
best-fitting codebook entry, together with the linear predictive
coefficients and the gain, make up the
payload of a CELP packet. The larger code books used for
higher-quality encodings require more bits
to index, resulting in higher bit rates and therefore larger
packets.
In some CELP variants, such as QCELP, Speexs variable bit rate
mode, or the approach advocated
by Zhang et al., the encoder adaptively chooses the bit rate for
each packet in order to achieve a good
balance of audio quality and network bandwidth. This approach is
appealing because the decrease in
data volume may be substantial, with little or no loss in
quality. In a two-way call, each participant is
idle roughly 63% of the time, so the savings may be substantial.
Unfortunately, this approach can also
cause substantial leakage of information in encrypted VoIP calls
because, in the standard specification
for Secure RTP (SRTP), the cryptographic layer does not pad or
otherwise alter the size of the original
RTP payload.
-
13
Intuitively, the sizes of CELP packets leak information because
the choice of bit rate is largely based
on the audio encoded in the packets payload. For example, the
variable bit-rate Speex codec encodes
vowel sounds at higher bit rates than fricative sounds like f or
s. In phonetic models of speech,
sounds are broken down into several different categories,
including the aforementioned vowels and
fricatives, as well as stops like b or d, and affricatives like
ch. Each of these canonical sounds
is called a phoneme, and the
pronunciation for each word in the
language can then be given as a sequence
of phonemes. While there is no consensus
on the exact number of phonemes in
spoken English, most in the speech
community put the number between 40
and 60.
In [9], to demonstrate the relationship
between bit rate and phonemes, several
recordings from the TIMIT corpus of phonetically-rich English
speech were encoded using Speex in
wideband variable bit rate mode, and observed the bit rate used
to encode each phoneme. The
probabilities for 8 of the 21 possible bit rates are shown for a
handful of phonemes in the following
figure. As expected, we see that the two vowel
sounds, aa and aw, are typically encoded at
signicantly higher bit rates than the fricative f or the
consonant k. Moreover, large differences
in the frequencies of certain bit rates (namely, 16.6, 27.8, and
34.2 kbps), can be used to distinguish
aa from aw and f from k.
Figure II-1: Distribution of bit rates used to encode four
phonemes with Speex
Figure II-2: Packets for articial Figure II-3: Packets for
intelligence
-
14
In fact, it is these differences in bit rate for the phonemes
that make recognizing words and phrases
in encrypted traffic possible. To illustrate the patterns that
occur in the stream of packet sizes when a
certain word is spoken, we examined the sequences of packets
generated by encoding several
utterances of the words artificial and intelligence from the
TIMIT corpus. They represent the
packets for each word visually in Figures 2 and 3 as a data
imagea grid with bit rate on the y-axis
and position in the sequence on the x-axis. Starting with a
plain white background, we darken the cell
at position (x,y) each time we observe a packet encoded at bit
rate y and position x for the given word.
In both graphs, we see several dark gray or black grid cells
where the same packet size is consistently
produced across different utterances of the word, and in fact,
these dark spots are closely related to
the phonemes in the two words. In Figure 2, the bit rate in the
2nd - 5th packets (the a in artificial)
is usually quite high (35.8kbps), as we would expect for a vowel
sound. Then, in packets 12 - 14 and
20 - 22, we see much lower bit rates for the fricative f and
affricative sh. Similar trends are visible
in Figure 3; for example, the t sound maps consistently to 24.6
kbps in both words.
Example of traffic analysis
In the paper Uncovering spoken phrases in encrypted VoIP
conversations, [9], the adopted method
in analyzing the encrypted VoIP stream can be summarized by the
following:
To identify a phrase without using any examples of the phrase or
any of its constituent words, this
concatenative synthesis technique is applied to generate a few
hundred synthetic training sequences
for the phrase. These sequences are used to train a profile HMM
for the phrase and then search for
the phrase in streams of packets. An overview of the entire
training and detection process is given in
Figure II-4.
Figure II-2: Overview of training and detection process
-
15
Mitigation Techniques
One way to prevent word spotting would be to pad packets to a
common length, or at least to coarser
granularity. Another way is to reframe from using VBR into using
the CBR mode. However, its not
optimal though. Padding regains the lost security (to a certain
extent as we will see) while preserving
some benefit from variable bit-rate encoding.
In the paper [9] the traffic analysis system (search algorithm)
was tested against padding. To explore
the tradeoff between padding and search accuracy, they encrypted
both their training and testing data
sets to multiples of 128, 256 or 512 bits and applied their
approach. The results are presented in
Figure II-4. The use of padding is quite encouraging as a
mitigation technique, as it greatly reduced
the overall accuracy of the search algorithm. When padding to
multiples of 128 bits, the system
achieves only 0.15 recall at 0.16 precision. Increasing padding
so that packets are multiples of 256
bits gives a recall of .04 at .04 precision.
The debate around the announcement of security flaws in variable
bit-rate encoding has led to
publishing of an RFC by the ITU. The standard, Guidelines for
the Use of Variable Bit Rate Audio
with Secure RTP, RFC 6562, specifies standards for dealing with
variable bit-rate in SRTP Protocol.
For scenarios where VBR is considered unsafe, a constant bit
rate (CBR) codec SHOULD be
negotiated and used instead, or the VBR codec SHOULD be operated
in a CBR mode. However, if
the codec does not support CBR, RTP padding SHOULD be used to
reduce the information leak to
an insignificant level. Packets may be padded to a constant size
or to a small range of sizes ([spot-
me] achieves good results by padding to the next multiple of 16
octets, but the amount of padding
Figure II-3: Robustness to padding
-
16
needed to hide the variation in packet size will depend on the
codec and the sophistication of the
attacker) or may be padded to a size that varies with time. The
most secure and RECOMMENDED
option is to pad all packets throughout the call to the same
size.
In the case where the size of the padded packets varies in time,
the same concerns as for VAD apply.
That is, the padding SHOULD NOT be reduced without waiting for a
certain (random) time. The
RECOMMENDED "hold time" is the same as the one for VAD.
Note that SRTP encrypts the count of the number of octets of
padding added to a packet, but not the
bit in the RTP header that indicates that the packet has been
padded. For this reason, it is
RECOMMENDED to add at least one octet of padding to all packets
in a media stream, so an attacker
cannot tell which packets needed padding.[10]
-
17
Chapter III Test-Bed
In the previous chapter, we exhibited the security weakness
provoked by the combination between
variable bit-rate encoding and same length encryption. This
weakness is in the form of vulnerability
to traffic analysis. The performance of the traffic analysis
system presented in the previous chapter
has shown degradation along with padding with increasing key
lengths.
Furthermore, as a result to the fact that padding preserves
security to a great extent. It was
recommended by the ITU in RFC 6562 to either use constant
bit-rate encoding or rely on padding to
16 bytes block length.
All the discussion around the subject didnt take into
consideration the tradeoff between security and
performance. A question was to be asked about the feasibility of
padding. A key point to have in mind
is that variable bitrate encoding aims at lowering the needed
bandwidth as much as possible. As a
consequence to that notion, the cost of padding in terms of
bit-rate and needed bandwidth is to be
calculated in order to have a good perspective about the price
we have to pay in order to achieve
security while using variable bitrate.
The answer for the question about the feasibility of padding is
our main goal in the research project.
This answer might be that padding will maybe cost more than
constant bitrate and, consequently,
padding is not the optimal solution for preserving security.
However, we aim at having a solid
perspective of the cost paid for different security levels. The
results of our test-bed will hopefully
give a good understanding about the relation between security,
quality, and performance.
Quality is a parameter we take in our research as part of
tradeoff formula. The quality of the encoder
is usually mapped to the bitrate used by it. Consequently, the
quality can be inserted into the tradeoff
formulation as a price to pay for preserving both security and
bitrate.
In order to have a proper testing and calculate the obtained
bitrates. We need to create a system in
which we implement compression, encryption, sending and
receiving of a voice stream. The system
should allow the manipulation of parameters that we are
interested in.
Test-bed requirements
The created system must be able to implement compression and
encryption of a speech stream.
Furthermore, the system should allow the manipulation of
parameters for both compression and
-
18
encryption. One more important requirement is ability to send
and receive the compressed and
encrypted stream. Sending/receiving conveys the packetization of
the stream in realistic manner that
can be related a real application. The system should be also
able to log the obtained bitrates at every
setup.
For compression, we should be able to choose the mode
(narrow-band, wideband). In addition to that,
we should be able to choose the quality of compression. The
quality variable is an important variable
that is supported by many algorithms that form the state of the
art. We emphasize the ability to choose
quality since we are interested in inserting quality as a
parameter in the tradeoff setup as we can see
later in the results section.
In encryption, the main requirement is the ability to pad data
to a multiple of 128, 256, 512 bits. Of
course, in addition to that, we need to adopt a cipher which is
trusted in the state of the art. The cipher
should have a low cost in terms of processing time since the
platforms are usually mobile phones with
limited memory and processing power. One additional requirement
is being a symmetric cipher since
all protocols implement symmetric encryption/decryption
mechanisms.
The requirements can be summarized and formulated in a compact
format as the following:
Compression:
o Widely implemented encoder
o Variable bit-rate compression
o Variable quality setting
Encryption
o Trusted low cost cipher
o Padding to different sizes
o Symmetric cipher
Test-bed elements
Based on the discussed requirements, the search for an encoder
and a cipher is aimed at finding
modules widely present in the state of the art. The test-bed is
built in a Linux environment (UBUNTU
distribution of GNU-Linux). The used libraries are all written
in C programming language,
consequently, the built test-bed was to be written in C.
-
19
For the encoder, the choice was set to Speex encoder. This
encoder was chosen since it meets all the
stated requirements. Furthermore, this encoder was used in the
three articles that state the security
vulnerability as the designated encoder.
Regarding encryption, the choice was obvious: Advanced
Encryption Standard. AES is standardized
and adopted in SRTP, the main standard for security in voice
over IP. However, SRTP specifications
and implementation use AES in CTR mode (Counter mode) this mode
generates a key stream and
mixes it with data (using XOR operation) in order to get the
encrypted text. The length of the initial
plain text is reserved. Consequently, this mode modes renders a
block cipher into a stream length
preserving cipher regardless of the block size of the cipher.
Other modes specified by SRTP are f8
and null cipher.
It is worthy of mentioning that the RFC published about the
guidelines for using variable bit-rate with
SRTP recommends relying on higher levels in the hierarchy of the
networking model to achieve
padding. The padding was part of compression or application
layer in general as per the published
standard. However, in our approach we tried to use a block
cipher in the test bed. The choice of a
block cipher does not affect the desired results in any way.
Furthermore, the choice making padding
part of the encryption process is justified in terms of security
requirements. The implementation of
padding in compression or other entity may induce security
vulnerabilities avoidable by using block
cipher. For example, padding can be done within the RTP payload,
the number of padding bytes will
be part of the encrypted header of the RTP packet, but the flag
specifying padding will not be
encrypted.
Speex Encoder
In our test bed, we used Speex encoder the designated
compression tool. We used the Speex library
and relied on detailed step by step construction of the encoder
using Speex API (Application
Programming interface). This choice is because manipulating
parameters and managing the encoders
output requires such construction rather than using a prebuilt
ready-to-use module.
The libspeex library contains all the functions for encoding and
decoding speech with the Speex codec.
When linking on a UNIX system, we must add -lspeex -lm to the
compiler command line.
In order to encode speech using Speex, we rst need to:
#include
Then in the code, a Speex bit-packing struct must be declared,
along with a Speex encoder state:
-
20
SpeexBits bits;
void *enc_state;
The two are initialized by:
speex_bits_init(&bits);
enc_state = speex_encoder_init(&speex_nb_mode);
For wideband coding, speex_nb_mode will be replaced by
speex_wb_mode. In most cases, you will
need to know the frame size used at the sampling rate you are
using.
The encoder is by default set to cbr mode. We set it into
variable bit-rate mode by using:
speex_encoder_ctl(enc_state,SPEEX_SET_VBR,&vbr);
The variable vbr an integer value ( 0 or 1). It is used to set
vbr on (1) or off (0).
There are many parameters that can be set for the Speex encoder,
but the most useful one is the quality
parameter that controls the quality vs. bit-rate tradeoff.
This is set by:
speex_encoder_ctl(enc_state,SPEEX_SET_VBR_QUALITY,&quality);
Quality is a float value ranging from 0.0 to 10.0 (inclusively).
The mapping between quality and bit-
rate is described in the following 2 tables for both narrowband
and wideband.
Mode Quality Bit-
rate (bps)
mflops Quality/description
0 - 250 0 No transmission (DTX)
1 0 2,150 6 Vocoder (mostly for comfort noise)
2 2 5,950 9 Very noticeable artifacts/noise, good
intelligibility
3 3-4 8,000 10 Artifacts/noise sometimes noticeable
4 5-6 11,000 14 Artifacts usually noticeable only with
headphones
5 7-8 15,000 11 Need good headphones to tell the difference
6 9 18,200 17.5 Hard to tell the difference even with good
headphones
7 10 24,600 14.5 Completely transparent for voice, good quality
music
8 1 3,950 10.5 Very noticeable artifacts/noise, good
intelligibility
9 - - - reserved
-
21
10 - - - reserved
11 - - - reserved
12 - - - reserved
13 - - - Application-defined, interpreted by callback or
skipped
14 - - - Speex in-band signaling
15 - - - Terminator code
Table III-1 Quality vurses bitrate for Speex narrowband
Mode/
Quality
Bit-rate (bps) Quality/description
0 3,950 Barely intelligible (mostly for comfort noise)
1 5,750 Very noticeable artifacts/noise, poor
intelligibility
2 7,750 Very noticeable artifacts/noise, good
intelligibility
3 9,800 Artifacts/noise sometimes annoying
4 12,800 Artifacts/noise usually noticeable
5 16,800 Artifacts/noise sometimes noticeable
6 20,600 Need good headphones to tell the difference
7 23,800 Need good headphones to tell the difference
8 27,800 Hard to tell the difference even with good
headphones
9 34,400 Hard to tell the difference even with good
headphones
10 42,400 Completely transparent for voice, good quality
music
Table III-2 Quality vurses bitrate for Speex wideband
Once the initialization is done, for every input frame:
speex_bits_reset(&bits);
speex_encode_int(enc_state, input_frame, &bits);
nbBytes = speex_bits_write(&bits, byte_ptr,
MAX_NB_BYTES);
Where input_frame is a (short *) pointing to the beginning of a
speech frame, byte_ptr is a (char *)
where the encoded frame will be written, MAX_NB_BYTES is the
maximum number of bytes that
can be written to byte_ptr without causing an overow and nbBytes
is the number of bytes actually
written to byte_ptr (the encoded size in bytes). Before calling
speex_bits_write, it is possible to nd
the number of bytes that need to be written by calling
speex_bits_nbytes(&bits), which returns a
number of bytes.
-
22
After youre done with the encoding, free all resources with:
speex_bits_destroy(&bits);
speex_encoder_destroy(enc_state);
AES Encryption
The choice of the AES cipher is justified in the previous
section of the chapter. However, the
algorithm has a high number of implementations. Among these, a
trusted and well known library in
the state of the art is OpenSSL.
OpenSSL provides two primary libraries: libssl and libcrypto.
The libcrypto library provides the
fundamental cryptographic routines used by libssl. You can
however use libcrypto without using
libssl.
For most uses, users should use the high level interface that is
provided for performing cryptographic
operations. This is known as the EVP interface (short for
Envelope). This interface provides a suite
of functions for performing encryption/decryption (both
symmetric and asymmetric),
signing/verifying, as well as generating hashes and MAC codes,
across the full range of OpenSSL
supported algorithms and modes. Working with the high level
interface means that a lot of the
complexity of performing cryptographic operations is hidden from
view. A single consistent API is
provided. In addition low level issues such as padding and
encryption modes are all handled.
The EVP functions provide a high level interface to OpenSSL
cryptographic functions. They provide
the following features:
A single consistent interface regardless of the underlying
algorithm or mode
Support for an extensive range of algorithms
Encryption/Decryption using both symmetric and asymmetric
algorithms
Sign/Verify
Key derivation
Secure Hash functions
Message Authentication Codes
Support for external crypto engines,
-
23
AES is available in libcrypto with different modes, and in block
sizes 128, 192, and 256 bits.
Unfortunately, the library doesnt support a block size of 512.
In fact, generally implementations of
AES use a block size of 128 and 256 at most. To deal with this
issue, we used the algorithm in CBC
mode for block sizes of 128 and 256 bits. And to get the size of
512 bits, we relied on manual padding.
Although the use of EVP as a high level interface simplifies
using the library to a great extent, using
EVP in a complex test bed with multi stage procedures may induce
complexity.
To encrypt using EVP, first we have to:
#include
The encryption process starts with initializing the cipher. We
have to create a context, "opaque"
encryption, decryption structures that libcrypto uses to record
status of encrypt/decrypt operations:
EVP_CIPHER_CTX e_ctx;
Then we have to create a key and IV (initiation vector) for the
cipher. A SHA1 digest is used to hash
the supplied key material (password) multiple times (rounds).
More rounds are more secure but
slower. Then after setting the key and IV, we call:
EVP_CIPHER_CTX_init(e_ctx);
EVP_EncryptInit_ex(e_ctx, EVP_aes_256_cbc(), NULL, key, iv);
This initiates AES encryption in CBC mode with a block size of
256 as shown in the second parameter.
To initialize 128 block size instead, we call:
EVP_EncryptInit_ex(e_ctx, EVP_aes_128_cbc(), NULL, key, iv);
Encryption of the Speex frame then takes place in the following
manner:
EVP_EncryptInit_ex(e_ctx, NULL, NULL, NULL, NULL);
EVP_EncryptUpdate(e_ctx, ciphertext, &c_len, plaintext,
*len);
EVP_EncryptFinal_ex(ectx, ciphertext+c_len, &f_len);
Note: both decompressing and decryption of the stream are not
implemented in the test bed. Although
implementation of decoding and decryption will add value and
integrity to the results. The results can
be calculated without the need for neither decryption nor
decoding.
-
24
RTP Sending/Receiving
The previous 2 stages of the operations held in the test bed
allow calculating bitrate in the absence of
packetization. To achieve realistic results, we implement
sending and receiving of the stream in two
separate threads. Then we calculate bitrates of the received and
dumped packets.
The library used for RTP sending/receiving is oRTP, an
implementation of the RTP library. A number
of calls must be made to initialize the library. The first of
the first of these is RTPCreate(), which
establishes a context. A context is an identifier used by the
library to determine which RTP session a
function call is to be associated with. An application can run
many sessions at the same time, each
created with a separate call to RTPCreate, resulting in a
different context for each. Most library
functions accept a context as the first argument. Once RTPCreate
has been called to initialize the
session, the addresses for the session must be set.
rtperror RTPCreate(context *the_context);
rtperror RTPOpenConnection(context cid);
Sending packets is fairly straightforward. The RTPSend()
function is used to tell the library to send
an RTP packet. It requires the user to pass a pointer to a
buffer, a length, a value for the marker field
in the RTP header, an increment for the timestamp, and the
context. The library will take the buffer,
add the RTP header, perform any required operations, and send
the packet. The library will
automatically send RTCP packets. The initial timestamp and
sequence number are chosen randomly.
rtperror RTPSend(context cid, int32 tsinc, int8 marker, int16
pti, int8 *payload, int len);
Receiving packets is a little more complex. In order to know if
a packet is available for reading, a
process can block, it can poll, or use any other kind of
mechanism. Since the library does not dictate
this policy, it is up to us to determine when data is available
for reading. We choose polling every 20
milliseconds in order to check for a received packet. To do
this, the library allows access to the
receive sockets. There are two: one for RTP, one for RTCP. The
functions RTPSessionGetRTPSocket
and RTPSessionGetRTCPSocket are used to do this. They take as
input the context and a pointer to
a socket. When they return 1, the socket has been filled in. We
then check for the presence on an RTP
packet on these sockets using select().
RTPSessionGetRTPSocket(context cid, socktype *value);
rtperror RTPReceive(context cid, int socket,
char *rtp_pkt_stream, int *len);
-
25
When a packet is present on either socket, the application
should call the function RTPReceive().
This function takes the context, the socket on which data is
present, a pointer to a buffer, and a pointer
to a length value. The length value should be initialized to the
amount of room in the buffer. The
library will read and process the RTP or RTCP packet. For RTCP,
it will perform all statistics
collection and parsing. The buffer will be filled in with the
entire RTP/RTCP packet, including the
header.
We then save the whole received packet into a file for further
calculation of the obtained bitrate. The
bitrate is calculated based on the previously known duration of
the sent speech.
Dataset
The choice of the data set was guided by the dataset used in the
articles published about the subject.
They used the TIMIT corpus, a database used for speech
recognition. Since the TIMIT database is
not open for public use. We chose to work with another speech
recognition training database: the
census database. Here we state information about the designated
dataset:
The directory contains the alphanumeric database (aka "census"
aka "an4") recorded at Carnegie
Mellon University circa 1991. Subjects were asked to spell out
personal information, such as name,
address, telephone number, birthdates, etc. They were instructed
to not use their actual numbers. In
addition to these, subjects also spoke randomly generated
sequences of words containing control
words. The database used internally at CMU has 1018 training and
140 test utterances, whereas the
database provided here has 948 training and 130 test utterances.
All data are sampled at 16 kHz, 16-
bit linear sampling. All recordings were made with a close
talking microphone.
In the dataset, we have two directories:
an4_clstk
The directory with training data has 74 sub-directories, one for
each speaker. 21 of
them are female, 53 are male. The total number of utterances is
948, and the average
duration is about 3 seconds, totaling a little less than 50
minutes of speech.
an4test_clstk
The directory with test data has 10 sub-directories, one for
each speaker. 3 of them
are female, 7 are male. The total number of utterances is 130,
totaling around 6
minutes of speech.
-
26
Test-bed overview
The presented test-bed can be summarized in the block diagram in
figure III-1.
The process of testing starts with choosing a file from the
dataset. The time of the file is calculated
by counting the number of samples in the file. After that, the
file is encoded, encrypted and sent over
a RTP socket. Compression must span all the range for quality (0
to 10). Encryption must also take
place in the 4 presented modes (stream and 3 block sizes). Next,
the file is sent over a RTP socket to
a local receiving socket initiated by another thread. Packets
are dumped and saved in an output file.
The recorded size of frames is used to calculate bit-rate
without the packetization overhead. The size
of sent/received stream is used to calculate the bit-rate along
with the network overhead.
Choose file from dataset
Calculate time
Compress using Speex
Set quality parameter
Encrypt using AES in CBC mode
Set the block size
record frame sizes
Send/Receive over RTP socket
dump frames
calculate bit-rate
-
27
Chapter IV Results and Conclusion
Tests were held using the test-bed presented in the previous
chapter. The resulting bitrates obtained
are divided into two categories: narrowband and wideband. Along
with presenting the bitrate obtained
for 4 encryption schemes (stream, 128, 256, and 512 padding).
The overhead for the latter three
schemes over the original stream bitrate is calculated.
Narrow Band Results
Figure IV-2: NB rate vs quality (without packtization)
Figure IV-1: NB padding overhead (without packetization)
-
28
As we can infer from these results, the overhead induced by
padding for narrow band mode is of great
magnitude. In figures IV-1 and IV-2, we see the rate versus
quality for the three levels of security as
well as the overhead induced by padding. The 128 bit padding has
a moderate overhead to be added.
512 bit padding has a constant bitrate throughout the whole
range of quality. Consequently, using
CBR with highest quality maybe a better solution than relying on
padding. However, for other
padding schemes (256 bit for example) the overhead added is
manageable.
An example of a tradeoff using these results can stated by the
following. Take for example the rates
of stream encryption and 256 padding. We can see the average
rates for streaming quality 10 and 256
padding of quality 7 are the same. A tradeoff can be made here:
padding to 256 bits and setting quality
to 7 can create a huge security gain while keeping the same
rate. The price we have to pay is quality.
Figure IV-3: NB rate vs quality (with packtization)
Figure IV-4: NB padding overhead (with packetization)
-
29
Wide Band Results
The same testing was held while setting Speex to wide-band mode.
The following Figures show the
obtained results (bit-rate versus quality, and overhead) for the
4 encryption streams adopted. The
results are a lot better than the results obtained for narrow
band. As we see in figure IV-5, the curves
corresponding to the 4 encryption schemes show less difference
and consequently less added
overhead. For example, if we have to do the same tradeoff
exhibited in the previous section, the
quality will downgrade only to 9 instead of 7. To have a padding
of 512 bits and keep the same bit-
rate, the quality will downgrade down to 8.
Figure IV-6: WB Rate vers Quality (without packetization)
Figure IV-5: WB overheaad (without packetization)
-
30
An important notion is that the overhead calculated with
packetization a smaller impact to a great
extent. The overhead for 256 bits padding and a quality of 7 for
example is around 30% if calculated
without packetization. This overhead is less than 20 percent
when calculated with packetization.
Statistical Analysis
The results shown in the previous sections represent only
average bitrate. To have a clearer
perspective, we calculated a 95% confidence interval for each
obtained bitrate. The confidence
Figure IV-8: Wide Band rate versus quality (with
packetization)
Figure IV-7: Wide Band overhead (with packetization)
-
31
interval gives us more information about the resulting bitrate.
The size of the confidence interval tells
us about the fact of benefiting from variable bitrate
compression. However, a very large confidence
interval cannot be linked to a manageable overhead.
Another important piece of information is that the confidence
interval overlapping between 2
encryption schemes will make us suggest that the 2 schemes can
be working in the same rate. The
statistical calculated results overall give a better perspective
to understand the tradeoff to be made.
The following figures present the confidence interval for the 4
encryption schemes. The 3 block
encryption schemes are compared to the stream cipher. Only wide
band results are shown.
Figure IV-9: Stream Cipher 95% Confidence Interval
Figure IV-10: Stream and 128 bit padding Confidence Interval
-
32
Conclusion and future recommendations
The answer to the main question asked in our report problem is:
yes, using padding for VoIP
encrypted stream is feasible. The results show in a clear manner
that a 3 dimensional tradeoff can be
made to get the desired solution. The parameters of the
tradeoff, the level of security (presented by
the padding block size), the bit-rate, and the quality. The two
latter parameters work inversely, while
the security parameter changes the scale of bitrate range.
Figure IV-12: Stream and 256 bit Confidence Interval
Figure IV-11: Stream and 512 bit padding confidence interval
-
33
A remark is to be done about the importance of 256 bit padding.
256 padding shows a great
enhancement in immunity to traffic analysis (chapter 2), but on
the other hand, the overhead induced
by this encryption scheme is manageable to a very great
extent.
The conclusive statement can be made about the possibility of
solving the security issues presented
in the literature without relying on new technology. Tools from
the state of the art, implemented and
standardized, can be used with minor modification to gain a
great security upgrade.
We recommend, as a future work, taking this approach and testing
it with video compression.
Although nothing is published yet about such analysis for video.
But the concept of information
leakage through varying packet size is worthy of studying for
all types network streams, especially
for real-time applications.
Another recommendation to be made is to push towards
standardizing such approach as a part of
security standards. Although an RFC is publish about the
guidelines for using variable bit-rate
encoding with SRTP, the problem is that this standard suggest
making padding part of the application
layer and a responsibility of the developer. Such approach may
induce security weaknesses avoidable
if padding was part of the standard.