Trade-off between security level and compression in voice communications

Analysis of the tradeoff between

compression ratio and security level in real-time voice communication

Par

Abdallah Attie

Encadr par:

Dr Ahmad Fadlallah

Dr Mohamad Raad

Soutenance le 09.07.2014 devant le jury compos de:

Dr. Wafaa Abou Diab

Dr. Bassem Bakhash

Dr. Ahmad Fadlallah

i

Abstract

The project aims at analysis of the tradeoff between security level and compression ratio in real-time

voice communication. The problem stated in the project is that the combination between variable

bitrate compression same length encryption will induce vulnerability to traffic analysis. The variation

of packet sizes can leak information about the conversation starting with language identification,

identifying certain phrases, and reconstructing phonemes. The solution to this problem is to rely on

constant bitrate compression or to pad the sent frames to a multiple of 16, 32, or 64 bytes. Each

padding schemes has a security gain in the form of increased immunity to the described traffic

analysis systems. The security level is escaladed with the increase in the size of the encryption block.

The research project we conduct aims at analysis of the impact of those padding schemes on the

bitrate of the VoIP stream. We created for this purpose a test bed that simulates the compression,

encryption and sending/receiving of the speech over RTP socket. The resulting bitrates are calculated

with and without the overhead of packetization. In conclusion, the resulting data allow proper clear

perspective of the tradeoff between three parameters: security level, bitrate, and quality.

ii

Contents

Abstract ....................................................................................................................................... i

List of Figures ............................................................................................................................ iv

List of Tables .............................................................................................................................. v

List of References ...................................................................................................................... vi

Chapter I Introduction ............................................................................................................... 1

Compression ............................................................................................................................... 1

Types of Speech Coders .............................................................................................................. 2

Variable Bit-Rate Coding ............................................................................................................ 3

Speech Coding State of the Art .................................................................................................... 3

Adaptive Multi Rate (AMR) .................................................................................................... 3

Opus ........................................................................................................................................ 4

Speex ...................................................................................................................................... 5

Security ....................................................................................................................................... 6

Symmetric and Asymmetric Encryption .................................................................................. 7

Block and Stream Encryption .................................................................................................. 8

Common Encryption Algorithms ............................................................................................. 8

Report Structure ........................................................................................................................ 10

Chapter II Literature Review and Problem Formulation ............................................................ 11

Traffic Analysis of Encrypted Voice Stream ............................................................................. 11

Information leakage via variable bit-rate................................................................................ 12

Example of traffic analysis ........................................................................................................ 14

Mitigation Techniques ............................................................................................................... 15

Chapter III Test-Bed .................................................................................................................. 17

Test-bed requirements ............................................................................................................... 17

Test-bed elements ..................................................................................................................... 18

iii

Speex Encoder ...................................................................................................................... 19

AES Encryption .................................................................................................................... 22

RTP Sending/Receiving ........................................................................................................ 24

Dataset .................................................................................................................................. 25

Test-bed overview ..................................................................................................................... 26

Chapter IV Results and Conclusion ............................................................................................ 27

Narrow Band Results ................................................................................................................ 27

Wide Band Results .................................................................................................................... 29

Statistical Analysis .................................................................................................................... 30

Conclusion and future recommendations ................................................................................... 32

iv

List of Figures

Figure I-1: Block Diagram of the Opus Encoder .............................................................................. 5

Figure II-1: Distribution of bit rates used to encode four phonemes with Speex ............................. 13

Figure II-2: Overview of training and detection process ................................................................ 14

Figure II-3: Robustness to padding ................................................................................................ 15

Figure IV-1: NB padding overhead (without packetization) ........................................................... 27

Figure IV-2: NB rate vs quality (without packtization) .................................................................. 27

Figure IV-3: NB rate vs quality (with packtization) ....................................................................... 28

Figure IV-4: NB padding overhead (with packetization) ............................................................... 28

Figure IV-5: WB overheaad (without packetization) ..................................................................... 29

Figure IV-6: WB Rate vers Quality (without packetization) .......................................................... 29

Figure IV-7: Wide Band overhead (with packetization) ................................................................. 30

Figure IV-8: Wide Band rate versus quality (with packetization) ................................................... 30

Figure IV-9: Stream Cipher 95% Confidence Interval ................................................................... 31

Figure IV-10: Stream and 128 bit padding Confidence Interval ..................................................... 31

Figure IV-11: Stream and 512 bit padding confidence interval ...................................................... 32

Figure IV-12: Stream and 256 bit Confidence Interval .................................................................. 32

v

List of Tables

Table I-1: Characteristics of Standardized Speech Coding Algorithms in Each of Four Broad

Categories Error! Bookmark not defined.

Table I-2 Comparison Between the 3 Speech Encoders ................... Error! Bookmark not defined.

Table III-1 Quality vurses bitrate for Speex narrowband ............................................................... 21

Table III-2 Quality vurses bitrate for Speex wideband ................................................................... 21

vi

List of References

1 M. Arjona Ramrez and M. Minami, "Low bit rate speech coding," in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1299-1308.

2 P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494.

3 M. Hasegawa-Johnson and A. Alwan, "Speech Coding: Fundamentals and Applications" in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1256-1265.

4 Wiki.hydrogenaud.io, (2014). Variable Bitrate - Hydrogenaudio Knowledgebase. [online] Available at: http://wiki.hydrogenaud.io/index.php?title=VBR [Accessed 22 Jun. 2014].

5 E. Ekudden et al, "THE ADAPTIVE MULTI-RATE SPEECH CODER", Ericson Research.

6 Tools.ietf.org, (2014). RFC 6716 - Definition of the Opus Audio Codec. [online] Available at: http://tools.ietf.org/html/rfc6716#section-2.1.8 [Accessed 22 Jun. 2014].

7 Speex.org, (2014). Introduction to CELP Coding. [online] Available at: http://www.speex.org/docs/manual/speex-manual/node9.html [Accessed 24 Jun. 2014].

8 C. V. Wright, L. Ballard, F. Monrose, and G. M. Masson. Language identication of encrypted VoIP trafc: Alejandra y Roberto or Alice and Bob? In Proceedings of the USENIX Security Symposium, 2007.

9 C. V. Wright, L. Ballard, S. E. Coull, F. Monrose, and G. M. Masson. Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations. In Proceedings of the IEE Symposium on Security and Privacy, 2008.

10 Tools.ietf.org, (2014). RFC 6562 -Guidelines for the Use of Variable Bit Rate Audio with Secure RTP. [online] Available at: http://tools.ietf.org/html/rfc6562#section-2.1.8 [Accessed 30 Jun. 2014].

1

Chapter I Introduction

Security and performance are two important issues that any network operator should be concerned

about. Such concern is escalated when dealing with real-time voice communication. One of the

reasons behind this is that performance directly affects the user experience in such real-time

application. Furthermore, the context of voice conversations is always personal, and consequently

has more severe privacy requirements with respect to other applications (web browsing for example).

Enhancing performance requires compression of the network stream, while preserving privacy as a

security aspect requires encryption of the exchanged data. Our project aims at finding the optimal

solution of combining the two operations (compression and encryption) since they don't get along

together by nature. That is because compression removes redundancy from data while encryption

adds it. Our research looks into the possibilities in the domains of both encryption and compression

in order to find the optimal combination using existing tools.

Compression

To achieve performance requirements, one of the most important techniques used is compressing the

exchanged data throughout the network. With the data being a voice signal, this gives it a certain

structure rendering it compressible at high ratios with no/minimal distortion. Therefore, speech

coding has always been a hot research area in which many approaches are adopted with different

perspectives and one outcome: minimizing the needed bandwidth while preserving voice quality at

an important level.

Speech coding is an application of data compression on digital audio signals containing speech.

Speech coding uses speech-specific parameter estimation using audio signal processing techniques to

model the speech signal, combined with generic data compression algorithms to represent the

resulting modeled parameters in a compact bit-stream. [1]

The techniques employed in speech coding are similar to those used in audio data compression and

audio coding where knowledge in psychoacoustics is used to transmit only data that is relevant to the

human auditory system. For example, in voice-band speech coding, only information in the frequency

band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.

2

A sampling rate of 8 kHz is needed for narrowband coding. Also, wideband coding codes information

in the frequency band reaching 7 8 kHz, which requires sampling or rate 16 kHz.

Speech coding differs from other forms of audio coding in that speech is a much simpler signal than

most other audio signals, and a lot more statistical information is available about the properties of

speech. As a result, some auditory information which is relevant in audio coding can be unnecessary

in the speech coding context. In speech coding, the most important criterion is preservation of

intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data. [2]

Types of Speech Coders

There are different types of speech encoders:

Waveform coders attempt to code the exact shape of the speech signal waveform, without

considering the nature of human speech production and speech perception. These coders are

high-bit-rate coders (typically above 16 kbps).

Linear prediction coders (LPCs), on the other hand, assume that the speech signal is the output

of a linear time-invariant (LTI) model of speech production. The transfer function of that

model is assumed to be all-pole (autoregressive model). The excitation function is a quasi-

periodic signal constructed from discrete pulses (18 per pitch period), pseudorandom noise,

or some combination of the two. If the excitation is generated only at the receiver, based on a

transmitted pitch period and voicing information, then the system is designated as an LPC

voice coder (vocoder). LPC vocoders that provide extra information about the spectral shape

of the excitation have been adopted as coder standards between 2.0 and 4.8 kbps.

LPC-based analysis-by-synthesis coders (LPC-AS), on the other hand, choose an excitation

function by explicitly testing a large set of candidate excitations and choosing the best. LPC-

AS coders are used in most standards between 4.8 and 16 kbps.

Sub-band coders are frequency-domain coders that attempt to parameterize the speech signal

in terms of spectral properties in different frequency bands. These coders are less widely used

than LPC-based coders but have the advantage of being scalable and do not model the

incoming signal as speech. Sub-band coders are widely used for high-quality audio coding.

Table 1.1 shows the four discussed types of speech coders. [3]

Speech Coder Class Rates (kbps) Complexity Standardized Applications

Waveform coders 1664 Low Landline telephone

Sub-band coders 12256 Medium Teleconferencing, audio

LPC-AS 4.816 High Digital cellular

LPC vocoder 2.04.8 High Satellite telephony, military

Table I-1: Characteristics of Standardized Speech Coding Algorithms in Each of Four Broad Categories

3

Variable Bit-Rate Coding

One of the important techniques in speech coding is using variable bitrate while coding the speech

signal. The main idea behind this technique is the fact that not all speech signals need the same bitrate

in coding. In Variable Bitrate (VBR) coding, the user chooses the desired quality level and/or a range

of allowable bitrates. Then the encoder tries to maintain the selected quality during the whole stream

by choosing the optimal amount of data to represent each frame of audio. The main advantage is that

the user is able to specify the quality level and conserve as much space as possible, but the

inconvenience is that the final file size is quite unpredictable.

Most modern encoders are able to perform VBR encoding, including (but not limited to) nearly all

popular MP3, AAC, (Ogg) Vorbis, Musepack, and WMA encoders. [4]

Speech Coding State of the Art

The two most important applications of speech coding are mobile telephony and Voice over IP.

Consequently, the standards for speech compression are organized and published by the International

Telecommunication Union (ITU) responsible for development in the mobile technology and by the

Internet Engineering Task Force (IETF).

This section presents the most widely used encoders in the domain: Adaptive Multi-Rate (AMR) is

an encoder developed and adopted by the ITU. It is used in WCDMA networks. On the other hand,

Opus and Speex are two sibling encoders developed by Xiph.org and adopted by IETF.

Adaptive Multi Rate (AMR)

The Adaptive Multi-Rate speech coder is based on the Algebraic CELP (ACELP) technology and is

referred to as a Multi-Rate ACELP (MR-ACELP) coder. The coder is capable of operating at 8

different bit-rates denoted coder modes. The frame size is 20 milliseconds with 4 sub-frames of 5

milliseconds. A look-ahead of 5 ms is used. The 12.2 Kbit/s mode is equivalent to the GSM EFR

coder while the 7.40 Kbit/s mode is equivalent to the EFR coder for the IS-136 system.

The AMR speech coder was developed to fulfill a challenging set of performance requirements for

clean speech, speech in background noise, tendering and degraded channel conditions. The highest

mode is the GSM EFR coder, which provides speech quality comparable to fixed-line quality. The

lowest mode provides communication quality. The range of bit-rates and the high quality provides

flexibility to trade quality and capacity as well as to optimize quality under changing channel

4

conditions. The quality was shown to be significantly higher than for existing speech services in GSM.

[5]

Opus

Opus codec is developed by Xiph.org and standardized for multimedia streaming and VoIP

applications by the IETF. Opus can handle a wide range of audio applications, including Voice over

IP, videoconferencing, in-game chat, and even remote live music performances. It can scale from low

bit-rate narrowband speech to very high quality stereo music. Supported features are [7]:

Bit-rates from 6 kb/s to 510 kb/s

Sampling rates from 8 kHz (narrowband) to 48 kHz (fullband)

Frame sizes from 2.5 ms to 60 ms

Support for both constant bit-rate (CBR) and variable bit-rate (VBR)

Audio bandwidth from narrowband to full-band

Support for speech and music

Support for mono and stereo

Support for up to 255 channels (multistream frames)

Dynamically adjustable bitrate, audio bandwidth, and frame size

Good loss robustness and packet loss concealment (PLC)

Floating point and fixed-point implementation

The Opus codec is a real-time interactive audio codec. It is composed of a layer based on Linear

Prediction [LPC] and a layer based on the Modified Discrete Cosine Transform [MDCT]. The main

idea behind using two layers is as follows: in speech, linear prediction techniques (such as Code-

Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform (e.g.,

MDCT) domain techniques, while the situation is reversed for music and higher speech frequencies.

Thus, a codec with both layers available can operate over a wider range than either one alone and can

achieve better quality by combining them than by using either one individually. [6]

The Opus encoder consists of two main blocks: the SILK encoder and the CELT encoder. However,

unlike the decoder, a valid (though potentially suboptimal) Opus encoder is not required to support

all modes and may thus only include a SILK encoder module or a CELT encoder module. The output

bit-stream of the Opus encoding contains bits from the SILK and CELT encoders, though these are

not separable due to the use of a range coder. A block diagram of the encoder is illustrated below.

[6]

5

Opus encoder is standardized for VoIP applications by the IETF. The reference (RFC 6716) defines

the encoder/decoder. Furthermore, IETF has published specifications for packet payload format of

Opus frames.

Speex

Speex encoder is the sibling of Opus, It is developed also by Xiph.org, and it has a very similar

approach to Opus. The options featured by the two encoders are similar to a great extent. However,

in our research we are interested more in experimenting with Speex rather than Opus. The reason

behind this will be explained later throughout the course of the report.

Speex is based on CELP, which stands for Code Excited Linear Prediction. The CELP technique is

based on three ideas:

The use of a linear prediction (LP) model to model the vocal tract

The use of (adaptive and fixed) codebook entries as input (excitation) of the LP model

The search performed in closed-loop in a perceptually weighted domain'

Speex is designed to compress voice at bitrates ranging from 2 to 44 kbps. Some of Speex's features

include:

Narrowband (8 kHz), wideband (16 kHz), and ultra-wideband (32 kHz) compression in the

same bit-stream

Intensity stereo encoding

Packet loss concealment

Variable bitrate operation (VBR)

Voice Activity Detection (VAD)

Figure I-1: Block Diagram of the Opus Encoder

6

Discontinuous Transmission (DTX)

Fixed-point port

Acoustic echo canceller

Noise suppression

The following table shows a comparison between the 3 discussed coders.

Codec Rate (kHz) bitrate (kbps)

delay

frame+lookahead

(ms)

multirate VBR license

Speex 8, 16, 32

2.15-24.6 (NB) 20+10 (NB)

yes yes

open-

source/

4-44.2 (WB) 20+14 (WB) free

software

Opus 8, 16, 22 6 - 510 2.5-60 yes yes open-

source/

AMR-

NB 8 4.75-12.2 20+5? yes proprietary

AMR-

WB 16 6.6-23.85 20+5? yes proprietary

(G.722.2)

Table I-2 Comparison Between the 3 Speech Encoders

To sum up, the bibliographic work has led us to emphasize the concept of variable bitrate (VBR).

This is due to reasons that are explained in Chapter II. Furthermore, the literature that we are dealing

with in our research is based on working with Speex encoder. Consequently, Speex will be our

designated encoder in the test-bed.

Security

The other concern in our research is security. As stated in the beginning of the chapter, privacy has

great importance for real-time voice communication applications, whether in mobile telephony or

voice over IP. In this section, we review the concept of encrypting voice data along with the state of

the art in the field.

Encryption is the process of converting plain text "unhidden" to a cryptic text "hidden" to secure it

against data thieves. This process has another part where cryptic text needs to be decrypted on the

other end to be understood. As dened in RFC 2828 [Reference], cryptographic system is "a set of

cryptographic algorithms together with the key management processes that support use of the

algorithms in some application context." This denition denes the whole mechanism that provides

the necessary level of security comprised of network protocols and data encryption algorithms.

7

The goals of any cryptography system fall into 5 categories:

Authentication: This means that before sending and receiving data using the system, the

receiver and sender identity should be veried.

Secrecy or Condentiality: Usually this function (feature) is how most people identify a

secure system. It means that only the authenticated people are able to interpret the message

(date) content and no one else.

Integrity: Integrity means that the content of the communicated data is assured to be free

from any type of modication between the end points (sender and receiver). The basic form

of integrity is packet check sum in IPv4 packets.

Non-Repudiation: This function implies that neither the sender nor the receiver can falsely

deny that they have sent/received a certain message.

Service Reliability and Availability: Since secure systems usually get attacked by intruders,

which may affect their availability and type of service to their users. Such systems should

provide a way to grant their users the quality of service they expect.

The category of our interest is confidentiality. Consequently, the reference to security throughout the

report is meant to address the confidentiality goal of the implemented security system. Furthermore,

the attack on the system is based on traffic analysis and not the conventional cryptanalysis. This idea

will be discussed in details in the next chapter.

Symmetric and Asymmetric Encryption

Data encryption procedures are mainly categorized into two categories depending on the type of

security keys used to encrypt/decrypt the secured data. These two categories are: Asymmetric and

Symmetric encryption techniques. In symmetric encryption, the sender and the receiver agree on a

secret (shared) key. Then they use this secret key to encrypt and decrypt their exchanged messages.

The main concern behind symmetric encryption is how to share the secret key securely between the

two peers. If the key gets known for any reason, the whole system collapses. On the other hand,

Asymmetric encryption is where two keys are used. To explain more, what Key1 can encrypt only

Key2 can decrypt, and vice versa. It is also known as Public Key Cryptography (PKC), because users

tend to use two keys: public key, which is known to the public, and private key, which is only known

to the user.

In the project, we will be interested in experimenting with symmetric encryption. This is because the

state of the art in the domain of speech encryption is based on symmetric ciphers. The reason behind

is that symmetric algorithms in general are less complex than asymmetric ones. The reduction in

complexity is of great importance to such real-time application, running usually on platforms with

limited capabilities (Mobile Phones).

8

Block and Stream Encryption

One of the main categorization methods for encryption techniques commonly used is based on the

form of the input data they operate on. The two types are Block Cipher and Stream Cipher.

Stream cipher operates on a stream of data by operating on it bit by bit. Stream cipher consists of two

major components: a key stream generator, and a mixing function. Mixing function is usually just an

XOR function, while key stream generator is the main unit in stream cipher encryption technique.

In a block cipher method, data is encrypted and decrypted in blocks. In its simplest mode, you divide

the plain text into blocks, which are then fed into the cipher system to produce blocks of cipher text.

ECB (Electronic Codebook Mode) is the basic form of block cipher where data blocks are encrypted

directly to generate its correspondent ciphered blocks.

There are many variances of block cipher, where dierent techniques are used to strengthen the

security of the system. The most common methods are: ECB (Electronic Codebook Mode), CBC

(Chain Block Chaining Mode), and OFB (Output Feedback Mode). ECB mode and the CBC mode

use the cipher block from the previous step of encryption in the current one, which forms a chain-like

encryption process. OFB operates on plain text in away similar to stream cipher that will be described

below, where the encryption key used in every step depends on the encryption key from the previous

step. There are other modes like CTR (counter) and CFB (Cipher Feedback). CTR mode is used to

transform a block cipher into a stream cipher. The idea is simple; a block mode is used to generate a

key stream, which is mixed (mainly XORed) with the plain text.

The recommended mode of operation for real-time voice communication is obviously the stream

cipher. This is due to the nature of transferred data, which is in the form of stream. However, in we

explore the option of using block cipher. The feasibility of using block ciphers for encryption of voice

data comes from the perspective of trading off performance for security. We will elaborate more on

that later in the course of the report.

Common Encryption Algorithms

Here we discuss 5 of the most famous ciphers present in the state of the art. Among these algorithms,

AES and KASUMI are implemented in real-time voice communication security. AES is standardized

for voice over IP in the Secure Real-time Transport Protocol (SRTP), which is a profile for RTP. On

9

the other hand, KASUMI was standardized by the ITU for GSM and consequent communication

systems.

DES: (Data Encryption Standard), was the rst encryption standard to be recommended by NIST

(National Institute of Standards and Technology). It is based on the IBM proposed algorithm called

Lucifer. DES became a standard in 1974. Since that time, many attacks and methods recorded that

exploit the weaknesses of DES, which made it an insecure block cipher.

3DES: As an enhancement of DES, the3DES (Triple DES) encryption standard was proposed. In this

standard the encryption method is similar to the one in original DES but applied 3 times to increase

the encryption level. But it is a known fact that 3DES is slower than other block cipher methods.

AES: (Advanced Encryption Standard), is the new encryption standard recommended by NIST to

replace DES. Rijndael (pronounced Rain Doll) algorithm was selected in 1997 after a competition to

select the best encryption standard. Brute force attack is the only eective attack known against it, in

which the attacker tries to test all the characters combinations to unlock the encryption. Both AES

and DES are block ciphers.

Blowsh: It is one of the most common public domain encryption algorithms provided by Bruce

Schneier - one of the world's leading cryptologists, and the president of Counterpane Systems, a

consulting rm specializing in cryptography and computer security. Blowsh is a variable length key,

64-bit block cipher. The Blowsh algorithm was rst introduced in 1993.This algorithm can be

optimized in hardware applications though it's mostly used in software applications. Though it suers

from weak keys problem, no attack is known to be successful against it.

KASUMI: It is a block cipher used in UMTS, GSM, and GPRS mobile communications systems. In

UMTS, KASUMI is used in the confidentiality (f8) and integrity algorithms (f9) with names UEA1

and UIA1, respectively. In GSM, KASUMI is used in the A5/3 key stream generator and in GPRS in

the GEA3 key stream generator.

KASUMI was designed for 3GPP to be used in UMTS security system by the Security Algorithms

Group of Experts (SAGE), a part of the European standards body ETSI. SAGE agreed with 3GPP

technical specification group (TSG) for system aspects of 3G security (SA3) to base the development

on an existing algorithm that had already undergone some evaluation. They chose the cipher

algorithm MISTY1 developed and patented by Mitsubishi Electric Corporation. The original

algorithm was slightly modified for easier hardware implementation and to meet other requirements

set for 3G mobile communications security.

10

In January 2010, Orr Dunkelman, Nathan Keller and Adi Shamir released a paper showing that they

could break Kasumi with a related key attack and very modest computational resources. Interestingly,

the attack is ineffective against MISTY.

Report Structure

In the first chapter of this report, we were acquainted with the state of the art of both compression

and encryption. We reviewed the encoding concepts along with the widely used encoders. We also

reviewed security in brief manner. Cipher types and modes were presented with emphasis on the

application of VoIP and Mobile telephony.

In the second chapter, we have a brief literature review stating the main problem the project tries to

tackle: the bad combination between VBR and stream encryption. The papers stating security

vulnerabilities are reviewed briefly. The solution for the problem is discussed and the perspective that

the project works in is determined.

Chapter III exhibits the test-bed that we created in order to test for bitrates. The test-bed is consisted

of 3 main elements (or stages): encoding, encryption, sending/receiving.

The fourth and final chapter includes all the obtained results. These results are the obtained bitrates

throughout different setups spanning the whole space of options found in our field of interest. This

chapter also includes the concluding the statement along with future recommendations.

11

Chapter II Literature Review and Problem Formulation

The main problem to be tackled in this project can be presented and explained in a very simple and

brief manner. The combination between variable bit-rate compression and length preserving

encryption (stream cipher) induces security weaknesses in the form vulnerability to traffic analysis.

The solution is reducing information leaking by reducing the variation of bitrate in the transmitted

stream. This is acquired by relying on constant bitrate (CBR) or by using padding. In brief, our project

emphasizes on the analysis of the cost of padding in the context of bandwidth. We aim at performing

tests of using padding and reaching a conclusion about the cost of padding and consequently its

feasibility. They proposition by the research project should answer the question about the possibility

of gaining trusted security level using existing tools.

In this chapter, we exhibit the weakness invoked by using variable bit-rate compression and then we

discuss the perspective adopted in tackling this problem.

Traffic Analysis of Encrypted Voice Stream

In 2007, a paper was published under the title of Language Identification of Encrypted VoIP Traffic.

After that by 2 years another paper, Spot me if you can: Uncovering spoken phrases in encrypted

VoIP conversations. The most important paper in the context was published in 2011 and titled by:

Phonotactic Reconstruction of Encrypted VoIP Conversations: Hookt on fon-iks. The inferred

common idea from the titles is extraction of certain information (language, some phrases, phoneme

reconstruction) from encrypted VoIP stream. A key point is not revealed in the titles: such extraction

relies on variable bit-rate compression.

The Secure RTP (SRTP) framework [RFC3711] is a widely used framework for securing RTP

sessions [RFC3550]. SRTP provides the ability to encrypt the payload of an RTP packet, and

optionally add an authentication tag, while leaving the RTP header and any header extension in the

clear. A range of encryption transforms can be used with SRTP, but none of the predefined encryption

transforms use any padding; the RTP and SRTP payload sizes match exactly.

When using SRTP with voice streams compressed using variable bit rate (VBR) codecs, the length

of the compressed packets will depend on the characteristics of the speech signal. This variation in

packet size will leak a small amount of information about the contents of the speech signal. This is

potentially a security risk for some applications. For example, [spot-me] shows that known phrases

12

in an encrypted call using the Speex codec in VBR mode can be recognized with high accuracy in

certain circumstances, and [fon-iks] shows that approximate transcripts of encrypted VBR calls can

be derived for some codecs without breaking the encryption. How significant these results are, and

how they generalize to other codecs, is still an open question. This memo discusses ways in which

such traffic analysis risks may be mitigated.

Information leakage via variable bit-rate

Generally speaking, the codec takes as input the audio stream from the user, which is typically

sampled at either 8000 or 16000 samples per second (Hz). At some fixed interval, the codec takes the

n most recent samples from the input, and compresses them into a packet for efficient transmission

across the network. To achieve the low latency required for real-time performance, the length of the

interval between packets is typically fixed between 10 and 50ms, with 20ms being the common case.

Thus for a 16 kHz audio source, we have n = 320 samples per packet, or 160 samples per packet for

the 8 kHz case.

Many common voice codecs are based on a technique called code-excited linear prediction (CELP).

For each packet, a CELP encoder simply performs a brute-force search over the entries in a codebook

of audio vectors to output the one that most closely reproduces the original audio. The quality of the

compressed sound is therefore determined by the number of entries in the codebook. The index of the

best-fitting codebook entry, together with the linear predictive coefficients and the gain, make up the

payload of a CELP packet. The larger code books used for higher-quality encodings require more bits

to index, resulting in higher bit rates and therefore larger packets.

In some CELP variants, such as QCELP, Speexs variable bit rate mode, or the approach advocated

by Zhang et al., the encoder adaptively chooses the bit rate for each packet in order to achieve a good

balance of audio quality and network bandwidth. This approach is appealing because the decrease in

data volume may be substantial, with little or no loss in quality. In a two-way call, each participant is

idle roughly 63% of the time, so the savings may be substantial. Unfortunately, this approach can also

cause substantial leakage of information in encrypted VoIP calls because, in the standard specification

for Secure RTP (SRTP), the cryptographic layer does not pad or otherwise alter the size of the original

RTP payload.

13

Intuitively, the sizes of CELP packets leak information because the choice of bit rate is largely based

on the audio encoded in the packets payload. For example, the variable bit-rate Speex codec encodes

vowel sounds at higher bit rates than fricative sounds like f or s. In phonetic models of speech,

sounds are broken down into several different categories, including the aforementioned vowels and

fricatives, as well as stops like b or d, and affricatives like ch. Each of these canonical sounds

is called a phoneme, and the

pronunciation for each word in the

language can then be given as a sequence

of phonemes. While there is no consensus

on the exact number of phonemes in

spoken English, most in the speech

community put the number between 40

and 60.

In [9], to demonstrate the relationship

between bit rate and phonemes, several

recordings from the TIMIT corpus of phonetically-rich English speech were encoded using Speex in

wideband variable bit rate mode, and observed the bit rate used to encode each phoneme. The

probabilities for 8 of the 21 possible bit rates are shown for a handful of phonemes in the following

figure. As expected, we see that the two vowel

sounds, aa and aw, are typically encoded at

signicantly higher bit rates than the fricative f or the consonant k. Moreover, large differences

in the frequencies of certain bit rates (namely, 16.6, 27.8, and 34.2 kbps), can be used to distinguish

aa from aw and f from k.

Figure II-1: Distribution of bit rates used to encode four

phonemes with Speex

Figure II-2: Packets for articial Figure II-3: Packets for intelligence

14

In fact, it is these differences in bit rate for the phonemes that make recognizing words and phrases

in encrypted traffic possible. To illustrate the patterns that occur in the stream of packet sizes when a

certain word is spoken, we examined the sequences of packets generated by encoding several

utterances of the words artificial and intelligence from the TIMIT corpus. They represent the

packets for each word visually in Figures 2 and 3 as a data imagea grid with bit rate on the y-axis

and position in the sequence on the x-axis. Starting with a plain white background, we darken the cell

at position (x,y) each time we observe a packet encoded at bit rate y and position x for the given word.

In both graphs, we see several dark gray or black grid cells where the same packet size is consistently

produced across different utterances of the word, and in fact, these dark spots are closely related to

the phonemes in the two words. In Figure 2, the bit rate in the 2nd - 5th packets (the a in artificial)

is usually quite high (35.8kbps), as we would expect for a vowel sound. Then, in packets 12 - 14 and

20 - 22, we see much lower bit rates for the fricative f and affricative sh. Similar trends are visible

in Figure 3; for example, the t sound maps consistently to 24.6 kbps in both words.

Example of traffic analysis

In the paper Uncovering spoken phrases in encrypted VoIP conversations, [9], the adopted method

in analyzing the encrypted VoIP stream can be summarized by the following:

To identify a phrase without using any examples of the phrase or any of its constituent words, this

concatenative synthesis technique is applied to generate a few hundred synthetic training sequences

for the phrase. These sequences are used to train a profile HMM for the phrase and then search for

the phrase in streams of packets. An overview of the entire training and detection process is given in

Figure II-4.

Figure II-2: Overview of training and detection process

15

Mitigation Techniques

One way to prevent word spotting would be to pad packets to a common length, or at least to coarser

granularity. Another way is to reframe from using VBR into using the CBR mode. However, its not

optimal though. Padding regains the lost security (to a certain extent as we will see) while preserving

some benefit from variable bit-rate encoding.

In the paper [9] the traffic analysis system (search algorithm) was tested against padding. To explore

the tradeoff between padding and search accuracy, they encrypted both their training and testing data

sets to multiples of 128, 256 or 512 bits and applied their approach. The results are presented in

Figure II-4. The use of padding is quite encouraging as a mitigation technique, as it greatly reduced

the overall accuracy of the search algorithm. When padding to multiples of 128 bits, the system

achieves only 0.15 recall at 0.16 precision. Increasing padding so that packets are multiples of 256

bits gives a recall of .04 at .04 precision.

The debate around the announcement of security flaws in variable bit-rate encoding has led to

publishing of an RFC by the ITU. The standard, Guidelines for the Use of Variable Bit Rate Audio

with Secure RTP, RFC 6562, specifies standards for dealing with variable bit-rate in SRTP Protocol.

For scenarios where VBR is considered unsafe, a constant bit rate (CBR) codec SHOULD be

negotiated and used instead, or the VBR codec SHOULD be operated in a CBR mode. However, if

the codec does not support CBR, RTP padding SHOULD be used to reduce the information leak to

an insignificant level. Packets may be padded to a constant size or to a small range of sizes ([spot-

me] achieves good results by padding to the next multiple of 16 octets, but the amount of padding

Figure II-3: Robustness to padding

16

needed to hide the variation in packet size will depend on the codec and the sophistication of the

attacker) or may be padded to a size that varies with time. The most secure and RECOMMENDED

option is to pad all packets throughout the call to the same size.

In the case where the size of the padded packets varies in time, the same concerns as for VAD apply.

That is, the padding SHOULD NOT be reduced without waiting for a certain (random) time. The

RECOMMENDED "hold time" is the same as the one for VAD.

Note that SRTP encrypts the count of the number of octets of padding added to a packet, but not the

bit in the RTP header that indicates that the packet has been padded. For this reason, it is

RECOMMENDED to add at least one octet of padding to all packets in a media stream, so an attacker

cannot tell which packets needed padding.[10]

17

Chapter III Test-Bed

In the previous chapter, we exhibited the security weakness provoked by the combination between

variable bit-rate encoding and same length encryption. This weakness is in the form of vulnerability

to traffic analysis. The performance of the traffic analysis system presented in the previous chapter

has shown degradation along with padding with increasing key lengths.

Furthermore, as a result to the fact that padding preserves security to a great extent. It was

recommended by the ITU in RFC 6562 to either use constant bit-rate encoding or rely on padding to

16 bytes block length.

All the discussion around the subject didnt take into consideration the tradeoff between security and

performance. A question was to be asked about the feasibility of padding. A key point to have in mind

is that variable bitrate encoding aims at lowering the needed bandwidth as much as possible. As a

consequence to that notion, the cost of padding in terms of bit-rate and needed bandwidth is to be

calculated in order to have a good perspective about the price we have to pay in order to achieve

security while using variable bitrate.

The answer for the question about the feasibility of padding is our main goal in the research project.

This answer might be that padding will maybe cost more than constant bitrate and, consequently,

padding is not the optimal solution for preserving security. However, we aim at having a solid

perspective of the cost paid for different security levels. The results of our test-bed will hopefully

give a good understanding about the relation between security, quality, and performance.

Quality is a parameter we take in our research as part of tradeoff formula. The quality of the encoder

is usually mapped to the bitrate used by it. Consequently, the quality can be inserted into the tradeoff

formulation as a price to pay for preserving both security and bitrate.

In order to have a proper testing and calculate the obtained bitrates. We need to create a system in

which we implement compression, encryption, sending and receiving of a voice stream. The system

should allow the manipulation of parameters that we are interested in.

Test-bed requirements

The created system must be able to implement compression and encryption of a speech stream.

Furthermore, the system should allow the manipulation of parameters for both compression and

18

encryption. One more important requirement is ability to send and receive the compressed and

encrypted stream. Sending/receiving conveys the packetization of the stream in realistic manner that

can be related a real application. The system should be also able to log the obtained bitrates at every

setup.

For compression, we should be able to choose the mode (narrow-band, wideband). In addition to that,

we should be able to choose the quality of compression. The quality variable is an important variable

that is supported by many algorithms that form the state of the art. We emphasize the ability to choose

quality since we are interested in inserting quality as a parameter in the tradeoff setup as we can see

later in the results section.

In encryption, the main requirement is the ability to pad data to a multiple of 128, 256, 512 bits. Of

course, in addition to that, we need to adopt a cipher which is trusted in the state of the art. The cipher

should have a low cost in terms of processing time since the platforms are usually mobile phones with

limited memory and processing power. One additional requirement is being a symmetric cipher since

all protocols implement symmetric encryption/decryption mechanisms.

The requirements can be summarized and formulated in a compact format as the following:

Compression:

o Widely implemented encoder

o Variable bit-rate compression

o Variable quality setting

Encryption

o Trusted low cost cipher

o Padding to different sizes

o Symmetric cipher

Test-bed elements

Based on the discussed requirements, the search for an encoder and a cipher is aimed at finding

modules widely present in the state of the art. The test-bed is built in a Linux environment (UBUNTU

distribution of GNU-Linux). The used libraries are all written in C programming language,

consequently, the built test-bed was to be written in C.

19

For the encoder, the choice was set to Speex encoder. This encoder was chosen since it meets all the

stated requirements. Furthermore, this encoder was used in the three articles that state the security

vulnerability as the designated encoder.

Regarding encryption, the choice was obvious: Advanced Encryption Standard. AES is standardized

and adopted in SRTP, the main standard for security in voice over IP. However, SRTP specifications

and implementation use AES in CTR mode (Counter mode) this mode generates a key stream and

mixes it with data (using XOR operation) in order to get the encrypted text. The length of the initial

plain text is reserved. Consequently, this mode modes renders a block cipher into a stream length

preserving cipher regardless of the block size of the cipher. Other modes specified by SRTP are f8

and null cipher.

It is worthy of mentioning that the RFC published about the guidelines for using variable bit-rate with

SRTP recommends relying on higher levels in the hierarchy of the networking model to achieve

padding. The padding was part of compression or application layer in general as per the published

standard. However, in our approach we tried to use a block cipher in the test bed. The choice of a

block cipher does not affect the desired results in any way. Furthermore, the choice making padding

part of the encryption process is justified in terms of security requirements. The implementation of

padding in compression or other entity may induce security vulnerabilities avoidable by using block

cipher. For example, padding can be done within the RTP payload, the number of padding bytes will

be part of the encrypted header of the RTP packet, but the flag specifying padding will not be

encrypted.

Speex Encoder

In our test bed, we used Speex encoder the designated compression tool. We used the Speex library

and relied on detailed step by step construction of the encoder using Speex API (Application

Programming interface). This choice is because manipulating parameters and managing the encoders

output requires such construction rather than using a prebuilt ready-to-use module.

The libspeex library contains all the functions for encoding and decoding speech with the Speex codec.

When linking on a UNIX system, we must add -lspeex -lm to the compiler command line.

In order to encode speech using Speex, we rst need to:

#include

Then in the code, a Speex bit-packing struct must be declared, along with a Speex encoder state:

20

SpeexBits bits;

void *enc_state;

The two are initialized by:

speex_bits_init(&bits);

enc_state = speex_encoder_init(&speex_nb_mode);

For wideband coding, speex_nb_mode will be replaced by speex_wb_mode. In most cases, you will

need to know the frame size used at the sampling rate you are using.

The encoder is by default set to cbr mode. We set it into variable bit-rate mode by using:

speex_encoder_ctl(enc_state,SPEEX_SET_VBR,&vbr);

The variable vbr an integer value ( 0 or 1). It is used to set vbr on (1) or off (0).

There are many parameters that can be set for the Speex encoder, but the most useful one is the quality

parameter that controls the quality vs. bit-rate tradeoff.

This is set by:

speex_encoder_ctl(enc_state,SPEEX_SET_VBR_QUALITY,&quality);

Quality is a float value ranging from 0.0 to 10.0 (inclusively). The mapping between quality and bit-

rate is described in the following 2 tables for both narrowband and wideband.

Mode Quality Bit-

rate (bps)

mflops Quality/description

0 - 250 0 No transmission (DTX)

1 0 2,150 6 Vocoder (mostly for comfort noise)

2 2 5,950 9 Very noticeable artifacts/noise, good intelligibility

3 3-4 8,000 10 Artifacts/noise sometimes noticeable

4 5-6 11,000 14 Artifacts usually noticeable only with headphones

5 7-8 15,000 11 Need good headphones to tell the difference

6 9 18,200 17.5 Hard to tell the difference even with good headphones

7 10 24,600 14.5 Completely transparent for voice, good quality music

8 1 3,950 10.5 Very noticeable artifacts/noise, good intelligibility

9 - - - reserved

21

10 - - - reserved

11 - - - reserved

12 - - - reserved

13 - - - Application-defined, interpreted by callback or skipped

14 - - - Speex in-band signaling

15 - - - Terminator code

Table III-1 Quality vurses bitrate for Speex narrowband

Mode/

Quality

Bit-rate (bps) Quality/description

0 3,950 Barely intelligible (mostly for comfort noise)

1 5,750 Very noticeable artifacts/noise, poor intelligibility

2 7,750 Very noticeable artifacts/noise, good intelligibility

3 9,800 Artifacts/noise sometimes annoying

4 12,800 Artifacts/noise usually noticeable

5 16,800 Artifacts/noise sometimes noticeable

6 20,600 Need good headphones to tell the difference

7 23,800 Need good headphones to tell the difference

8 27,800 Hard to tell the difference even with good headphones

9 34,400 Hard to tell the difference even with good headphones

10 42,400 Completely transparent for voice, good quality music

Table III-2 Quality vurses bitrate for Speex wideband

Once the initialization is done, for every input frame:

speex_bits_reset(&bits);

speex_encode_int(enc_state, input_frame, &bits);

nbBytes = speex_bits_write(&bits, byte_ptr, MAX_NB_BYTES);

Where input_frame is a (short *) pointing to the beginning of a speech frame, byte_ptr is a (char *)

where the encoded frame will be written, MAX_NB_BYTES is the maximum number of bytes that

can be written to byte_ptr without causing an overow and nbBytes is the number of bytes actually

written to byte_ptr (the encoded size in bytes). Before calling speex_bits_write, it is possible to nd

the number of bytes that need to be written by calling speex_bits_nbytes(&bits), which returns a

number of bytes.

22

After youre done with the encoding, free all resources with:

speex_bits_destroy(&bits);

speex_encoder_destroy(enc_state);

AES Encryption

The choice of the AES cipher is justified in the previous section of the chapter. However, the

algorithm has a high number of implementations. Among these, a trusted and well known library in

the state of the art is OpenSSL.

OpenSSL provides two primary libraries: libssl and libcrypto. The libcrypto library provides the

fundamental cryptographic routines used by libssl. You can however use libcrypto without using

libssl.

For most uses, users should use the high level interface that is provided for performing cryptographic

operations. This is known as the EVP interface (short for Envelope). This interface provides a suite

of functions for performing encryption/decryption (both symmetric and asymmetric),

signing/verifying, as well as generating hashes and MAC codes, across the full range of OpenSSL

supported algorithms and modes. Working with the high level interface means that a lot of the

complexity of performing cryptographic operations is hidden from view. A single consistent API is

provided. In addition low level issues such as padding and encryption modes are all handled.

The EVP functions provide a high level interface to OpenSSL cryptographic functions. They provide

the following features:

A single consistent interface regardless of the underlying algorithm or mode

Support for an extensive range of algorithms

Encryption/Decryption using both symmetric and asymmetric algorithms

Sign/Verify

Key derivation

Secure Hash functions

Message Authentication Codes

Support for external crypto engines,

23

AES is available in libcrypto with different modes, and in block sizes 128, 192, and 256 bits.

Unfortunately, the library doesnt support a block size of 512. In fact, generally implementations of

AES use a block size of 128 and 256 at most. To deal with this issue, we used the algorithm in CBC

mode for block sizes of 128 and 256 bits. And to get the size of 512 bits, we relied on manual padding.

Although the use of EVP as a high level interface simplifies using the library to a great extent, using

EVP in a complex test bed with multi stage procedures may induce complexity.

To encrypt using EVP, first we have to:

#include

The encryption process starts with initializing the cipher. We have to create a context, "opaque"

encryption, decryption structures that libcrypto uses to record status of encrypt/decrypt operations:

EVP_CIPHER_CTX e_ctx;

Then we have to create a key and IV (initiation vector) for the cipher. A SHA1 digest is used to hash

the supplied key material (password) multiple times (rounds). More rounds are more secure but

slower. Then after setting the key and IV, we call:

EVP_CIPHER_CTX_init(e_ctx);

EVP_EncryptInit_ex(e_ctx, EVP_aes_256_cbc(), NULL, key, iv);

This initiates AES encryption in CBC mode with a block size of 256 as shown in the second parameter.

To initialize 128 block size instead, we call:

EVP_EncryptInit_ex(e_ctx, EVP_aes_128_cbc(), NULL, key, iv);

Encryption of the Speex frame then takes place in the following manner:

EVP_EncryptInit_ex(e_ctx, NULL, NULL, NULL, NULL);

EVP_EncryptUpdate(e_ctx, ciphertext, &c_len, plaintext, *len);

EVP_EncryptFinal_ex(ectx, ciphertext+c_len, &f_len);

Note: both decompressing and decryption of the stream are not implemented in the test bed. Although

implementation of decoding and decryption will add value and integrity to the results. The results can

be calculated without the need for neither decryption nor decoding.

24

RTP Sending/Receiving

The previous 2 stages of the operations held in the test bed allow calculating bitrate in the absence of

packetization. To achieve realistic results, we implement sending and receiving of the stream in two

separate threads. Then we calculate bitrates of the received and dumped packets.

The library used for RTP sending/receiving is oRTP, an implementation of the RTP library. A number

of calls must be made to initialize the library. The first of the first of these is RTPCreate(), which

establishes a context. A context is an identifier used by the library to determine which RTP session a

function call is to be associated with. An application can run many sessions at the same time, each

created with a separate call to RTPCreate, resulting in a different context for each. Most library

functions accept a context as the first argument. Once RTPCreate has been called to initialize the

session, the addresses for the session must be set.

rtperror RTPCreate(context *the_context);

rtperror RTPOpenConnection(context cid);

Sending packets is fairly straightforward. The RTPSend() function is used to tell the library to send

an RTP packet. It requires the user to pass a pointer to a buffer, a length, a value for the marker field

in the RTP header, an increment for the timestamp, and the context. The library will take the buffer,

add the RTP header, perform any required operations, and send the packet. The library will

automatically send RTCP packets. The initial timestamp and sequence number are chosen randomly.

rtperror RTPSend(context cid, int32 tsinc, int8 marker, int16 pti, int8 *payload, int len);

Receiving packets is a little more complex. In order to know if a packet is available for reading, a

process can block, it can poll, or use any other kind of mechanism. Since the library does not dictate

this policy, it is up to us to determine when data is available for reading. We choose polling every 20

milliseconds in order to check for a received packet. To do this, the library allows access to the

receive sockets. There are two: one for RTP, one for RTCP. The functions RTPSessionGetRTPSocket

and RTPSessionGetRTCPSocket are used to do this. They take as input the context and a pointer to

a socket. When they return 1, the socket has been filled in. We then check for the presence on an RTP

packet on these sockets using select().

RTPSessionGetRTPSocket(context cid, socktype *value);

rtperror RTPReceive(context cid, int socket,

char *rtp_pkt_stream, int *len);

25

When a packet is present on either socket, the application should call the function RTPReceive().

This function takes the context, the socket on which data is present, a pointer to a buffer, and a pointer

to a length value. The length value should be initialized to the amount of room in the buffer. The

library will read and process the RTP or RTCP packet. For RTCP, it will perform all statistics

collection and parsing. The buffer will be filled in with the entire RTP/RTCP packet, including the

header.

We then save the whole received packet into a file for further calculation of the obtained bitrate. The

bitrate is calculated based on the previously known duration of the sent speech.

Dataset

The choice of the data set was guided by the dataset used in the articles published about the subject.

They used the TIMIT corpus, a database used for speech recognition. Since the TIMIT database is

not open for public use. We chose to work with another speech recognition training database: the

census database. Here we state information about the designated dataset:

The directory contains the alphanumeric database (aka "census" aka "an4") recorded at Carnegie

Mellon University circa 1991. Subjects were asked to spell out personal information, such as name,

address, telephone number, birthdates, etc. They were instructed to not use their actual numbers. In

addition to these, subjects also spoke randomly generated sequences of words containing control

words. The database used internally at CMU has 1018 training and 140 test utterances, whereas the

database provided here has 948 training and 130 test utterances. All data are sampled at 16 kHz, 16-

bit linear sampling. All recordings were made with a close talking microphone.

In the dataset, we have two directories:

an4_clstk

The directory with training data has 74 sub-directories, one for each speaker. 21 of

them are female, 53 are male. The total number of utterances is 948, and the average

duration is about 3 seconds, totaling a little less than 50 minutes of speech.

an4test_clstk

The directory with test data has 10 sub-directories, one for each speaker. 3 of them

are female, 7 are male. The total number of utterances is 130, totaling around 6

minutes of speech.

26

Test-bed overview

The presented test-bed can be summarized in the block diagram in figure III-1.

The process of testing starts with choosing a file from the dataset. The time of the file is calculated

by counting the number of samples in the file. After that, the file is encoded, encrypted and sent over

a RTP socket. Compression must span all the range for quality (0 to 10). Encryption must also take

place in the 4 presented modes (stream and 3 block sizes). Next, the file is sent over a RTP socket to

a local receiving socket initiated by another thread. Packets are dumped and saved in an output file.

The recorded size of frames is used to calculate bit-rate without the packetization overhead. The size

of sent/received stream is used to calculate the bit-rate along with the network overhead.

Choose file from dataset

Calculate time

Compress using Speex

Set quality parameter

Encrypt using AES in CBC mode

Set the block size

record frame sizes

Send/Receive over RTP socket

dump frames

calculate bit-rate

27

Chapter IV Results and Conclusion

Tests were held using the test-bed presented in the previous chapter. The resulting bitrates obtained

are divided into two categories: narrowband and wideband. Along with presenting the bitrate obtained

for 4 encryption schemes (stream, 128, 256, and 512 padding). The overhead for the latter three

schemes over the original stream bitrate is calculated.

Narrow Band Results

Figure IV-2: NB rate vs quality (without packtization)

Figure IV-1: NB padding overhead (without packetization)

28

As we can infer from these results, the overhead induced by padding for narrow band mode is of great

magnitude. In figures IV-1 and IV-2, we see the rate versus quality for the three levels of security as

well as the overhead induced by padding. The 128 bit padding has a moderate overhead to be added.

512 bit padding has a constant bitrate throughout the whole range of quality. Consequently, using

CBR with highest quality maybe a better solution than relying on padding. However, for other

padding schemes (256 bit for example) the overhead added is manageable.

An example of a tradeoff using these results can stated by the following. Take for example the rates

of stream encryption and 256 padding. We can see the average rates for streaming quality 10 and 256

padding of quality 7 are the same. A tradeoff can be made here: padding to 256 bits and setting quality

to 7 can create a huge security gain while keeping the same rate. The price we have to pay is quality.

Figure IV-3: NB rate vs quality (with packtization)

Figure IV-4: NB padding overhead (with packetization)

29

Wide Band Results

The same testing was held while setting Speex to wide-band mode. The following Figures show the

obtained results (bit-rate versus quality, and overhead) for the 4 encryption streams adopted. The

results are a lot better than the results obtained for narrow band. As we see in figure IV-5, the curves

corresponding to the 4 encryption schemes show less difference and consequently less added

overhead. For example, if we have to do the same tradeoff exhibited in the previous section, the

quality will downgrade only to 9 instead of 7. To have a padding of 512 bits and keep the same bit-

rate, the quality will downgrade down to 8.

Figure IV-6: WB Rate vers Quality (without packetization)

Figure IV-5: WB overheaad (without packetization)

30

An important notion is that the overhead calculated with packetization a smaller impact to a great

extent. The overhead for 256 bits padding and a quality of 7 for example is around 30% if calculated

without packetization. This overhead is less than 20 percent when calculated with packetization.

Statistical Analysis

The results shown in the previous sections represent only average bitrate. To have a clearer

perspective, we calculated a 95% confidence interval for each obtained bitrate. The confidence

Figure IV-8: Wide Band rate versus quality (with packetization)

Figure IV-7: Wide Band overhead (with packetization)

31

interval gives us more information about the resulting bitrate. The size of the confidence interval tells

us about the fact of benefiting from variable bitrate compression. However, a very large confidence

interval cannot be linked to a manageable overhead.

Another important piece of information is that the confidence interval overlapping between 2

encryption schemes will make us suggest that the 2 schemes can be working in the same rate. The

statistical calculated results overall give a better perspective to understand the tradeoff to be made.

The following figures present the confidence interval for the 4 encryption schemes. The 3 block

encryption schemes are compared to the stream cipher. Only wide band results are shown.

Figure IV-9: Stream Cipher 95% Confidence Interval

Figure IV-10: Stream and 128 bit padding Confidence Interval

32

Conclusion and future recommendations

The answer to the main question asked in our report problem is: yes, using padding for VoIP

encrypted stream is feasible. The results show in a clear manner that a 3 dimensional tradeoff can be

made to get the desired solution. The parameters of the tradeoff, the level of security (presented by

the padding block size), the bit-rate, and the quality. The two latter parameters work inversely, while

the security parameter changes the scale of bitrate range.

Figure IV-12: Stream and 256 bit Confidence Interval

Figure IV-11: Stream and 512 bit padding confidence interval

33

A remark is to be done about the importance of 256 bit padding. 256 padding shows a great

enhancement in immunity to traffic analysis (chapter 2), but on the other hand, the overhead induced

by this encryption scheme is manageable to a very great extent.

The conclusive statement can be made about the possibility of solving the security issues presented

in the literature without relying on new technology. Tools from the state of the art, implemented and

standardized, can be used with minor modification to gain a great security upgrade.

We recommend, as a future work, taking this approach and testing it with video compression.

Although nothing is published yet about such analysis for video. But the concept of information

leakage through varying packet size is worthy of studying for all types network streams, especially

for real-time applications.

Another recommendation to be made is to push towards standardizing such approach as a part of

security standards. Although an RFC is publish about the guidelines for using variable bit-rate

encoding with SRTP, the problem is that this standard suggest making padding part of the application

layer and a responsibility of the developer. Such approach may induce security weaknesses avoidable

if padding was part of the standard.

Trade-off between security level and compression in voice communications

Documents