-
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 7,
NO. 6, DECEMBER 2012 1865
Steganography Integration Into a Low-Bit RateSpeech Codec
Yongfeng Huang, Chenghao Liu, Shanyu Tang, Senior Member, IEEE,
and Sen Bai
AbstractLow bit-rate speech codecs have been widely used inaudio
communications like VoIP and mobile communications, sothat
steganography in low bit-rate audio streamswould have
broadapplications in practice. In this paper, the authors propose a
newalgorithm for steganography in low bit-rate VoIP audio streamsby
integrating information hiding into the process of speech
en-coding. The proposed algorithm performs data embedding
whilepitch period prediction is conducted during low bit-rate
speechencoding, thus maintaining synchronization between
informationhiding and speech encoding. The steganography algorithm
canachieve high quality of speech and prevent detection of
steganal-ysis, but also has great compatibility with a standard low
bit-ratespeech codec without causing further delay by data
embeddingand extraction. Testing shows, with the proposed
algorithm, thedata embedding rate of the secret message can attain
4 bits/frame(133.3 bits/second).
Index TermsG.723.1, information hiding, low bit-rate
speechcodec, pitch period prediction, VoIP.
I. INTRODUCTION
N OWADAYS people are becoming more and moreconcerned about the
security of private informationtransmitted over the Internet.
Protecting the private informationfrom being attacked is regarded
as one of the major problems inthe field of information security.
Apart from encryption, digitalsteganography has been one of the
solutions to protecting datatransmission over the network
[1].Steganography is the science of covert communications that
conceal the existence of secret information embedded in
covermedia over an insecure network. A great effort has been made
toexplore the methods for embedding information in cover media,such
as plaintext [2], audio files in WAV or MP3 [3], and im-ages with
BMP or JPEG format [4]. In recent years, computernetwork protocols
and streaming media like Voice over InternetProtocol (VoIP) audio
streams were used as cover media to
Manuscript received June 23, 2012; revised August 30, 2012;
accepted Au-gust 30, 2012. Date of publication September 12, 2012;
date of current versionNovember 15, 2012. This work was supported
in part by the National NaturalScience Foundation of China under
Grant 61271392 and Grant 61272469. Theassociate editor coordinating
the review of this manuscript and approving it forpublication was
Prof. Jiwu Huang. (Corresponding author: S. Tang.)Y. Huang is with
the Department of Electronic Engineering, Tsinghua Uni-
versity, Qing Hua Yuan, Hai Dian District, Beijing, 100084,
China (e-mail:[email protected]).C. Liu and S. Bai are with
the Department of Information Engineering,
Chongqing Communication Institute, Chongqing, 400035, China
(e-mail:[email protected]; [email protected]).S. Tang is with the
School of Computer Science, China University
of Geosciences, Wuhan City, Hubei Province, 430074, China
(e-mail:[email protected]).Color versions of one or more of the
figures in this paper are available online
at http://ieeexplore.ieee.org.Digital Object Identifier
10.1109/TIFS.2012.2218599
embed secret messages [5], [6]. Dittmann et al. [5], for
example,suggested the design and evaluation of steganography in
VoIP,indicating possible threats as a result of embedding secret
mes-sages in such a widely used communication protocol.The methods
of speech steganography can be classified into
three categories. The first is the least significant bit (LSB)
re-placement/matching method towards the pulse code modulation(PCM)
format voice data [3]. The second hides a secret messagein
transform domain, firstly transforming the covers data to
thetransform domain, and then modifying some parameters in
thedomain to embed the secret message, with often used
transformincluding the Cepstrum transform [7], discrete cosine
transform[8], and so on. The third is the Quantization Index
Modula-tion (QIM)-based method firstly proposed by Xiao et al.
[9].The QIM hides the secret message by modifying the quantiza-tion
vector, which is applicable to various digital media, suchas
speech, image and video. It is very suitable to informationhiding
in the media compression encoding process.Although some methods
have been suggested for speech
steganography, most of which dealt with high bit-rate
speechformat like PCM. However, most codecs used in VoIP are
thosewith low bit-rate, such as Internet low bit-rate codec
(iLBC),G.723.1 and G.729A; this means existing
steganographicmethods do not necessarily meet all the requirements
of infor-mation hiding in VoIP. Up to now, only little attention
has beenpaid to steganography in low bit-rate VoIP audio streams.
Forexample, in our preliminary work, we proposed a
codebookpartition algorithm called the Complementary Neighbor
Vertex(CNV) algorithm for optimally dividing the vector
codebookinto two subcodebooks, which are needed by QIM embedding.In
general, it is more challenging to embed information in
low bit-rate VoIP streams. The first reason is the requisite
forreal-time VoIP communications. Most previous
steganographicalgorithms have been designed for embedding data in
image oraudio files. These algorithms usually take relatively long
time toprocess data embedding. So they are not suitable for
steganog-raphy in VoIP streams. Secondly, only a few results have
so farproved conventional steganographic algorithms could
survivelow bit-rate compression. Finally, data embedding is to
replacethe redundancy in the cover media with the secret message;
theless the redundancy is, the more difficult information
hidingbecomes. Unfortunately, all low bit-rate codecs are based
onanalysis by synthesis (AbS) that uses effective methods such
aslinear predictive coding (LPC) to eliminate redundancy. So
con-ventional steganographic algorithms, i.e., replacing LSBs
withthe secret message, are not necessarily suitable for
steganog-raphy in low bit-rate VoIP audio streams.To take on these
challenges, we propose a new method for
steganography in low bit-rate VoIP audio streams and design
1556-6013/$31.00 2012 IEEE
-
1866 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY,
VOL. 7, NO. 6, DECEMBER 2012
an enhanced speech codec to integrate the information
hidingfunction.The rest of the paper is organized as follows. In
Section II,
related work is briefly introduced. Section III describes
thepitch period prediction method in the hybrid speech
codec.Section IV presents a new pitch period prediction-based
al-gorithm for steganography in low bit-rate VoIP streams, andan
enhanced speech codec combined with information hiding.Experimental
results are discussed in Section V. Finally,Section VI concludes
with a summary and directions for futurework.
II. RELATED WORKOver the past few years, a number of attempts
have beenmade
to study steganography in low bit-rate audio streams. Some
re-lated works are introduced below.Several MP3stego, AAC-based
audio steganographic sys-
tems have been suggested in recent years [10][12]. Wang et
al.[1] proposed a scheme to convey secret messages by embeddingthem
in VoIP streams. The scheme divides the steganographyprocess into
two steps, compressing the secret message andembedding its binary
bits into the LSBs of the cover speechencoded by G.711 codec.
Dittmann et al. [5] presented a moregeneral scheme for
steganography in VoIP, which can be usedfor transmitting an
arbitrary secret message. More recently,Huang and coworkers [7]
suggested an M-Sequence basedLSB steganographic algorithm for
embedding information inVoIP streams encoded by G.729A codec. With
their algorithm,embedding data in a speech frame takes less than 20
us onaverage, which is negligible in comparison with the
allowablecoding time of 15 ms for each frame in VoIP. In
addition,Huang et al. [6] suggested an algorithm for embedding
datain some parameters of the inactive speech frames encodedby
G.723.1 codec. However, this algorithm is also based onthe LSB
substitution of encoded audio streams. Therefore,the algorithms
above would lead to obvious distortion, whichaffects the quality of
steganographic speech.Xiao suggested a QIM-based steganography in
low bit-rate
speech while encoding [9]. The QIM method randomly dividesthe
whole codebook into two parts, each colored with white orblack.
When a secret bit of 0 is embedded, the white code-word is used;
the black codeword is used when a secret bit of1 is embedded. On
the receiving side, the hidden bit is ex-tracted by checking which
part of the codebook the codewordbelongs to. It is the first
attempt to perform steganography andcompression operation in the
same codec. However, this infor-mation hiding algorithm has a small
hiding capacity, which isno use in practice.Our work described in
this paper is the first ever effort to
explore a novel method for steganography in low bit-rate
speechbased on pitch period prediction while the speech is
encoded.The steganographic algorithm can not only achievemuch
higherdata hiding capacity than the QIM algorithm [9], but also
assurea good quality of speech.
III. PITCH PERIOD PREDICTION IN HYBRID SPEECH CODECAs pitch
period prediction is required in almost all speech
analysis-synthesis (vocoder) systems, the pitch period
predictoris an essential component in all speech codecs of low
bit-rate.
Because of the importance of pitch period prediction, a
varietyof algorithms for pitch period prediction have been proposed
inthe speech processing literature [13][15]. However,
accuratepredictions about the pitch period of a speech signal from
theacoustic pressure waveform alone is often exceedingly
difficultdue to the reasons below.1) The glottal excitation
waveform is not a perfect train ofperiodic pulses. Although finding
the period of a perfectlyperiodic waveform is straightforward,
predicting the pe-riod of the speech waveform can be quite
difficult, as thespeech waveform varies both in period and in the
detailedstructure of the waveform within a period.
2) The interaction between the vocal tract and the glottal
exci-tation also makes pitch period prediction difficult. In
someinstances, the formants of the vocal tract can
significantlyalter the structure of the glottal waveform, so that
the ac-tual pitch period is unlikely to predict. Such an
interac-tion is most deleterious to pitch period prediction
duringfast movements of articulators while the formants are
alsochanged rapidly.
3) The problem of accurately predicting the pitch period isthe
inherent difficulty in defining the exact beginning andend of each
pitch period during voiced speech segments.Choosing the beginning
and ending locations of the pitchperiod is often quite arbitrary.
The pitch period discrep-ancies are arisen from the
quasi-periodicity of the speechwaveform, but also the fact that
peak measurements aresensitive to the formant structure during the
pitch period,whereas zero crossings of the waveform are sensitive
to theformants, noise, and any DC level in the waveform.
4) Another difficulty of pitch period prediction is how to
dis-tinguish between unvoiced speech and low-level voicedspeech. In
many cases, transitions between unvoicedspeech segments and
low-level voiced speech segmentsare very subtle, and so they are
extremely hard to pinpoint.
Apart from the difficulties in measuring the pitch period
dis-cussed above, pitch period prediction is also impeded by
otherfactors. Although it is difficult to predict the pitch period,
anumber of sophisticated algorithms have been developed forpitch
period prediction. Basically, algorithms for pitch periodprediction
can be classified into three categories. The first cate-gory mainly
utilizes the time-domain properties of speech sig-nals, the second
category employs the frequency-domain prop-erties of speech
signals, and the third category uses both thetime- and
frequency-domain properties of speech signals. Mostlow bit-rate
speech encoders, such as ITU G.723.1 and G.729A,adopt the first
type of algorithms. As an example, the pitch pe-riod prediction
algorithm of ITU G.723.1 is introduced below.ITU-T G.723.1 encoder
operates on frames of 240 samples
each, a speech frame is denoted by ,equal to 30 ms at an 8 kHz
sampling rate. Each frame is dividedinto four subframes of 60
samples each. After accomplishinga series of processes, the input
signal of a frame is con-verted to the weighted speech signal .For
every two subframes (120 samples), the open loop pitch pe-riod, ,
is computed using the weighted speech signal .The pitch estimation
is performed on blocks of 120 samples. Thepitch period is searched
in the range from 18 to 142 samples.Two pitch estimations are
computed for every frame, one for
-
HUANG et al.: STEGANOGRAPHY INTEGRATION INTO A LOW-BIT RATE
SPEECH CODEC 1867
the first two subframes and the other for the last two. The
openloop pitch period estimation, , is computed using the
per-ceptually weighted speech . A cross correlation
criterion,namely , calculated by using the maximization method[13],
is used to determine the pitch period, as shown in (1).
(1)
The index which maximizes the cross correlation, ,is selected as
the open loop pitch estimation for the appropriatetwo
subframes.While searching for the best index, preference isgiven to
smaller pitch periods to avoid choosing pitch multiples.Maximums of
are searched for beginning with .For every maximum found, its value
is compared to thebest previous maximum found, . The following
pseudocode shows how it works:
if
then (if
then (select , )
)
else (if
then (select , )
)
Using the pitch period estimation, , a closed loop
pitchpredictor is computed. The pitch predictor in G.723.1 is
afifth order pitch predictor. The pitch prediction contributionis
treated as a conventional adaptive codebook contribution.For
subframes 0 and 2, the closed loop pitch lag is selectedfrom around
the appropriate open loop pitch lag in the range of. For subframes
1 and 3, the closed loop pitch lag is coded
differentially using 2 bits and may differ from the
previoussubframe lag only by , 0, or [10].
IV. PITCH PERIOD PREDICTION-BASED STEGANOGRAPHYALGORITHM
A. Embedding Algorithm
In the process of G.723.1 encoding, the open-loop
pitchestimation is conducted first, followed by closed-loop
pitchprediction. The open-loop pitch estimation computes
theopen-loop pitch period of a frame of speech signal
. For each frame, two pitch periodsare computed by using the
first two subframes and the lasttwo subframes, respectively. The
method for computing theopen-loop pitch period is described
below.First, a cross correlation criterion is computed by using
(1), and then it searches for the open-loop pitch following
theprocedures below [13]:1) Suppose , , ;
2) Using (1), compute . If
(2)
and
(3)
then , and .3) Set , if , return to 2), otherwise stop.Having
obtained the pitch period of a frame of speech
signal , search for the closed-loop pitchperiod and embed
information.The closed-loop pitch period of a subframe is defined
by ,
, 1, 2, 3, and its open-loop pitch period is ,, 1, representing
the open-loop pitch periods of the first twosubframes and the last
two subframes, respectively. Adjusting
yields
(4)
The closed-loop pitch period is assigned a value closeto the
open-loop pitch period . The values for oddsubframes and for even
subframes are obtained from differentranges as shown in (5).
(5)
The minimum value of is 17, and its maximum is 143. Thenumber of
is equal to the number of elements in , denotingby . represents the
th element in ,
.The pitch prediction contribution is treated as a
conventional
adaptive codebook contribution. For subframes 0 and 2, theclosed
loop pitch lag is selected around the appropriate openloop pitch
lag in the range and coded using 7 bits. For sub-frames 1 and 3,
the closed loop pitch lag is coded differentiallyusing 2 bits and
may differ from the previous subframe lag onlyby , 0, or [13]. The
quantized and decoded pitch lagvalues are referred to as from this
point on. The pitch pre-dictor gains are vector quantized using two
codebooks with 85or 170 entries for the high bit rate and 170
entries for the lowbit rate. The 170 entry codebook is the same for
both rates. Forthe high rate, if is less than 58 for subframes 0
and 1 or ifis less than 58 for subframes 2 and 3, then the 85 entry
code-
book is used for the pitch gain quantization. Otherwise, the
pitchgain is quantized using the 170 entry codebook. We studied
thepitch distribution probabilities of closed-loop pitch period of
un-touched G.723.1 VoIP speeches, and Fig. 1 shows the pitch
dis-tribution probability results for four types of untouched
G.723.1VoIP speeches, each with 250 samples.
-
1868 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY,
VOL. 7, NO. 6, DECEMBER 2012
Fig. 1. Pitch distribution probabilities of four types of
untouched G.723.1 VoIPspeech samples.
TABLE IDATA EMBEDDING AT DIFFERENT EMBEDDING BIT-RATES
In search for the closed-loop pitch period, data embedding
isaccomplished by adjusting the searching range of the
pitchprediction of a subframe according to the secret bit
informa-tion to be embedded. For instance, if the secret
information to beembedded is 0, the subframe search is performed on
the evenelements in ; if the secret information is 1, the odd
elementsin are searched. In G.723.1, each frame has four
sub-frames, , all subframesrequire searching for the closed-loop
pitch, so that data embed-ding can be performed on part of or all
subframes. Therefore, wepropose a series of solutions for
steganography at four differentembedding bit-rates, as shown in
Table I, while the 15 strate-gies are randomly selected, the
average data embedding rate isaround 2.1 bits/frame, not 4
bits/frame.On the basis of the steganography solutions listed in
Table I,
a new data embedding algorithm is proposed below.
Step 1: Step 0: generate a random , ,then choose a steganography
solution accordingto and Table I.
Step 2: Step 1: according to , decide the embeddingbit-rate and
where to embed the secret bit stream
, i.e., which i is the subframe inthe frame, .
Step 3: Step 2: suppose the bit in the bit stream isembedded in
the subframe of the frame, data embedding is conducted by using
the
following algorithm.Step 4: Step 3: if , then data are embedded
in the
subframe of the frame, i.e., the pitchperiod of the subframe is
searched upon.
ifif
ifif (6)
ifif
ifif (7)
If , then data are embedded in the subframeof the frame, i.e.,
the pitch period of the subframe issearched upon .
ifif
ifif
ifif
ifif (8)
Step 4: repeat Step 3 until the completion of data embeddingof
the secret message .For steganography using the data embedding
algorithm
above, errors in predicting speech pitch periods can be
esti-mated in theory. As G.723.1 samples at 8 KHz, analysis ofthe
closed-loop pitch period prediction shows data embeddingwould lead
to one sampling-point error. So the absolute error
in predicting pitch period caused by data embeddingcan be
computed by
(9)
If the pitch period is , the maximum of is26.144 Hz, and the
relative error is 5.882%;If the pitch period is , the maximum of
is
0.394 Hz, and the relative error is 0.699%.Therefore, the error
in pitch frequency as a result of adjusting
pitch prediction is proportional to the pitch frequency of
speech
-
HUANG et al.: STEGANOGRAPHY INTEGRATION INTO A LOW-BIT RATE
SPEECH CODEC 1869
signal, but the error has a little impact on speech synthesis,
par-ticularly for those speech signals with lower pitch frequency.
Inthe literature [15], the average error of the most advanced
algo-rithms for predicting pitch periods is found to be
samples,indicating that the pitch period prediction error arising
from thedata embedding algorithm is within the normal range.
B. Extracting Algorithm
The sender embeds the secret message in the low bit-ratespeech
streams encoded by G.723.1, and the bit streams con-taining the
message are then sent to the receiver who extractsthe secret
message following the algorithm below.
Step 1: Step 1: using a negotiating mechanism, thereceiver
acquires the data embedding algorithm(steganography solution) for
the current speechframe .
Step 2: Step 2: compute the pitch periodsof four subframes , , ,
ofthe speech frame f[m] decoded by G.723.1.
Step 3: Step 3: according to the data embedding algorithm,
decide which of the four subframes ,, , contains the secret
message,
and determine the bits of the message using thefollowing
formula
(10)Step 4: Step 4: repeat Step 3 until completion of decoding
all
speech frames, following by the bit streams of thesecret message
to be convertedto the secret message .
C. Design of the Coder With SteganographyA joint information
embedding and lossy compression
method is suggested in the literature [16], but no attemptshave
been made to study data embedding integrating into lowbit-rate
speech encoding. By using a data embedding algorithmbased on pitch
period prediction, we here develop the G.723.1low bit-rate speech
codec with data embedding functionality,i.e., the embedding and
extracting of the secret message areintegrated into G.723.1 speech
codec.To achieve data embedding while encoding in G.723.1, our
specially designed secret information preprocessing
module,steganography solution selecting module, updating module,and
secret information bit stream framer module are insertedinto a
normal G.723.1 speech coder, as shown in Fig. 2. Thepitch period
prediction module in the codec is also modifiedso as to enable
search for the closed-loop pitch upon the pitchperiod updating set,
thus realizing data embedding. Similarly,in order to achieve secret
data extraction, the novel pitch periododd-even deciding module,
steganography solution selectingmodule, secret data extraction
module, and secret informationpostprocessing module are built into
the G.723.1 decoder, asshown in Fig. 3. Fig. 2 illustrates
information embedding inte-grating into G.723.1 coder, whereas Fig.
3 shows informationextraction along with G.723.1 decoding.In the
process of information embedding and speech en-
coding, the secret message are compressed
Fig. 2. G.723.1 coder with information embedding.
Fig. 3. G.723.1 decoder with information extraction.
to form the secret data bit stream , whichis divided into
segments according to the data embeddingalgorithm. The secret
segments are then embedded into speechstreams by adjusting pitch
period prediction.In the process of speech decoding and information
extraction,
G.723.1 decoder computes the pitch period of a subframe ,, 1, 2,
3, in the current frame , decides the odd-even
nature of the pitch period of the subframe by using the
pitchperiod odd-even deciding module, determines the hidden databit
according to the odd-even nature of and the steganog-raphy solution
. The hidden data bit is then used to extractthe secret
information, , by using the secret informationpostprocessing
module.
-
1870 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY,
VOL. 7, NO. 6, DECEMBER 2012
Fig. 4. Comparisons of time-domain amplitude lots of a 3-second
CSM sampleat different embedding bit-rates.
V. RESULTS AND DISCUSSION
A. Test Samples and Conditions
To evaluate the performance of the proposed
steganographicalgorithm, we employed different speech sample files
with PCMformat as cover media for steganography to conduct
experi-ments. The speech samples are classified into four groups,
Chi-nese SpeechMan (CSM), Chinese SpeechWoman (CSW), Eng-lish
Speech Man (ESM), and English Speech Woman (ESW).Each group
contains 100 pieces of speech samples with lengthof 3 seconds, and
100 pieces of 10-second speech samples, andthe four groups total
800 speech samples. Each speech samplewas sampled at 8000 Hz and
quantized to 16 bits, and saved inPCM format. Those speech samples
with length of 3 seconds aredefined as the Sample-3 sample set; the
Sample-10 samplecontains 10-second speech samples.In our
experiments, ITU G.723.1 codec operated at 6.3 kbps,
without silence compression. Fifteen solutions for data
embed-ding proposed in Table I were used to conduct steganographyat
four different embedding bit-rates (1 bit/frame, 2 bits/frame,3
bits/frame, and 4 bits/frame). Secret data were embedded intoeach
audio frame by randomly choosing different embeddingbit-rates and
steganography solutions at equal probability.
B. Results and Analysis
Fig. 4 shows comparisons of the time-domain amplitudespectrum of
an original 3-second CSM sample with those of thestego 3-second CSM
samples at four different data embeddingbit-rates. Almost no
distortion occurred in the time domain as aresult of data embedding
in the speech sample; no differencesbetween the original speech
sample and the stego speech sam-ples in the time-domain spectrum
were perceived, indicatingthat our proposed steganography algorithm
had no or very littleimpact on the quality of the original
speech.We used the perceptual evaluation speech quality (PESQ)
value to assess the subjective quality of the stego speech
sam-ples. Figs. 5 and 6 shows the PESQ values for the
originalspeech samples after G.723.1 codec without any data
embed-ding and the stego speech files processed by G.723.1 with
dataembedding by means of the proposed steganography
algorithm(detailed in Section IV), when the 3-second and the
10-secondspeech samples were used as cover media, respectively.
The
Fig. 5. PESQ values for 3-second samples using the proposed
steganographyalgorithm.
Fig. 6. PESQ values for 10-second samples using the proposed
steganographyalgorithm.
black curves are the PESQ values for the original speech
sam-ples without data hiding. Steganography was carried out at
fourdifferent data embedding bit-rates (red curve: 1 bit/frame,
greencurve: 2 bits/frame, blue curve: 3 bits/frame, navy curve: 4
bits/frame,). As Figs. 5 and 6 show, for the two types of speech
covermedia, the variations in PESQ between the original speech
filesand the stego speech files were so small, which means the
pro-posed steganography algorithm has little effect on PESQ.Figs. 7
and 8 show comparisons of PESQ values between
using the proposed steganography algorithm and using theCNV
algorithm (yellow curve) presented in the literature [9]for
3-second samples and 10-second samples, respectively.There were no
obvious discrepancies in the PESQ value
-
HUANG et al.: STEGANOGRAPHY INTEGRATION INTO A LOW-BIT RATE
SPEECH CODEC 1871
Fig. 7. Comparisons of PESQ values for 3-second samples between
using theproposed steganography algorithm and using the CNV
algorithm [9].
Fig. 8. Comparisons of PESQ values for 10-second samples between
using theproposed steganography algorithm and using the CNV
algorithm [9].
without (black curve: no hiding) and with data embedding attwo
different embedding bit-rates (blue curve: 3 bits/frame,navy curve:
4 bits/frame). As Figs. 7 and 8 show, the variationsin PESQ between
the original speech files and the stego speechfiles were so small,
indicating that the proposed informationhiding along with speech
compression encoding had no or verylittle impact on the quality of
the synthesized speech.Tables IIV list the PESQ values for the
original speech sam-
ples and the stego speech files obtained by using the
proposedsteganography algorithm, when the 3-second and the
10-second
TABLE IIPESQ STATISTICS AT 1 BIT/FRAME DATA EMBEDDING
BIT-RATE
TABLE IIIPESQ STATISTICS AT 2 BITS/FRAME DATA EMBEDDING
BIT-RATE
TABLE IVPESQ STATISTICS AT 3 BITS/FRAME DATA EMBEDDING
BIT-RATE
speech samples were used as cover media, respectively.
Thestatistical results were obtained for steganography
experimentsconducted at four different data embedding bit-rates.
The PESQvalues ranged from 2.9 to 4.1. On average, data hiding
hadless effect on the PESQ values of the male speech samplesthan
the female speech samples. This is probably due to thefact that the
pitch frequency of female speech has a greaterrange, and changes
more quickly than male speech. Analysisof Tables IIV shows, as the
data embedding bit-rate increases,the average worsening change in
PESQ increasesfor 3s sam-ples, ; for 10s samples,
. The maximum of theaverage worsening change in PESQ is 0.50%,
and the average
-
1872 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY,
VOL. 7, NO. 6, DECEMBER 2012
TABLE VPESQ STATISTICS AT 4 BITS/FRAME DATA EMBEDDING
BIT-RATE
TABLE VIPESQ STATISTICS USING THE STEGANOGRAPHY ALGORITHM
PRESENTED IN [9]
TABLE VIICOMPARISONS OF CHANGES IN PESQ BETWEEN THE
PROPOSEDSTEGANOGRAPHY ALGORITHM AND THE ONE PRESENTED IN [9]
change in PESQ is within the standard error in PESQ for
thespeech samples without data hiding. This also means data
hidinghas a negligible effect on PESQ.Table VI lists PESQ
statistical results for the stego speech
files obtained by using the steganography algorithm presentedin
[9], with cover media having the lengths of 3 and 10
seconds.Similarly, data embedding with the proposed algorithm led
toa small change in PESQ, and the average change in PESQ isalso
within the standard error in PESQ for the speech sampleswithout
data hiding. However, the previous steganography algo-rithm [9]
resulted in a larger change in PESQ than our proposedalgorithm, and
so it had a slightly high impact on PESQ.Table VII lists
comparisons of changes in PESQ between the
proposed steganography algorithm and the CNV algorithm
pre-sented in [9]. At the same embedding bit-rate with 3-second
TABLE VIIIDIFFERENCES IN PESQ BETWEEN NORMAL EN- AND DECODING
AND DATA
HIDING USING DIFFERENT ALGORITHMS
speech samples, the overall average standard error for the
stegospeech files using the proposed steganography algorithm
was1.60%, 4.04% less than the CNV algorithm, with both algo-rithms
leading to 0.96% change in PESQ; for 10-second speechsamples, the
average worsening changes in PESQ of CSM andCSW with the proposed
algorithm were smaller, those of ESMand ESW were bigger, the
overall worsening change in PESQwas 0.05% larger, and the standard
error (0.84%) was 0.02%larger in comparison with CNV. With the
embedding bit-ratereaching 4 bits/frame, the average worsening
change in PESQof 3-second speech samples with the proposed
algorithm was0.26% larger, and the overall standard error (1.61%)
was 4.03%smaller compared with CNV; for 10-second speech samples,
theaverage worsening change in PESQ was 0.33% larger, and
theoverall standard error (0.90%) was 0.08% bigger than CNV.Table
VIII lists differences in PESQ between normal en-
and decoding and data hiding using different algorithms.
Whenusing the proposed steganography algorithm, the average
wors-ening change in PESQ and the standard error of both 3s and10s
speech samples were within the range of the standard errorof normal
en- and decoding. For the algorithm presented in[9], this was the
case for the 10s speech samples only. In com-parison with the
previous algorithm, the proposed algorithmhad less impact on PESQ
at lower data embedding bit-rates;when the data embedding bit-rate
increased to 4 bits/frame, theaverage worsening change in PESQ was
0.295% larger, andthe overall average standard error was 1.975%
less than theprevious algorithm.To evaluate the security of the
proposed steganography algo-
rithm, we employed the latest steganalysis method [17][20],which
uses Derivative Mel-Frequency Cepstral Coefficients(DMFCC)-based
Support Vector Machine (SVM) to detectaudio steganography. SVM set
RBF core function as its defaultparameter.The test samples used
were 501 CSM samples (300 as
training samples, and 201 as test samples), 533 CSW samples(300
as training samples, and 233 as test samples), 819 ESMsamples (600
as training samples, and 219 as test samples), 825ESM samples (600
as training samples, and 225 as test sam-ples), and Hybrid samples
containing CSM, CSM, ESM andESW samples. These five sorts of speech
samples were usedas the cover media in which data embedding at 4
bits/frametook place by using the proposed steganography algorithm
and
-
HUANG et al.: STEGANOGRAPHY INTEGRATION INTO A LOW-BIT RATE
SPEECH CODEC 1873
TABLE IXSTEGANALYSIS RESULTS OF THE LITERATURE [6] ALGORITHM
USINGDMFCC AT DIFFERENT DETECTION WINDOWS (DATA EMBEDDING
RATE OF 3 BITS/FRAME)
TABLE XSTEGANALYSIS RESULTS OF THE PROPOSED ALGORITHM USING
DMFCC ATDIFFERENT DETECTION WINDOWS (DATA EMBEDDING RATE OF 3
BITS/FRAME)
the one presented in [6]. The steganalysis results are listed
inTables IX and X.In the experiments, we used LIBSVM Version 3.0
[21]. In
the SVM-scale of LIBSVM, the lower is , the upper is 1, andthe
other parameters used are default values. In the SVM-trainof
LIBSVM, the svm_type is C-SVC, the kernel_type is RBF(radial basis
function), the cost is 1000, the epsilon is 0.00001,and the other
parameters used are default values.As Table IX shows, when the
detection window length was
150 frames, the accuracy of DMFCC in detecting
steganographyusing the algorithm suggested in [6] reached 80% for
all thefive types of speech samples, and increased further to over
90%at detection window length of 300 frames. This indicates
thatDMFCC is very effective in detecting the old
steganographyalgorithm [6].Table X shows the accuracy of DMFCC in
detecting
steganography with the proposed algorithm barely achieved53% for
five types of speech samples, with the maximumaccuracy up to 56%,
indicating that the proposed steganog-raphy algorithm is unlikely
to be detected by DMFCC audiosteganalysis.We also adopted the
latest DMFCC audio steganalysis,
Second-order derivative-based Markov approach for
audiosteganalysis [22], [23], to detect VoIP steganography with
theproposed steganographic algorithm, and the results are
pre-sented in Table XI. As Table XI shows, the average accuracyof
Markov-DMFCC steganalysis in detecting steganographywith the
proposed algorithm just reached 51% for five differenttypes of
speech samples, with the maximum accuracy up to
TABLE XISTEGANALYSIS RESULTS OF THE PROPOSED ALGORITHM USING
THEMARKOV-DMFCC APPROACH [22], [23] AT DIFFERENT DETECTION
WINDOWS (DATA EMBEDDING RATE OF 3 BITS/FRAME)
Fig. 9. Comparisons of steganalysis results of two algorithms
using DMFCCat different detection window lengths.
54%, which means the proposed steganographic algorithm
isunlikely to be detected by Markov-DMFCC steganalysis. Thiswas
probably due to the ineffectiveness of Markov-DMFCCsteganalysis
through analyzing Markov transition features, indetecting the
proposed steganographic algorithm, which usesthe pitch lag
parameters substitution.Fig. 9 shows comparisons of steganalysis
results of two al-
gorithms using DMFCC at different detection window lengthswhen
Hybrid speech samples were used as cover media. As thedetection
window length increased, the accuracy of DMFCCin detecting the
steganography algorithm presented in [6] im-proved significantly;
the detection accuracy attained 90% whenthe detection window length
reached 200 frames. By contrast,DMFCC was not effective in
detecting the proposed steganog-raphy algorithm at different
detection window lengths.Fig. 10 shows the pitch distribution
probabilities of G.723.1
VoIP samples (duration of 20 seconds) without and with
dataembedding. No obvious changes in the statistical propertyof the
closed-loop pitch periods in the speech samples afterG.723.1 codec
without or with data embedding had been found
-
1874 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY,
VOL. 7, NO. 6, DECEMBER 2012
Fig. 10. Pitch distribution probabilities of G.723.1 VoIP
samples (duration of20 seconds) without and with data
embedding.
TABLE XIISTEGANALYSIS RESULTS OF THE PROPOSED ALGORITHM
USING
SVM AT DIFFERENT DETECTION WINDOWS
for four types of VoIP audio samples, indicating that the
pro-posed steganographic system retains the statistical property
oforiginal closed-loop pitch periods.We carried out extra
steganalysis experiments. As our
proposed steganographic algorithm is based on pitch
periodprediction, pitch statistical characteristic-based
steganalysiswas specially designed in a way that suppose
eavesdroppersknow our steganographic algorithm
(Kerckhoffs-compliant),with VoIP samples of 3s, 5s, 10s, 20s and
30s in length with andwithout steganography being available,
through analyzing pitchlag of VoIP samples with and without
steganography eaves-droppers obtained the first-order pitch
statistical characteristics,which were classified by using SVM
(similar to DMFCC de-tection method in setup), and the detection
results are presentedin Table XII. As the table shows, at five
different detectionwindow lengths, the accuracy in detecting
steganographywas below 70%, indicating that our proposed
steganographicalgorithm is capable of standing against
steganalysis.
VI. CONCLUSIONSIn this paper, we have proposed a new method for
steganog-
raphy in low bit-rate VoIP streams based on pitch
periodprediction. On the basis of ITU G.723.1, a widely used
lowbit-rate speech codec, we have developed a much-improvedG.723.1
speech codec with the information hiding function-ality. Fifteen
solutions for steganography have been suggestedto perform on VoIP
speech samples at four data embeddingbit-rates taking into account
the characteristics of G.723.1. The
experimental results have shown that the worsening changein PESQ
of the stego speech files obtained by using the pro-posed
steganography algorithm was within 1.2%, indicatinglittle impact on
the quality of speech. In comparison with aprevious algorithm [9],
the proposed steganography algorithmhas been found to have slightly
larger effect on PESQ for 3sspeech samples, but have less effect
for 10s speech samplesat 3 bits/frame data embedding rate; the
worsening changein PESQ was 0.298% higher as the data embedding
bit-ratereaching 4 bits/frame (33.3% increase than the old
algorithm).Steganalysis tests using DMFCC-SVM have shown that
theproposed steganography algorithm could prevent from
beingdetected by steganalysis. Investigation into the applicability
ofthe proposed algorithm to other low bit-rate speech codecs
shallbe the subject of future work. The steganalysis
performancewith different classifiers such as Fishers linear
classifier andlogistic regression shall be part of future work.
REFERENCES[1] C. Wang and Q. Wu, Information hiding in real-time
VoIP streams,
in Proc. 9th IEEE Int. Symp. Multimedia, Taichung, Taiwan, 2007,
pp.255262.
[2] S. Zander, G. Armitage, and P. Branch, A survey of covert
channelsand countermeasures in computer network protocols, IEEE
Commun.Surveys Tutorials, vol. 9, no. 3, pp. 4457, 3rd Quarter,
2007.
[3] N. Aoki, A technique of lossless steganography for G.711
telephonyspeech, inProc. 4th Int. Conf. Intelligent
InformationHiding andMul-timedia Signal Processing (IIHMSP2008),
2008, pp. 608611.
[4] P.-C. Su and C.-C. J. Kuo, Steganography in JPEG2000
compressedimages, IEEE Trans. Consumer Electron., vol. 49, no. 4,
pp. 824832,Nov. 2003.
[5] J. Dittmann, D. Hesse, and R. Hillert, Steganography and
steganalysisin voice over IP scenarios: Operational aspects and
first experienceswith a new steganalysis tool set, in Proc. SPIE
Security, Steganog-raphy, and Watermarking of Multimedia Contents
VII, Mar. 2005, vol.5681, pp. 607618.
[6] Y. F. Huang, S. Tang, and J. Yuan, Steganography in inactive
framesof VoIP streams encoded by source codec, IEEE Trans. Inf.
ForensicsSecurity, vol. 6, no. 2, pp. 296306, Jun. 2011.
[7] Y. Su, Y. Huang, and X. Li, Steganography-oriented noisy
resistancemodel of G.729a, in Proc. 2006 IMACS Multi-Confe.
ComputationalEngineering in Systems Applications, Beijing, China,
2006, pp. 1115.
[8] L. Liu, M. Li, Q. Li, and Y. Liang, Perceptually transparent
infor-mation hiding in G.729 bitstream, in Proc. 4th Int. Conf.
IntelligentInformation Hiding and Multimedia Signal Processing,
Harbin, China,2008, pp. 406409.
[9] B. Xiao, Y. Huang, and S. Tang, An approach to information
hidingin low bit-rate speech stream, in Proc. 2008 IEEE Global
Telecommu-nications Conf., New Orleans, LA, 2008, pp. 15.
[10] D. Yan, R. Wang, and L. Zhang, Quantization step
parity-basedsteganography for MP3 audio, J. Fund. Inform., vol. 97,
no. 12, pp.114, 2009.
[11] Fabien Petitcolas Mar. 28, 2012 [Online]. Available:
http://www.petit-colas.net/fabien/steganography/mp3stego/
[12] M. Sheikhan, K. Asadollahi, and R. Shahnazi, Improvement of
em-bedding capacity and quality of DWT-based audio steganography
sys-tems, World Appl. Sci. J., vol. 10, no. 12, pp. 15011507,
2010.
[13] ITU.ITU-T Recommendation G.723.1. Dual Rate Speech Coderfor
Multimedia Communication Transmitting at 5.3 and 6.3 kbit/s1996
[Online]. Available:
http://www.itu.int/rec/T-REC-G.723.1-200605-I/en
[14] ITU.ITU-T Recommendations G.729. Coding of Speech at 8
kbit/sUsing Conjugate-Structure Algebraic-Code-Excited
Linear-Prediction(CS-ACELP) 2007 [Online]. Available:
http://www.itu.int/rec/T-REC-G.729/e
[15] R. P. Ramachandran and P. Kabal, Pitch prediction filters
in speechcoding, IEEE Trans. Acoustics Speech Signal Process., vol.
37, no. 4,pp. 467478, Apr. 1989.
[16] A. Maor and N. Merhav, On joint information embedding and
lossycompression, IEEE Trans. Inf. Theory, vol. 51, no. 8, pp.
29983008,Aug. 2005.
-
HUANG et al.: STEGANOGRAPHY INTEGRATION INTO A LOW-BIT RATE
SPEECH CODEC 1875
[17] Y. Huang, S. Tang, C. Bao, and Y. J. Yip, Steganalysis of
compressedspeech to detect covert voice over Internet protocol
channels, IET Inf.Security, vol. 5, no. 1, pp. 2632, 2011.
[18] Q. Liu, A. H. Sung, andM. Qiao, Temporal derivative-based
spectrumand mel-cepstrum audio steganalysis, IEEE Trans. Inf.
Forensics Se-curity, vol. 4, no. 3, pp. 359368, Sep. 2009.
[19] Y. Huang, S. Tang, and Y. Zhang, Detection of covert voice
over In-ternet protocol communications using sliding window-based
steganal-ysis, IET Commun., vol. 5, no. 7, pp. 929936, 2011.
[20] H. Yong-feng, Y. Jian, and C. Mingchao, Key distribution in
thecovert communication based on VoIP, Chinese J. Electron., vol.
20,no. 2, pp. 357361, 2011.
[21] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support
VectorMachines [DB/OL] Oct. 22, 2011 [Online]. Available:
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
[22] Q. Liu, A. H. Sung, and M. Qiao, Novel stream mining for
audio ste-ganalysis, in ACMMultimedia. NewYork: ACM, 2009, pp.
95104.
[23] Q. Liu, A. H. Sung, and M. Qiao, Derivative-based audio
steganal-ysis, ACM Trans. Multimedia Comput., Commun., Appl., vol.
7, no.3, pp. 18:118:9, 2011.
Yongfeng Huang received the Ph.D. degree in com-puter science
and engineering from Huazhong Uni-versity of Science and Technology
in 2000.He is an Associate Professor in the Department
of Electronic Engineering, Tsinghua University, Bei-jing. His
research interests include VoIP, P2P IP TV,multimedia network
security, and next-generation In-ternet. He has published five
books and over 70 re-search papers on computer network and
multimediacommunications. As one of the principal researchers,he
has designed and constructed the China Education
and Research Network (CERNET), which is the second largest
computer net-work in China.Dr. Huang is the principal/joint grant
holder of ten externally funded research
projects.
Chenghao Liu received the B.E. and M.S. degreesin the Department
of Information Engineering,Chongqing Communication Institute,
Chongqing,China.His research interests mainly focus on
information
hiding and image processing.
Shanyu Tang (A08M08SM10) received thePh.D. degree from Imperial
College London in 1995.He is a Distinguished Professor in the
School of
Computer Science, China University of Geosciences.He is
dedicated to adventurous research in fractalcomputing methods for
covert communications,network security, and bio-informatics.Dr.
Tang is the principal grant holder of four ex-
ternally funded research projects. He has contributedto 70
scientific publications36 refereed journalpapers including IEEE
TRANSACTIONS and IEE/IET
journal papers.
Sen Bai received the B.E. degree in mathematicsfrom Sichuan
University, China, in 1985, and theM.S. and Ph.D. degrees in
applied mathematics,control theory and control engineering from
theChongqing University, China, in 1998 and 2002,respectively. He
is a Professor in the Department ofInformation Engineering,
Chongqing Communica-tion Institute.Dr. Bais research interests
mainly focus on
information hiding, image processing, and
patternrecognition.