Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony Chapter 1 INTRODUCTION With the rapid worldwide growth of VoIP services, the spam issue in VoIP systems becomes increasingly important , which is the reason why important companies, like NEC and Microsoft, have already developed mechanisms to tackle SPam over Internet Telephony (SPIT). A serious obstacle when trying to prevent SPIT is identifying VoIP communications, which originate from software robots (‘‘bots’’). Alan Turing’s ‘‘Turing Test’’ paper discusses the special case of a human tester who wishes to distinguish humans from computer programs. Nowadays, there has been a considerable interest in applying an alternate form of the Turing Test, the so called Reverse Turing Test. The term ‘‘Reverse Turing Test’’ is used to describe that the tester is not a human but a machine. In the spam protection world this kind of computer administrated Reverse Turing Test is also called CAPTCHA (Completely Automated Public Turing Test to Tell Computer and Humans Apart). The research interest in this subject has spurred a number of relevant proposals. Commercial examples include major stakeholders in the field, such as Google and MSN, which require CAPTCHA (visual or audio), in order to provide services to users. However, there exist computer programs, which can break the CAPTCHA that have been proposed so far. In this paper, an audio CAPTCHA was developed that is suitable for use in VoIP systems. In specific, first we present the background and related work and explain the main aspects of SPIT and Dept of ISE, BTLIT Page 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 1
INTRODUCTION
With the rapid worldwide growth of VoIP services, the spam issue in VoIP systems becomes
increasingly important , which is the reason why important companies, like NEC and Microsoft, have
already developed mechanisms to tackle SPam over Internet Telephony (SPIT). A serious obstacle
when trying to prevent SPIT is identifying VoIP communications, which originate from software
robots (‘‘bots’’). Alan Turing’s ‘‘Turing Test’’ paper discusses the special case of a human tester
who wishes to distinguish humans from computer programs. Nowadays, there has been a
considerable interest in applying an alternate form of the Turing Test, the so called Reverse Turing
Test. The term ‘‘Reverse Turing Test’’ is used to describe that the tester is not a human but a
machine. In the spam protection world this kind of computer administrated Reverse Turing Test is
also called CAPTCHA (Completely Automated Public Turing Test to Tell Computer and Humans
Apart). The research interest in this subject has spurred a number of relevant proposals. Commercial
examples include major stakeholders in the field, such as Google and MSN, which require
CAPTCHA (visual or audio), in order to provide services to users. However, there exist computer
programs, which can break the CAPTCHA that have been proposed so far.
In this paper, an audio CAPTCHA was developed that is suitable for use in VoIP systems. In
specific, first we present the background and related work and explain the main aspects of SPIT and
CAPTCHA. Then, we provide the basic requirements of a CAPTCHA, briefly explain why an audio
CAPTCHA is suitable for VoIP systems, and present an algorithm for selecting a suitable
CAPTCHA.
Dept of ISE, BTLIT Page 1
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 2
BACKGROUND
SPIT constitutes an emerging type of threat in VoIP systems. It illustrates several similarities
to email spam. Both spammers and ‘‘spitters’’ use the Internet, so as to target a group of users and
initiate bulk and unsolicited messages and calls. Compared to traditional telephony, IP telephony
provides a more effective channel, since messages are sent in bulk and at a low cost. Individuals can
use spam-bots to harvest VoIP addresses. Furthermore, since call-route tracing over IP is harder, the
potential for fraud is considerably greater.
A CAPTCHA is a method that is widely used to uphold automated SPAM attacks. The same
technique can be used to mitigate SPIT. According to this, each time a callee receives a call from an
unknown caller, an automated Reverse Turing Test would be triggered. The ‘‘spit-bot’’ needs to
solve this test in order to complete its attack. Integrating such a technique into a VoIP system raises
two main issues. First, the CAPTCHA module should be combined with other anti-SPIT controls, i.e.,
not every call should pass through the CAPTCHA challenge, since each CAPTCHA requires
considerable computational resources. A simultaneous triggering of several CAPTCHA challenges
can soon lead to denial of service. Challenges would also cause annoyance to users, if they had to
solve one CAPTCHA for every call they make. Second, a CAPTCHA needs to be friendly and easy
to solve (‘‘pass’’) for a human user.
2.1. CAPTCHA
A CAPTCHA is a test that most humans should be able to pass, but computer programs
should not. Such a test is often based on hard open AI problems, e.g., automatic recognition of
distorted text, or of human speech against a noisy background. Differing from the original Turing
Test, CAPTCHA challenges are automatically generated and graded by a computer. Since only
humans are able to return a sensible response, an auto-mated Turing Test embedded in a protocol can
verify whether there is a human or a bot behind the challenged computer. Although the original
Turing Test was designed as a measure of progress for AI, CAPTCHA is rather a human-nature-
authentication mechanism.
This paper is focused on audio CAPTCHA. These were initially created to enable people that
are visually impaired to register or make use of a service that requires solving a CAPTCHA. Today,
an audio CAPTCHA would be useful to defend against automated audio VoIP messages, as visual
Dept of ISE, BTLIT Page 2
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
CAPTCHA are hard to apply in VoIP systems, mainly due to the limitations of end-user devices. For
example, nowadays not many people have a home telephony device with a screen capable of
displaying a proper (high resolution) image CAPTCHA. If an adequate CAPTCHA is used, it should
be hard for a spit-bot to respond correctly and thus manage to initiate a call. Also, audio CAPTCHA
seems attractive, as text-based CAPTCHA has been demonstrated breakable.
2.2. Related work
As the audio CAPTCHA technology is practically in its infancy, the relevant research work is
currently limited.
Bigham and Cavender demonstrated that existing audio CAPTCHA are clearly more difficult
and time-consuming to complete as compared to visual CAPTCHA ( Bigham and Cav- ender, 2009).
They created a comparison between the existing CAPTCHA implementations, but they do not reach
to any conclusion on how their characteristics affect the user success rate. They developed and
evaluated an optimized interface for non-visual use, which can be added in-place to an existing audio
CAPTCHA. In their published CAPTCHA evaluation they mentioned that Facebook, Veoh, and
Craigs-list use different CAPTCHA; today, all three of them use Recaptcha ( Recaptcha Audio
CAPTCHA).
Tam et al. (2008a,b) described a number of security tests of audio CAPTCHA. The authors
used machine learning techniques, which are similar to the ones used for breaking visual CAPTCHA.
They analyzed three audio CAPTCHA taken from popular websites (Google ( Google Audio
CAPTCHA), Recaptcha ( Recaptcha Audio CAPTCHA), Digg ( DIGG)). In some cases they reached
correct solutions with an accuracy of up to 71%. The main issue with this work is that they only
tested the audio CAPTCHA implementations and did not analyze what is the impact of audio
CAPTCHA characteristics on its performance.
Yan and El Ahmad (2008) worked on the usability issues that should be taken into
consideration when developing a CAPTCHA. Their work does not specifically focus on audio
CAPTCHA, with the exception of a few characteristics (i.e., character set). Their work was concluded
with a framework referring to CAPTCHA usability.
Bursztein and Bethard (2009) developed a prototype audio CAPTCHA decoder, called
decaptcha, which is able to success-fully break 75% of the eBay audio CAPTCHA. They described
an automated process for downloading audio CAPTCHA, training the decaptcha bot and finally
solving the eBay CAPTCHA.
Finally, Markkola and Lindqvist (2008) proposed a number of ‘‘voice’’ CAPTCHA for
Internet telephony. However, they did not explain in detail how this could be integrated into an
Dept of ISE, BTLIT Page 3
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Internet telephony infrastructure. Also, their work lacks experimentation results.
2.3. A new approach
In the paper, apart from classifying the audio CAPTCHA attributes and evaluating the current
audio CAPTCHA implementations, a new audio CAPTCHA for VoIP environments will be
developed. The proposed CAPTCHA must be easy for human users to solve, easy for a tester
machine to generate and grade, and hard for a software bot to solve. The validation of its performance
will be made by two means; namely, by user tests and by a bot configured to solve ‘‘difficult’’ audio
CAPTCHAs. The latter requirement implies that a specific kind of test should be developed; i.e., a
test that is easy to generate but intractable to pass without knowledge that is available to humans but
not to machines. Audio recognition fits in this category. For example, humans can easily identify
words in an environment, whereas this is usually hard for machines ( Dusan and Rabiner, 2005; von
Ahn et al., 2008). Specification-wise, a CAPTCHA should ideally be 100% effective at identifying
software bots, but it was proved ( Chellapilla et al., 2005) that a CAPTCHA could be designed to
fight bots with a low failure rate (i.e., <0.1%). Generically, a CAPTCHA is effective as long as the
cost of using a software robot remains higher than the cost of using a human, even when the
spammers use cheap labor to solve CAPTCHA ( Trend Micro’s TrendLabs).
In order to develop a new audio CAPTCHA, we followed an iterative algorithm: (a) we
selected a set of attributes that are appropriate for audio CAPTCHA, (b) we developed a CAPTCHA
that is based on these attributes, and (c) we evaluated the CAPTCHA by calculating the success rates
of a bot and of a number of users, until the results were adequately ( Fig. 1).
Fig. 1. A generic CAPTCHA development process.
Dept of ISE, BTLIT Page 4
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 3
CAPTCHA ATTRIBUTES
A high user success rate is a key factor in deciding whether a new CAPTCHA is effective or
not. This is particularly important in the case of an audio CAPTCHA, as it does not only refer to
VoIP callers, but also to visually impaired users of a VoIP service. Equally important is the bot
success rate, which should be kept to a minimum. Both factors depend on a number of attributes. The
main characteristic of these attributes is that they should all be adjusted in the production procedure
of the CAPTCHA. We classified these attributes into four categories: (a) vocabulary, (b) background
noise, (c) time, and (d) audio production.
3.1. Vocabulary attributes
Audio CAPTCHA designs vary, mainly due to the vocabulary used. Variations depend upon:
(a) the set of characters the audio CAPTCHA consists of, (b) the number of characters of a single
CAPTCHA, and (c) the local settings, e.g., the language that CAPTCHA characters belong to.
3.1.1. Adequate data field
A data field (called ‘‘alphabet’’) is used as a pool for selecting the characters to be included in
an audio CAPTCHA. In order to integrate an audio CAPTCHA into a VoIP system, we chose an
alphabet of ten one-digit numbers, i.e., {0, ., 9}. Such a choice allows the use of the DTMF method
for answering the audio CAPTCHA. Other examples of audio CAPTCHA that use only digits are the
MSN and the Google ones. Moreover, some CAPTCHA includes beep sounds in their vocabulary, so
as to inform the user that the audio CAPTCHA begins. From the other side, a limited alphabet and
beep sounds may make an audio method quite vulnerable to attacks.
3.1.2. Spoken characters variation
In order to make the CAPTCHA solution even harder for a bot to solve, we introduce a
number of different human speakers for each digit of the alphabet. For example, if there are X
different speakers for each character, then there will be X different ways to pronounce each character.
This essentially means that each speaker makes a difference for a bot, but hardly for a human.
Another drawback for a CAPTCHA implementation is the use of a fixed number of
characters. A non-variable number of characters, in combination with a limited alphabet, can make a
CAPTCHA vulnerable to attack. For example, if only 3-digit CAPTCHA are used and a bot can
Dept of ISE, BTLIT Page 5
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
successfully recognize only 2 of the digits, then it can reach a success rate of ≥10% just by guessing
the remaining digit. On the other hand, if the number of digits of a CAPTCHA is not fixed and a bot
can successfully recognize only 2 of them, then the number of remaining digits is not known to the
bot.
3.1.3. Language requirements
Another important factor is the mother tongue of the users, as it plays a major role in
achieving a human user high success rate. This is particularly important in the case of audio methods,
where identifying spoken characters is hard to do, in
case the mother tongue of the speaker and the user differs. Therefore, the language should meet the
scope of the specific CAPTCA implementation. As a good practice, the spoken characters should be
not more than a few. The CAPTCHA we developed can be adjusted for non-English users, as it is
created dynamically and different characters can be added easily.
3.2. Noise attributes
The noise is still another important attribute of an audio CAPTCHA, as it can help to increase the
difficulty for an automated procedure to solve it.
3.2.1. Background noise
The background noise, which can be added during the production of a voice message, can
make CAPTCHA particularly resistant to attacks by automated bots. Application of background noise
requires a great variety of such noises to be available. These noises should be rotated in an erratic
manner. In our proposal, instead of developing a repository with noises we chose to proceed with a
dynamic production of them, while ensuring that they are distorted in a random manner. The way
various noises are produced should prevent their easy elimination by automated programs that use
learning techniques ( Tam et al., 2008a). In any case, the final version of the audio message, resulting
from the combined use of different distortion techniques and added noise, should be such that the
majority of users can easily recognize it. In the proposed CAPTCHA there was a real-time distortion,
applied in between the characters, as there appears to be no effective method for evaluating how
people understand digits with distortion.
3.2.2. Intermediate noise
Intermediate noise may prevent an automated program from isolating correctly spoken
characters from a voice message. The developer needs to select the scale in which the inter-mediate
noise will be applied, because intermediate noise can decrease not only the automated bot success rate
Dept of ISE, BTLIT Page 6
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
but also that of the user ( Festa, 2003). Also, as this noise should have the same characteristics as the
background noise, it should be created dynamically.
3.3. Time attributes
A set of variables should be defined during the production of an audio snapshot. The variables
refer to the length of the audio message, which depends on: (a) the number of characters spoken, (b)
the characters chosen, and (c) the time required for each character to be announced, which in turn
depends on the speaker of each character. Both, the beginning and the end of each spoken character,
should also be defined. This depends on the duration of each char-acter, as well as on the duration of
the pause between spoken characters. If the above time parameters follow specific patterns, then the
resistance of the audio CAPTCHA to a bot will decrease significantly. In the proposed CAPTCHA
we aim at eliminating such time-related patterns.
3.4. Audio production attributes
In principle, an audio CAPTCHA production procedure should be automated. In practice, an
acceptable human interference could be allowed only for the adjustment of the various thresholds.
3.4.1. Automated production process
The automation of the CAPTCHA production process is a desirable, though hard to achieve,
property. The various elements that compose an audio CAPTCHA, such as the number of characters
of a message, the speaker of each character, the background sound, the timing and the distortion of
the message, make the process time-costly and demanding in terms of hardware resources. Our
choice is to produce audio CAPTCHA periodically, in order: (a) not to produce them in real-time, and
(b) not to produce identical snapshots for extended time periods.
3.4.2. Audio CAPTCHA reappearance
An audio CAPTCHA should reappear as rare as possible. However, with short alphabets
every CAPTCHA is actually expected to reappear after a while. Due to the attributes of the voice
messages (e.g., technical distortion, added noise, language, speakers, etc.), as well as to the context of
the user (e.g., noisy environment, etc.), a voice message sometimes cannot be identified by the user
on the first attempt. There-fore, a second chance should be given. In this case, a different CAPTCHA
should be used.
3.4.3. Audio CAPTCHA reproduction
An audio CAPTCHA should be reproduced in a streaming way. The main reason for this is
that most of the bots need a training session before they are able to solve a CAPTCHA. Therefore, if
the audio reproduction process is not streaming, then the bot could easily download all audio
Dept of ISE, BTLIT Page 7
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
CAPTCHA that are needed for the training session.
Fig. 2 refers to all the attributes of an audio CAPTCHA.
Fig. 2 Audio CAPTCHA attributes.
Dept of ISE, BTLIT Page 8
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
Chapter 4
AUDIO CAPTCHA EVALUATION
In this section we evaluate some popular audio CAPTCHA utilizing the above mentioned
characteristics. First, we collected twelve (12) different audio CAPTCHA, not only from popular
websites (i.e., Google, Hotmail, Recaptcha), but also from other sources (Secure Image CAPTCHA).
For each of them we down-loaded100 examples (in .wav or .mp3format), resulting ina total of 1200
audio files that were used for the evaluation.
Then, for each audio CAPTCHA we provided a short description of its functionality. We
summarized with drafting a table that includes all these CAPTCHA, together with their attributes.
Two interesting points, regarding our analysis, are:
1.User’s success rate was calculated by inviting 10 users to solve 5 CAPTCHA of each
implementation. All CAPTCHA were in English, which was the mother tongue of one (1) of the
participants (as a requirement, all users should speak English). All users had a university degree.
Also, they all use a PC for more than 20 h/week.
2.The ‘‘automated creation’’ attribute was not put in-place for the commercial CAPTCHA (Google,
MSN), as their rele-vant algorithms are not publicly available.
4.1. Google
The Google Audio CAPTCHA uses a limited data field of ten digits (0, ., 9), which seems not
adequate for every situation; however, it is suitable for a VoIP system. The number of digits for each
audio CAPTCHA is not fixed, but it ranges from 5 to 10 digits. Moreover, this CAPTCHA is
available in multiple languages. This CAPTCHA uses background and intermediate noise. The noise
at the beginning is louder and then a different speaker is used for the announcement of each character.
In addition, the duration of a CAPTCHA ranges from 20 to 50 s (based on our Google Audio
sample). Google uses three beeps every time an audio CAPTCHA begins. These beeps make the
audio CAPTCHA vulnerable to attacks because it is much easier for a bot to know when a
CAPTCHA begins. Furthermore, Google Audio CAPTCHA is announced twice in every audio file,
therefore an attacker can process it twice and has multiple attempts to find the right answer. Finally,
the most important drawback is the user success rate, which is not adequately high.
Dept of ISE, BTLIT Page 9
Audio CAPTCHA: Existing solutions assessment and a new implementation for VoIP telephony
4.2. MSN
The MSN Audio CAPTCHA uses a limited data field of ten (10) digits, with a fixed number
of spoken characters (10) in each one. The frequency of the spoken characters varies, since a number
of different speakers are used. That makes MSN Audio CAPTCHA vulnerable to attacks. Also, it is
available in multiple languages. MSN uses weak and constant background noise. The distance
between the words is, to a far extend, constant. Moreover, the duration of the CAPTCHA is not
always the same (e.g., one CAPTCHA lasts 0:07 s, another 0:16 s). There are no beeps at the
beginning of this audio CAPTCHA. The main advantage of MSN Audio CAPTCHA is it is easy for a
user to understand. As a result, the user success rate is high.
4.3. Recaptcha
The Recaptcha Audio CAPTCHA uses a large data field that includes various phrases.
Therefore, the number of spoken words varies and it is available only in English. Recaptcha uses no
background noise. On the other hand, it uses distortion techniques and multiple speakers, with
different pronunciation and different pace. The user can hear twice the audio CAPTCHA in one audio
file (like Google). Recaptcha does not use beeps. The duration of this CAPTCHA is almost fixed.
Moreover, the user success rate is significantly low. Recaptcha Audio CAPTCHA meets most of the
requirements for an effective tool. Its main drawbacks are the vocabulary (includes more than digits),
as well as the user success rate, which is low. The latter happens because it seems not easy for a user
to understand the words and their combination.
4.4. eBay
The eBay Audio CAPTCHA has a limited data field of ten (10) digits (0–9). The number of
spoken characters is always six (6). The CAPTCHA uses different speakers and it is available in
several languages, depending on the specific eBay sites (i.e., the digits in www.ebay.fr are
pronounced in French). More-over, there is a different background noise for each digit, but there is no
intermediate noise. Finally, the duration of the CAPTCHA, as well as the speaker pace, are both
fixed. The main advantages of this implementation are the high user success rate, the lack of beeps at
the beginning or end of the CAPTCHA, and its streaming reproduction.
4.5. Secure Image CAPTCHA
Secure Image CAPTCHA uses an adequate data field of digits (0–9) and letters (A–Z). The
number of spoken characters is fixed and it is available only in English. On the other hand, this
CAPTCHA uses the same speaker all the time. Moreover, it uses simple background noise and there
is no intermediate one. Also, the CAPTCHA duration and the speaker pace are fixed. Secure Image
CAPTCHA is an open-source free PHP CAPTCHA script; therefore most of the attributes can be