Cache Based Side Channel Attacks On AES A Major Project Report Submitted in partial fulfillment for the Award of BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE AND ENGINEERING By Ravi Prakash Giri 2011ECS32 To SHRI MATA VAISHNO DEVI UNIVERSITY, J&K, INDIA MAY, 2015
68
Embed
Cache Based Side Channel Attacks On AESrgiri8/files/RPG_BTP.pdf · 2016. 12. 21. · side channel attacks take advantage of the fact that access times to di erent levels of the memory
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Cache Based Side Channel Attacks On AES
A Major Project Report
Submitted in partial fulfillment for the Award of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
Ravi Prakash Giri
2011ECS32
To
SHRI MATA VAISHNO DEVI UNIVERSITY,
J&K, INDIA
MAY, 2015
ii
Certificate
This is to certify that I, Ravi Prakash Giri ( 2011ECS32 ) have worked under the
guidance of Mrs. Sonika Gupta on the project titled ”Cache Based Side Channel
Attacks On AES” in the School of Computer Science & Engineering, College of
Engineering, Shri Mata Vaishno Devi University, Kakryal, Jammu & Kashmir from
2nd Jan 2015 to 17th May 2015 for the award of Bachelor of Technology in
Computer Science & Engineering.
The contents of this project, in full or in parts, have not been submitted to any
other Institute or University for the award of any degree or diploma.
Student’s Signature
Student’s Name
This is to certify that the above student has worked for the project titled
”Cache Based Side Channel Attacks on AES” under my supervision.
Signature:
Guide Name: Mrs. Sonika Gupta
iii
iv
Acknowledgement
I would like to express my sincere gratitude to Prof. Bernard L. Menezes, IIT-
Bombay for his constant motivation, useful suggestions and words of wisdom. He has
been my primary source of guidance during my entire project. I would like to extend
my gratitude towards my internal guide Mrs. Sonika Gupta for her guidance and
for providing necessary information regarding the project. I am extremely grateful
for the opportunity to work on this project in a team comprising Bholanath Roy,
Vibhor Agrawal, Ashokkumar C under the supervision of Prof. Bernard Menezes
at IIT-Bombay. A summery of this work has been submitted recently in a paper
titled ” Design and Implementation of an Espionage Network for Cache based Side
Channel Attacks on AES ” to an international conference.
v
Abstract
Side channel attacks exploit information gained from physical implementation
or design rather than mathematical weaknesses of the cryptographic systems . We
have extended and modified the existing work in the field of cache-based side channel
attacks targeting the software implementation of Advanced Encryption Standard
(AES) by designing and implementing the espionage network. Our model includes
a spy controller, ring of spy threads and an analytical operator all hosted on a
single server. A collaborative execution of spy controller and spy ring restrict the
victim process to access very few cache memory lines where the lookup tables reside.
Our results indicate that our setup can deduce the encryption key in fewer than 30
encryptions and with far fewer victim interruptions compared to previous work.
Moreover, this approach can be adapted to work on various OS platforms and on
With the increasing popularity of internet as a communication as well as data stor-
age medium, demand for securing the confidential data from unauthorized access
has increased a lot during the last decade. Cryptographic schemes that prevent
confidential data to be accessed by unauthorized users have become increasingly im-
portant.More and more cryptographic schemes can be seen floating around. Before
being deployed in practice, such schemes typically have to pass a rigorous reviewing
process to eliminate design weaknesses. However, just ensuring theoretical sound-
ness such schemes does not ensure concrete security of its physical implementation.
Side-channel cryptanalysis is any attack on a cryptosystem requiring information
emitted as a byproduct of the physical implementation. Side channel attacks are
an important class of implementation level attacks on cryptographic systems that
exploits leakage of information through data-dependent characteristics of physical
implementations such as electromagnetic radiation, power consumption of device,
running time of certain operations, etc. and, typically, are specific to the actual
implementation of the algorithm. Side channel attacks utilize the fact that in re-
ality, a cipher is not a pure mathematical function Ek[P ] → C, but a function
Ek[P ] → (C, t), where t is any additional information produced by the physical
implementation[13]. An important class of timing attacks are those based on ob-
taining measurements from cache memory systems.
General classes of side channel attack include:
Timing-Attack is based on measuring how much time various computations
take to perform. Power-monitoring attacks are the those that make use of vary-
ing power consumption by the hardware during computation. Electromagnetic
Attacks are based on leaked electromagnetic radiation, which can directly provide
plaintexts and other information. Such measurements can be used to infer crypto-
graphic keys using techniques equivalent to those in power analysis or can be used
2
in non-cryptographic attacks, e.g. TEMPEST (aka van Eck phreaking or radiation
monitoring) attacks. Acoustic cryptanalysis and Differential fault analysis
attacks exploit sound produced during a computation (rather like power analysis).
Row Hammer are another kind of side channel attacks in which off-limits memory
can be changed by accessing adjacent memory.
The Advanced Encryption Standard (AES)[6] is a relatively new algorithm for
secret key cryptography, is now universally supported on servers, browsers, etc.
Software implementations of AES including OpenSSL, make extensive use of table
lookups in lieu of time-consuming mathematical field operations[6]. Cache-based
side channel attacks take advantage of the fact that access times to different levels
of the memory hierarchy are different and hence it can retrieve the key of a victim
performing AES.
1.1 Purpose
The purpose of our experiment is to design and implement an efficient cache based
side channel attacks on Advanced Encryption Standard - the de facto standard of
secret key cryptography. In last 10 years, various attacks on AES has been reported
with some complications. So the main purpose of our experiment is to develop
a much easier attack that require very less Victim interruptions and encryptions
in comparison to any previous works and can be incorporated on today’s modern
processors like core i-5 and core i-7.
1.2 Problem Statement
1.2.1 Motivation
Among many side channel attacks available, the reason we are particularly interested
in cache as side channel attacks is that caches form a shared resource for which all
processes compete, and it thus is affected by every process. While the data stored
in the cache is protected by virtual memory mechanisms, the metadata about the
contents of the cache, and in particular the memory access patterns of processes
using that cache, are not fully protected.
Thus, cache provides an easy to access medium, which attacker can spy on, in a
concealed manner.
3
1.2.2 Goals
• To design and implement an espionage network with associated analytic ca-
pabilities that retrieve the AES key using fewer encryptions and also fewer
interruptions to the victim process.
• To demonstrate a complete attack on OpenSSL implementation of AES. Fur-
ther to reduce the time quantum provided to the victim process to an extent
useful for our attack.
• To understand how shared as well as non-shared AES tables both can be
exploited using cache.
1.3 Report Overview
This document is a brief report on how cache can be exploited as side channel attack.
To start with, the report briefly describes how cache works and how it can be used
as a medium for spying and gathering information, which otherwise is meant to be
secret.
Chapter 2 of this report describes the related work done in this field. Chapter
3 describes the preliminaries of Side Channel Attacks and cache working. Chapter
4 gives a wide idea about how cache can be used in both public key cryptography
and secret key cryptography as medium for attacker. The report majorly comprises
of attack on AES, so chapter 5 explains AES algorithm and how it is implemented.
We will be focusing on this algorithm only, to perform our attack. In chapter 6 and
chapter 7 we will go through the techniques to exploit AES in shared as well as
non-shared scenario. The next chapter deals with the design and implementation
of our espionage infrastructure for the attack. In the following chapters, we will
analyse the results of our attack and discuss the countermeasures.
4
Chapter 2
Related Work
The first consideration of cache memory as a covert channel to extract sensitive in-
formation was mentioned by Hu[12]. In April 2005, the software implementations of
AES was exploited and reported by Bernstein. D.J. Bernstein announced a cache-
timing attack that he used to break a custom server that used OpenSSL’s AES
encryption[5]. The attack required over 200 million chosen plaintexts on Pentium-
III machine. The custom server was designed to give out as much timing information
as possible (the server reports back the number of machine cycles taken by the en-
cryption operation). Although their attack is generic and portable, it needs 227.5
encryptions and sample timing measurements with known key in an identical con-
figuration of target server.
In 2003, Tsunoo et al.[17] demonstrated time driven cache attack on DES. They
focused on overall hit ratio during encryption and performed attack by exploiting the
correlation between cache hits and encryption time. A similar approach was used
by Bonneau et al., where they emphasized individual cache collisions during encryp-
tion instead of overall hit ratio[13]. Although Bonneau’s attack was a considerable
improvement over previous work, it still requires 213 timing samples.
In October 2005, Dag Arne Osvik, Adi Shamir and Eran Tromer presented a
paper[16] demonstrating several cache-timing attacks against AES. One attack was
able to obtain an entire AES key after only 800 operations triggering encryptions,
in a total of 65 milliseconds. This attack requires the attacker to be able to run
programs on the same system or platform that is performing AES.
A major contribution in access-driven cache attacks appeared in Tromer et al.
paper presented in 2010[7]. They performed both synchronous and asynchronous
attacks. In synchronous attack, 300 encryptions were required to recover 128 bit
AES key on Athlon64 system and in asynchronous attack, 45.7 bits of information
about key was retrieved effectively. They introduced the Prime + Probe technique
5
to perform an access-driven attack. In prime phase, attacker fills cache with its own
data before encryption begins. During encryption, victim evicts some of attacker’s
data from cache in order to load lookup table entries. In probe phase, attacker
calculates reloading time of its data and finds cache misses corresponding to those
lines where victim loaded lookup table entries. Both the attacker and victim must
execute on the same core of the processor to make the attack successful.
The ability to detect whether a cache line has been evicted or not, was further
exploited by Neve et al in 2007[14]. Advancing in the mainstream of asynchronous
attacks, they performed improved access-driven cache attack on last round of AES
to recover 128 bit key with 20 encryptions. However, this attack was feasible only
on single-threaded processors and the practicality of their attack implementation
was not clear due to insufficient system and OS kernel version details.
Gullasch et al. proposed an efficient access driven cache attack[11] when attacker
and victim use a shared crypto library. The spy process first flushes the memory lines
corresponding to entire lookup table entries from all levels of cache and interrupts
victim process after allowing it single lookup table access. After every interrupt,
by calculating reload time it finds which memory line is accessed by victim. This
information is further processed using neural network to remove noise in order to
retrieve the AES key.
Wei et al. used Bernstein’s timing attack on AES running inside an ARM
Cortex-A8 single core system in a virtualized environment to extract AES encryp-
tion key[15]. Apecechea et al. in 2014 performed Bernstein’s cache based timing
attack in a virtualized environment (Xen and VMware VMMs) to recover AES se-
cret key[10] from co-resident VM with 229 encryptions. Later they improved it in
Irazoqui et al. paper[8] using Flush + Reload technique and recover AES secret key
with 219 encryptions.
We are improving over prior works performed in last decade by providing a first
practical access-driven attack on AES algorithm. Our attack will be working in much
weaker assumptions and very less Victim interruptions than any attacks discussed
so far. Moreover, it is very efficient as it require only 25 encryptions to retrive the
complete AES key.
6
Chapter 3
Preliminaries
3.1 Basics of Cache working
Cache is placed between RAM (main memory) and CPU. The instructions and data,
before reaching CPU from memory gets stored in cache and are accessed from there.
Cache is a component that stores data so future requests for that data can be served
faster; the data stored in a cache might be the results of an earlier computation,
or the duplicates of data stored elsewhere. A cache hit occurs when the requested
data can be found in a cache, while a cache miss occurs when it cannot. When
a cache miss occurs, the CPU retrives the data from the main memory and stores
it in to the cache. This action is motivated by the temporal locality principle:
recently accessed data is likely to be accessed again. Cache hits are served by reading
data from the cache, which is faster than recomputing a result or reading from a
slower data store; thus, the more requests can be served from the cache, the faster
the system performs.
However, the CPU is going to take advantage of spatial locality as well:
when a data is accessed,the values stored close to the accessed data is likely to be
accessed again. Hence, when a cache miss occur, it not only load that data but
loads the whole cache line that includes the data nearby. The cache line represents
the partitions of the data that can be written or retrived at a time.
To understand this problem in more detail, lets us assume we have a n-way set
associative cache memory, where each address can be mapped to n different cache
blocks. Cache contains 2a cache sets each containing n cache lines and each line in
turn contains 2b bytes of data. To locate a cache line holding data from memory
address A, least significant b bits are ignored as cache line size is 2b bytes. Next
a bits denotes the cache set and remaining bits are tag field for verification of the
correct entry. Data can go into any line within a specific set.
7
This granularity level is determined by some predetermined cache replacement
policy. In general Least Recently Used page replacement policy is used.
3.2 Cache based Side Channel Attacks
Previously side channel attacks were used to break specialized systems such as smart
cards etc. Now a days major focus is on side channel attacks that exploit the shared
resources in conventional microprocessors. Such attacks are very powerful because
they do not require the attackers physical presence to observe the side-channel and
can therefore be launched remotely using only non-privileged operations.
Cache-based side-channel attacks represent an example of this class of attacks.
In this, attacker process monitors the cache activity performed the victim cipher
process. If carefully designed, such attacks can leak enough information about the
secret key. Cache-based side-channel attacks are based on the fact that if CPU
accesses data from memory, and that data is not available in cache, it experiences a
delay pertaining to cache miss and this delay is significant enough to be measured
from the situation where data is present in cache and accessed. Victim can thus find
out the occurrence and frequency of cache misses.
The runt-time of the fast software based ciphers like AES heavily depends on
the speed at which table look ups are performed. A popular style of implementation
of the AES is its T-table implementation[6]. It combines all four major operations
that is being performed throughout the encryptions into one single table look up
per state byte along xor operations. The index of the loaded entry is determined
by a byte of the cipher state. Therefore the information on which table values have
been loaded into cache can reveal information about the secret state of AES.
In any side channel attack, there are essentially two phases:
1. Online phase where side channel information is gained using repeated en-
cryption/decryption. Here, attacker measures and tabulates the side channel
information (timing, power consumption etc) as per the attack method.
2. Offline phase where the data from Online phase is used to generate results and
graphs which helps in prediction and verification of observations regarding the
secret value of the cipher. In many cases, we can actively utilize analysis from
this phase to carry out encryption and decryption in Online phase.
8
3.3 Types of Cache based side channel attacks
3.3.1 Time driven
In time-driven attacks[5], attacker can observe the aggregated profile of an encryp-
tion or decryption, i.e. total execution time taken by the cipher process to complete
that encryption or decryption. Attacker thus correlates the time taken by cipher
process to the number of cache misses occurring during that encryption. More the
number of cache misses more the execution time. This attack relies on the accurate
measurement of timing of the encryption and execute the timing code synchronously
before and after an encryption round.
As this attack is based on overall execution time, there can be other factors (other
processes running simultaneously with victim) affecting the victim process, we need
a large number of sample in offline phase to accurately measure the information
about secret key. Being sad, this type of attack is very easy to carry out and
requires minimal coding in online phase for gathering information required. To find
out the relationship between timing information and key values, attacker can make
statistical algorithm-specific inferences about the state during processing.
For example, it might be inferred that in encryption with large number of misses,
certain key related variable are not equal as they access different parts of memory
causing cache misses, while in lesser number of misses, they are equal. With such
kind of observations, attacker can relate plain-text to cipher key and hence unravel
the key bits.
3.3.2 Trace driven
In Trace-driven attacks[3], attacker is able to capture the profile of cache activity
during encryption, up to the granularity of individual memory accesses. In this type
of attack, the attacker can figure out the outcome of every memory access (trace)
the cipher process issues in terms of hits and misses.
Trace can be defined as sequence of cache hits and misses. For example: HMMM,
MMMH, HHMM, HMHM is a valid trace, where H represents hit and M represent
cache miss. The attacker can observe if a particular memory access to a lookup table
yield a hit or miss, thus can infer information about the lookup indices. As these
indices are key dependent in almost all cases, secret information can be revealed.
This type of information can be calculated using simple power analysis of the
target process. As the power consumption of a microprocessor is dependent on
the instruction being executed and on data being manipulated, the attacker can
therefore observe the difference in power consumption when cache miss routine is
9
being carried out by the victim.
3.3.3 Access driven
These are most recent of all three and most powerful amongst them. In this the
attacker and victim process shares the cache memory, and secret information is
leaked using the cache as side channel medium. Here the attacker can determine
information up to the granularity of the cache sets modified by the victim process.
Thus, attacker can determine the elements of lookup table accessed by the cipher.
Figure 3.1: Access based cache attack[3]
The whole process can be summarized as below. In such attacks, the two pro-
cesses are executing on the same machine, thus sharing the data cache. Victim
process during encryption requests for data residing in memory causing either cache
hit or miss. Attacker spies this cache activity of the victim process. It, using the
techniques discussed in section X, measures the cache set being accessed.
Among all three techniques, this is the most powerful technique and can give
most information to the attacker. However, gathering such information from the
system under scrutiny is quite complex.
10
Chapter 4
Cache Attacks in Cryptographic
Algorithms
4.1 Introduction
Cache based Side Channel Attacks are applicable to both Secret key as well as public
key encryption schemes. The next two sub-sections briefly describes how it can be
used to attack both the scenarios.
4.1.1 Cache attacks in secret key cryptography
The basic principle of cache based Side channel attack is the difference in access
time of data in case of cache hit vs cache miss.
Secret key cryptography, like AES, DES works on the principle of simple math-
ematical operations which are repeated again and again to get better encrypted
outcome. Like in AES each encryption consists of almost identical 10 rounds with
each round is a combination of 4 simple mathematical/logic operations.
These operations due to their simple nature can easily be realized in the form
of tables/arrays where intermediate data is stored in them and simply accessed
as and when needed. This step greatly reduces the time required to perform the
operations as all the four operations of a round are simply reduced to a few table
accesses.
However, this poses another problem in the form of side channel attack. These
access tables are now stored in cache and the encryption algorithm uses some com-
bination of key bits to access the particular element of the key. If the attacker
somehow, figures out some information about the location access by the encryption
algorithm, he/she can directly relate the same to find out the key bits.
11
4.1.2 Cache attacks in public key cryptography
Public key cryptography on the other hand is based on heavy mathematical opera-
tions (in the order of 100’s of bits of numbers). For example, in RSA encryption of
a message, we need to calculate (mp mod n), where p is a large number of the order
of 1000’s of bit.
Due to such huge complexity involved, they, unlike secret key cryptography can-
not be pre-computed and stored in table. Due to this, these operations takes too
much time as compared to secret key counterparts. As there are no tables involved,
we cannot apply the same principle as that in AES to attack such schemes.
However, we can observe that while performing such operations (modular expo-
nentiation, etc), we take different paths based on the secret bits. As an example, I
want to perform operation a50 for some ’a’.
Writing 50 in binary notation: 110010.
Now, starting with result = 1, and moving right bit by bit from left side (MSb)
in exponent,
For every bit 1, we first square the result and multiply with a.
For every bit 0, we simple square the result.
The steps to get the result[4]:
Bit position considered Result (Initial Value = 1)110010 (12) ∗ a = a110010 (a2) ∗ a = a3
110010 (a3)2 = a6
110010 (a6)2 = a12
110010 ((a12)2 ∗ a = a25
110010 (a25)2 = a50
Table 4.1: Steps in calculating a50
We can clearly see that different operations are performed based on different bit
values. This is the basis of side channel attacks in public Key scenario. These func-
tions will be loaded into the memory and thus will map to some cache location(s).
Let’s assume, square function is mapped at line x and multiply function at line y.
Attacker, instead of spying on the data cache will now look for instruction cache, and
will try to figure out at each step, whether multiplication is performed or squaring
by continuously looking at both lines x and y each time. Once, the attacker get the
ordering of such squaring or multiplication operations, he/she can simply get the
unknown secret exponent easily.
12
Chapter 5
Advanced Encryption Standard
5.1 Description of the Cipher
AES is based on a design principle known as a substitution-permutation network,
combination of both substitution and permutation, and is fast in both software and
hardware. Unlike its predecessor DES, AES does not use a Feistel network. AES is
a variant of Rijndael which has a fixed block size of 128 bits, and a key size of 128,
192, or 256 bits.
Let us briefly take a look at how AES works and how AES tables are computed
and used. AES operates on 4 X 4 column major order matrix of bytes, processing
16 Bytes at a time. The key size determines the number of repetitions of the
transformation rounds, that converts input into intermediary output, which at the
end of last round becomes the cipher text.
Number of rounds are:
1. 10 Rounds for Key size of 128 bits
2. 12 Rounds for Key size of 192 bits
3. 14 Rounds for Key size of 256 bits
Each round consists of several processing steps, each containing four similar but
different stages, including one that depends on the encryption key itself. A set of
reverse rounds are applied to transform ciphertext back into the original plaintext
using the same encryption key.
13
5.2 AES Algorithm
The Algorithm consists of four main parts:
1. Key Expansions: Generating round keys for each round using Rijndael’s key
schedule algorithm. AES requires a separate 128-bit round key block for each
round plus one more.
2. Initial Round: Each byte of the state is combined with a block of the round
key using bitwise xor.
3. Rounds:
• Sub Bytes
• Shift Rows
• Mix Columns
• Add round Key
4. Final Round:
• Sub Bytes
• Shift Rows
• Add round Key
5.2.1 Key Expansions
AES uses Rinjdael key schedule[6] to compute a seperate round key for each round
from the initial key. It uses Rinjdael S-box in the process. Algorithm 1 below is self
explanatory algorithm for the same.
Let w[0] ... w[3] be initialized with original AES key, where w[i] is 4 byte word.
5.2.2 Initial Round
Before starting the first round, each byte of plaintext is combined with corresponding
Byte of Initial 128 bits of Key using bitwise XOR operation.
5.2.3 Rounds
Each round except the last round, performs the four undermentioned steps:
14
Algorithm 1 Rinjdael key schedule
1: procedure KeySchedule2: for i = 4 to 43 do3: x← w[i− 1]4: if (i is a multiple of 4) then5: x← f(x)6: end if7: w[i]← w[i− 4]⊕ x8: end for9: end procedure
1. SubBytes() Transformation : In this step, each byte in the state matrix
is replaced with another according to a lookup table called the Rijndael S-
Box (substitution box). This step provides nonlinearity in the cipher. The
S-box used is derived from the multiplicative inverse over GF(28), known to
have good non-linearity properties. It is a fixed known-to-everyone table. The
secrecy is present in the key and not in the algorithm.
Figure 5.1: SubBytes() Transformation[2]
2. ShiftRows() Transformation : In ShiftRow, the rows of the State are cycli-
cally shifted over different offsets. Row 0 is not shifted, Row 1 is shifted over
C1 bytes, row 2 over C2 bytes and row 3 over C3 bytes. The shift offsets C1,
C2 and C3 depend on the block length. The operation of shifting the rows of
the State over the specified offsets is denoted by:
ShiftRow(State).
3. MixColumns() Transformation : Multiplication of each column by a con-
stant 4x4 matrix over the field GF (28). In this step, a mixing operation is
operated on the four bytes of each column. The MixColumns function takes
four bytes as input and outputs four bytes, where each input byte affects all
15
Figure 5.2: ShiftRows() Transformation[2]
four output bytes. This provides diffusion to the cipher which ensures that
modification of individual bits in the plaintext gets redistributed non-uniformly
in the ciphertext.
Figure 5.3: MixColumns() Transformation[2]
4. AddRoundKey() Transformation : In this operation, a Round Key is applied
to the State by a simple bitwise EXOR. The Round Key is derived from the
Cipher Key by means of the key schedule. The Round Key length is equal to
the block length. The transformation that consists of EXORing a Round Key
to the State is denoted by:
AddRoundKey(State,RoundKey)
The transformation is illustrated in Figure5.4.
5.2.4 Final Round
The MixColumns operation is omitted in the last round, and an additional Ad-
dRoundKey operation is performed before the first round (using a whitening key).
16
Figure 5.4: AddRoundKey() Transformation
5.3 AES Implementation
5.3.1 Round Transformations
The different steps of the round transformation can be combined in a single set of
table lookups, allowing for very fast implementations on processors with word length
32 or above. In this section, it is explained how this can be done. One column of
the round output e is expressed in terms of bytes of the round input a. Here, ai,jdenotes the byte of a in row i and column j, aj denotes the column j of State a.
For the key addition and the MixColumn transformation, we have :
For the ShiftRow and the ByteSub transformations, we have :
In all expression the column indices must be taken modulo block size which is 4 in
this case. By substitution, the above expressions can be combined into:
The matrix multiplication can be expressed as a linear combination of vectors:
The multiplication factors S[ai,j] of the four vectors are obtained by performing
17
a table lookup on input bytes ai,j in the S-box table S[256].
We define tables T0 to T3 :
These are 4 tables with 256 4-byte word entries and make up for 4 KByte of total
space. Using these tables, the round transformation can be expressed as:
ej = T0[x0,j]⊕ T1[x1,j+1]⊕ T2[x2,j+2]⊕ T3[x3,j+3]⊕ kj
Hence, a table-lookup implementation with 4 KB of tables takes only 4 table lookups
and 4 EXORs per column per round. Each table is accessed by using an 8 bit index
and gives 32 bits of output.
There is a separate key setup phase where a given 16-byte secret key k = (k0 ,
. . , k15) is expanded into 10 round keys, K(r) for r = 1, . . . , 10. Each round
key is divided into 4 words of 4 bytes each: K(r) = (K(r)0 , K
(r)1 , K
(r)2 , K
(r)3 ). The 0th
round key is just the raw key: K(0)j = (k4j, k4j+1, k4j+2, k4j+3 ) for j = 0, 1, 2, 3.
18
Given a 16-byte plaintext p = (p0 , . . , p15), encryption proceeds by comput-
ing a 16-byte intermediate state x(r) = (x0 , . . . , x15 ) at each round r . The
initial state x(0) is computed by x(0)i = pi ⊕ ki for (i = 0, . . . , 15). Then, the first
9 rounds are computed by updating the intermediate state as follows[16], for r = 0,
Finally, to compute the last round above equation is repeated with r = 9, except
that T0, ..., T3 is replaced by T(10)0 , ..., T
(10)3 . The resulting x(10) is the ciphertext. the
change of lookup tables in the last round is due to the absence of MixColumn trans-
formation. Compared to the algebraic formulation of AES, here the lookup tables
represent the combination of ShiftRows, MixColumns and SubBytes operations; the
change of lookup tables in the last round is due to the absence of MixColumns.
5.3.2 Last Round Implementation
Last round can be implemented in multiple ways:
• Using additional table : Here, a seperate table of size 1KB is used. The
entries in this table are simply substituted index concatenated 4 times one
after the other.
• Using previous tables : In this case, some tables which are used in the
previous rounds can be used.
19
Chapter 6
Cache attacks on Non-shared table
6.1 Overview
Synchronous attacks is applicable in scenarios where the plaintext or ciphertext is
known and the attacker can operate synchronously with the program performing
AES encryption on the same processor, by using some interface that triggers en-
cryption under an unknown key. The main target of the attacker is to gather the
table accesses to as much fine granularity as possible.
If we consider a case, where attacker at each instant is able to say that this par-
ticular table access was made by victim process, calculating the secret key will be
trivial. In such a scenario, attacker will simply do an xor operation with the table
access from first intermediate round, and will get the whole key straight away, be-
cause for first round table accesses are simply plain-text byte ⊕ed with corresponding
key byte.
xi = pi ⊕ ki
Knowing the table access exactly means getting xi value. So, simply do an XOR
with the pi to get the ki.
However, this task of getting the table access is not so simple and straightforward,
neither can we achieve this granularity of table access.
As we know that each table entry occupies 4 Byte and assuming cache block size
is standard 64 Byte, 16 table entries will go into one cache block. This cache block
is the minimum amount of data being brought from memory into the cache. So,
even if the victim process has accessed one entry, whole 16 entries corresponding to
that cache block is brought into the memory.
Attacker cannot thus figure out the exact table access. He/she can thus only be
able to find out the cache block which is being accessed by the victim process.
20
To find this information, we need to consider two scenarios.
1. Non-shared table data : Here, cache is shared, i.e. both the processes are
using the same cache, but AES tables are not shared. So, at the start, attacker
don’t even know where the AES tables are in the cache.
2. Shared table data : Here, both the processes have access to same AES table
in the cache. Attacker now has the information of the location of AES tables in
the memory, thus he knows the cache lines to which the tables are mapped. We
are targeting OpenSSL implementations of AES, which by default is shared,
so this scenario is also quite realistic.
In the subsequent sections we will look at the approaches to mount the attack in
both the scenarios. We will then comment on the practicality to launch the attacks
in such situations and problems faced in the implementation.
We will then propose our approach which is a combination of the above attacks
and how it can help us to mount the attack in practise.
Let us consider the first scenario, where the data tables are not shared and
thus attacker does not know the position of the cache.
6.2 Cache access measurement
We can use one of the below two techniques to find out the cache block(s) accessed
by the victim process.
1. Measurement using Evict+Time[7] In this method, we manipulate the
state of the cache before each encryption, and observe the execution time of
the subsequent encryption. In a chosen-plaintext setting, the method proceeds
as follows :
• For each table l = 0, 1, 2, 3 do
– For each block y = 0, 1 . . . 15 do
(a) For plaintext p, run AES to get the blocks used by AES into the
cache.
(b) For the same plaintext p, run AES again and measure the time taken
for encryption, cachedT ime, with all blocks in the cache
(c) (Evict Phase) Evict the block y of table l
21
(d) For the same plaintext p, run AES again and measure the time taken
for encryption after eviction, evictedT ime, with one block evicted.
(e) (Time Phase) Note the time taken for encryption with eviction
2. Measurement using Prime+Probe [7] This measurement method tries
to discover the set of memory blocks read by the encryption a posteriori, by
examining the state of the cache after encryption. The attacker allocates a
contiguous byte array A[0, ... , S∗W ∗B−1]. This method proceeds as follows :
• For each table l = 0, 1, 2, 3 do
– For each block y = 0, 1 . . . 15 do
(a) Access ’W’ memory block in ’A’ that map to same cache set as that
of y to evict the block y
(b) (Prime Phase) Read the same ’W’ memory block again and measure
the time taken to read all ’W’ blocks, cachedT ime, with all ’W’
blocks in cache
(c) For plaintext p, run AES to get the blocks used by AES into the
cache.
(d) (Probe Phase) Again read the same ’W’ blocks and measure the time
taken to read all ’W’ blocks, checkblockUsed times
Figure 6.1: Fig a,b,c are for Evict-Time, while Fig d,e are for Prime-Probe [7]
The problem with Evict + Time method is that, it will only give information
about one table access per encryption. So, to get the information about all the table
accesses during a particular encryption, we need to run the encryption of same plain
text for each cache set. If we assume, AES tables occupy 64 cache blocks, we need
to run this Evict + Time 64 times for measuring accesses of just single encryption.
This scenario is quite unrealistic, which requires same data to be encrypted again
and again.
22
Here, if we don’t know the offset from where the tables start, we need to fill the
whole cache again and again and measure the cache accesses of a small subset of
cache in which the tables reside. One optimization could be to first find out the
location where the AES tables are in the memory and then use the above strategies
by just filling a small portion of the cache. To find the location of the tables, we can
use Prime + Probe attack. We will simply give a score to each cache set if we find
out that some process has accessed that location. If we do this repeatedly, there
are high chances that the locations which are accessed by AES get a high score,
because they have been accessed each time we did our probe, while other may not
be accessed each time[9].
Figure 6.2: Graph showing cache sets with high Access Time. These are likely tobe the location where AES tables are mapped[9].
Once we fixed that these are the bounds of our tables, we can just fill these cache
lines only in Prime + Probe attack. The above method will not work in the case of
Hardware pre-fetching, where for every line accessed, the next line is automatically
fetched. We will discuss this in more details in further sections.
After getting the table accesses for each encryption, we will use One Round
and Two round attack to get the final key. These are discussed in the next
sections.
6.3 First Round Attack
For attacking AES, a natural approach is to observe the lookups performed in the
first round[7]. The table accesses are simply xi = pi ⊕ ki for all i = 0 - 15 each of
which depend on only one key byte and one plaintext. We already have the plaintext
for the encryption. So, any knowledge related to xi will reveal some information
about key bits.
23
Since each cache block contains 16 table entries and each table contains 256
entries, each table will be mapped to 16 cache blocks. Thus any information about
the access of any particular cache block gives information about the 16 entries as a
whole, i.e. about the first 4 bits. So, we will be able to figure out the first 4 bits of
each key byte using one round attack.
Ideally, we would require the first 16 accesses in order. However, in the given
scenario, we do not have that leverage. Rather, we have cache accesses of the whole
encryption, i.e. we know during the whole encryption which cache blocks out of
64/80 cache blocks are accessed by the victim process. In such scenario, we can
discover the partial information about key bytes as follow.
Consider the case where 〈pi⊕ ki〉4 is indeed present in the list of accesses of that
particular encryption. We can say that this key ki can be a probable candidate for
the actual key. However, if 〈pi ⊕ ki〉4 is not present in the list of accesses, we can
say for sure that this particular key byte will definitely not be my key. The reason
for this is, if this would have been my key byte, than the access corresponding to
〈pi ⊕ ki〉4 must be present in the list of accesses as this particular line must have
been accessed in the first round itself.
In real scenario, due to noise and inaccuracy of measurements, we will not remove
the key values if the corresponding access is not found, rather we will give each a
score of 1 every time it is found and a score of 0 if it is not found.
At the end, when we plot the graph the actual key values should have a peak
because if must have been there in all the encryptions.
This algorithm specifies how one round attack can be implemented.
Note: For plaintext bytes 0, 4, 8, 12 we look at table T0 and so on[1].
Algorithm 2 One Round Attack
1: while true do2: for each plaintext pi do3: for each possible key value (0-255) do4: xi ← 〈pi ⊕ ki〉45: if xi is present in list of accesses then6: graph[i][ki] ← graph[i][ki] + 17: end if8: end for9: end for10: end while
24
6.4 Second Round Attack
The above one round attack has reduced the key space from 128 bits to 64 bits as
for each key byte we are able to retrieve 4 bits. Second round attack is also based
on the same principle of cache accesses as the first round. The only difference is
that unlike the first round, where the cache accesses are simply 〈pi ⊕ ki〉4 the cache
accesses in the second round are based on the outcome of the first round. Each
round scrambles the data in non-linear fashion.
For the second round, we specifically exploit these 4 equations[16]:
Figure 6.3: Equations for second round attack
Here, the key bits which are S-boxed will affect the result of equation in a non
linear way. That means, a change in the least significant 4 bits of a key value can
affect the most significant value of the result. However, this is not the case with
the key byte which are directly xored. If we observe these equations, we will notice
that for each equation, we only have to find out the lower key bits of only 4 keys.
For example, in first equation, lower bits of only k0, k5, k10, k15 will affect the most
significant bits of the result, i.e. they will affect the table access.
So now, we have 16 possible values for each key byte, and each equation has
4 of them. Thus we have total of 164 = 65536 combinations. For each combina-
tion, we apply the same principle as before, i.e. giving the candidate score to that
combination using which we get the access in the list of accesses.
These attacks are based on the assumption that we accurately get the accesses
of the whole encryption. This requires proper synchronization between victim and
attacker process. This is not practical in most of the scenarios.
25
Chapter 7
Cache attacks by exploiting CFS
7.1 Overview
The synchronous attack that has been explained in the previous section is an effi-
cient way to recover key, howeve it is limited to scenarios where the attacker obtain
known plaintexts and has some interaction with the encryption code which allows
him to execute code synchronously before and after encryption. In this section we
describe a class of attacks that eliminate these prerequisites. The attacker will exe-
cute his own program on the same processor as the Victim program performing AES
encryption, but with no explicit interaction such as inter-process communication,
the only knowledge assumed is about a non-uniform distribution of the plaintexts
or ciphertexts.
This chapter describes an attack which is based on the assumption that the spy
process is able to observe every single memory access made by the victim. This high
granularity is achieved by exploiting the behaviour of Completely Fair Scheduler
(CFS) used by Linux kernel.
In next section, we deal with how CFS works and how it can be exploited to
allow the victim process so little time so that it can only make one access in that
duration.
7.2 Completely Fair Scheduler
To gather table accesses in case of shared memory scenario, we need some kind of
synchronization mechanism so that the attacker can observe each and every victim
access.
For this, we as an attacker requires that whenever we want, Operating System
should preempt the victim process and allow attacker to run, which in turn will
26
gather the required memory accesses. For this task of allotting CPU to processes,
scheduler comes into picture. This task of preempting the victim process at will and
gathering required accesses is not easy, as the scheduler has to maintain fairness
among all processes, while achieving maximum throughput at the same time.
So, for achieving the same, we need some kind of attck mechanism on scheduling
capability of Operating System. This paper deals with exploiting an implemenata-
tion of scheduler, known as Completely Fair Scheduler (CFS).
So, lets discuss in brief, how it perform the task of scheduling. This scheduler
tries to behave like an ideal system while giving fair share to each process. To
achieve this, it maintains a virtual runtime of each process, which denotes the time
spent by process while running. So, virtual runtime of running process will increase
at a particular moment in time.
CFS maintains fairness by allowing a process to increase it virtual runtime only
upto a certain bound, after which, it will preempt the process and will select the
process with least virtual runtime at that moment.
This is clearly explained with the help of the given diagram. Here, three processes
are running on a multitasking system. At start, process 1 is activated because it
has least virtual runtime. After allowing the process to run, its virtual runtime
will increase and at the point where maximum unfairness is reached, the next
process is scheduled.
Figure 7.1: Functioning of the Completely Fair Scheduler.[11]
7.3 Attacking CFS
This feature of fairness can be exploited by the attacker in the following way. The
basic idea is that the attacker process requests most of the available CPU while
27
leaving very small intervals for victim process. In this small time, victim access a
memory location, thus bringing the table in cache, and gets scheduled out. Attacker
then gains control and thus can figure out the cache line accessed by the victim. To
achieve this, the attacker process launches some hundred identical threads which ini-
tialize their virtual runtime to as low as possible by blocking for sufficiently amount
of time. The following steps are then performed in a round robin fashion:
• Upon getting activated, thread i first measures which memory access were
performed by V since the previous measurement.
• It then computes tsleep and twakeup, which designate the points in time when
thread i should block and thread i + 1 should unblock, respectively. It pro-
grams a timer to unblock thread i + 1 at twakeup.
• Finally, thread i enters a busy wait loop until tsleep is reached, where it blocks
to voluntarily yield the CPU.
Figure 7.2: Denial Of Service attack on CFS.[11]
Due to a large number of threads the virtual runtime will increase very slowly
and thus whenever scheduler looks at a process to run, it will always choose the
attacker process over victim.
7.4 Retrieving Key
Once we get the cache accesses, we can use the following method[11] to retrieve the
key.
28
AES encryption can be described by this single relation:
Y = M • s(X)⊕K. (7.1)
where, X and Y are state matrix before and after a particular encryption round,
M is the matrix for mix-column step.
X denotes the row-shifted matrix.
and K is the key.
Also, any two consecutive rounds of the same encryption can be put together in
the form of this equation.
ki∗ = y
i
∗ ⊕ (M • s(xi))∗ (7.2)
where, a denotes that it is the 4 byte column vector,
a denotes that the row shofting is applied,
a∗ denotes the leaked bits from cache accesses, which are 5 in case of compressed
table and 4 otherwise.
The basic steps of the finding the key bits are:
1. We treat each of the N accesses as the beginning of a round.
2. For each beginning, we calculate the potential candidates of keys from the
above equation.
3. Based on the different sets of potential candidates, we calculate the keys which
are most probable. This assumption is based on the fact that, if the potential
beginning is correct, the possible keys generated from those are correct.
29
Chapter 8
Design & Implementation of
Espionage Infrastructure
8.1 Flush+Reload Technique
The Flush+Reload attack is a powerful access-driven cache-based side-channel attack
technique. It was proposed by Gullasch et al.[11] but was first named by Yarom et
al.[18]. It usually employs a spy process to check if specific cache lines have been
accessed or not by the attacker’s code. The attack is carried out by a spy process
which works in 3 stages:
Flushing Stage :
In this stage, the attacker flushes the desired memory lines from the cache using
clflush command and hence make sure that they have to be retrieved from the
main memory next time they need to be accessed. The attack would work even if
attacker and victim reside on different cpu cores as clflush flushes memory lines
from all cpu cores.
Accessing the target :
Attacker waits until the Victim process runs a fragment of code, which might use
the memory lines that have been flushed in the first stage.
Reloading Stage :
In reload stage the attacker reloads again the previously flushed memory lines and
measures the time it takes to reload. Depending on the time taken to fetch the
memory lines, the attacker decides whether the victim accessed the memory line or
30
not. If victim would have accessed the memory lines then it would be present in
the cache and if not then it won’t be present in the cache. The following figures the
timing diagrams of various scenarios where victim and attacker accesses the same
cache line. Fig A and B shows the timing diagram without and with victim accessing
the cache line. While doing the experiments, we need to look for some cases, where
victim does not access the cache precisely at a time, which attacker wants. Rest
three diagram, C, D and E shows the timing diagram for such cases.
Figure 8.1: Flush - reload attack timings [18]
The implementation of attack is in Figure 8.2. The code measures the time
to read the data at a memory address and then evicts the memory line from the
cache[18]. The implementation has been given as code within the asm command.
The assembly code takes input as the address that is stored in %ecx (Line 16).
It returns the time to read this address in the register %eax which is stored in the
variable time(Line 15).
The threshold used in the attack is system dependent. For our core i-5 system,
we will set it to 100 that we will be discussing in the next section.
31
Figure 8.2: Code for the Flush+Reload Technique [18]
8.2 Our espionage infrastructure
Our espionage infrastructure in Figure 8.3 consists of three important parts:Spy
Controller [SC], Spy Ring and Centre of Advanced Analytics [CAA].
The SC residing in one cpu core controls the spy threads running on another core.
CAA, implemented with analytical abilities is responsible for providing dynamic
delay instructions to SC so that V can be restricted to fewer access to the memory
lines. Lower the value of the access to the memory lines by the Victim, more accurate
will be the results.
32
Figure 8.3: The Espionage Network.
For the successful attack, our aim is to execute spy threads and V as shown in
Figure 8.4. V is running on a core where spy rings are also scheduled by the SC.
This make OS to divide the CPU time quantum equally into spies and Victim. We
are calling each instance of V ( when V is getiing its turn to run) as run. In each
run of V, it performs the AES encryptions and brings data into cache. The default
time slice (or quantum) assigned by the OS to a process is large enough to make
thousands of cache accesses. So, to stop OS to provide this large quantum to V, our
espionage infrastructure restrict it to have very small time slice.
Figure 8.4: Timeline of victim and spy threads.
Scheduling is a central idea in multitasking Operating System where CPU time
has to be multiplexed in different running processes giving illusion of parallel execu-
tion. Completely Fair Scheduler has been equipped in all the Linux systems starting
from kernel version 2.6.23[11].
To ensure that fair time is allocated to the all processes, the CFS introduces the
concept of virtual runtime associated with each processes. In an ideal scenario, if
we consider total no. of processes running on a CPU core is n then time quantum
33
allocated to each of the processes is 1/n. To achieve this on a realtime system, the
CFS introduces a virtual runtime τi for every process i. In Figure 8.4, the sum of
the CPU times allocated to V is equal to that of the times given to each of the spy
threads running on that core.
In our attack implementation, each spy threads performs the measurement of
access times of each cache lines containing AES lookup tables and then flushes the
tables from all levels of cache. Each spy threads after performing above work signals
the SC through a shared variable, finished to awake the next threads of the ring.
It then waits for an amount of time δ1 before blocking on cond variable. This is
where the victim comes in the picture. All spy threads are in blocking state and OS
resumes the execution of Victim(V).
Algorithm 3 Spy Threads
1: SpyThreads Ti2: while true do3: for each cacheLine containing AES tables do4: if accessTime[cacheLine]< THRESHOLD then5: isAccessed[cacheLine] ← true6: clflush(cacheLine)7: end if8: end for9: mutexLock(var)10: finished← true11: mutexUnlock(var)12: delay loop by time = δ113: end while
The SC continuously checks on the finished flag, once it is true, it delay with
the δ2 time and signals the next spy in the ring to start its execution. The time
delay δ2 is optional as it is only required when no. of accesses by victim is more
than what is suitable for the attack. So, we can restrict the no. of accesses to the
lookup tables by varying the value of δ2.
Our attacks has been designed to work on multiple cores system. Before the
actual attack begin and Victim starts performing the AES encryptions, attacker
schedules its ring of spy threads into the same CPU cores where V reside. SC has to
work on another CPU core alone so that it can send signals without any battle for
the CPU immediately. The Centre for advanced analytics (CAA) can be employed
on any remaining cores including the cores at which SC execute.
The delay loops of δ1 and δ2 is used to fine tune the whole setup so that victim
34
Algorithm 4 Spy Controller
1: while true do2: while finished 6= true do3: end while4: delay loop by time = δ25: condSignal(nextThread)6: mutexLock(var)7: finished=false8: mutexUnlock(var)9: end while
could access the minimum cache lines in its time quantum. Increasing the value
of δ1 decreases the total no. of accesses by the victim as it will use some of the
portion of Victim time. Contrast to this, if we increase the value of δ2 it will allow
V to execute for δ2 extra time hence no. of accesses to the cache lines by V will be
increased.
The value of THRESHOLD in Algorithm 3 has been decided on the basis of time
taken to bring data back in cache after flushing it from all levels of cache memory.
The distribution of time taken(in ticks) for cache hits and miss has been clearly
presented in the Figure 8.5. On the basis of this, we fixed threshold = 100 ticks.
Figure 8.5: Frequency Vs Cache Access Time(ticks).
35
8.3 Approach for Attack
The previous attack is based on the assumption that we can get each access made
by victim. Now, a single table access generally takes less than 100 ns to complete.
That said, it means that victim is scheduled for just around 100 ns every time it
gets access. It seems quite unrealistic. To relax such constraints we propose a
combination of both the attacks, where we will exploit CFS to gather the memory
accesses and use the last table as a synchronization mechanism to know the table
accesses of an encryption.
In this case, we will assume shared table scenario, as the OpenSSL implemen-
tations which we are targeting are shared by default. Here, we assume that victim
is continuously encrypting the data and thus accessing the tables. This assumption
is very much realistic as we can consider the victim to be cloud service providing
encrypted data storage as a service with an unknown key. User, in this case the
attacker, can ask for the cloud service to encrypt the data, which then starts its
encryption sequence and continue encrypting till the end.
We will exploit CFS to gather memory accesses of the victim, but unlike the
previous case, we do not require each individual access, rather we can allow a chunk
of accesses to the victim. For example here, we will give the results of experiments
based on chunk of below 30 accesses.
After getting the accesses, we propose the following algorithm for achieving the
synchronization.
8.3.1 Algorithm
For each group of accesses, check the following:
If that group contains some last table entries, consider that group and next 2
groups as table accesses of next encryption.
This is because the group containing last table accesses can be in one of the
states and in each case, we justify our approach in Chapter 10.
After getting the accesses, we will use first round and second round attacks to
recover the complete AES key as described in section 6.3 and section 6.4.
36
Chapter 9
Coding
9.1 Experimental Setup
Our experiments were performed on Intel(R) Core-i5 2540M [email protected] ma-
chine running Debian Kali Linux 1.1.0, 64bit, kernel version 3.14.5/3.18 using the C
implementation of AES in OpenSSL 0.9.8a. This version of OpenSSL uses a separate
table for the last round of encryption. The core-i5 has 3-level cache architecture.
The L1 cache is 32KB (8-way associative), L2 cache is 256KB (8-way associative)
and L3 cache is 3MB (12-way associative). Each CPU core has private L1 and L2
cache whereas L3 is shared among different CPU cores.
The coding chapter includes code snippets of major components of our espionage
infrastructure. Source code for the attack has been written in basically C language
as the language in which kernel has been programmed is C. We have also written
various scripts in python to automate our attack and generate the results. The
major work done here is performed by SC and spy threads. The victim will be
performing the AES encryptions. For AES encryptions, we have used aes core file
that contains the tables and aes encrypt() function. In our Victim code, AES
encryptions is being performed 100 times with different plaintexts.
Followings are the major sections of the code in our attacker process:
9.2 Attacker source code
1 #de f i n e GNU SOURCE //Assuming a l l header f i l e s inc luded
2 #i f n d e f POSIX THREAD PROCESS SHARED
3 #er r o r This system does not support p roce s s shared mutex
4 #end i f
5
37
6 #de f i n e NUMTHREADS 15
7 #de f i n e MAXCOUNT 10000
8
9 i n t segmentId ;
10 i n t segmentChildId ;
11 i n t segmentcheckId ;
12 i n t ∗ currThread ;
13 i n t ∗ ch i l d ;
14 i n t ∗ ch e ck s t a t e ;
15 pthread cond t ∗ cvptr [NUMTHREADS+1] ;
16 pthr ead condat t r t c a t t r [NUMTHREADS+1] ;
17
18 pthread cond t ∗ cvptrChi ld ; //Condit ion Var iab le Po inte r s o f Child
19 pthr ead condat t r t ca t t rCh i l d ; //Condit ion Var iab le At t r ibute s o f Child
20 pthread mutex t ∗mptr [NUMTHREADS+1] ; //Mutex Po inte r s
21 pthread mutexatt r t matr [NUMTHREADS+1] ; //Mutex Att r ibute s
22
23 pthread mutex t ∗mptrChild ; //Mutex Po inte r s
24 pthread mutexatt r t matrChild ; //Mutex Att r ibute s
25
26
27 i n t shared mem id ; // shared memory Id
28 i n t ∗mp shared mem ptr ; // shared memory ptr −− po in t ing to mutex
29 i n t ∗ cv shared mem ptr ; // shared memory ptr −− po in t ing to cond i t i on
va r i ab l e
30
31 i n l i n e void c l f l u s h ( v o l a t i l e void ∗p)32 {33 asm v o l a t i l e ( ” c l f l u s h (%0)” : : ” r ” (p) ) ;
34 }35
36 unsigned long probe ( char ∗ adrs ) {37 v o l a t i l e unsigned long time ;
38 asm v o l a t i l e (
39 ” mfence \n”40 ” l f e n c e \n”41 ” rd t s c \n”42 ” l f e n c e \n”43 ” movl %%eax , %%e s i \n”44 ” movl (%1) , %%eax \n”45 ” l f e n c e \n”46 ” rd t s c \n”47 ” sub l %%es i , %%eax \n”48 ” c l f l u s h 0(%1) \n”49 : ”=a” ( time )
50 : ” c” ( adrs )
38
51 : ”%e s i ” , ”%edx” ) ;
52 re turn time ;
53 }54
55 s t r u c t s ha r ed u s e s t
56 {57 unsigned long long acc e s s count ;
58 i n t f l a g ;
59 i n t t h r e ad s t a t e ;
60 unsigned long long check count ;
61 } ;62
63 s t r u c t s ha r ed u s e s t ∗ s h a r e d s t u f f ;
64
65 create shared memory ( )
66 {67 void ∗ shared memory=(void ∗) 0 ;68
69 i n t shmid ;
70 shmid =shmget ( ( key t ) 1234 , 4096 , 0666 | IPC CREAT ) ;
237 ch e ck s t a t e = ( i n t ∗) shmat ( segmentcheckId , NULL, 0) ;
238 ∗ ch e ck s t a t e = 0 ; // shared va r i ab l e f o r thread among parent and
ch i l d
239 create shared memory ( ) ;
240 s h a r ed s t u f f−>t h r e ad s t a t e =2;
241 s h a r ed s t u f f−>check count =0;
242
243 p id t pid , id ;
244 i n t i ;
245
246 pid = fo rk ( ) ;
247 id = getp id ( ) ;
248
249 i f ( pid > 0)
250 {251 // In parent
252 pthread t threads [NUMTHREADS] ;
253 i n t rc ;
254 long t ;
255 f o r ( t=0; t<NUMTHREADS; t++)
256 {257 rc = pthr ead c r ea t e (&threads [ t ] , NULL, parentThreads , ( void ∗) t ) ;258 i f ( rc )
259 {260 p r i n t f ( ”ERROR; return code from pthr ead c r ea t e ( ) i s %d\n”
, rc ) ;
261 e x i t (−1) ;262 }263 }264 pth r ead ex i t (NULL) ;
265
266 }267 e l s e
43
268 {269 // In Child
270 pthread t thread ;
271 thread = p th r e a d s e l f ( ) ;
272 c pu s e t t my set ; /∗ Def ine your cpu se t b i t mask . ∗/273 CPU ZERO(&my set ) ; /∗ I n i t i a l i z e i t a l l to 0 , i . e . no CPUs s e l e c t e d
. ∗/274 CPU SET(1 , &my set ) ; /∗ s e t the b i t that r ep r e s en t s core 1 . ∗/275 s c h e d s e t a f f i n i t y (0 , s i z e o f ( c pu s e t t ) , &my set ) ; /∗ Set a f f i n i t y o f
t h i s p roc e s s ∗/276 i n t check counter =0;
277 long i n t counter = 0 ;
278 whi le ( counter++<MAXCOUNT)
279 {280 whi le ( ! ∗ ch i l d ) ; // wait t i l l s e t by Thread
281 v o l a t i l e i n t wa i t c h i l d =0;
282 // whi l e ( wa i t c h i l d++<10) ;
283
284 pthread mutex lock ( mptrChild ) ;
285 ∗ ch i l d =0;
286 pthread mutex unlock ( mptrChild ) ;
287 pth r ead cond s i gna l ( cvptr [∗ currThread ] ) ;
288 }289 }290 re turn 1 ;
291 }
44
Chapter 10
Results and Analysis
The first thing that has to be decided by us is to check the time taken by the cache
memory to access the cache lines by the victim. From the previous experiment we
have seen in the figure 8.5 that time taken by V to access the memory lines from the
cache is in the range of 32-68 for our system. So, we decided the threshold value to
be 100 which clearly separate the value of time to bring data from cache and main
memory(which takes more than 200 ticks time).
On the basis of the threshold value decided in the section 8.2, we have performed
our experiment using our espionage network with different no. of spy threads. The
no. of accesses by Victim should decrease as we increase the no. of spy threads
(Section 8.2). The no. of distinct memory accesses has be clearly depicted in the
Figure10.1 and Figure10.2. We are currently able to restrict V to access only between
18-27 accesses in each run which is enough for the success of our attack.
For the proper understanding of our results, we have written codes in our Spy
Controller (Chapter 9) to print the exact cache lines into a file, accessed by the
Victim which was detected by the spy in its turn. So our access results includes the
no. of cache line accesses by the Victim in each run but detected in spy turn with
corresponding AES tables information.
45
Figure 10.1: #Accesses per run(#spy threads=10).
In Figure10.1, where no. of threads are 10 in spy ring, we can see that the
average accesses are in the range of 22-37 while in the case of Figure10.2, where
ring of spy threads contains 40 thrads, no. of accesses are in suitable range for our
attack.
Figure 10.2: #Accesses per run(#spy threads=40).
For the same plaintext encrypted 100 times by Victim, results of few access has
been shown in the following Figure 10.3. It clearly depict the start of encryption
46
where all the accesses are of AES look-up tables T0, T1, T2 and T3. Accesses in
the fifth table T4 depict the ending of the encryption as cache lines in the range of
64-79 is being accessed. Sometimes the tables of the new encryption can be noticed
in the last round of the previous encryption. This indicate that the last round of the
previous encryption include the few tables of the beginning of the next encryption.
Figure 10.3: Cache accesses detected by Spy threads.
We can resolve the above stated conflicting accesses for our attack as explained
in the Table 10.1
Here, we have taken ideal memory accesses from an actual implementation of
AES encryption of OpenSSL vs 0.9.8a. This part is basically to solidify our point,
that if we apply modifications as discussed in previous section, we will still be able
to generate the keys.
For such set-up, we made changes in OpenSSL AES code to print the table
accesses in a file, such that the table offsets are printed. We gathered this data for
as many as 100000 encryptions with a single key.
Results worth mentioning are explained here with the graphs as mentioned in
[4].
1. If in ideal scenario, attacker could get all the accesses of an encryption and
that too in order, only one encryption is needed to get the keys from the first
round attack, and less than 5 encryptions for the second round attack to
recover complete key.
47
S.No. Accesses contained in agroup with last table entries
Observations
1 (A) Accesses from currentencryption (9th round).(B) Last round Accesses.(C) Accesses from nextencryption (1st round).
As we are not concerned about entries otherthan first two rounds, data from current en-cryption is of no use to us. But the datacontaining first round of next encryption isimportant to us. So it’s better to considerall accesses as next encryption access.
2 (A) Accesses from Last ta-ble (B) Accesses from firstand second round of nextencryption.
Here also, we need data of next encryption,so we consider the accesses as that of newencryption.
3 (A) Accesses from 8th and9th round of current encryp-tion. (B) Last round Ac-cesses.
Here, even if the data of next encryption isnot present, we can’t be sure of this situa-tion, so we simply consider these access asnext encryption access and along with nexttwo groups accesses, we consider is for newencryption.
4 (A) Accesses from 8th and9th round of current encryp-tion. (B) Few entries of lastround.
This scenario is also similar to previous sce-nario, and we will not have any problem inconsidering these accesses as new encryptionaccesses.
Table 10.1: Conflicting access resolution
48
2. We have varied number of accesses available to the attacker in each encryption,
both in the case of pre-fetching and no pre-fetching in perfectly synchronized
situation, where attacker exactly knows the start of encryption.
To get such results, we ran our attack on various number of encryptions and
plotted graph for each key byte under analysis vs the score it received during
the analysis. The complete algorithm is explained in Section 6.3.
Here, we will take the case of key byte 8 (k8). Below are two graphs showing
the plot in case of 1100 encryptions (left) and 1300 encryptions (right) for the
case of hardware pre-fetch enabled and considering whole 160 accesses chunk.
We can see that while in left, the peak although visible, is not very clear,
while in the right graph the clarity began to grow. If we increase the number
of encryptions, the difference will increase much more.
Figure 10.4: Differences in the peak for 1100 accesses
With the next graph, we can clearly see that as the number of accesses as
a bunch is available to attacker, the number of encryptions needed increases.
This is reasonable as well because with the increase in number of encryption
as a group, more cache lines are active now, which decreases the probability
of assurance about first round accesses.
In the case of hardware pre-fetching enabled, the number of accesses are defi-
nitely higher than that of its counterpart with no pre-fetching, and they also
49
Figure 10.5: Differences in the peak for 1300 encryptions
follow the same behaviour as that observed in the case of no pre-fetching.