Top Banner
Design and implementation of an improved and parallelized RSA algorithm on multicore CPU’s and GPU’s Kennedy B.
21
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Design and implementation of an improved and parallelized RSA algorithm on multicore CPUs and GPUs

    Kennedy B.

  • Outline Introduction

    Objective

    Problem of statement

    Methodology

    Results and discussions

    Conclusions and Recommendations

    References

    2

  • Introduction Graphics Processing Units (GPU) and their

    development tools have advanced and more in need in industry.

    Among several development frameworks for CPU and GPU(s), OpenCL provides a programming environment to write portable code that can run in parallel across heterogeneous platforms consisting of many different types of devices, for example CPUs, GPUs, DSPs, FPGAs and other types of processors.

    Sensitive message exchange through the internet. Secure messages in an unsecure-channel. Security is achieved through encryption. Cryptanalysis: - art of securing data over the internet. symmetric-key cryptography is based on sharing secrecy; asymmetric-key cryptography is based on personal secrecy.

    3

  • Asymmetric key cryptography uses two separate keys: one private and one public.

    Fig: General idea of asymmetric-key cryptosystem

    Contd

    4

  • The most common public-key algorithm is the RSA

    cryptosystem, named for its inventors (Rivest, Shamir, and

    Adleman).

    Contd

    Fig. Complexity of operations in RSA

    5

  • Objective

    main objective :

    The main objective of this project aims at speeding up RSA encryption and decryption algorithm.

    Specific objective:

    The specific objective of this project is to parallelize the RSA algorithm and implementing it on a multicore CPUs and GPU. 6

  • Problem of statement Public key algorithms (e.g., RSA algorithm) rely on hard

    mathematical problems :

    modular multiplication and modular exponentiation of very large integers, ranging from 128 to 2048 bits.

    The achievement of the calculation process will not be easy to implement.

    With the rapid developments in hardware and softwaretechnologies- sequential implementation not safer and fast enough.

    Parallel algorithms play a significant role in maintaining rapid growth- on CPUs and GPU

    However, such parallelization is a challenging process.

    Motivated by such challenge this project proposes a hybrid system to parallelize the RSA for multicore CPUs and many cores GPU.

    7

  • Methodology RSA algorithm is appropriate for encryption

    and digital signature.

    Internet security depends significantly on the security properties of the RSA cryptosystem.

    Its security depends upon the insolvability of the integer factorization problem.

    Modular arithmetic is used to implement modular exponentiations in RSA algorithm.

    The RSA algorithm consists of three steps which include key generation, encryption and decryption ones.

    8

  • Contd The RSA algorithm can be summarized in the following steps:

    Step 1: Generate randomly two large prime's p and q of approximately the same size, but not too close together. which are kept secret.

    Step 2: Calculate the modulus n = p*q. and Calculate: (n) = (p-1) (q-1); Where (n) represents the Euler Totientfunction.

    Step 3: Choose a random encryption exponent e less than nsuch that the GCD ( (n), e) =1, 1

  • ContdMontgomery Algorithm

    Presented by Peter Montgomery in 1985, an algorithm used in public-key cryptography.

    Serve as an efficient algorithm for modular multiplication and exponentiation operations.

    Efficient computation for a large modular arithmetic (>=1024 bits). The Montgomery algorithm consists of two approaches:

    multiplication and reduction.

    Montgomery multiplication is a method for computing a. b mod nfor positive integers a, b, and n.

    It is useful to compute ae mod n for a large value of n. It eliminates the mod n reduction steps and as a result, tends to

    reduce the size of the timing characteristics.

    In common, Montgomery multiplication algorithm computes the Montgomery product as specified by: MonMul (a', b') = a' .b' .r-1

    (mod n)

    Where, a and b are less than the modulus n. it is needed to declare another integer r which must be greater than n, as the gcd (r, n) = 1

    10

  • ContdMontgomery Reduction Algorithm.

    Step 1: Input a, e, n.

    Step 2: Function: MonExp (a, e, n).

    Step 3: Calculate a'= a .r mod n.

    Step 4: Calculate x'= 1 .r mod n.

    Step 5: for i = n 1 down to 0 loop

    x' = MonMul(x', x')

    If e.i= 1; then x' = MonMul(x', a') End loop.

    Step 6: x = MonMul(x', 1)

    Step 7: return x.

    Step 8: Output: ae mod n.

    11

  • ContdIMPLEMENTATION OF RSA ALGORITHMSequential RSA implementation on the CPU alg.

    Step 1: Generate the keys as mentioned earlier.

    Public key {e,n} public struct RSA_Public_Key

    Private key {d,n}

    public struct RSA_Secret_Key

    Step 2: Insert the text that will be encrypted from a file or typing it.

    Step 3: Send the data to a for loop to do encryption

    For (int i = 0; i < list_source.Count; i++)

    { var item = new {Id = i, Data = list source[i]}

    };

    Step 4: The decryption process is done using public Encrypt (long int biPlain, RSA_Public_Key rpkKey)

    12

  • Contd Parallel RSA implementation on the multicore CPU alg.

    Step 1: Generate the keys as mentioned above.Public key {e,n}

    public struct RSA_Public_Key

    Private key {d,n}

    public struct RSA_Secret_Key

    Step 2: Insert the text that will be encrypted from a file or typing it.

    Step 3: Create a pool of threadsThreadStartsList

    ParameterizedThreadStart

    Step 4: Each thread will take a portion of data to implement encryption on it.

    For (int i = 0; i < list_source.Count; i++)

    {

    ParameterizedThreadStart ();

    {doEncrypt ((ThreadParameters)); };

    }

    Step 5: The decryption process can be executed same wise.

    13

  • Contd Parallel RSA implementation on the many core GPU alg.

    Step 1: Generate the keys as mentioned earlier.

    Public key {e,n} : public struct RSA_Public_Key

    Private key {d,n} :public struct RSA_Secret_Key

    Step 2: Insert the text that will be encrypted from a file or typing it.

    Step 3: Set kernel launch parameters (Set grid/ block size for GPU execution).

    Launcher.SetGridSize (512);

    Launcher.SetBlockSize (128);

    Step 4: Call kernel method (GPU kernel)

    Reduce_GPU (A, n, m, mPrime);

    Step 5: Get the thread id and total number of thread.

    int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X;

    int TotalThreads = BlockDimension.X * GridDimension.X;

    14

  • Results and Discussions The CPU carries out the key generation.

    As for the encryption and decryption process it is handled with these three cases:

    1) A sequential implementation of the RSA algorithm runs on the CPU.

    2) An RSA parallel implementation executed on the multicore CPU.

    3) An RSA parallel implementation executed on the many core GPU.

    The proposed variant implementations support variable key size as demand.

    The main bottleneck of the RSA encryption process is the large size of data.

    In order to provide a parallel implementation of the RSA, it is desired to have no dependencies between the data.

    As so, the data can be divided into small portions, each thread can calculate a portion.

    As a result, this data parallelism method increases the computing speed of RSA.

    15

  • Contd In order to compare the speedup gain of

    parallelizing RSA in multicore CPU and GPU computing environments, the sequential and parallelized algorithms has been implemented and the elapsed time for the encryption/decryption process has been recorded.

    it is seen that the GPU implementation begin to be faster than the other two implementation when the key size is gets higher.

    The experiments are conducted on a desktop with Intel Core I7, 3.23 GHz CPU and NvidiaGeForce GT630M GPU.

    16

  • The execution time (latency) in Milliseconds for

    encryption/decryption with variant key size

    Key

    Size

    in bits

    Sequential

    CPU(Enc | Dec)

    ParallelizedCPU

    (Enc | Dec)

    GPU

    (Enc | Dec)

    768 0.110 5.03 0.87 5.46 1.08 2.42

    1024 0.130 8.89 0.94 7.89 0.92 2.78

    2048 0.49 76.294 1.28 38.66 0.91 9.27

    3072 0.85 250.034 1.4 73.98 0.8 23.59

    4096 1.54 411.453 1.9 140.38 0.99 41.26

    6144 4.01 1727.58 3.13 369.3 1.95 93.28

    8192 5.93 2664.31 3.9 724.96 1.980 201.07

    17

  • 18

    KeySize 512 1024 2048 3072 4096 6144 8192

    SEQUENTIAL

    CPUparallel

    GPUparallel

  • Conclusions and Recommendations for future work Due to its roots in modular arithmetic based on very large

    numbers, RSA is considered to be slow algorithm. This paper proposed a variants implementations of

    executing modular exponentiation on multicore CPUs and GPU.

    The GPU implementation gained moderately a higher speed up over the sequential CPU implementation; while the multithread CPU implementation gained only moderate speed up over the sequential CPU implementation.

    Furthermore, additional speedup could be gained as far as the throughput is concerned.

    Results reveal that the GPU is appropriate to speed up the RSA algorithm.

    In the future work, implementing this algorithm on FPGAsis recommended.

    19

  • References 1) https://en.wikipedia.org/wiki/RSA_(cryptosystem)

    2) mathworld.wolfram.com

    3) OpenCL C/C++ Programming Guide, khronous.org

    4) Handbook of Applied Cryptography. CRC Press, Inc., Boca Raton, FL, USA.

    5) Diego Viot, Rodolfo Aurelio, Helano Castro and Jardel Silveria, Modular

    6) Multiplication Algorithm for PKC, Universiadade Federal do Ceard, LESC

    7) Josef Pieprzyk1 and David Pointcheval, Parallel Authentication and Public key encryption, Springer-Verlag 2003

    8) Chandra, S. S. & Chandra, K. 2005. Cbigint class: an implementation of big integers in c++. J. Comput. Small Coll., 20(4)

    9) Bewick, G. 1994. Fast multiplication algorithms and implementation.

    20

  • 21