Efficient Parallel Implementation of Matrix Multiplication ...downloads.hindawi.com/journals/scn/2018/7012056.pdf · SecurityandCommunicationNetworks e ARM Cortex-A series is used

Research ArticleEfficient Parallel Implementation of Matrix Multiplication forLattice-Based Cryptography on Modern ARM Processor

Taehwan Park 1 Hwajeong Seo 2 Junsub Kim3 Haeryong Park3 and Howon Kim 1

1Pusan National University School of Computer Science and Engineering San-30 Jangjeon-Dong Geumjeong-GuBusan 609-735 Republic of Korea2Hansung University IT Engineering 116 Samseong-Yoro-16-Gil Seongbuk-gu Seoul 136-792 Republic of Korea3Cryptographic Technical Team Security Industry Division Korea Internet Security Agency 6F 9 Jinheung-gil NajuJeollanam-do 58324 Republic of Korea

Correspondence should be addressed to Howon Kim howonkimpusanackr

Received 6 April 2018 Accepted 5 September 2018 Published 24 September 2018

Guest Editor Chong Hee Kim

Copyright copy 2018 Taehwan Park et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Recently various types of postquantum cryptography algorithms have been proposed for the National Institute of Standards andTechnologyrsquos Postquantum Cryptography Standardization competition Lattice-based cryptography which is based on Learningwith Errors is based on matrix multiplication A large-size matrix multiplication requires a long execution time for key generationencryption and decryption In this paper we propose an efficient parallel implementation of matrix multiplication and vectoraddition with matrix transpose using ARM NEON instructions on ARM Cortex-A platforms The proposed method achievesperformance enhancements of 3693 695 3292 and 766 The optimized method is applied to the Lizard CCA keygeneration step enhances the performance by 704 366 757 and 932 over previous state-of-the-art implementations

1 Introduction

In these days with the development of quantum computingtechnologies there are security threats to the existing blockcipher due to the Groverrsquos algorithm [1] and public keycryptographic algorithms such as RSA which is based onthe integer factorization problem the discrete logarithmproblem and ECC which is based on elliptic curve dis-crete logarithm problem according to Shorrsquos algorithm [2]For this reason many cryptographers are designing newcryptographic algorithms such as lattice-based cryptographymultivariate-based cryptography Hash-based cryptographycode-based cryptography and supersingular elliptic curveisogeny-based cryptography which are safe in a quantumcomputing environment In PQCrypto 2016 the NationalInstitute of Standards and Technology (NIST) announced thePostquantum Cryptography Standardization competitionThe submission deadline was November 30 2017 and thefirst standardization workshop date was 11 April 2018 ManyPostquantum cryptographic algorithms have been proposedLattice-based cryptography which is based on Learning with

Errors (LWE) problems used matrix multiplication andvector addition operations for key generation encryptionand decryption However matrix multiplication and vectoraddition for a large matrix take much time for key generationencryption and decryption For efficient implementation oflattice-based cryptography speed optimized implementationon matrix multiplication and vector addition is needed Inthis paper we propose efficient parallel implementation ofmatrix multiplication and vector addition for lattice-basedcryptography based LWEproblems usingARMNEONSIMDintrinsic functions

The remainder of this paper is organized as followsSection 2 discusses the literature related to the LWE prob-lems NIST PQC Standardization Lizard lattice-based cryp-tography ARM NEON SIMD and related studies on effi-cient implementation of lattice-based cryptography We pro-pose efficient ARM NEON optimized matrix multiplicationand vector addition implementation methods in Section 3Section 4 gives experimental and evaluation results on pro-posedARMNEONoptimized matrix multiplication and vec-tor addition implementation and Lizard CCA key generation

HindawiSecurity and Communication NetworksVolume 2018 Article ID 7012056 10 pageshttpsdoiorg10115520187012056

2 Security and Communication Networks

with the proposed method Section 5 provides some finalconclusions

2 Related Studies

In this section we describe related studies on LWE problemsand NIST PQC standardization

21 Learning with Errors (LWE) Problems Regev introducedthe Learning with Errors (LWE) problem [4] For examplefor an n-dimensional vector 119904 isin Z119899119902 and an error distribution120594 over Z the LWE distribution 119860119871119882119864119899119902119909(119904) over Z119899119902 times Z119902 isobtained by choosing a vector uniformly and randomly fromZ119899119902 and an error e from 120594 and using

(119886 119887 = ⟨119886 119904⟩ + 119890) isin Z119899119902 times Z119902 (1)

The search LWE problem finds 119904 isin Z119899119902 for given arbitrar-ily many independent samples (ai bi) from 119860

119871119882119864119899119902119909(119904) The

hardness of the decision LWE problem is guaranteed bythe worst case hardness of the standard lattice problemssuch as the decision version of the shortest vector problem(GapSVP) and the shortest independent vectors problem(SIVP) Peikert et al [5 6] improved the reduction of theclassical version Brakerski et al [6] proved that the LWEproblem with a binary secret is at least as hard as theoriginal LWE problem and Cheon et al [7] proved thehardness of the LWE problemwith a sparse secret Accordingto these research results in these days the LWE problemhas been used as a hardness assumption for lattice-basedpostquantum cryptography In lattice-based cryptographyerrors (E) can be used during encryption and decryptionprocedures and they are generated by random samplers suchas the Gaussian sampler During encryption and decryptionprocedures they used matrix multiplication between matrixA and secret matrix S and then vector addition with errorsvector E For example Peikert [5] proposed a cryptosystembased on the LWE problem which is secure against anychosen-ciphertext attack and Lin et al [8] proposed a keyexchange scheme based on the LWE problem Many lattice-based cryptography systems provide security in a quantumcomputing environment based on LWE problems

22 NIST PQC Standardization The United States NationalInstitute of Standards and Technology (NIST) has initiatedpostquantum cryptography standardization since 2016 Thesubmission deadline was November 30 2017 A total of 69postquantum cryptographic algorithms were submitted onNIST PQC standardization Round 1 26 lattice-based crypto-graphic algorithms (5 signatures 21 KEM (key encapsulationmechanism)encryption) 19 code-based cryptographic algo-rithms (3 signatures 16 KEMencryption) 9 multivariate-based (7 signatures 2 KEMencryption) 3 hash-based signa-ture schemes and 8 others (2 signatures 6 KEMencryption)were submitted Four algorithms have been withdrawn Thelattice-based cryptography is the most proposed type ofpostquantum cryptography for NIST PQC standardizationaccording toNISTPQC standardizationRound 1 submission

Q15 Q14

middot middot middot

middot middot middot Q2 Q1 Q0

D31 D30 D29 D28 D5 D4 D3 D2 D1 D0

Figure 1 ARM NEON register bank

Most lattice-based cryptographic algorithms are based on theLWEproblem for providing security in a quantumcomputingenvironment and efficiency of implementation The firstNIST PQC standardization conference was scheduled totake place on April 11-13 2018 After the first NIST PQCstandardization it will take about five to six years until thefinal decision for NIST PQC standardization is made DuringPQC standardization efficient implementation of submittedpostquantum cryptographic algorithms is an important issue

23 Lizard Lizard [3] is a family of postquantum pub-lic key encryption (PKE) schemes and key encapsulationmechanisms (KEMs) which was submitted to NIST PQCstandardization round 1 The security of Lizard is based onsparse a small secret version of Learning with Errors (LWE)and learning with rounding (LWR) A sparse signed binarysecret LWE problem is at least as hard as the original LWEproblem The public key for Lizard was chosen to be aset of LWE samples with signed binary secret informationLizard supports IND-CPAPKE IND-CCA2 KEM and IND-CCA2 PKE and there are two types of Lizard namelyLizard and Rlizard which are based on Ring-LWE and Ring-LWR problems In the key generation step of Lizard it firstsamples a secret vector 119904 isin minus1 0 1119899 a random matrix119860 isin Z119898times119899 and an error vector 119890 larr997888 119863119866119898120590 of whichthe components are expected to be small The secret key iswritten as sk larr997888 s and the public key is written as pk larr997888(Ab)where 119887 = 119860119878 + 119890 isin Z119898119902 Hence the public key q is aninstance of LWE with the secret vector s There are five typesof parameter sets of LizardCCA CCA CATEGORY1 N536CCA CATEGORY1 N663 CCA CATEGORY3 N816 CCACATEGORY3 N952 CCA CATEGORY5 N1088 and CCACATEGORY5 N1300 The parameter sets of LizardKEMare similar to the parameter sets of LizardCCA HoweverRLizardCCA and RLizardKEM have four types of param-eter sets RING CATEGORY1 RING CATEGORY3 N1024RING CATEGORY3 N2048 and RING CATEGORY5 Inthis study we used the proposed method for efficient matrixmultiplication and vector addition using ARMNEON SIMDon the LizardCCA key generation step and evaluated theperformance of proposed method on the proposed methodsapplication aspect

24 ARMNEON ARMNEON is an advanced single instruc-tion multiple data (SIMD) engine for the ARM Cortex-Aseries and Cortex-R52 processor [9] It was introduced tothe ARMv7-A and ARMv7-R profiles and it is also now asan extension to the ARMv8-A and ARMv8-R profiles ARMNEON supports 128-bit size Q registers (Q0-Q15)Q registerscan be written as 4 32-bit size data 8 16-bit size data and16 8-bit size data Each Q register can be separated into 2 Dregisters (64-bit size) as in Figure 1

Security and Communication Networks 3

The ARM Cortex-A series is used for smartphonesand some IoT devices such as the Raspberry Pi seriesFor this reason ARM NEON SIMD is used for high-performance multimedia processing and big-data processingin the Cortex-A series environment

There are two methods to use ARM NEONThe first oneuses ARM NEON intrinsic functions that can be mapped tothe ARM NEON assembly instruction by 1-1 The other usesARM NEON assembly code In this study we used ARMNEON intrinsic functions for efficient development of theproposed method

In 2012 Bernstein introduced implementation of a cryp-tographic algorithm using ARM NEON [10] Since thenthere have beenmany research studies on efficient implemen-tation of cryptographic algorithms The Streit method [11]proposed efficient implementation of a NewHope postquan-tum key exchange scheme using NEON in an ARMv8-Aenvironment Seo [12] proposed a high-performance imple-mentation of SGCM in an ARM environment using NEONLiu Zhe et al [13] proposed efficient Number TheoreticTransform (NTT) implementation using NEON for efficientRing-LWE software implementation in a Cortex-A seriesenvironment Seo et al [14] proposed a compact GCMimplementation in a 32-bit ARMv7-A processor environmentusing NEON

25 Related Studies on Efficient Implementation of Lattice-Based Cryptography There are many research results onefficient implementation of lattice-based cryptographyPoppelmann [15] proposed an efficient implementation ofRing-LWE encryption in a reconfigurable hardware 8 bitmicrocontroller environment and software implementationof GLP on IntelAMD CPUs and BLISS in the Cortex-M4Fenvironment Nejatollahi et al [16] introduced trendsand challenges for lattice-based cryptography softwareimplementation In this paper the time complexity ofmatrix-to-matrixvector multiplication is O(1198992) and it isneeded to implement matrix multiplication efficientlyThe Liu Zhe method [17] surveyed implementation oflattice-based cryptography on IoT devices and suggestedthat the Ring-LWE-based cryptosystem would play anessential role in postquantum edge computing and thepostquantum IoT environment Lie Zhe et al [18] proposedhigh-performance ideal lattice-based cryptography on an8-bit AVR microcontroller They proposed an efficient andsecure implementation of Ring-LWE encryption in an 8-bitAVR environment against timing side-channel attack BosJoppe et al [19] proposed CRYSTALS-Kyber which ismodule-lattice-based KEM which provides CCA-secureIn their paper they proposed AVX2 implementation andperformance of CRYSTALS-Kyber The McCarthy method[20] proposed a practical implementation of identity-basedencryption over NTRU lattice-based cryptography on anIntel Core i7-6700 CPU They optimized the DLP-IBE andGaussian sampler for efficient implementation Yuan Ye etal [21] proposed memory-constrained implementation oflattice-based encryption in a standard Java card environmentFor efficiency they optimized Montgomery ModularMultiplication (MMM) and Fast Fourier Transform (FFT)

for NTT Oder Tobias et al [22] proposed practical CCA2-secure and masking Ring-LWE implementation in an ARMCortex-M4F environment They implemented maskedPRNG (SHAKE-128) for a countermeasure of a side-channelattack The OSullivan method [23] reviewed the state-of-the-art in efficient designs for lattice-based cryptographyhardware and software implementation

3 Proposed Method

In this section we describe our proposedmethod for efficientmatrix multiplication and vector addition using ARMNEONSIMD

31 Problem on Matrix Multiplication and Vector AdditionImplementation First we describe the problem on matrixmultiplication and vector addition for lattice-based cryptog-raphy based on the LWE problem For example there areMatrix A (119886119894119895 0 le 119894 le 119872 0 le 119895 le 119873) Matrix S (119904119895119896 0 le119895 le 119873 0 le 119896 le 119871) and Matrix E (119864119894119896 0 le 119894 le 119872 0 le119896 le 119871) as in Figure 2 If we want to implement matrixmultiplication and vector addition we have to multiply eachelement on the row of Matrix A and the column of MatrixS After matrix multiplication we add the element of thematrix multiplication result and the element of Matrix EThese procedures have a problem multiplying and additionbetween each element of the matrix so computing takes along time

For solving and efficient implementation of matrix mul-tiplication and vector addition we propose efficient matrixmultiplication and vector addition using NEON in an ARMCortex-A environment

32 Proposed Efficient Matrix Multiplication and Vector Addi-tion For efficient matrix multiplication and vector addi-tion we used ARM NEON intrinsic functions as shown inTable 1 Using ARM NEON SIMD we could compute 128-bit size data at each instruction ARM NEON supports thevector interleave function vector multiplying accumulationlane broadcast and extracting lanes from a vector into aregister For this reason we proposed matrix multiplicationafter the matrix transpose for NEON SIMD implementationusing ARM NEON intrinsic functions as in Table 1 Foran efficient matrix transpose we used the vector interleaveNEON function for efficient implementation We used vectormultiplying accumulation and extracting lanes from a vectorinto a register and NEON lane broadcast for efficient matrixmultiplication

The NEON data load operation intrinsic function canload data (128-bit) from an 81632-bit data array with a sizeof 16 8 or 4 Figure 3 describes a 128-bit size NEONdata loadfroma 16-bittimes8 size data array using only theNEONdata loadintrinsic function

The NEON data store operation intrinsic function canstore data (128-bit) into an 81632-bit data array with a size of16 8 or 4 Figure 4 describes a 128-bit size NEON data storeinto a 16-bittimes8 size data array using only the NEONdata storeintrinsic function


N

MN

L L L

Matrix A

X

Matrix S Matrix E

+ M M=

Matrix Rst

Figure 2 Matrix multiplication and vector addition (existing method)

01234567

01234567

16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit

16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit 16-bit7 lane 6 lane 5 lane 4 lane 3 lane 2 lane 1 lane 0 lane

uint16_t ptr[8]

uint16x8_t r

Figure 3 NEON data load operation

01234567

01234567



uint16x8_t r

uint16_t ptr[8]

Figure 4 NEON data store operation

The NEON extracting lane from a NEON vector to aregister extracts data according to the lane number valueFigure 5 describes the NEON extracting lane number 2 datafrom NEON vector a (16-bittimes8 size) to an unsigned short16-bit size data register r The NEON extracting lane froma NEON vector to a register operation can also extractdata such as 81632-bit data from the NEON vector ThisNEON intrinsic function will be used at data accumulate andstore into register during matrix multiplication procedureThe details of NEON extracting intrinsic function usage aredescribed in Algorithm 2

The NEON lane broadcast intrinsic function sets all thelane data in the NEONvector at the same value as in Figure 6This NEON intrinsic function is used for initializing theaccumulation NEON vector as zero during the matrix multi-plication procedure The details of the NEON lane broadcastintrinsic function usage are described in Algorithm 2

TheNEON vector interleave function supports the vectorinterleave between 2 NEON registers as in Figure 7 After thevector interleave the result of the vector interleave is to storeat the NEON register array (with a size of 2 2 128-bit data)If we implemented matrix transpose using C language wehave to exchange between elements on thematrixHowever ifwe use NEON vector interleave we can exchange 128-bit sizedata at each instructionThisNEON intrinsic function is used

for matrix element transpose during the matrix transposeprocedure in Algorithm 1

Algorithm 1 describes the matrix transpose method usingNEON for efficient matrix multiplication In Algorithm 1from lines No 2 to No 5 it computes the matrix index whichis located at outbound of the matrix as index which is locatedat inbound for NEON SIMD matrix transpose At that timethe matrix row index can be set as the matrix row index(BLOCK TRANSPOSE-N BLOCK TRANSPOSE) and thematrix column index can be set as the matrix column index(BLOCK TRANSPOSEndashL BLOCK TRANSPOSE)

After calculating the matrix index it repeats the dataload on NEON registers and the vector interleave betweenNEON registers until the matrix transpose is done foreach BLOCK TRANSPOSE from lines No 7 to No 56 InAlgorithm 1 we assume that each data element of the matrixhas 16-bit size data so BLOCK TRANSPOSEmeans 8 becauseeach NEON register size is 128-bit (16-bittimes8 data) Aftermatrix transpose at each BLOCK TRANSPOSE it storesNEON register data to the transposed matrix array

For matrix multiplication and vector addition if weuse C language we have to multiply element by elementwhich are on each matrix and after matrix multiplicationwe have to add each element in the matrix and vectorwhich takes a long execution time according to the increasing


Table 1 ARM NEON intrinsic functions for the proposed method

Operations ARMNEON Intrinsic functionsLoad uint16x8 t vld1q u16( transfersize(8) uint16 t const lowast ptr)Store void vst1q u16( transfersize(8) uint16 tlowast ptr uint16x8 t val)Extracting lanes from a vector into a register uint16 t vgetq lane u16(uint16x8 t vec constrange(0 7) int lane)Lane Broadcast uint16x8 t vdupq n u16(uint16 t value)Vector Interleave uint16x8x2 t vzipq u16(uint16x8 t a uint16x8 t b)Vector Multiply Accumulate uint16x8 t vmlaq u16(uint16x8 t a uint16x8 t b uint16x8 t c)


uint16x8_t a

uint16_t r

int lane = 2

012

2

34567

Figure 5 NEON extracting lane from a vector to a register

matrix size However if we use NEON vector multiplicationand accumulation as in Figure 8 we can implement matrixmultiplication and vector addition by 128-bit size data at eachNEON instruction which accelerates the performance of thematrix multiplication and vector addition

We propose an efficient matrix multiplication and accu-mulation method as in Algorithm 2 based on ARM NEONSIMD Algorithm 2 is conducted after the matrix trans-pose In Algorithm 2 LANE SHORT NUM has the samevalue as BLOCK TRANSPOSE in Algorithm 1 Line No 3 inAlgorithm 2 describes setting the NEON register sum vectvalue as 16-bit data 0 using the NEON Intrinsic function(vdupq) for lane broadcasting as the same value From linesNo 4 to No 7 it loads data frommatrix A and matrix S to theNEON register according to each matrix index Then it mul-tiplies and accumulates NEON registers for matrix multipli-cation and vector addition within NLANES SHORT NUMFor lines No 8 and 9 it stores the NEON register value onthe array (16-bit data and array size 8) and accumulates thevalues on matrix E according to the matrix index Then itstores the NEON vector into the register and accumulateselement values in the register and stores the result onMatrix Eaccording to theMatrix E index From lines No 10 toNo 12 itcalculates matrix multiplication and vector addition betweenmatrix elements which are located at outbound of the matrixsizeNEONregister lane size In this part if the row and col-umn size of the matrix is even then it does not operate UsingAlgorithm 2 we calculate the matrix multiplication andvector addition using NEON and if the position of matrixelement is greater than the NEON register lane size we usednormal matrix multiplication and vector addition using C

As previously described we propose an efficient matrixtranspose matrix multiplication and vector addition Nowwe propose an efficient matrix transpose multiplication andvector addition for LWE in lattice-based cryptography as inAlgorithm 3 In Algorithm 3 we transpose matrix S usingAlgorithm 1 and calculate matrix multiplication and vector(matrix E) addition using Algorithm 2

Figure 9 describes Algorithm 3 as a block diagram InFigure 9 dark blue and dark red parts are calculated usingNEON SIMD for matrix multiplication and vector additionbased on NEON multiplication and accumulation At thattime Matrix S is transposed by the NEON based matrixtranspose operation in Algorithm 1 Positions of light blueand light red parts are greater than matrix row valueNEONlane size or columns valueNEON lane size These partsare calculated using C and the normal method for matrixmultiplication and vector addition

If we re-used the ARMNEON SIMD data register whichwas the result data right before the operation as operand dataat the next operation during NEON SIMD programming ithas data dependency and data dependency causes a ReadAfter Write (RAW) data hazard (aka stall) which takessome clock cycles to load data that was result data rightbefore operation again To avoid the data hazard and enhanceperformance we scheduled order of NEON register used Forefficient NEON SIMD implementation we used fully NEONQ registers (Q0-Q15)

4 Experiment amp Evaluation

In this section we describe the experimental environmentthe performance measurement and the evaluation of theproposed method For objective evaluation we applied theproposed method on the LizardCCA key generation stepwhich used the LWE problem for key generation

41 Experiment Our experimental environment was Rasp-berry Pi 3 Model B Raspberry Pi 3 Model B has a Broad-com BCM2387 chipset (12GHz Quad-Core ARM Cortex-A53) and 1GB LPDDR2 memory The operating system isRaspbian GNULinux 80 (Jessie) We used GCC compilerversion 492 and the compile options -O3 -mcpu=cortex-a53-mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits -mtune=cortex-a53 -std=c99 for using ARMNEON and com-piling for the Cortex-A53 environment For C version codes


RequireMatrix S (N times L matrix 119904119894119895 0 le i le 119873 0 le j le 119871)EnsureMatrix S lsquo(L times Nmatrix 1199041015840 119894119895 0 le 119895 le 119873 0 le 119894 le 119871)1 for i from 0 to N i+= BLOCK TRANSPOSE do2 if i+BLOCK TRANSPOSE gt N3 let i-= BLOCK TRANSPOSE-NBLOCK TRANSPOSE4 for j from 0 to L j+=BLOCK TRANSPOSE do5 if j+BLOCK TRANSPOSE gt L6 let j-= BLOCK TRANSPOSE-LBLOCK TRANSPOSE7 vec1 l=NEON Vector Load(S+ilowastL+j)8 vec1 h=NEON Vector Load(S+ilowastL+j+8)9 vec2 l=NEON Vector Load(S+(i+8)lowastL+j)10 vec2 h=NEON Vector Load(S+(i+8)lowastL+j+8)11 t2 = NEON Vector Interleave(vec1 l vec2 l)12 t3 = NEON Vector Interleave(vec1 h vec2 h)13 vec1 l = NEON Vector Load(S + (i + 2) lowast LWE L + j)14 vec1 h = NEON Vector Load (S + (i + 2) lowast LWE L + j + 8)15 vec2 l = NEON Vector Load (S + (i + 10) lowast LWE L + j)16 vec2 h = NEON Vector Load (S + (i + 10) lowast LWE L + j + 8)17 t4 = NEON Vector Interleave(vec1 l vec2 l)18 t5 = NEON Vector Interleave(vec1 h vec2 h)19 vec1 l = NEON Vector Load (S + (i + 3) lowast LWE L + j)20 vec1 h = NEON Vector Load (S + (i + 3) lowast LWE L + j + 8)21 vec2 l = NEON Vector Load (S + (i + 11) lowast LWE L + j)22 vec2 h = NEON Vector Load (S + (i + 11) lowast LWE L + j + 8)23 t6 = NEON Vector Interleave(vec1 l vec2 l)24 t7 = NEON Vector Interleave(vec1 h vec2 h)25 m0 = NEON Vector Interleave(t0val[0] t4val[0])26 m1 = NEON Vector Interleave(t0val[1] t4val[1])27 m2 = NEON Vector Interleave(t1val[0] t5val[0])28 m3 = NEON Vector Interleave(t1val[1] t5val[1])29 m4 = NEON Vector Interleave(t2val[0] t6val[0])30 m5 = NEON Vector Interleave(t2val[1] t6val[1])31 m6 = NEON Vector Interleave(t3val[0] t7val[0])32 m7 = NEON Vector Interleave(t3val[1] t7val[1])33 t0 = NEON Vector Interleave(m0val[0] m4val[0])34 t1 = NEON Vector Interleave(m0val[1] m4val[1])35 t2 = NEON Vector Interleave(m1val[0] m5val[0])36 t3 = NEON Vector Interleave(m1val[1] m5val[1])37 t4 = NEON Vector Interleave(m2val[0] m6val[0])38 t5 = NEON Vector Interleave(m2val[1] m6val[1])39 t6 = NEON Vector Interleave(m3val[0] m7val[0])40 t7 = NEON Vector Interleave(m3val[1] m7val[1])41 NEON Vector Store (Srsquo + j lowast LWE N + i t0val[0])42 NEON Vector Store (Srsquo + j lowast LWE N + i + 8 t0val[1])43 NEON Vector Store (Srsquo + (j + 1) lowast LWE N + i t1val[0])44 NEON Vector Store (Srsquo + (j + 1) lowast LWE N + i + 8 t1val[1])45 NEON Vector Store (Srsquo + (j + 2) lowast LWE N + i t2val[0])46 NEON Vector Store (Srsquo + (j + 2) lowast LWE N + i + 8 t2val[1])47 NEON Vector Store (Srsquo + (j + 3) lowast LWE N + i t3val[0])48 NEON Vector Store (Srsquo + (j + 3) lowast LWE N + i + 8 t3val[1])49 NEON Vector Store (Srsquo + (j + 4) lowast LWE N + i t4val[0])50 NEON Vector Store (Srsquo + (j + 4) lowast LWE N + i + 8 t4val[1])51 NEON Vector Store (Srsquo + (j + 5) lowast LWE N + i t5val[0])52 NEON Vector Store (Srsquo + (j + 5) lowast LWE N + i + 8 t5val[1])53 NEON Vector Store (Srsquo + (j + 6) lowast LWE N + i t6val[0])54 NEON Vector Store (Srsquo + (j + 6) lowast LWE N + i + 8 t6val[1])55 NEON Vector Store (Srsquo + (j + 7) lowast LWE N + i t7val[0])56 NEON Vector Store (Srsquo + (j + 7) lowast LWE N + i + 8 t7val[1])57 Return Srsquo

Algorithm 1 Efficient matrix transpose


RequireMatrix A (M times Nmatrix 119886119894119895 0 le 119894 le 119872 0 le 119895 le 119873) Matrix S (N times L matrix 119904119895119896 0 le 119895 le 119873 0 le119896 le 119871) Matrix E (M times L matrix 119890119894119896 0 le 119894 le 119872 0 le k le L)EnsureMatrix E (M times L matrix 119903119894119896 0 le 119894 le 119872 0 le k le L)1 for i from 0 to M do2 for j from 0 to L do3 sum vect = NEON Lane Broadcast(0)4 for k from 0 to iter k do5 a vec = NEON Vector Load (A + i lowast N + k lowast LANES SHORT NUM)6 s vec = NEON Vector Load (S + j lowast N + k lowast LANES SHORT NUM)7 sum vect = NEON Multiply Accumulate(sum vect a vec s vec)8 NEON Vector Store (sum sum vect)9 E[i lowast L + j] += sum[0]+sum[1]+sum[2] + sum[3] +sum[4]+sum[5]+sum[6]+sum[7]10 if (k == NLANES SHORT NUM) ampamp (NLANES SHORT NUM)11 for k from N-(NLANES SHORT NUM) to N do12 E[i lowast L + j] += A[ilowastN+k]lowastB[klowastN+j]13 Return E

Algorithm 2 Efficient matrix multiplication and accumulation

uint16x8_t r

uint16_t value

0 0 0 0 0

0

0 0 0

Figure 6 NEON lane broadcast operation

1 2 3 4 1 2 3 4

11 2 2 3 3 4 4

VZIP ARM NEON Interleave

Figure 7 VZIP ARM NEON interleave operation

we used the compile option for NEON autovectorizationas -O3 -mcpu=cortex-a53 ndashftree-vectorize -mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits -mtune=cortex-a53-std=c99 The GCC vectorization was enabled using the flagndashftree-vectorize and ndashO3 To enable NEON we used flagsnamely -mfloat-abi=hard -mfpu=neon-fp-armv8 -mneon-for-64bits -mtune=cortex-a53 If we used GCC autovectorizationfor NEON the GCC compiler made the C source code asNEON code by autovectorization

42 Evaluation To evaluate our method we measuredthe average execution time for 1000 periods of opera-tion according to LizardCCA parameters For LizardCCACATEGORY5 N1088 and LizardCCA CATEGORY5 N1088parameters we could not measure the execution time Firstwe measured and compared the performance of the pro-posed matrix transpose method and normal C version as inTable 2 Our proposed matrix transpose method performedbetter than C version (with GCC autovectorization) The Cversion (with GCC autovectorization) had a low performancebecause it had some conditional branches such as lsquowhilersquo andlsquoif rsquo statements

11 2 2 3 3 4 4

32 7 9 16 19 29 33

11 1 1 1 1 1 1

times

+

=

Figure 8 VMLA ARM NEON multiply accumulation operation

After we measured the proposed matrix transposemethod we measured the proposed matrix multiplicationand vector addition For an objective evaluation we com-pared the performance of the proposed method with the Cversion from the matrix multiplication and vector additionpart in the LizardCCA key generation step [3] according tothe LizardCCA parameters The C version from LizardCCA[3] was submitted to NIST PQC Standardization round 1and it was normal C version matrix multiplication usingC pointer The proposed method for matrix multiplicationand vector addition included the matrix transpose partTable 3 describes the comparison results between the Cversion [3] (with GCC autovectorization) and the proposed


NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst

Figure 9 Proposed matrix multiplication and vector addition

Table 2 Matrix transpose performance (Unit ms)

N M L C version Proposed (NEON)(Auto-Vectorization)

536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113

Table 3 Matrix multiplication performance (unit ms)

N M L C version [3] Proposed (NEON)(Auto-Vectorization)

536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326

Table 4 LizardCCA key generation performance (unit ms)

N M L Cheon et al [3] Proposed (NEON)(Auto-Vectorization)

536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144

method The proposed method improved the performanceat the parameters by 3693 695 3292 and 766respectively

Our proposed methods performed better Next weapplied the proposed methods on the LizardCCA key gen-eration step [3] for objective evaluation Table 4 describesthe performance comparison results between the LizardCCAkey generation step [3] and the proposed method The pro-posedmethodswith the LizardCCAkey generation steps hadimproved performance at the parameters by 704 366757 and 932 respectively over the original LizardCCAkey generation step [3]

According to Tables 3 and 4 the proposed methods forefficient matrix multiplication had improved performanceHowever in the case of the LizardCCA CATEGORY3 N663

parameter the rate of increase in performance was lowerthan the others because parameter N was 663 and it had aremainder as 7 (663 = 8 times 82 + 7) so it was necessary to domatrix multiplication for matrix elements that were locatedfrom 656 to 663 using normal method

5 Conclusions

Nowadays many postquantum cryptography systems arebeing developed to deal with quantum computing technolo-gies and security threats to the existing cryptosystem NISTis working on postquantum cryptography standardizationA large part of the submissions to NISTrsquos PQC Standard-ization competition is lattice-based cryptography and manylattice-based cryptographic algorithms are based on the LWE


RequireMatrix A (M times Nmatrix 119886119894119895 0 le 119894 le 119872 0 le 119895 le 119873) Matrix S (N times L matrix 119904119895119896 0 le 119895 le 119873 0 le119896 le 119871) Matrix E (M times L matrix 119890119894119896 0 le 119894 le 119872 0 le k le L)EnsureMatrix Rst (M times L matrix 119903119894119896 0 le 119894 le 119872 0 le k le L)1 NEON Matrix Transpose (Matrix S) (Algorithm 1)2 NEON Matrix Multiply Accumulate (Matrix A Matrix S Matrix E) (Algorithm 2)3 Return E

Algorithm 3 Efficient matrix transpose multiplication and accumulation for LWE

problem The LWE problem-based procedures need matrixmultiplication between huge size matrices However normalmatrix multiplication calculates element by element on thematrix For efficient matrix multiplication we proposedmatrix multiplication and vector addition with a matrixtranspose using ARMNEONSIMD techniques for efficiencyThe proposed matrix multiplication and vector addition withmatrix transpose method improved performance at eachparameter by 3693 695 3292 and 766 respectivelyand the proposed method with LizardCCA key generationsteps have improved performance at each parameter by704 366 757 and 932 respectively over the originalLizardCCA key generation step [3] In the future researchon efficient matrix multiplication onmatrix elements that arelocated at outbound of NEON register lane size is neededfor further improved efficiency and using a fully NEONmethod We will research on efficient implementation ofmatrix multiplication and vector addition for lattice-basedcryptography using full NEON SIMD for any parametersmixing ARM NEONARM assembly instruction and AVX2SIMD in an Intel x64 environment

Data Availability

Proposed matrix transpose multiplication and vector addi-tion implementation source codes are uploaded to Githubrepository (httpsgithubcompth5804MatTrans Mul NEONPQC)

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

This work of Taehwan Park and Howon Kim was supportedby the Ministry of Trade Industry amp Energy (MOTIEKorea) under the Industrial Technology Innovation Pro-gram (no 10073236) This work of Hwajeong Seo was sup-ported by the National Research Foundation of Korea (NRF)grant funded by the Korean government (MSIT) (no NRF-2017R1C1B5075742)This work of Junsub Kim and HaeryongPark was supported by the Institute for Information amp com-munications Technology Promotion (IITP) grant funded bythe Korean government (MSIP) (no 2017-0-00616 develop-ment for lattice-based postquantum public key cryptographicscheme)

References

[1] L K Grover ldquoA fast quantum mechanical algorithm fordatabase searchrdquo in Proceedings of the 28th Annual ACMSymposium on Theory of Computing pp 212ndash219 ACM 1996

[2] P W Shor ldquoAlgorithms for quantum computation discretelogarithms and factoringrdquo in Proceedings of the 35th AnnualSymposium on Foundations of Computer Science (SFCS rsquo94) pp124ndash134 IEEE 1994

[3] J H Cheon S Park J Lee et al ldquoPost-QuantumCryptographyrdquoNational Institute of Standards and Technology Tech rep 2017httpscsrcnistgovprojectspost-quantum-cryptographyround-1-submissions

[4] O Regev ldquoOn lattices learning with errors random linearcodes and cryptographyrdquo Journal of the ACM vol 56 no 6article 34 2009

[5] C Peikert ldquoPublic-key cryptosystems from the worst-caseshortest vector problemrdquo in Proceedings of the forty-first annualACM symposium onTheory of computing ACM 2009

[6] Z Brakerski A Langlois C Peikert O Regev and D StehleldquoClassical hardness of learning with errorsrdquo in Proceedings of the45th Annual ACM Symposium on Theory of Computing (STOCrsquo13) pp 575ndash584 ACM June 2013

[7] H C Jung H Kyoohyung K Jinsu L Changmin and SYongha ldquoPractical postquantum public key cryptosystembased on LWErdquo in Proceedings of the 19th Annual internationalConference on Information Security and Cryptology 2016

[8] D Jintai X Xiang and L Xiaodong ldquoA Simple Provably SecureKey Exchange Scheme Based on the Learning with ErrorsProblemrdquo in Cryptology EPrint Archive p 688 688 2012

[9] ARM NEON Programmerrsquos Guide version 10 (2013)[10] D J Bernstein and P Schwabe ldquoNEON Cryptordquo in Crypto-

graphic Hardware and Embedded Systems ndash CHES 2012 vol7428 ofLectureNotes in Computer Science pp 320ndash339 SpringerBerlin Heidelberg Berlin Heidelberg 2012

[11] S Streit and F De Santis ldquoPost-Quantum Key Exchange onARMv8-A A New Hope for NEON Made Simplerdquo IEEE Tran-sactions on Computers 2017

[12] H Seo ldquoHigh performance implementation of SGCM on high-end IoT devicesrdquo Journal of Information and CommunicationConvergence Engineering vol 15 no 4 pp 212ndash216 2017

[13] Z Liu R AzarderakhshHKim andH Seo ldquoEfficient SoftwareImplementation of Ring-LWE Encryption on IoT ProcessorsrdquoIEEE Transactions on Computers 2017

[14] H Seo G Lee T Park andHKim ldquoCompactGCM implemen-tations on 32-bit ARMv7-A processorsrdquo in Proceedings of the2017 International Conference on Information and Communica-tion Technology Convergence (ICTC) pp 704ndash707 Jeju October2017


[15] T Poppelmann ldquoEfficient implementation of ideal lattice-basedcryptographyrdquo it - Information Technology vol 59 no 6 2017

[16] H Nejatollahi N Dutt and R Cammarota ldquoSpecial sessionTrends challenges and needs for lattice-based cryptographyimplementationsrdquo in Proceedings of the 12th IEEEACMIFIPInternational Conference on HardwareSoftware Codesign andSystem Synthesis CODES 2017 Republic of Korea October 2017

[17] Z Liu K-K R Choo and J Grossschadl ldquoSecuring EdgeDevices in the Post-Quantum Internet of Things Using Lattice-Based Cryptographyrdquo IEEE Communications Magazine vol 56no 2 pp 158ndash162 2018

[18] Z Liu T Poppelmann T Oder et al ldquoHigh-performance ideallattice-based cryptography on 8-bit AVR microcontrollersrdquoACMTransactions on EmbeddedComputing Systems vol 16 no4 2017

[19] J Bos L Ducas E Kiltz et al ldquoCRYSTALS - Kyber A CCA-Secure Module-Lattice-Based KEMrdquo in Proceedings of the 2018IEEE European Symposium on Security and Privacy (EuroSampP)pp 353ndash367 London April 2018

[20] S McCarthy N Smyth and E OrsquoSullivan ldquoA Practical Imple-mentation of Identity-Based Encryption Over NTRU LatticesrdquoLecture Notes in Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notes in Bioinformat-ics) Preface vol 10655 pp 227ndash246 2017

[21] Y Yuan K Fukushima S Kiyomoto and T Takagi ldquoMemory-constrained implementation of lattice-based encryptionscheme on standard Java Cardrdquo in Proceedings of the 10th IEEEInternational Symposium on Hardware Oriented Security andTrust HOST 2017 pp 47ndash50 USA May 2017

[22] O Tobias T Schneider T Poppelmann and T Guneysu ldquoPra-ctical cca2-secure andmasked ring-lwe implementationrdquo IACRTransactions on Cryptographic Hardware and Embedded Sys-tems vol 1 pp 142ndash174 2018

[23] E OrsquoSullivan and F Regazzoni ldquoSpecial session paper Efficientarithmetic for lattice-based cryptographyrdquo in Proceedings of the12th IEEEACMIFIP International Conference on HardwareSoftware Codesign and System Synthesis CODES 2017 Republicof Korea October 2017

International Journal of

AerospaceEngineeringHindawiwwwhindawicom Volume 2018

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018


Active and Passive Electronic Components

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Hindawiwwwhindawicom

Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Control Scienceand Engineering

Journal of



Journal ofEngineeringVolume 2018

SensorsJournal of



RotatingMachinery


Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation


Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

Submit your manuscripts atwwwhindawicom


with the proposed method Section 5 provides some finalconclusions

2 Related Studies

In this section we describe related studies on LWE problemsand NIST PQC standardization

21 Learning with Errors (LWE) Problems Regev introducedthe Learning with Errors (LWE) problem [4] For examplefor an n-dimensional vector 119904 isin Z119899119902 and an error distribution120594 over Z the LWE distribution 119860119871119882119864119899119902119909(119904) over Z119899119902 times Z119902 isobtained by choosing a vector uniformly and randomly fromZ119899119902 and an error e from 120594 and using

(119886 119887 = ⟨119886 119904⟩ + 119890) isin Z119899119902 times Z119902 (1)

The search LWE problem finds 119904 isin Z119899119902 for given arbitrar-ily many independent samples (ai bi) from 119860

119871119882119864119899119902119909(119904) The

hardness of the decision LWE problem is guaranteed bythe worst case hardness of the standard lattice problemssuch as the decision version of the shortest vector problem(GapSVP) and the shortest independent vectors problem(SIVP) Peikert et al [5 6] improved the reduction of theclassical version Brakerski et al [6] proved that the LWEproblem with a binary secret is at least as hard as theoriginal LWE problem and Cheon et al [7] proved thehardness of the LWE problemwith a sparse secret Accordingto these research results in these days the LWE problemhas been used as a hardness assumption for lattice-basedpostquantum cryptography In lattice-based cryptographyerrors (E) can be used during encryption and decryptionprocedures and they are generated by random samplers suchas the Gaussian sampler During encryption and decryptionprocedures they used matrix multiplication between matrixA and secret matrix S and then vector addition with errorsvector E For example Peikert [5] proposed a cryptosystembased on the LWE problem which is secure against anychosen-ciphertext attack and Lin et al [8] proposed a keyexchange scheme based on the LWE problem Many lattice-based cryptography systems provide security in a quantumcomputing environment based on LWE problems

22 NIST PQC Standardization The United States NationalInstitute of Standards and Technology (NIST) has initiatedpostquantum cryptography standardization since 2016 Thesubmission deadline was November 30 2017 A total of 69postquantum cryptographic algorithms were submitted onNIST PQC standardization Round 1 26 lattice-based crypto-graphic algorithms (5 signatures 21 KEM (key encapsulationmechanism)encryption) 19 code-based cryptographic algo-rithms (3 signatures 16 KEMencryption) 9 multivariate-based (7 signatures 2 KEMencryption) 3 hash-based signa-ture schemes and 8 others (2 signatures 6 KEMencryption)were submitted Four algorithms have been withdrawn Thelattice-based cryptography is the most proposed type ofpostquantum cryptography for NIST PQC standardizationaccording toNISTPQC standardizationRound 1 submission

Q15 Q14

middot middot middot

middot middot middot Q2 Q1 Q0

D31 D30 D29 D28 D5 D4 D3 D2 D1 D0

Figure 1 ARM NEON register bank

Most lattice-based cryptographic algorithms are based on theLWEproblem for providing security in a quantumcomputingenvironment and efficiency of implementation The firstNIST PQC standardization conference was scheduled totake place on April 11-13 2018 After the first NIST PQCstandardization it will take about five to six years until thefinal decision for NIST PQC standardization is made DuringPQC standardization efficient implementation of submittedpostquantum cryptographic algorithms is an important issue

23 Lizard Lizard [3] is a family of postquantum pub-lic key encryption (PKE) schemes and key encapsulationmechanisms (KEMs) which was submitted to NIST PQCstandardization round 1 The security of Lizard is based onsparse a small secret version of Learning with Errors (LWE)and learning with rounding (LWR) A sparse signed binarysecret LWE problem is at least as hard as the original LWEproblem The public key for Lizard was chosen to be aset of LWE samples with signed binary secret informationLizard supports IND-CPAPKE IND-CCA2 KEM and IND-CCA2 PKE and there are two types of Lizard namelyLizard and Rlizard which are based on Ring-LWE and Ring-LWR problems In the key generation step of Lizard it firstsamples a secret vector 119904 isin minus1 0 1119899 a random matrix119860 isin Z119898times119899 and an error vector 119890 larr997888 119863119866119898120590 of whichthe components are expected to be small The secret key iswritten as sk larr997888 s and the public key is written as pk larr997888(Ab)where 119887 = 119860119878 + 119890 isin Z119898119902 Hence the public key q is aninstance of LWE with the secret vector s There are five typesof parameter sets of LizardCCA CCA CATEGORY1 N536CCA CATEGORY1 N663 CCA CATEGORY3 N816 CCACATEGORY3 N952 CCA CATEGORY5 N1088 and CCACATEGORY5 N1300 The parameter sets of LizardKEMare similar to the parameter sets of LizardCCA HoweverRLizardCCA and RLizardKEM have four types of param-eter sets RING CATEGORY1 RING CATEGORY3 N1024RING CATEGORY3 N2048 and RING CATEGORY5 Inthis study we used the proposed method for efficient matrixmultiplication and vector addition using ARMNEON SIMDon the LizardCCA key generation step and evaluated theperformance of proposed method on the proposed methodsapplication aspect

24 ARMNEON ARMNEON is an advanced single instruc-tion multiple data (SIMD) engine for the ARM Cortex-Aseries and Cortex-R52 processor [9] It was introduced tothe ARMv7-A and ARMv7-R profiles and it is also now asan extension to the ARMv8-A and ARMv8-R profiles ARMNEON supports 128-bit size Q registers (Q0-Q15)Q registerscan be written as 4 32-bit size data 8 16-bit size data and16 8-bit size data Each Q register can be separated into 2 Dregisters (64-bit size) as in Figure 1







3 Proposed Method








N

MN

L L L

Matrix A

X

Matrix S Matrix E

+ M M=

Matrix Rst


01234567

01234567



uint16_t ptr[8]

uint16x8_t r


01234567

01234567



uint16x8_t r

uint16_t ptr[8]













uint16x8_t a

uint16_t r

int lane = 2

012

2

34567
















uint16x8_t r

uint16_t value

0 0 0 0 0

0

0 0 0


1 2 3 4 1 2 3 4

11 2 2 3 3 4 4





11 2 2 3 3 4 4

32 7 9 16 19 29 33

11 1 1 1 1 1 1

times

+

=




NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst




536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113



536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326



536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144





5 Conclusions






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia








3 Proposed Method








N

MN

L L L

Matrix A

X

Matrix S Matrix E

+ M M=

Matrix Rst


01234567

01234567



uint16_t ptr[8]

uint16x8_t r


01234567

01234567



uint16x8_t r

uint16_t ptr[8]













uint16x8_t a

uint16_t r

int lane = 2

012

2

34567
















uint16x8_t r

uint16_t value

0 0 0 0 0

0

0 0 0


1 2 3 4 1 2 3 4

11 2 2 3 3 4 4





11 2 2 3 3 4 4

32 7 9 16 19 29 33

11 1 1 1 1 1 1

times

+

=




NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst




536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113



536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326



536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144





5 Conclusions






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



N

MN

L L L

Matrix A

X

Matrix S Matrix E

+ M M=

Matrix Rst


01234567

01234567



uint16_t ptr[8]

uint16x8_t r


01234567

01234567



uint16x8_t r

uint16_t ptr[8]













uint16x8_t a

uint16_t r

int lane = 2

012

2

34567
















uint16x8_t r

uint16_t value

0 0 0 0 0

0

0 0 0


1 2 3 4 1 2 3 4

11 2 2 3 3 4 4





11 2 2 3 3 4 4

32 7 9 16 19 29 33

11 1 1 1 1 1 1

times

+

=




NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst




536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113



536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326



536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144





5 Conclusions






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia






uint16x8_t a

uint16_t r

int lane = 2

012

2

34567
















uint16x8_t r

uint16_t value

0 0 0 0 0

0

0 0 0


1 2 3 4 1 2 3 4

11 2 2 3 3 4 4





11 2 2 3 3 4 4

32 7 9 16 19 29 33

11 1 1 1 1 1 1

times

+

=




NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst




536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113



536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326



536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144





5 Conclusions






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia








uint16x8_t r

uint16_t value

0 0 0 0 0

0

0 0 0


1 2 3 4 1 2 3 4

11 2 2 3 3 4 4





11 2 2 3 3 4 4

32 7 9 16 19 29 33

11 1 1 1 1 1 1

times

+

=




NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst




536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113



536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326



536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144





5 Conclusions






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia





uint16x8_t r

uint16_t value

0 0 0 0 0

0

0 0 0


1 2 3 4 1 2 3 4

11 2 2 3 3 4 4





11 2 2 3 3 4 4

32 7 9 16 19 29 33

11 1 1 1 1 1 1

times

+

=




NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst




536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113



536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326



536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144





5 Conclusions






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia



NN

M

L

M M

L L

Matrix A

X

TMatrix S Matrix E

+ =

Matrix Rst




536 1024 256 3642304 0446443663 1024 256 6300066 0707373816 1024 384 9704782 178282952 1024 384 1172607 2078113



536 1024 256 1488991 9391285663 1024 256 1710976 1592069816 1024 384 3347499 2245633952 1024 384 3917564 3617326



536 1024 256 622920 579071663 1024 256 736950 709942816 1024 384 1075164 993760952 1024 384 1239633 1124144





5 Conclusions






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia






Data Availability




Acknowledgments


References



























RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia














RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia




RoboticsJournal of




VLSI Design



Shock and Vibration







Journal of



Volume 2018



Volume 2018


Journal of




SensorsJournal of



RotatingMachinery





Propagation






Hindawi


Advances in

Multimedia


Efficient Parallel Implementation of Matrix Multiplication ...downloads.hindawi.com/journals/scn/2018/7012056.pdf · SecurityandCommunicationNetworks e ARM Cortex-A series is used

Documents