© 2017 Arm Limited SFO17-314 Optimizing Golang for High Performance with ARM64 Assembly Wei Xiao Staff Software Engineer [email protected] September 27, 2017 Linaro Connect SFO17
© 2017 Arm Limited
SFO17-314 Optimizing Golang for High Performance with ARM64
AssemblyWei Xiao
Staff Software Engineer
September 27, 2017
Linaro Connect SFO17
© 2017 Arm Limited 2
Agenda
• Introduction
• Differences from GNU Assembly
• Integrate assembly into Golang
• Optimize CRC32 for arm64
• Optimize SHA256 for arm64
• Optimize IndexByte for arm64
• Work Summary and Next steps
© 2017 Arm Limited 3
Introduction
• Assembly optimization benefits
• Take advantages of ARMv8 capabilities
– Hardware specific instructions (such as SVC, AES, SHA and etc.)
– Vector (Single Instruction Multiple Data) Instructions
• Others
– No need for CGo dependency
– Avoid runtime context switching overhead
– Optimized code (vs Go compiler)
– Faster compilation
© 2017 Arm Limited 4
Assembly Optimization Current Status
• Go Standard packages with assembly optimization
crypto/aes crypto/elliptic crypto/internal/cipherhw crypto/md5
crypto/rc4 crypto/sha1 crypto/sha256 crypto/sha512
hash/crc32 math math/big reflect
runtime runtime/cgo runtime/internal/atomicruntime/internal/sys
strings sync/atomic syscall ……
red – arm64 optimization ongoing
black – no arm64 optimization
© 2017 Arm Limited 5
Assembly Terminology
• Mnemonic
• CALL, MOVW, MOVD, …
• Register
• R1, F0, V3, …
• Immediate
• $1, $0x100, …
• Memory
• (R1), 8(R3), …
Registers in AArch64
© 2017 Arm Limited 6
Instruction Differences from GNU Assembly
• Semi-abstract instruction set (Plan 9 from Bell Labs)
• Architecture independent mnemonics like MOVD
• Some architecture aspects shine through
• Assembler may insert prologues, remove ‘unreachable’ instructions
• Instructions may be expanded by the assembler
• Not all instructions available
• BYTE/WORD/LONG directives to lay down opcodes into instruction stream directly
1 // func Add(a, b int) int 2 TEXT ·Add(SB),$0-24 3 MOVD arg1+0(FP), R0 4 MOVD arg2+8(FP), R1 5 ADD R1, R0, R0 6 MOVD R0, ret+16(FP) 7 RET
© 2017 Arm Limited 7
Operand Differences from GNU Assembly
• Data flow from left to right
• ADD R1, R2 → R2 += R1
• SUBW R12<<29, R7, R8 → R8 = R7 – (R12<<29)
• Memory operands: base + offset
• MOVH (R1), R2 → R2 = *R1
• MOVBU 8(R3), R4 → R4 = *(8 + R3)
• MOVD mypackage·myvar(SB), R8 → R8 = *myvar
• Addresses
• MOVD $8(R1), R3 → R3 = R1 + 8
• MOVD $·myvar(SB), R4 → R4 = &myvar
package mypackagevar myvar int64
UnicodeU+00B7
© 2017 Arm Limited 8
Go Assembly Extension for arm64
• Extended register, e.g.: ADD Rm.<ext>[<<amount], Rn, Rd
• Arrangement for SIMD instructions, e.g.: VADDP Vm.<T>, Vn.<T>, Vd.<T>
• Width specifier and element index for SIMD instructions, e.g.: VMOV Vn.<T>[index], Rd
• Register List, e.g.: VLD1 (Rn), [Vt1.<T>, Vt2.<T>, Vt3.<T>]
• Register offset variant, e.g.: VLD1.P (Rn)(Rm), [Vt1.<T>, Vt2.<T>]
• Go assembly for ARM64 reference manual: src/cmd/internal/obj/arm64/doc.go
• Full details
• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited 9
Assembly Build Rule
• Toolchain will select appropriate assembly files according to GOOS+GOARCH
• Using file extensions, e.g.
• sys_linux_arm64.s
• sys_darwin_arm64.s
• Example: assembly files for: hash/crc32
• crc32_amd64p32.s
• crc32_amd64.s
• crc32_arm64.s
• crc32_ppc64le.s crc32_table_ppc64le.s
• crc32_s390x.s
© 2017 Arm Limited 10
Prototype
• Function call is the bridge between Go and assembly
• Function declaration
• src/runtime/timestub.go
• func walltime() (sec int64, nsec int32)
• Function assembly implementation
• runtime/sys_linux_arm64.s
package(optional)
function name
Flag(optional)
stack frame size
arguments size
(optional)
Middle dot
© 2017 Arm Limited 11
Pseudo-registers
• FP: Frame Pointer
• Points to the bottom of the argument list
• Offsets are positive
• Offsets must include a name, e.g. arg+0(FP)
• SP: Stack Pointer
• Points to the top of the space allocated for local variables
• Offsets are negative
• Offsets must include a name, e.g. ptr-8(SP)
• SB: Static Base
• Named offsets from a global base
Low address
High address
Low address
High address
© 2017 Arm Limited 12
Calling Convention
• All arguments are passed on the stack
• Offsets from FP
• Return arguments follow input arguments
• Start of return arguments aligned to pointer size
• All registers are caller saved, except:
• Stack pointer register (RSP)
• G context pointer register (R28)
• Frame pointer (R29)
© 2017 Arm Limited 14
Optimize CRC32 for arm64 – Before
• Pure Go table-driven implementation
src/hash/crc32/crc32_generic.go
42 func simpleUpdate(crc uint32, tab *Table, p []byte) uint32 {43 crc = ^crc44 for _, v := range p {45 crc = tab[byte(crc)^v] ^ (crc >> 8)46 }47 return ^crc48 }
© 2017 Arm Limited 15
Optimize CRC32 for arm64 – After
• Assembly for arm64src/hash/crc32/crc32_arm64.s
9 // func castagnoliUpdate(crc uint32, p []byte) uint32 10 TEXT ·castagnoliUpdate(SB),NOSPLIT,$0-36 11 MOVWU crc+0(FP), R9 // CRC value 12 MOVD p+8(FP), R13 // data pointer 13 MOVD p_len+16(FP), R11 // len(p) 14 15 CMP $8, R11 16 BLT less_than_8 17 18 update: 19 MOVD.P 8(R13), R10 20 CRC32CX R10, R9 21 SUB $8, R11 22 23 CMP $8, R11 24 BLT less_than_8 25 26 JMP update
… 46 done: 47 MOVWU R9, ret+32(FP) 48 RET
0(FP)
ret
p.cap
p.len
p.base
crc
32(FP)
8(FP)
16(FP)
© 2017 Arm Limited 16
Optimize CRC32 for arm64 – Result
• Optimization with assembly
• 2X-7X speedup
© 2017 Arm Limited 17
Optimize SHA256 for arm64
• SHA256 introduction
block rounds K Hash
SHA-256 512bits 64 32bits 32bits 256bits
© 2017 Arm Limited 18
Optimize SHA256 for arm64 – Message schedule
src/crypto/sha256/sha256block.go
84 for i := 0; i < 16; i++ {85 j := i * 486 w[i] = uint32(p[j])<<24 | uint32(p[j+1])<<16 | uint32(p[j+2])<<8 | uint32(p[j+3])87 }88 for i := 16; i < 64; i++ {89 v1 := w[i-2]90 t1 := (v1>>17 | v1<<(32-17)) ^ (v1>>19 | v1<<(32-19)) ^ (v1 >> 10)91 v2 := w[i-15]92 t2 := (v2>>7 | v2<<(32-7)) ^ (v2>>18 | v2<<(32-18)) ^ (v2 >> 3)93 w[i] = t1 + w[i-7] + t2 + w[i-16]94 }
for i := 16; i < 64; i+=4 {SHA256SU0 Vn.S4, Vd.S4SHA256SU1 Vm.S4, Vn.S4, Vd.S4
}
© 2017 Arm Limited 19
Optimize SHA256 for arm64 – Hash Computation
src/crypto/sha256/sha256block.go
98 for i := 0; i < 64; i++ { 99 t1 := h + ((e>>6 | e<<(32-6)) ^ (e>>11 | e<<(32-11)) ^ (e>>25 | e<<(32-25))) + ((e & f) ^ (^e & g)) + _K[i] + w[i]100101 t2 := ((a>>2 | a<<(32-2)) ^ (a>>13 | a<<(32-13)) ^ (a>>22 | a<<(32-22))) + ((a & b) ^ (a & c) ^ (b & c))102103 h = g104 g = f105 f = e106 e = d + t1107 d = c108 c = b109 b = a110 a = t1 + t2111 }
for i := 0; i < 64; i+=4 {SHA256H Vm, Vn, Vd.4SSHA256H2 Vm, Vn, Vd.4S
}
© 2017 Arm Limited 20
Optimize SHA256 for arm64 – Implementation
src/crypto/sha256/sha256block_arm64.s
© 2017 Arm Limited 21
Optimize SHA256 for arm64 – Result
• Optimization with assembly
• 2X-16X speedup
© 2017 Arm Limited 22
Optimize IndexByte for arm64 – Before
H E L L O W O R L D …
R1R0
R2 D
R0
src/runtime/asm_arm64.s
© 2017 Arm Limited 23
Optimize IndexByte for arm64 – After
• Assembly implementation with SIMD
• SIMD instruction: CMEQ Vm.B16, Vn.B16, Vd.B16
Compare 16 bytes in parallel
More details:• Input slice shorter than 16• Input slice address not 16-byte aligned• Input slice size not 16-byte aligned• Count trailing zeros (not leading zeros)
• Implementation:• https://go-review.googlesource.com/c/go/+/41654
© 2017 Arm Limited 24
Optimize IndexByte for arm64 – Result
• Optimization with SIMD
• 1.5X-8X speedup
© 2017 Arm Limited 25
Work Summary
Disassembler (arm64):https://go-review.googlesource.com/c/arch/+/43651 https://go-review.googlesource.com/c/arch/+/56810 https://go-review.googlesource.com/c/go/+/58930https://go-review.googlesource.com/c/go/+/56331 https://go-review.googlesource.com/c/go/+/49530
Assembler (arm64):https://go-review.googlesource.com/c/go/+/33594 https://go-review.googlesource.com/c/go/+/33595 https://go-review.googlesource.com/c/go/+/41511 https://go-review.googlesource.com/c/go/+/41654 https://go-review.googlesource.com/c/go/+/45850 https://go-review.googlesource.com/c/go/+/54951https://go-review.googlesource.com/c/go/+/54990 https://go-review.googlesource.com/c/go/+/57852 https://go-review.googlesource.com/c/go/+/58350https://go-review.googlesource.com/c/go/+/56030 https://go-review.googlesource.com/c/go/+/46438 https://go-review.googlesource.com/c/go/+/41653
Optimizations:https://go-review.googlesource.com/c/go/+/40074 https://go-review.googlesource.com/c/go/+/61550 https://go-review.googlesource.com/c/go/+/61570https://go-review.googlesource.com/c/go/+/33597 https://go-review.googlesource.com/c/go/+/64490 https://go-review.googlesource.com/c/go/+/55610
Others:https://go-review.googlesource.com/c/go/+/61511 https://go-review.googlesource.com/c/go/+/62850 https://go-review.googlesource.com/c/go/+/45112https://go-review.googlesource.com/c/go/+/44390 https://go-review.googlesource.com/c/go/+/42971 https://go-review.googlesource.com/c/go/+/40511https://go-review.googlesource.com/c/arch/+/37172
© 2017 Arm Limited 26
Next Steps
• Crypto optimizations:
• aes, elliptic, …
• SIMD optimizations:
• strings, bytes, runtime, reflect, …
• Compiler SSA arm64 back-end optimizations
• Others
• Internal arm64 linker
• Tool for arm64: race detector, memory sanitizer, …
• New architecture features
• ...
© 2017 Arm Limited 28
CGo
GO ABI C ABI
1 package print2 3 // #include <stdio.h>4 // #include <stdlib.h>5 import "C"6 import "unsafe"7 8 func Print(s string) {9 cs := C.CString(s)10 C.fputs(cs, 11(*C.FILE)(C.stdout))12 C.free(unsafe.Pointer(cs))13 }
CGo
© 2017 Arm Limited 29
Useful in macros!
Branch Difference from GNU Assembly
• On arm64: B is alias for JMP, BL is alias for CALL
Jump to labels
JMP L1NOP
L1:NOP
L2: NOPNOPB L2
Call and Indirect Jump
BL $p.fooMOV $p·foo, R3CALL(R3)
B (R3)MOV 0(R26), R4JMP (R4)
Jump relative to PC
JMP 2(PC)NOPNOP
NOPNOPJMP -2(PC)