Bit Hamming in Golang: SIMD Supported Code

For a recent project, I needed to calculate the Hamming distance between a very large set of byte sequences.

As usual, I started by making a proof-of-concept in Python; however, it became clear very quickly that I would not be able to get the speeds I needed for this project. As long as we have the famous GIL in Python, it will be very difficult to make full use of all the system resources we have available.

My second language of choice is, of course, Golang. While it was much easier to parallelize all my calculations, I was still convinced that I could speed up the process and so started this experiment.

Background – Hamming Distance

Let’s start off with some background.

The Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols differ. In the realm of binary data, it measures the number of bit changes required to transform one binary sequence into another. This concept is crucial in error detection and correction, coding theory, and information theory.

When working with byte objects—which are essentially sequences of bits grouped into bytes—calculating the Hamming distance involves a bitwise comparison of each bit in the two sequences.

Example

Consider two bytes:

Byte A: 0b10101010
Byte B: 0b10011010

Performing a bitwise XOR:

   1 0 1 0 1 0 1 0  (Byte A)
⊕  1 0 0 1 1 0 1 0  (Byte B)
-------------------------
   0 0 1 1 0 0 0 0  (Result)

In the result 0b00110000, there are two bits set to 1. Therefore, the Hamming distance between Byte A and Byte B is 2.

Background — SIMD Instructions

As I continued to optimize the Hamming distance calculation, I began exploring SIMD (Single Instruction, Multiple Data) instructions to push the performance even further. SIMD allows a single CPU instruction to perform the same operation on multiple data points simultaneously. This parallel processing is ideal for data-intensive tasks like ours.

My Dive into SIMD

Initially, the optimized Go implementation provided a decent speed boost, but I couldn’t shake the feeling that there was more performance to unlock. That’s when I decided to dive into SIMD instructions. I was curious to see how much faster the calculation could run by leveraging the capabilities of modern CPUs.

Understanding SIMD in Simple Terms

Think of SIMD as a conveyor belt that processes multiple items at once instead of one at a time. If you have a repetitive operation to perform on a large dataset, SIMD can handle multiple data points in parallel, significantly reducing computation time.

Different Instruction Sets for Different Architectures

One of the interesting challenges I encountered was that SIMD instruction sets vary depending on the CPU architecture:

x86 Architecture (Intel and AMD processors):
- Uses instruction sets like SSE (Streaming SIMD Extensions) and AVX (Advanced Vector Extensions).
- Common in traditional desktops and servers.
ARM Architecture:
- Utilizes NEON instructions.
- Found in mobile devices and, notably, in Apple’s M1 and M2 chips.

Since I’m working on a Mac equipped with an Apple Silicon chip (based on ARM architecture), I focused on NEON instructions. This meant tailoring my code to use NEON for SIMD operations, allowing me to tap into the full potential of my hardware.

Implementation — Pure Go

Embarking on the journey to optimize the Hamming distance calculation, my first instinct was to implement the solution in pure Go. Go is renowned for its simplicity and efficiency, and I was eager to see how far I could push its performance without relying on external libraries or complex optimizations.

Starting with the Basics

I began by writing a straightforward function that iterates over each byte of the input sequences, computes the XOR of the corresponding bytes, and counts the number of set bits in the result. The math/bits package in Go provides the OnesCount8 function, which efficiently counts the number of one bits in an 8-bit unsigned integer. Here’s the initial implementation:

import "math/bits"

func HammingDistancePureGo(a, b []byte) int {
	hammingDistance := 0
	for i := 0; i < len(a); i++ {
		hammingDistance += bits.OnesCount8(a[i] ^ b[i])
	}
	return hammingDistance
}

This code is clean and easy to understand. It leverages Go’s standard library and doesn’t introduce any unnecessary complexity. I was optimistic that this implementation would offer reasonable performance for my needs.

The First Benchmark

To assess the performance, I set up a simple benchmark function. I generated two byte slices, a and b, each containing 1,024 bytes. For simplicity, I filled them with sequential data:

import (
	"fmt"
	"time"
)

func benchmarkHammingDistancePureGo() {
	// Generate input of length 1024 bytes
	a := make([]byte, 1024)
	b := make([]byte, 1024)

	// Fill a and b with sample data
	for i := 0; i < 1024; i++ {
		a[i] = byte(i)
		b[i] = byte(i + 1)
	}

	fmt.Println("Running Pure Go version...")
	start := time.Now()
	pureGoResult := 0
	for i := 0; i < 10000000; i++ {
		pureGoResult += HammingDistancePureGo(a, b)
	}
	durationPureGo := time.Since(start)

	fmt.Printf("Pure Go version took: %v, Result: %d\n", durationPureGo, pureGoResult)
}

When I ran the benchmark, the results were as follows:

Pure Go version took: 8.561100375s, Result: 20400000000

Over eight and a half seconds to complete 10 million iterations was acceptable, but I couldn’t help but wonder if I could do better.

Seeking Optimization Opportunities

I revisited my code, looking for any bottlenecks or areas ripe for optimization. One idea that came to mind was loop unrolling. By processing multiple bytes per iteration, I could reduce the overhead of loop control and function calls.

I decided to process four bytes in each iteration without the inner loop:

func HammingDistanceGo(a, b []byte) int {
	hammingDistance := 0
	length := len(a)

	// Process 4 bytes at a time
	for i := 0; i <= length-4; i += 4 {
		hammingDistance += bits.OnesCount8(a[i] ^ b[i])
		hammingDistance += bits.OnesCount8(a[i+1] ^ b[i+1])
		hammingDistance += bits.OnesCount8(a[i+2] ^ b[i+2])
		hammingDistance += bits.OnesCount8(a[i+3] ^ b[i+3])
	}

	// Handle any remaining bytes
	for i := length &^ 3; i < length; i++ {
		hammingDistance += bits.OnesCount8(a[i] ^ b[i])
	}

	return hammingDistance
}

This version eliminated the inner loop altogether, potentially reducing overhead even further.

Testing the Optimized Version

I updated the benchmark to include this new function:

func benchmarkHammingDistanceGo() {
	// Generate input of length 1024 bytes
	a := make([]byte, 1024)
	b := make([]byte, 1024)

	// Fill a and b with data
	for i := 0; i < 1024; i++ {
		a[i] = byte(i)
		b[i] = byte(i + 1)
	}

	fmt.Println("Running Optimized Go version...")
	start := time.Now()
	optimizedGoResult := 0
	for i := 0; i < 10000000; i++ {
		optimizedGoResult += HammingDistanceGo(a, b)
	}
	durationOptimizedGo := time.Since(start)

	fmt.Printf("Optimized Go version took: %v, Result: %d\n", durationOptimizedGo, optimizedGoResult)
}

The results were encouraging:

Optimized Go version took: 5.669980625s, Result: 20400000000

Now we were talking! This new implementation was approximately 34% faster than the original pure Go version.

Performance Comparison

Comparing the two versions:

Pure Go Version: 8.561100375 seconds
Optimized Go Version: 5.669980625 seconds

By eliminating unnecessary loops and processing multiple bytes per iteration, I achieved a significant performance improvement.

Reflecting on the Progress

This experience taught me that sometimes, even small changes can lead to significant performance improvements. By understanding how the Go compiler and runtime handle loops and function calls, I was able to fine-tune the code to execute more efficiently.

However, I also realized that there might be limits to how much I could optimize using pure Go. If I wanted to achieve even greater performance gains, I might need to look beyond Go’s standard capabilities.

With the optimized Go version in hand, I was pleased with the progress but still curious about how much faster I could make the calculation.

Implementation — CGO

To push the envelope even further, we resorted to using CGO to leverage SIMD (Single Instruction, Multiple Data) instructions available in modern CPUs. Specifically, we’ll use NEON instructions (because I’m running on a M2 Mac), which are SIMD extensions for ARM architectures.

C Implementation Using NEON Intrinsics

First, let’s look at the C code that uses NEON intrinsics to compute the Hamming distance efficiently:

// hamming_distance.h
#include <arm_neon.h>
#include <stdint.h>
#include <stddef.h>

int hamming_distance_neon(const uint8_t* a, const uint8_t* b, size_t length) {
    int hamming_distance = 0;
    size_t i;

    // Process in chunks of 16 bytes using NEON
    for (i = 0; i + 15 < length; i += 16) {
        uint8x16_t va = vld1q_u8(a + i);
        uint8x16_t vb = vld1q_u8(b + i);

        uint8x16_t vxor = veorq_u8(va, vb);

        // Count set bits in each byte
        uint8x16_t vcnt = vcntq_u8(vxor);

        // Sum the counts across the vector
        uint16x8_t vsum1 = vpaddlq_u8(vcnt);
        uint32x4_t vsum2 = vpaddlq_u16(vsum1);
        uint64x2_t vsum3 = vpaddlq_u32(vsum2);

        // Extract the sum from the vector
        hamming_distance += vgetq_lane_u64(vsum3, 0);
        hamming_distance += vgetq_lane_u64(vsum3, 1);
    }

    // Handle remaining bytes
    for (; i < length; i++) {
        hamming_distance += __builtin_popcount(a[i] ^ b[i]);
    }

    return hamming_distance;
}

Explanation:

Loading Data:
- vld1q_u8 loads 16 bytes from the arrays a and b into NEON registers va and vb.
Bitwise XOR:
- veorq_u8 computes the bitwise XOR of va and vb, storing the result in vxor.
Counting Set Bits:
- vcntq_u8 counts the number of set bits in each byte of vxor.
Summing Counts:
- vpaddlq_u8, vpaddlq_u16, and vpaddlq_u32 perform pairwise additions, effectively summing the counts.
Extracting Results:
- vgetq_lane_u64 extracts the sums from the NEON registers and adds them to hamming_distance.
Handling Remaining Bytes:
- For any remaining bytes that don’t fit into a 16-byte chunk, we use the built-in __builtin_popcount function.

Go Wrapper Using CGO

Next, we’ll create a Go wrapper to call our NEON-optimized C function:

// #cgo CFLAGS: -march=armv8-a+simd
// #include "hamming_distance.h"
import "C"

// HammingDistanceNeon calls the NEON-optimized C function using CGO.
func HammingDistanceNeon(a, b []byte) int {
    return int(C.hamming_distance_neon(
        (*C.uint8_t)(&a[0]),
        (*C.uint8_t)(&b[0]),
        C.size_t(len(a)),
    ))
}

Note:

The // #cgo CFLAGS: -march=armv8-a+simd directive ensures that the C compiler uses the correct architecture flags to enable NEON instructions.

Updating the Benchmark

Let’s update our benchmark function to include the NEON-optimized version:

func benchmarkHammingDistanceNeon() {
    // Generate input of length 1024 bytes
    a := make([]byte, 1024)
    b := make([]byte, 1024)

    // Fill a and b with data
    for i := 0; i < 1024; i++ {
        a[i] = byte(i)
        b[i] = byte(i + 1)
    }

    fmt.Println("Running NEON version...")
    start := time.Now()
    neonResult := 0
    for i := 0; i < 10000000; i++ {
        neonResult += HammingDistanceNeon(a, b)
    }
    durationNeon := time.Since(start)

    fmt.Printf("NEON version took: %v, Result: %d\n", durationNeon, neonResult)
}

Results

When we run the NEON-optimized code, we see a significant performance improvement:

NEON version took: 1.139869208s, Result: 20400000000

Performance Analysis

By leveraging NEON SIMD instructions, we reduced the execution time from approximately 6.4 seconds (optimized Go version) to around 1.1 seconds. That’s an impressive 82% reduction in execution time compared to the optimized Go implementation.

Why Such Improvement?

Data Parallelism: NEON instructions process multiple data points in a single instruction, allowing us to compute the Hamming distance of 16 bytes simultaneously.
Reduced Instruction Overhead: SIMD minimizes the number of instructions executed, reducing overhead and improving CPU cache utilization.

Considerations

Platform Dependency: NEON is specific to ARM architectures. If you’re running on x86 platforms, you would use SSE or AVX instructions instead.
CGO Overhead: While CGO introduces some overhead due to cgo calls and potential cgo callouts, the performance gains from SIMD outweigh this overhead in computationally intensive tasks.
Complexity: Using SIMD and CGO increases code complexity. Ensure that the performance benefits justify the added complexity and potential maintenance challenges.

Implementation — Go Assembly

Determined to squeeze out every last drop of performance, I ventured into the realm of Go assembly. Writing assembly code in Go isn’t for the faint of heart, but I was eager to see how much faster I could make the Hamming distance calculation by coding at this low level.

Diving into Go Assembly with NEON Instructions

Before diving into the code, it’s important to note that Go’s assembly language is not the same as traditional ARM assembly. Go uses its own assembly syntax based on the Plan 9 assembly language, which can differ significantly from conventional assembly languages. While I’m familiar with x86 assembly, I’m not an expert in ARM assembly—let alone Go’s ARM assembly syntax. This made the process both challenging and educational as I navigated through unfamiliar territory.

Given that my Mac uses an ARM-based processor with NEON support, I decided to write the assembly code using NEON instructions directly within Go’s assembly syntax. This approach allows me to harness SIMD capabilities without the overhead of CGO.

Here’s the Go assembly code I crafted:

// hamming_distance_neon_arm64.s
// func HammingDistanceNEONASM(a, b []byte, length int, result *int)
TEXT ·HammingDistanceNEONASM(SB), 4, $0
    MOVD   length+32(FP), R2     // R2 = length
    MOVD   result+56(FP), R11    // R11 = result

    MOVD   a+0(FP), R0           // R0 = a
    MOVD   b+24(FP), R1          // R1 = b

    // Initialize hamming_distance to 0 in R3
    MOVD   $0, R3                // R3 = hamming_distance

    // Initialize index i to 0 in R4
    MOVD   $0, R4                // R4 = i

loop_start:
    // Check if (i + 15) >= length
    ADD     $15, R4, R5          // R5 = R4 + 15
    CMP     R2, R5               // Compare R5 with R2 (length)
    BGE     loop_end             // If R5 >= length, exit loop

    // Load 16 bytes from a + i into V0
    ADD     R0, R4, R6           // R6 = a + i
    VLD1    (R6), [V0.B16]

    // Load 16 bytes from b + i into V1
    ADD     R1, R4, R7           // R7 = b + i
    VLD1    (R7), [V1.B16]

    // XOR V0 and V1 into V2
    VEOR    V0.B16, V1.B16, V2.B16

    // Count set bits in V2
    VCNT    V2.B16, V2.B16

    // Sum the counts across the vector
    VUADDLV V2.B16, V2

    // Move the sum from V2.B[0] to R8
    VMOV    V2.B[0], R8

    // Add R8 to hamming_distance (R3)
    ADD     R8, R3, R3

    // Increment i by 16
    ADD     $16, R4, R4

    // Loop back
    B       loop_start

loop_end:
    // Handle remaining bytes
    CMP     R4, R2
    BGE     end

scalar_loop:
    // Load byte from a[i] into R5
    MOVBU   (R0)(R4), R5

    // Load byte from b[i] into R6
    MOVBU   (R1)(R4), R6

    // XOR R5 and R6 into R8
    EOR     R5, R6, R8

    // Move R8 to V0.B16 for counting bits
    VMOV    R8, V0.B16

    // Count set bits in V0 and add to hamming_distance
    VCNT    V0.B16, V0.B16
    VUADDLV V0.B16, V0
    VMOV    V0.B[0], R9

    ADD     R9, R3, R3

    // Increment i
    ADD     $1, R4, R4

    // Loop back if i < length
    CMP     R4, R2
    BLT     scalar_loop

end:
    // Store hamming_distance in result
    MOVD   R3, (R11)
    RET

Explanation:

Function Signature:
- The function HammingDistanceNEONASM takes four parameters:
  - a and b: pointers to the byte slices.
  - length: the length of the slices.
  - result: a pointer to store the resulting Hamming distance.
Initialization:
- Registers are set up to hold the parameters and initialize counters.
Main Loop (loop_start):
- Processes 16 bytes at a time using NEON SIMD instructions.
- Loading Data:
  - VLD1 loads 16 bytes from each input array into NEON registers V0 and V1.
- Bitwise XOR:
  - VEOR computes the XOR of V0 and V1, storing the result in V2.
- Counting Set Bits:
  - VCNT counts the number of set bits in each byte of V2.
- Summing Counts:
  - VUADDLV horizontally adds the counts to produce a single sum.
- Updating Hamming Distance:
  - The sum is added to the running total in R3.
- Incrementing Index:
  - Increases the index i by 16 to process the next chunk.
Scalar Loop (scalar_loop):
- Handles any remaining bytes that don’t fit into a 16-byte block.
- Processes one byte at a time using scalar operations.
Completion:
- Stores the final Hamming distance into the provided result pointer.

Integrating with Go

To call this assembly function from Go, I used the following Go code:

//go:noescape
func HammingDistanceNEONASM(a, b []byte, length int, result *int)

// HammingDistanceNEON calls the assembly function.
func HammingDistanceNEON(a, b []byte) int {
    var result int
    HammingDistanceNEONASM(a, b, len(a), &result)
    return result
}

Benchmarking the Assembly Implementation

With everything set up, I ran the benchmark:

func benchmarkHammingDistanceNEONASM() {
    // Generate input of length 1024 bytes
    a := make([]byte, 1024)
    b := make([]byte, 1024)

    // Fill a and b with data
    for i := 0; i < 1024; i++ {
        a[i] = byte(i)
        b[i] = byte(i + 1)
    }

    fmt.Println("Running NEON ASM version...")
    start := time.Now()
    neonASMResult := 0
    for i := 0; i < 10000000; i++ {
        neonASMResult += HammingDistanceNEON(a, b)
    }
    durationNEONASM := time.Since(start)

    fmt.Printf("NEON ASM version took: %v, Result: %d\n", durationNEONASM, neonASMResult)
}

Results

The performance improvement was substantial:

NEON ASM version took: 527.420084ms, Result: 20400000000

Performance Analysis

Comparing all versions:

Pure Go Version: ~8.56 seconds
Optimized Go Version: ~5.67 seconds
CGO NEON Version: ~1.14 seconds
Go Assembly NEON Version: ~0.53 seconds

By writing the critical section in Go assembly using NEON instructions, I achieved the fastest execution time yet—over 94% faster than the optimized pure Go version.

Why Such a Significant Improvement?

Eliminating CGO Overhead: By writing assembly code directly in Go, I avoided the overhead associated with CGO calls.
Direct Control: Assembly allows for precise control over the CPU instructions, enabling optimal use of registers and instruction pipelines.
Full SIMD Utilization: Leveraging NEON instructions directly ensures maximum utilization of the SIMD capabilities.

Reflections on Writing Assembly in Go

Writing assembly code was both challenging and rewarding. It required a deep understanding of both the ARM architecture and Go’s calling conventions. Debugging was non-trivial, but the performance gains made it worthwhile.

Some takeaways:

Complexity vs. Performance: While assembly code offers the best performance, it comes at the cost of code readability and maintainability.
Portability Concerns: Assembly code is architecture-specific. The NEON instructions used here will only work on ARM64 processors supporting NEON.
Use Cases: Such low-level optimizations are best reserved for performance-critical sections where the benefits outweigh the drawbacks.

Conclusion

This exploration into optimizing the Hamming distance calculation was an enlightening journey through different layers of software optimization:

Pure Go Implementation: A straightforward approach, easy to read and maintain, but with moderate performance.
Optimized Go Version: Improved performance through loop unrolling and processing multiple bytes per iteration.
CGO with NEON Intrinsics: Leveraged SIMD instructions via CGO, resulting in significant performance gains.
Go Assembly with NEON Instructions: Achieved the best performance by writing critical code in assembly, fully utilizing hardware capabilities.

Final Performance Comparison

Implementation	Time Taken	Relative Speedup
Pure Go	~8.56 seconds	Baseline
Optimized Go	~5.67 seconds	~34% faster
CGO NEON	~1.14 seconds	~87% faster
Go Assembly NEON	~0.53 seconds	~94% faster

Key Learnings

Profiling Is Essential: Before optimizing, it’s crucial to profile your code to understand where the bottlenecks are.
Algorithmic Efficiency Matters: Sometimes, algorithmic changes can offer significant improvements without low-level optimizations.
Know Your Tools: Understanding the capabilities and limitations of your programming language and hardware can unlock substantial performance gains.
Balance Complexity and Maintainability: While low-level optimizations can offer performance benefits, they add complexity and reduce code readability.

Moving Forward

While I achieved remarkable performance improvements, it’s important to consider the maintainability and portability of the code. For most applications, the optimized Go version may offer a good balance between performance and maintainability.

However, in scenarios where performance is paramount and the target architecture is known, leveraging assembly and SIMD instructions can be justified.

Note on Running NEON Code on macOS

If you’re planning to run this code on a modern Mac with an Apple Silicon chip, you might encounter an issue where the process gets killed when you try to execute it. This happens because, by default, running NEON instructions is restricted on macOS due to security policies. To successfully run NEON instructions, you need to sign your binary.

Signing Your Binary

Signing the binary informs macOS that the code is trusted and allows it to execute with the necessary permissions. Here’s how you can do it:

Create a Code Signing Certificate
First, you’ll need a code signing certificate. For testing purposes, you can create a self-signed certificate using the Keychain Access app:
- Open Keychain Access on your Mac.
- Go to Keychain Access > Certificate Assistant > Create a Certificate.
- In the dialog that appears:
  - Name: Enter a name for your certificate (e.g., “NEON Code Signing”).
  - Identity Type: Select Self Signed Root.
  - Certificate Type: Choose Code Signing.
- Click Create to generate the certificate.
- Your new certificate will appear in the login keychain under My Certificates.
Sign the Binary During Build
After creating the certificate, you can sign your binary using the codesign command. Here’s the command line I use:
```
go build -o hamming_benchmark
codesign -s "NEON Code Signing" hamming_benchmark
```
Replace "NEON Code Signing" with the name of your certificate if it’s different.
- go build -o hamming_benchmark: Builds your Go program and outputs the binary named hamming_benchmark.
- codesign -s "NEON Code Signing" hamming_benchmark: Signs the binary with your code signing certificate.

Running the Signed Binary

With the binary signed, you should now be able to run your program without it being killed by the operating system:

./hamming_benchmark

Background – Hamming Distance#

Background — SIMD Instructions#

My Dive into SIMD#

Understanding SIMD in Simple Terms#

Different Instruction Sets for Different Architectures#

Implementation — Pure Go#

Starting with the Basics#

The First Benchmark#

Seeking Optimization Opportunities#

Testing the Optimized Version#

Performance Comparison#

Reflecting on the Progress#

Implementation — CGO#

C Implementation Using NEON Intrinsics#

Go Wrapper Using CGO#

Updating the Benchmark#

Results#

Performance Analysis#

Considerations#

Implementation — Go Assembly#

Diving into Go Assembly with NEON Instructions#

Integrating with Go#

Benchmarking the Assembly Implementation#

Results#

Performance Analysis#

Reflections on Writing Assembly in Go#

Conclusion#

Final Performance Comparison#

Key Learnings#

Moving Forward#

Note on Running NEON Code on macOS#

Signing Your Binary#

Running the Signed Binary#

References#

Background – Hamming Distance

Background — SIMD Instructions

My Dive into SIMD

Understanding SIMD in Simple Terms

Different Instruction Sets for Different Architectures

Implementation — Pure Go

Starting with the Basics

The First Benchmark

Seeking Optimization Opportunities

Testing the Optimized Version

Performance Comparison

Reflecting on the Progress

Implementation — CGO

C Implementation Using NEON Intrinsics

Go Wrapper Using CGO

Updating the Benchmark

Results

Performance Analysis

Considerations

Implementation — Go Assembly

Diving into Go Assembly with NEON Instructions

Integrating with Go

Benchmarking the Assembly Implementation

Results

Performance Analysis

Reflections on Writing Assembly in Go

Conclusion

Final Performance Comparison

Key Learnings

Moving Forward

Note on Running NEON Code on macOS

Signing Your Binary

Running the Signed Binary

References