Improving parallel calls from C to Go performance

June 29, 2023

go golang cgo runtime

Prologue

Some weeks ago, my ex-colleague asked me a question about performance problem in his go/cgo code:

The Go code use cgo,
The code perform parallel Go -> C and C -> Go calls.
The code run on bare metal with high number of CPUs.

Though we all know that cgo is slow, but the quirkish only happens when adding more CPUs.

Here’s roughly English translated version of what’s my ex-colleague told me:

It’s strange that when adding more CPUs, starting from 32, the performaces start decreasing.

The problem

The go runtime runs C code in separated OS thread. If C code callback to Go, the runtime must ensure the callback is run on the same OS thread. Otherwise, the scheduler could move the goroutine which has been run the callback to different M (aka OS thread), prevent the callback from going back to C code after finish.

Further, every calls from C to Go needs to check whether go runtime initialization done:

fmt.Fprintf(fgcc, "\tsize_t _cgo_ctxt = _cgo_wait_runtime_init_done();\n")

source

So a global mutex is used for synchronization. This often does not matter, because the lock is held only briefly. However, with code that does a lot of parallel calls from C to Go, there could be a heavy contention on the mutex.

From this pprof result:

pprof result

We can see that the code spending most of the time for lock/unlock/wait.

Here’s a simplified program to reproduce the problem:

package main

/*
#include <stdint.h>
#include <stdlib.h>
#include <string.h>

extern void c_call_go_get_value(uint8_t* key32, uint8_t *value32);

static inline void get_value(uint8_t* key32, uint8_t *value32) {
    uint8_t value[32];
    c_call_go_get_value(key32, value);
    memcpy(value32, value, 32);
}
*/
import "C"
import (
	"fmt"
	"runtime"
	"sync"
	"time"
	"unsafe"
)

func main() {
	numWorkers := runtime.NumCPU()
	fmt.Printf("numWorkers = %v\n", numWorkers)
	numTasks := 10_000_000
	tasks := make(chan struct{}, numTasks)
	for i := 0; i < numTasks; i++ {
		tasks <- struct{}{}
	}
	close(tasks)
	start := time.Now()
	var wg sync.WaitGroup
	wg.Add(numWorkers)
	for i := 0; i < numWorkers; i++ {
		go func() {
			defer wg.Done()
			for range tasks {
				_ = getValue([32]byte{})
			}
		}()
	}
	wg.Wait()
	fmt.Printf("took %vms\n", time.Since(start).Milliseconds())
}

func getValue(key [32]byte) []byte {
	key32 := (*C.uint8_t)(C.CBytes(key[:]))
	value32 := (*C.uint8_t)(C.malloc(32))
	C.get_value(key32, value32)
	ret := C.GoBytes(unsafe.Pointer(value32), 32)
	C.free(unsafe.Pointer(key32))
	C.free(unsafe.Pointer(value32))
	return ret
}

//export c_call_go_get_value
func c_call_go_get_value(key32 *C.uint8_t, value32 *C.uint8_t) {
	key := C.GoBytes(unsafe.Pointer(key32), 32)
	_ = key
	value := make([]byte, 32)
	for i := 0; i < len(value); i++ {
		*(*C.uint8_t)(unsafe.Pointer(uintptr(unsafe.Pointer(value32)) + uintptr(i))) = (C.uint8_t)(value[i])
	}
}

Running with go1.20.5:

$ go1.20.5 run main.go
numWorkers = 8
took 2687ms

The fix

Since this is an initialization guard, we can use atomic operation to provide fast path in case the initialization has been done. That would help reducing number of threads holding the mutex, prevent heavy contention.

CL 505455 was sent to fixing this issue, reducing the running time of above program by ~1s:

$ go run main.go
numWorkers = 8
took 1706ms

The patch is also experiment by my ex-colleague in his environment, reducing the mutex unlock from ~45% to just ~17%.

Epilogue

cgo is not go, and there’s still a lot of room for improvement.

Thank you for reading so far!

Till next time!

Google Open Source Peer Bonus (2)

May 24, 2024

go golang GoogleOSS

GopherCon 2023

October 2, 2023

gophercon community go golang

A practical optimization for Go compiler

May 25, 2023

go golang compiler