AsyncIO Performance
Monday, December 1, 2025
Factor has green threads and a long-standing feature request to be able to utilize native threads more efficiently for concurrent tasks. In the meantime, the cooperative threading model allows for asynchronous tasks which is particularly useful when waiting for I/O such as used by sockets over a computer network.
And while it might be true that asynchrony is not concurrency, there are a lot of other things one could say about concurrency and multi-threaded or multi-process performance. Today I want to discuss an article that Will McGugan wrote about the overhead of Python asyncio tasks and the good discussion that followed on Hacker News.
Let’s go over the benchmark in a few programming languages – including Factor!
Python
The article presents this benchmark in Python that does no work but measures the relative overhead of the asyncio task infrastructure when creating a large number of asynchronous tasks:
from asyncio import create_task, wait, run
from time import process_time as time
async def time_tasks(count=100) -> float:
"""Time creating and destroying tasks."""
async def nop_task() -> None:
"""Do nothing task."""
pass
start = time()
tasks = [create_task(nop_task()) for _ in range(count)]
await wait(tasks)
elapsed = time() - start
return elapsed
for count in range(100_000, 1_000_000 + 1, 100_000):
create_time = run(time_tasks(count))
create_per_second = 1 / (create_time / count)
print(f"{count:9,} tasks \t {create_per_second:0,.0f} tasks per/s")
Using the latest Python 3.14, this is reasonably fast on my laptop taking about 13 seconds:
$ time python3.14 foo.py
100,000 tasks 577,247 tasks per/s
200,000 tasks 533,911 tasks per/s
300,000 tasks 546,127 tasks per/s
400,000 tasks 488,219 tasks per/s
500,000 tasks 466,636 tasks per/s
600,000 tasks 469,972 tasks per/s
700,000 tasks 434,126 tasks per/s
800,000 tasks 428,456 tasks per/s
900,000 tasks 404,905 tasks per/s
1,000,000 tasks 376,167 tasks per/s
python3.14 foo.py 12.69s user 0.27s system 99% cpu 12.971 total
Factor
We could translate this directly to Factor using the concurrency.combinators vocabulary.
In particular, the parallel-map word starts a new thread applying a quotation to each element in the sequence and then waits for all the threads to finish:
USING: concurrency.combinators formatting io kernel math ranges sequences
tools.time ;
: time-tasks ( n -- )
<iota> [ ] parallel-map drop ;
: run-tasks ( -- )
100,000 1,000,000 100,000 <range> [
dup [ time-tasks ] benchmark 1e9 / dupd /
"%7d tasks \t %7d tasks per/s\n" printf flush
] each ;
After making an improvement to our parallel-map implementation that uses a count-down latch for more efficient waiting on a group of tasks, this runs 2.5x as fast as Python:
IN: scratchpad gc [ run-tasks ] time
100000 tasks 1246872 tasks per/s
200000 tasks 1209500 tasks per/s
300000 tasks 1141121 tasks per/s
400000 tasks 1121304 tasks per/s
500000 tasks 1119707 tasks per/s
600000 tasks 1135459 tasks per/s
700000 tasks 956541 tasks per/s
800000 tasks 1091807 tasks per/s
900000 tasks 944753 tasks per/s
1000000 tasks 1137681 tasks per/s
Running time: 5.142044833 seconds
That’s pretty good for a comparable dynamic language, and especially since we are still running in Rosetta 2 on Apple macOS translating Intel x86-64 to Apple Silicon aarch64 on the fly!
It also turns out that 75% of the benchmark time is spent in the garbage collector, so probably there are some big wins we can get if we look more closely into that.
Go
We could translate that benchmark into Go 1.25:
package main
import (
"fmt"
"sync"
"time"
)
func timeTasks(count int) time.Duration {
nopTask := func(done func()) {
done()
}
start := time.Now()
wg := &sync.WaitGroup{}
wg.Add(count)
for i := 0; i < count; i++ {
go nopTask(wg.Done)
}
wg.Wait()
return time.Now().Sub(start)
}
func main() {
for n := 100_000; n <= 1_000_000; n += 100_000 {
createTime := timeTasks(n)
createPerSecond := (1.0 / (float64(createTime) / float64(n))) * float64(time.Second)
fmt.Printf("%7d tasks \t %7d tasks per/s\n", n, createPerSecond)
}
}
And show that it is about 11x times faster than Python using multiple CPUs.
$ time go run foo.go
100000 tasks 3889083 tasks per/s
200000 tasks 5748283 tasks per/s
300000 tasks 6324955 tasks per/s
400000 tasks 6265341 tasks per/s
500000 tasks 6301852 tasks per/s
600000 tasks 5572898 tasks per/s
700000 tasks 6239860 tasks per/s
800000 tasks 6276241 tasks per/s
900000 tasks 6226128 tasks per/s
1000000 tasks 6243859 tasks per/s
go run foo.go 2.44s user 0.71s system 270% cpu 1.165 total
If we limit GOMAXPROCS to one CPU, it runs only 7.5x times faster than Python:
$ time GOMAXPROCS=1 go run foo.go
100000 tasks 2240106 tasks per/s
200000 tasks 2869379 tasks per/s
300000 tasks 2745897 tasks per/s
400000 tasks 3759142 tasks per/s
500000 tasks 3090267 tasks per/s
600000 tasks 3489138 tasks per/s
700000 tasks 3608874 tasks per/s
800000 tasks 3200636 tasks per/s
900000 tasks 3682102 tasks per/s
1000000 tasks 3259778 tasks per/s
GOMAXPROCS=1 go run foo.go 1.65s user 0.08s system 99% cpu 1.735 total
JavaScript
We could build the same benchmark in JavaScript:
async function time_tasks(count=100) {
async function nop_task() {
return performance.now();
}
const start = performance.now()
let tasks = Array(count).map(nop_task)
await Promise.all(tasks)
const elapsed = performance.now() - start
return elapsed / 1e3
}
async function run_tasks() {
for (let count = 100000; count < 1000000 + 1; count += 100000) {
const ct = await time_tasks(count)
console.log(`${count}: ${Math.round(1 / (ct / count))} tasks/sec`)
}
}
run_tasks()
And it runs pretty fast on Node 25.2.1 – about 26x times faster than Python!
$ time node foo.js
100000: 9448038 tasks/sec
200000: 11555322 tasks/sec
300000: 18286318 tasks/sec
400000: 10017217 tasks/sec
500000: 12587060 tasks/sec
600000: 14198956 tasks/sec
700000: 13294620 tasks/sec
800000: 12045403 tasks/sec
900000: 11135513 tasks/sec
1000000: 13577663 tasks/sec
node foo.js 0.82s user 0.10s system 185% cpu 0.496 total
But it runs even faster on Bun 1.3.3 – about 36x times faster than Python!
$ time bun foo.js
100000: 9771222 tasks/sec
200000: 13388075 tasks/sec
300000: 13242548 tasks/sec
400000: 13130144 tasks/sec
500000: 16530496 tasks/sec
600000: 16979009 tasks/sec
700000: 16781272 tasks/sec
800000: 17098919 tasks/sec
900000: 17111784 tasks/sec
1000000: 18288515 tasks/sec
bun foo.js 0.37s user 0.02s system 111% cpu 0.353 total
I’m sure other languages perform both better and worse, but this gives us some nice ideas of where we stand relative to some useful production programming languages. There is clearly room to grow, some potential low-hanging fruit, and known features such as supporting native threads that could be a big improvement to the status quo!
PRs welcome!