AsyncIO Performance

Monday, December 1, 2025

Factor has green threads and a long-standing feature request to be able to utilize native threads more efficiently for concurrent tasks. In the meantime, the cooperative threading model allows for asynchronous tasks which is particularly useful when waiting for I/O such as used by sockets over a computer network.

And while it might be true that asynchrony is not concurrency, there are a lot of other things one could say about concurrency and multi-threaded or multi-process performance. Today I want to discuss an article that Will McGugan wrote about the overhead of Python asyncio tasks and the good discussion that followed on Hacker News.

Let’s go over the benchmark in a few programming languages – including Factor!

Python

The article presents this benchmark in Python that does no work but measures the relative overhead of the asyncio task infrastructure when creating a large number of asynchronous tasks:

from asyncio import create_task, wait, run
from time import process_time as time

async def time_tasks(count=100) -> float:
    """Time creating and destroying tasks."""

    async def nop_task() -> None:
        """Do nothing task."""
        pass

    start = time()
    tasks = [create_task(nop_task()) for _ in range(count)]
    await wait(tasks)
    elapsed = time() - start
    return elapsed

for count in range(100_000, 1_000_000 + 1, 100_000):
    create_time = run(time_tasks(count))
    create_per_second = 1 / (create_time / count)
    print(f"{count:9,} tasks \t {create_per_second:0,.0f} tasks per/s")

Using the latest Python 3.14, this is reasonably fast on my laptop taking about 13 seconds:

$ time python3.14 foo.py
  100,000 tasks    577,247 tasks per/s
  200,000 tasks    533,911 tasks per/s
  300,000 tasks    546,127 tasks per/s
  400,000 tasks    488,219 tasks per/s
  500,000 tasks    466,636 tasks per/s
  600,000 tasks    469,972 tasks per/s
  700,000 tasks    434,126 tasks per/s
  800,000 tasks    428,456 tasks per/s
  900,000 tasks    404,905 tasks per/s
1,000,000 tasks    376,167 tasks per/s

python3.14 foo.py  12.69s user 0.27s system 99% cpu 12.971 total

Factor

We could translate this directly to Factor using the concurrency.combinators vocabulary.

In particular, the parallel-map word starts a new thread applying a quotation to each element in the sequence and then waits for all the threads to finish:

USING: concurrency.combinators formatting io kernel math ranges sequences
tools.time ;

: time-tasks ( n -- )
    <iota> [ ] parallel-map drop ;

: run-tasks ( -- )
    100,000 1,000,000 100,000 <range> [
       dup [ time-tasks ] benchmark 1e9 / dupd /
       "%7d tasks \t %7d tasks per/s\n" printf flush
    ] each ;

After making an improvement to our parallel-map implementation that uses a count-down latch for more efficient waiting on a group of tasks, this runs 2.5x as fast as Python:

IN: scratchpad gc [ run-tasks ] time
 100000 tasks   1246872 tasks per/s
 200000 tasks   1209500 tasks per/s
 300000 tasks   1141121 tasks per/s
 400000 tasks   1121304 tasks per/s
 500000 tasks   1119707 tasks per/s
 600000 tasks   1135459 tasks per/s
 700000 tasks    956541 tasks per/s
 800000 tasks   1091807 tasks per/s
 900000 tasks    944753 tasks per/s
1000000 tasks   1137681 tasks per/s

Running time: 5.142044833 seconds

That’s pretty good for a comparable dynamic language, and especially since we are still running in Rosetta 2 on Apple macOS translating Intel x86-64 to Apple Silicon aarch64 on the fly!

It also turns out that 75% of the benchmark time is spent in the garbage collector, so probably there are some big wins we can get if we look more closely into that.

Go

We could translate that benchmark into Go 1.25:

package main

import (
    "fmt"
    "sync"
    "time"
)

func timeTasks(count int) time.Duration {
    nopTask := func(done func()) {
        done()
    }
    start := time.Now()
    wg := &sync.WaitGroup{}
    wg.Add(count)
    for i := 0; i < count; i++ {
        go nopTask(wg.Done)
    }
    wg.Wait()
    return time.Now().Sub(start)
}
func main() {
    for n := 100_000; n <= 1_000_000; n += 100_000 {
        createTime := timeTasks(n)
        createPerSecond := (1.0 / (float64(createTime) / float64(n))) * float64(time.Second)
        fmt.Printf("%7d tasks \t %7d tasks per/s\n", n, createPerSecond)
    }
}

And show that it is about 11x times faster than Python using multiple CPUs.

$ time go run foo.go
 100000 tasks    3889083 tasks per/s
 200000 tasks    5748283 tasks per/s
 300000 tasks    6324955 tasks per/s
 400000 tasks    6265341 tasks per/s
 500000 tasks    6301852 tasks per/s
 600000 tasks    5572898 tasks per/s
 700000 tasks    6239860 tasks per/s
 800000 tasks    6276241 tasks per/s
 900000 tasks    6226128 tasks per/s
1000000 tasks    6243859 tasks per/s

go run foo.go  2.44s user 0.71s system 270% cpu 1.165 total

If we limit GOMAXPROCS to one CPU, it runs only 7.5x times faster than Python:

$ time GOMAXPROCS=1 go run foo.go
 100000 tasks    2240106 tasks per/s
 200000 tasks    2869379 tasks per/s
 300000 tasks    2745897 tasks per/s
 400000 tasks    3759142 tasks per/s
 500000 tasks    3090267 tasks per/s
 600000 tasks    3489138 tasks per/s
 700000 tasks    3608874 tasks per/s
 800000 tasks    3200636 tasks per/s
 900000 tasks    3682102 tasks per/s
1000000 tasks    3259778 tasks per/s

GOMAXPROCS=1 go run foo.go  1.65s user 0.08s system 99% cpu 1.735 total

JavaScript

We could build the same benchmark in JavaScript:

async function time_tasks(count=100) {
    async function nop_task() {
        return performance.now();
    }

    const start = performance.now()
    let tasks = Array(count).map(nop_task)
    await Promise.all(tasks)
    const elapsed = performance.now() - start
    return elapsed / 1e3
}

async function run_tasks() {
    for (let count = 100000; count < 1000000 + 1; count += 100000) {
        const ct = await time_tasks(count)
        console.log(`${count}: ${Math.round(1 / (ct / count))} tasks/sec`)
    }
}

run_tasks()

And it runs pretty fast on Node 25.2.1 – about 26x times faster than Python!

$ time node foo.js
100000: 9448038 tasks/sec
200000: 11555322 tasks/sec
300000: 18286318 tasks/sec
400000: 10017217 tasks/sec
500000: 12587060 tasks/sec
600000: 14198956 tasks/sec
700000: 13294620 tasks/sec
800000: 12045403 tasks/sec
900000: 11135513 tasks/sec
1000000: 13577663 tasks/sec

node foo.js  0.82s user 0.10s system 185% cpu 0.496 total

But it runs even faster on Bun 1.3.3 – about 36x times faster than Python!

$ time bun foo.js
100000: 9771222 tasks/sec
200000: 13388075 tasks/sec
300000: 13242548 tasks/sec
400000: 13130144 tasks/sec
500000: 16530496 tasks/sec
600000: 16979009 tasks/sec
700000: 16781272 tasks/sec
800000: 17098919 tasks/sec
900000: 17111784 tasks/sec
1000000: 18288515 tasks/sec

bun foo.js  0.37s user 0.02s system 111% cpu 0.353 total

I’m sure other languages perform both better and worse, but this gives us some nice ideas of where we stand relative to some useful production programming languages. There is clearly room to grow, some potential low-hanging fruit, and known features such as supporting native threads that could be a big improvement to the status quo!

PRs welcome!