chunkloris: per-chunk http amplification

tl;dr

an http/1.1 request body sent as N one-byte chunked-transfer-encoding chunks is rfc-compliant. on almost every production http server we tested it forces one parser/dispatcher callback per chunk — and the per-chunk cpu cost is measurable in microseconds on a single core.

across the 54-server survey:

27 / 27 http/1 servers retain a measurable per-chunk cpu cost under the paced one-byte-per-chunk comparator (mode b, N=250,000, 1 vcpu). the range is 3.6 µs/chunk (kestrel on asp.net core, parser overhead only — application delivery is batched) to 113.6 µs/chunk (nginx as origin). median ≈ 12.4 µs.
http/2 and http/3 inherit the shape per-frame. 14 of 17 measured h2c servers expose the per-frame cpu cost; 4 of 4 measured h3 servers do.
websocket is the one protocol where most implementations already batch correctly (node-ws, gorilla, rust-tungstenite, kestrel-ws). python asgi (uvicorn + websockets / wsproto) is the outlier.
the “deploy behind nginx” mitigation is empirically real for http/1: default proxy_request_buffering on collapses N chunks into a single content-length-framed upstream request. haproxy does not aggregate by default; http-buffer-request waits for one buffer’s worth, not a per-chunk accumulator.

draft pdf →

what “per-chunk” means

the http/1.1 chunked transfer encoding (rfc 9112) fixes the wire format:

<hex-size>\r\n<bytes>\r\n
<hex-size>\r\n<bytes>\r\n
...
0\r\n\r\n

it does not fix how a parser schedules callbacks into the application:

one callback per wire chunk?
one callback per recv() batch (whatever the kernel happens to deliver)?
one callback per fixed-byte threshold (e.g. 8 kib)?
one callback after the entire body is buffered?

every production server picks one of these as the default. the paper is a measurement of which one each server picked, and what it costs per chunk on a 1-vcpu container under two probe modes:

mode a — bridge-coalesced. the prober writes the whole chunked body back-to-back with TCP_NODELAY off. the docker bridge coalesces many wire chunks into each server-side recv(). closest to pod-to-pod traffic.
mode b — paced 100 µs. between every chunk the prober busy-waits 100 µs and TCP_NODELAY is on. each chunk leaves the prober as its own segment. closest to slow-drip attacker pacing or trans-wan bursts.

mode b is the strict comparator. mode b minus mode a (when both are present) isolates the server overhead from the prober’s pacing budget.

the attack shape (“chunkloris”)

diagram: N one-byte chunks travel through kernel TCP, the HTTP parser, and the application handler, producing N events on per-chunk servers vs few events on batched servers — the attack shape: N one-byte chunks → N parser events → N application wakeups on per-chunk servers. on batched servers (kestrel via System.IO.Pipelines and similar) parser cpu is still per-chunk but the application boundary is decoupled from the wire framing rate.

the paper names this measured shape chunkloris because it is a low-payload http availability attack in the slowloris tradition: the attacker sends valid wire bytes that force the server to do excessive per-event work, and the limiting resource on the server is parser/dispatcher cpu rather than idle connection state.

a representative request body:

POST /upload HTTP/1.1
Host: x
Transfer-Encoding: chunked
Content-Type: application/octet-stream
Connection: close

1\r\nA\r\n1\r\nA\r\n... (250,000 times) ...0\r\n\r\n

that body is 6 wire bytes per data byte, ≈ 1.5 mb on the wire to deliver 250,000 chunks. on a server with a 50 µs/chunk per-chunk cost a single connection consumes ≈ 12.5 cpu-seconds of a single core. amplification is roughly cpu seconds per kib of attack traffic, not “kib delivered per kib sent.”

this shape is rfc-compliant. there is no protocol violation. there is no need to keep the connection open slowly.

why this is a “known issue” — and what the paper adds

the lineage is well-known in pieces:

slowloris (2009) — connection-state exhaustion.
read-rate / write-rate slow attacks documented for http servers since ~2012.
per-chunk parser/dispatcher cost has been raised in multiple upstream issues, most directly in hyper #4008 (“provide built-in chunked request limits and cpu-safe streaming helpers”) — closed not_planned / S-waiting-on-author.
nginx documents proxy_request_buffering and client_body_buffer_size as the mitigation; the body-buffering behavior is the reason this is not also a per-chunk problem behind a default nginx proxy.

what the paper contributes:

a like-for-like cross-ecosystem measurement matrix: 27 http/1 + 17 h2 + 4 h3 + 6 ws servers, every cell measured on the same 1-vcpu docker container with cgroup-derived cpu cost.
source citations for the parser callback boundary in every server (the per-framework deep-dives in section 4 of the pdf).
a mitigation taxonomy — server-side aggregation, application-side chunk limits, frontend buffering, transport-layer rate limiting — and an empirically-confirmed characterization of the “deploy behind nginx” advice (it works; behind haproxy by default it does not).
a counterexample at the application boundary: kestrel on asp.net core uses System.IO.Pipelines to batch application-delivery callbacks. parser cpu is still per-chunk, but application code wakes on coalesced reads.

headline figure

ranked per-chunk CPU cost across 27 HTTP/1 servers under the paced Mode B 250k comparator — per-chunk server cpu cost across the 27-server canonical http/1 mode-b comparator (paced 100 µs gap, N = 250,000 one-byte chunks, 1 vcpu container). log-scale x-axis. kestrel batches application delivery but still has substantial parser cpu cost under this strict comparator.

verdict distribution: VULNERABLE-PER-CHUNK, VULNERABLE-PER-FRAME, BATCHES-CORRECTLY, PROTECTED-H2-GOAWAY across protocols — verdict distribution by protocol. http/1, http/2, http/3 are dominated by per-chunk / per-frame verdicts. websocket is the one protocol where batching correctly is the common case.

wall time vs chunk count N under Mode A bridge-coalesced; log-log axes with reference slopes 1 and 2 — wall time vs chunk count N under mode a (bridge-coalesced), restricted to the common N ∈ {50,000, 100,000, 250,000} cells. log-log axes with reference slopes at 1 (linear) and 2 (quadratic). the completed common-cell matrix is roughly linear. an earlier draft over-claimed a quadratic shape from incomplete cells; the rerun does not support that.

results, all 54 servers

http/1.1 chunked transfer encoding (27 servers)

server	ecosystem	mode a (µs/chunk)	mode b (µs/chunk)	verdict
uvicorn (httptools)	python	0.5	8.5	per-chunk
daphne	python	1.7	9.3	per-chunk
uvicorn (h11)	python	4.0	14.0	per-chunk
hypercorn (h11)	python	5.5	23.5	per-chunk
granian	python	18.6	23.4	per-chunk
gunicorn (sync)	python	2.7	105.6	per-chunk
waitress	python	18.4	108.0	per-chunk
tornado	python	18.4	104.3	per-chunk
go net/http	go	0.2	7.99	per-chunk
gin	go	0.2	11.6	per-chunk
axum	rust	0.5	4.8	per-chunk
actix-web	rust	0.2	3.8	per-chunk
node http	node	1.1	5.2	per-chunk
express	node	1.0	5.2	per-chunk
fastify	node	0.9	5.1	per-chunk
vertx	jvm	0.4	4.3	per-chunk
spring boot (tomcat)	jvm	0.5	12.4	per-chunk
cowboy	beam	1.7	14.0	per-chunk
phoenix (cowboy2)	beam	2.1	35.0	per-chunk
bandit	beam	67.1	82.2	per-chunk
nginx	c	0.2	113.6	per-chunk
apache httpd	c	0.2	15.3	per-chunk
haproxy	c	0.2	7.6	per-chunk
kestrel 9	dotnet	1.3	3.6	batches application delivery; parser still per-chunk
puma	ruby	1.2	28.9	per-chunk
unicorn	ruby	—	—	per-chunk (see page)
falcon	ruby	—	—	per-chunk (see page)

http/2 data frames, h2c (17 servers)

server	ecosystem	mode a (µs/frame)	mode b (µs/frame)	verdict
hypercorn h2	python	7.7	27.0	per-frame
granian h2	python	16.3	28.8	per-frame
node h2	node	0.8	5.2	per-frame
fastify h2	node	1.1	6.1	per-frame
kestrel h2	dotnet	2.0	58.6	per-frame
vertx h2	jvm	0.5	4.4	batches
rust hyper h2	rust	0.5	3.7	batches
actix h2	rust	0.5	4.1	batches
go h2c	go	4.8	19.4	per-frame
bandit h2	beam	1.0	17.3	per-frame
cowboy h2	beam	—	—	h2 GOAWAY 11
phoenix cowboy h2	beam	—	—	h2 GOAWAY 11
spring tomcat h2	jvm	—	—	h2 GOAWAY 11
nginx h2	c	0.2	103.5	per-frame
apache httpd h2	c	2.5	27.9	per-frame
haproxy h2	c	0.3	8.2	per-frame
falcon h2	ruby	3.1	11.4	per-frame

http/3 data frames over quic (4 servers)

server	ecosystem	mode a (µs/frame)	mode b (µs/frame)	verdict
hypercorn h3 (aioquic)	python	2.8	66.6	per-frame
aioquic h3	python	1.4	36.3	per-frame
quic-go h3	go	0.8	33.2	per-frame
kestrel h3 (msquic)	dotnet	2.3	67.4	per-frame

websocket text frames (6 servers)

server	ecosystem	mode a (µs/frame)	mode b (µs/frame)	verdict
uvicorn + websockets	python	4.7	5.0	per-frame
uvicorn + wsproto	python	10.2	26.6	per-frame
node-ws	node	0.23	5.1	batches
gorilla websocket	go	0.24	5.6	batches
rust tungstenite	rust	0.09	7.1	batches
kestrel websocket	dotnet	0.39	11.5	batches

see the http/1 vs /2 vs /3 protocol comparison for the cross-protocol summary, and any per-server page for the source citation and the full mode-a / mode-b measurement matrix for that server.

three observed parser/dispatcher boundary classes: per-chunk / per-frame (43 of 54), application-boundary batched (5 of 54), and in-protocol abort via H2 GOAWAY (3 of 54) — three observed parser/dispatcher boundary classes across the 54-server matrix.

mitigations available today

probe paths through nginx (default), nginx with proxy_request_buffering off, and HAProxy default, showing whether N chunks reach the upstream as 1 recv() or N recv()s — measured behavior of the three deployments. default nginx collapses N chunks into a single content-length-framed upstream request; nginx with `proxy_request_buffering off` and haproxy’s default streaming both forward the per-chunk shape to the upstream.

the paper’s mitigation section ranks these by where the work happens.

server-side aggregation (preferred). kestrel’s Http1ChunkedEncodingMessageBody.PumpAsync drains the readable buffer into a System.IO.Pipelines pipe and wakes the application with a coalesced PipeReader.ReadAsync result. application code sees one event per buffer, not one per wire chunk. this is what every event-loop http server should expose as an opt-in primitive.

application-side chunk limits. an explicit cap on the number of decoded chunks per request, applied before the application handler reads them. hyper issue #4008 proposed this as BodyExt::limit_chunks(N); that proposal is currently closed not_planned. without it, most rust / go / python event-loop servers can be capped only by total bytes, not by chunk count.

frontend buffering. default nginx (proxy_request_buffering on) collapses N chunks into a single content-length-framed upstream request. apache mod_proxy_http is documented to behave the same way. haproxy by default streams; http-buffer-request waits for one buffer’s worth, not a per-chunk accumulator.

transport-layer rate limiting. capping requests per second per source ip or per connection bounds the throughput of any single attacker but does not remove the per-chunk cost.

resources

draft pdf — full paper, all sections.
http/1 vs /2 vs /3 protocol comparison
per-server pages — one for each of the 54 servers in the survey, with the parser path source citation and the full mode-a / mode-b measurement matrix.
hyper #4008 — upstream proposal for application-side chunk limits, closed not_planned.
rfc 9112 §7.1 — http/1.1 chunked transfer coding.
rfc 9113 §6.1 — http/2 data frames.
rfc 9114 §7.2.1 — http/3 data frames.

status

this is a working paper. the pdf linked above is the current draft; the measurement matrix is complete for the cells described in the paper, but the per-framework deep dives and the bibliography are still being expanded. the wiki pages on this site are kept in sync with the measurement matrix in data/all.json from the paper repo.