chunkloris: per-chunk http amplification
on this page
tl;dr
an http/1.1 request body sent as N one-byte chunked-transfer-encoding chunks is
rfc-compliant. on almost every production http server we tested it forces one
parser/dispatcher callback per chunk — and the per-chunk cpu cost is measurable
in microseconds on a single core.
across the 54-server survey:
- 27 / 27 http/1 servers retain a measurable per-chunk cpu cost under the paced one-byte-per-chunk comparator (mode b, N=250,000, 1 vcpu). the range is 3.6 µs/chunk (kestrel on asp.net core, parser overhead only — application delivery is batched) to 113.6 µs/chunk (nginx as origin). median ≈ 12.4 µs.
- http/2 and http/3 inherit the shape per-frame. 14 of 17 measured h2c servers expose the per-frame cpu cost; 4 of 4 measured h3 servers do.
- websocket is the one protocol where most implementations already batch correctly (node-ws, gorilla, rust-tungstenite, kestrel-ws). python asgi (uvicorn + websockets / wsproto) is the outlier.
- the “deploy behind nginx” mitigation is empirically real for http/1:
default
proxy_request_buffering oncollapses N chunks into a single content-length-framed upstream request. haproxy does not aggregate by default;http-buffer-requestwaits for one buffer’s worth, not a per-chunk accumulator.
what “per-chunk” means
the http/1.1 chunked transfer encoding (rfc 9112) fixes the wire format:
<hex-size>\r\n<bytes>\r\n
<hex-size>\r\n<bytes>\r\n
...
0\r\n\r\n it does not fix how a parser schedules callbacks into the application:
- one callback per wire chunk?
- one callback per
recv()batch (whatever the kernel happens to deliver)? - one callback per fixed-byte threshold (e.g. 8 kib)?
- one callback after the entire body is buffered?
every production server picks one of these as the default. the paper is a measurement of which one each server picked, and what it costs per chunk on a 1-vcpu container under two probe modes:
- mode a — bridge-coalesced. the prober writes the whole chunked body
back-to-back with
TCP_NODELAYoff. the docker bridge coalesces many wire chunks into each server-siderecv(). closest to pod-to-pod traffic. - mode b — paced 100 µs. between every chunk the prober busy-waits 100 µs
and
TCP_NODELAYis on. each chunk leaves the prober as its own segment. closest to slow-drip attacker pacing or trans-wan bursts.
mode b is the strict comparator. mode b minus mode a (when both are present) isolates the server overhead from the prober’s pacing budget.
the attack shape (“chunkloris”)
the attack shape: N one-byte chunks → N parser events → N application wakeups on per-chunk servers. on batched servers (kestrel via System.IO.Pipelines and similar) parser cpu is still per-chunk but the application boundary is decoupled from the wire framing rate.
the paper names this measured shape chunkloris because it is a low-payload http availability attack in the slowloris tradition: the attacker sends valid wire bytes that force the server to do excessive per-event work, and the limiting resource on the server is parser/dispatcher cpu rather than idle connection state.
a representative request body:
POST /upload HTTP/1.1
Host: x
Transfer-Encoding: chunked
Content-Type: application/octet-stream
Connection: close
1\r\nA\r\n1\r\nA\r\n... (250,000 times) ...0\r\n\r\n that body is 6 wire bytes per data byte, ≈ 1.5 mb on the wire to deliver 250,000 chunks. on a server with a 50 µs/chunk per-chunk cost a single connection consumes ≈ 12.5 cpu-seconds of a single core. amplification is roughly cpu seconds per kib of attack traffic, not “kib delivered per kib sent.”
this shape is rfc-compliant. there is no protocol violation. there is no need to keep the connection open slowly.
why this is a “known issue” — and what the paper adds
the lineage is well-known in pieces:
- slowloris (2009) — connection-state exhaustion.
- read-rate / write-rate slow attacks documented for http servers since ~2012.
- per-chunk parser/dispatcher cost has been raised in multiple upstream issues,
most directly in hyper #4008 (“provide built-in chunked request limits
and cpu-safe streaming helpers”) — closed
not_planned/S-waiting-on-author. - nginx documents
proxy_request_bufferingandclient_body_buffer_sizeas the mitigation; the body-buffering behavior is the reason this is not also a per-chunk problem behind a default nginx proxy.
what the paper contributes:
- a like-for-like cross-ecosystem measurement matrix: 27 http/1 + 17 h2 + 4 h3 + 6 ws servers, every cell measured on the same 1-vcpu docker container with cgroup-derived cpu cost.
- source citations for the parser callback boundary in every server (the per-framework deep-dives in section 4 of the pdf).
- a mitigation taxonomy — server-side aggregation, application-side chunk limits, frontend buffering, transport-layer rate limiting — and an empirically-confirmed characterization of the “deploy behind nginx” advice (it works; behind haproxy by default it does not).
- a counterexample at the application boundary: kestrel on asp.net core
uses
System.IO.Pipelinesto batch application-delivery callbacks. parser cpu is still per-chunk, but application code wakes on coalesced reads.
headline figure
per-chunk server cpu cost across the 27-server canonical http/1 mode-b comparator (paced 100 µs gap, N = 250,000 one-byte chunks, 1 vcpu container). log-scale x-axis. kestrel batches application delivery but still has substantial parser cpu cost under this strict comparator.
verdict distribution by protocol. http/1, http/2, http/3 are dominated by per-chunk / per-frame verdicts. websocket is the one protocol where batching correctly is the common case.
wall time vs chunk count N under mode a (bridge-coalesced), restricted to the common N ∈ {50,000, 100,000, 250,000} cells. log-log axes with reference slopes at 1 (linear) and 2 (quadratic). the completed common-cell matrix is roughly linear. an earlier draft over-claimed a quadratic shape from incomplete cells; the rerun does not support that.
results, all 54 servers
http/1.1 chunked transfer encoding (27 servers)
| server | ecosystem | mode a (µs/chunk) | mode b (µs/chunk) | verdict |
|---|---|---|---|---|
| uvicorn (httptools) | python | 0.5 | 8.5 | per-chunk |
| daphne | python | 1.7 | 9.3 | per-chunk |
| uvicorn (h11) | python | 4.0 | 14.0 | per-chunk |
| hypercorn (h11) | python | 5.5 | 23.5 | per-chunk |
| granian | python | 18.6 | 23.4 | per-chunk |
| gunicorn (sync) | python | 2.7 | 105.6 | per-chunk |
| waitress | python | 18.4 | 108.0 | per-chunk |
| tornado | python | 18.4 | 104.3 | per-chunk |
| go net/http | go | 0.2 | 7.99 | per-chunk |
| gin | go | 0.2 | 11.6 | per-chunk |
| axum | rust | 0.5 | 4.8 | per-chunk |
| actix-web | rust | 0.2 | 3.8 | per-chunk |
| node http | node | 1.1 | 5.2 | per-chunk |
| express | node | 1.0 | 5.2 | per-chunk |
| fastify | node | 0.9 | 5.1 | per-chunk |
| vertx | jvm | 0.4 | 4.3 | per-chunk |
| spring boot (tomcat) | jvm | 0.5 | 12.4 | per-chunk |
| cowboy | beam | 1.7 | 14.0 | per-chunk |
| phoenix (cowboy2) | beam | 2.1 | 35.0 | per-chunk |
| bandit | beam | 67.1 | 82.2 | per-chunk |
| nginx | c | 0.2 | 113.6 | per-chunk |
| apache httpd | c | 0.2 | 15.3 | per-chunk |
| haproxy | c | 0.2 | 7.6 | per-chunk |
| kestrel 9 | dotnet | 1.3 | 3.6 | batches application delivery; parser still per-chunk |
| puma | ruby | 1.2 | 28.9 | per-chunk |
| unicorn | ruby | — | — | per-chunk (see page) |
| falcon | ruby | — | — | per-chunk (see page) |
http/2 data frames, h2c (17 servers)
| server | ecosystem | mode a (µs/frame) | mode b (µs/frame) | verdict |
|---|---|---|---|---|
| hypercorn h2 | python | 7.7 | 27.0 | per-frame |
| granian h2 | python | 16.3 | 28.8 | per-frame |
| node h2 | node | 0.8 | 5.2 | per-frame |
| fastify h2 | node | 1.1 | 6.1 | per-frame |
| kestrel h2 | dotnet | 2.0 | 58.6 | per-frame |
| vertx h2 | jvm | 0.5 | 4.4 | batches |
| rust hyper h2 | rust | 0.5 | 3.7 | batches |
| actix h2 | rust | 0.5 | 4.1 | batches |
| go h2c | go | 4.8 | 19.4 | per-frame |
| bandit h2 | beam | 1.0 | 17.3 | per-frame |
| cowboy h2 | beam | — | — | h2 GOAWAY 11 |
| phoenix cowboy h2 | beam | — | — | h2 GOAWAY 11 |
| spring tomcat h2 | jvm | — | — | h2 GOAWAY 11 |
| nginx h2 | c | 0.2 | 103.5 | per-frame |
| apache httpd h2 | c | 2.5 | 27.9 | per-frame |
| haproxy h2 | c | 0.3 | 8.2 | per-frame |
| falcon h2 | ruby | 3.1 | 11.4 | per-frame |
http/3 data frames over quic (4 servers)
| server | ecosystem | mode a (µs/frame) | mode b (µs/frame) | verdict |
|---|---|---|---|---|
| hypercorn h3 (aioquic) | python | 2.8 | 66.6 | per-frame |
| aioquic h3 | python | 1.4 | 36.3 | per-frame |
| quic-go h3 | go | 0.8 | 33.2 | per-frame |
| kestrel h3 (msquic) | dotnet | 2.3 | 67.4 | per-frame |
websocket text frames (6 servers)
| server | ecosystem | mode a (µs/frame) | mode b (µs/frame) | verdict |
|---|---|---|---|---|
| uvicorn + websockets | python | 4.7 | 5.0 | per-frame |
| uvicorn + wsproto | python | 10.2 | 26.6 | per-frame |
| node-ws | node | 0.23 | 5.1 | batches |
| gorilla websocket | go | 0.24 | 5.6 | batches |
| rust tungstenite | rust | 0.09 | 7.1 | batches |
| kestrel websocket | dotnet | 0.39 | 11.5 | batches |
see the http/1 vs /2 vs /3 protocol comparison for the cross-protocol summary, and any per-server page for the source citation and the full mode-a / mode-b measurement matrix for that server.
three observed parser/dispatcher boundary classes across the 54-server matrix.
mitigations available today
measured behavior of the three deployments. default nginx collapses N chunks into a single content-length-framed upstream request; nginx with proxy_request_buffering off and haproxy’s default streaming both forward the per-chunk shape to the upstream.
the paper’s mitigation section ranks these by where the work happens.
server-side aggregation (preferred). kestrel’s Http1ChunkedEncodingMessageBody.PumpAsync
drains the readable buffer into a System.IO.Pipelines pipe and wakes the
application with a coalesced PipeReader.ReadAsync result. application code
sees one event per buffer, not one per wire chunk. this is what every
event-loop http server should expose as an opt-in primitive.
application-side chunk limits. an explicit cap on the number of decoded
chunks per request, applied before the application handler reads them. hyper
issue #4008 proposed this as BodyExt::limit_chunks(N); that proposal is
currently closed not_planned. without it, most rust / go / python event-loop
servers can be capped only by total bytes, not by chunk count.
frontend buffering. default nginx (proxy_request_buffering on) collapses
N chunks into a single content-length-framed upstream request. apache
mod_proxy_http is documented to behave the same way. haproxy by default
streams; http-buffer-request waits for one buffer’s worth, not a per-chunk
accumulator.
transport-layer rate limiting. capping requests per second per source ip or per connection bounds the throughput of any single attacker but does not remove the per-chunk cost.
resources
- draft pdf — full paper, all sections.
- http/1 vs /2 vs /3 protocol comparison
- per-server pages — one for each of the 54 servers in the survey, with the parser path source citation and the full mode-a / mode-b measurement matrix.
- hyper #4008 — upstream
proposal for application-side chunk limits, closed
not_planned. - rfc 9112 §7.1 — http/1.1 chunked transfer coding.
- rfc 9113 §6.1 — http/2 data frames.
- rfc 9114 §7.2.1 — http/3 data frames.
status
this is a working paper. the pdf linked above is the current draft; the
measurement matrix is complete for the cells described in the paper, but the
per-framework deep dives and the bibliography are still being expanded. the
wiki pages on this site are kept in sync with the measurement matrix in
data/all.json from the paper repo.