chunkloris: per-chunk http amplification

tl;dr

an http/1.1 request body sent as N one-byte chunked-transfer-encoding chunks is rfc-compliant. on almost every production http server we tested it forces one parser/dispatcher callback per chunk — and the per-chunk cpu cost is measurable in microseconds on a single core.

across the 54-server survey:

  • 27 / 27 http/1 servers retain a measurable per-chunk cpu cost under the paced one-byte-per-chunk comparator (mode b, N=250,000, 1 vcpu). the range is 3.6 µs/chunk (kestrel on asp.net core, parser overhead only — application delivery is batched) to 113.6 µs/chunk (nginx as origin). median ≈ 12.4 µs.
  • http/2 and http/3 inherit the shape per-frame. 14 of 17 measured h2c servers expose the per-frame cpu cost; 4 of 4 measured h3 servers do.
  • websocket is the one protocol where most implementations already batch correctly (node-ws, gorilla, rust-tungstenite, kestrel-ws). python asgi (uvicorn + websockets / wsproto) is the outlier.
  • the “deploy behind nginx” mitigation is empirically real for http/1: default proxy_request_buffering on collapses N chunks into a single content-length-framed upstream request. haproxy does not aggregate by default; http-buffer-request waits for one buffer’s worth, not a per-chunk accumulator.

draft pdf →

what “per-chunk” means

the http/1.1 chunked transfer encoding (rfc 9112) fixes the wire format:

<hex-size>\r\n<bytes>\r\n
<hex-size>\r\n<bytes>\r\n
...
0\r\n\r\n

it does not fix how a parser schedules callbacks into the application:

  • one callback per wire chunk?
  • one callback per recv() batch (whatever the kernel happens to deliver)?
  • one callback per fixed-byte threshold (e.g. 8 kib)?
  • one callback after the entire body is buffered?

every production server picks one of these as the default. the paper is a measurement of which one each server picked, and what it costs per chunk on a 1-vcpu container under two probe modes:

  • mode a — bridge-coalesced. the prober writes the whole chunked body back-to-back with TCP_NODELAY off. the docker bridge coalesces many wire chunks into each server-side recv(). closest to pod-to-pod traffic.
  • mode b — paced 100 µs. between every chunk the prober busy-waits 100 µs and TCP_NODELAY is on. each chunk leaves the prober as its own segment. closest to slow-drip attacker pacing or trans-wan bursts.

mode b is the strict comparator. mode b minus mode a (when both are present) isolates the server overhead from the prober’s pacing budget.

the attack shape (“chunkloris”)

diagram: N one-byte chunks travel through kernel TCP, the HTTP parser, and the application handler, producing N events on per-chunk servers vs few events on batched servers

the attack shape: N one-byte chunks → N parser events → N application wakeups on per-chunk servers. on batched servers (kestrel via System.IO.Pipelines and similar) parser cpu is still per-chunk but the application boundary is decoupled from the wire framing rate.

the paper names this measured shape chunkloris because it is a low-payload http availability attack in the slowloris tradition: the attacker sends valid wire bytes that force the server to do excessive per-event work, and the limiting resource on the server is parser/dispatcher cpu rather than idle connection state.

a representative request body:

POST /upload HTTP/1.1
Host: x
Transfer-Encoding: chunked
Content-Type: application/octet-stream
Connection: close

1\r\nA\r\n1\r\nA\r\n... (250,000 times) ...0\r\n\r\n

that body is 6 wire bytes per data byte, ≈ 1.5 mb on the wire to deliver 250,000 chunks. on a server with a 50 µs/chunk per-chunk cost a single connection consumes ≈ 12.5 cpu-seconds of a single core. amplification is roughly cpu seconds per kib of attack traffic, not “kib delivered per kib sent.”

this shape is rfc-compliant. there is no protocol violation. there is no need to keep the connection open slowly.

why this is a “known issue” — and what the paper adds

the lineage is well-known in pieces:

  • slowloris (2009) — connection-state exhaustion.
  • read-rate / write-rate slow attacks documented for http servers since ~2012.
  • per-chunk parser/dispatcher cost has been raised in multiple upstream issues, most directly in hyper #4008 (“provide built-in chunked request limits and cpu-safe streaming helpers”) — closed not_planned / S-waiting-on-author.
  • nginx documents proxy_request_buffering and client_body_buffer_size as the mitigation; the body-buffering behavior is the reason this is not also a per-chunk problem behind a default nginx proxy.

what the paper contributes:

  1. a like-for-like cross-ecosystem measurement matrix: 27 http/1 + 17 h2 + 4 h3 + 6 ws servers, every cell measured on the same 1-vcpu docker container with cgroup-derived cpu cost.
  2. source citations for the parser callback boundary in every server (the per-framework deep-dives in section 4 of the pdf).
  3. a mitigation taxonomy — server-side aggregation, application-side chunk limits, frontend buffering, transport-layer rate limiting — and an empirically-confirmed characterization of the “deploy behind nginx” advice (it works; behind haproxy by default it does not).
  4. a counterexample at the application boundary: kestrel on asp.net core uses System.IO.Pipelines to batch application-delivery callbacks. parser cpu is still per-chunk, but application code wakes on coalesced reads.

headline figure

ranked per-chunk CPU cost across 27 HTTP/1 servers under the paced Mode B 250k comparator

per-chunk server cpu cost across the 27-server canonical http/1 mode-b comparator (paced 100 µs gap, N = 250,000 one-byte chunks, 1 vcpu container). log-scale x-axis. kestrel batches application delivery but still has substantial parser cpu cost under this strict comparator.

verdict distribution: VULNERABLE-PER-CHUNK, VULNERABLE-PER-FRAME, BATCHES-CORRECTLY, PROTECTED-H2-GOAWAY across protocols

verdict distribution by protocol. http/1, http/2, http/3 are dominated by per-chunk / per-frame verdicts. websocket is the one protocol where batching correctly is the common case.

wall time vs chunk count N under Mode A bridge-coalesced; log-log axes with reference slopes 1 and 2

wall time vs chunk count N under mode a (bridge-coalesced), restricted to the common N ∈ {50,000, 100,000, 250,000} cells. log-log axes with reference slopes at 1 (linear) and 2 (quadratic). the completed common-cell matrix is roughly linear. an earlier draft over-claimed a quadratic shape from incomplete cells; the rerun does not support that.

results, all 54 servers

http/1.1 chunked transfer encoding (27 servers)

serverecosystemmode a (µs/chunk)mode b (µs/chunk)verdict
uvicorn (httptools)python0.58.5per-chunk
daphnepython1.79.3per-chunk
uvicorn (h11)python4.014.0per-chunk
hypercorn (h11)python5.523.5per-chunk
granianpython18.623.4per-chunk
gunicorn (sync)python2.7105.6per-chunk
waitresspython18.4108.0per-chunk
tornadopython18.4104.3per-chunk
go net/httpgo0.27.99per-chunk
gingo0.211.6per-chunk
axumrust0.54.8per-chunk
actix-webrust0.23.8per-chunk
node httpnode1.15.2per-chunk
expressnode1.05.2per-chunk
fastifynode0.95.1per-chunk
vertxjvm0.44.3per-chunk
spring boot (tomcat)jvm0.512.4per-chunk
cowboybeam1.714.0per-chunk
phoenix (cowboy2)beam2.135.0per-chunk
banditbeam67.182.2per-chunk
nginxc0.2113.6per-chunk
apache httpdc0.215.3per-chunk
haproxyc0.27.6per-chunk
kestrel 9dotnet1.33.6batches application delivery; parser still per-chunk
pumaruby1.228.9per-chunk
unicornrubyper-chunk (see page)
falconrubyper-chunk (see page)

http/2 data frames, h2c (17 servers)

serverecosystemmode a (µs/frame)mode b (µs/frame)verdict
hypercorn h2python7.727.0per-frame
granian h2python16.328.8per-frame
node h2node0.85.2per-frame
fastify h2node1.16.1per-frame
kestrel h2dotnet2.058.6per-frame
vertx h2jvm0.54.4batches
rust hyper h2rust0.53.7batches
actix h2rust0.54.1batches
go h2cgo4.819.4per-frame
bandit h2beam1.017.3per-frame
cowboy h2beamh2 GOAWAY 11
phoenix cowboy h2beamh2 GOAWAY 11
spring tomcat h2jvmh2 GOAWAY 11
nginx h2c0.2103.5per-frame
apache httpd h2c2.527.9per-frame
haproxy h2c0.38.2per-frame
falcon h2ruby3.111.4per-frame

http/3 data frames over quic (4 servers)

serverecosystemmode a (µs/frame)mode b (µs/frame)verdict
hypercorn h3 (aioquic)python2.866.6per-frame
aioquic h3python1.436.3per-frame
quic-go h3go0.833.2per-frame
kestrel h3 (msquic)dotnet2.367.4per-frame

websocket text frames (6 servers)

serverecosystemmode a (µs/frame)mode b (µs/frame)verdict
uvicorn + websocketspython4.75.0per-frame
uvicorn + wsprotopython10.226.6per-frame
node-wsnode0.235.1batches
gorilla websocketgo0.245.6batches
rust tungsteniterust0.097.1batches
kestrel websocketdotnet0.3911.5batches

see the http/1 vs /2 vs /3 protocol comparison for the cross-protocol summary, and any per-server page for the source citation and the full mode-a / mode-b measurement matrix for that server.

three observed parser/dispatcher boundary classes: per-chunk / per-frame (43 of 54), application-boundary batched (5 of 54), and in-protocol abort via H2 GOAWAY (3 of 54)

three observed parser/dispatcher boundary classes across the 54-server matrix.

mitigations available today

probe paths through nginx (default), nginx with proxy_request_buffering off, and HAProxy default, showing whether N chunks reach the upstream as 1 recv() or N recv()s

measured behavior of the three deployments. default nginx collapses N chunks into a single content-length-framed upstream request; nginx with proxy_request_buffering off and haproxy’s default streaming both forward the per-chunk shape to the upstream.

the paper’s mitigation section ranks these by where the work happens.

server-side aggregation (preferred). kestrel’s Http1ChunkedEncodingMessageBody.PumpAsync drains the readable buffer into a System.IO.Pipelines pipe and wakes the application with a coalesced PipeReader.ReadAsync result. application code sees one event per buffer, not one per wire chunk. this is what every event-loop http server should expose as an opt-in primitive.

application-side chunk limits. an explicit cap on the number of decoded chunks per request, applied before the application handler reads them. hyper issue #4008 proposed this as BodyExt::limit_chunks(N); that proposal is currently closed not_planned. without it, most rust / go / python event-loop servers can be capped only by total bytes, not by chunk count.

frontend buffering. default nginx (proxy_request_buffering on) collapses N chunks into a single content-length-framed upstream request. apache mod_proxy_http is documented to behave the same way. haproxy by default streams; http-buffer-request waits for one buffer’s worth, not a per-chunk accumulator.

transport-layer rate limiting. capping requests per second per source ip or per connection bounds the throughput of any single attacker but does not remove the per-chunk cost.

resources

status

this is a working paper. the pdf linked above is the current draft; the measurement matrix is complete for the cells described in the paper, but the per-framework deep dives and the bibliography are still being expanded. the wiki pages on this site are kept in sync with the measurement matrix in data/all.json from the paper repo.

on this page