classifying packet payloads with mimelens
on this page
this page walks through the deployment regime that mimelens was actually built for: classifying file content from network packets — not whole files. background and full benchmarks are on the mimelens overview page; the paper pdf is at /papers/mimelens-2026-draft.pdf.
the punchline up front: mimelens-medium-byte hits 85.5% top-1 on the 125-class libmagic mime taxonomy from a single 1.4 kb udp datagram. magika v1.1 on the same entire-stream prefix scores 61.9%; libmagic 5.46 on the same 4 kb prefix scores 79.1%; trid 2.24 self-consistent scores 72.9%. on real tcpdump captures, the byte cell wins at every cumulative threshold.
what you need
- python 3.12,
uv, and thebinary-embedding-paperrepo - a checkpoint:
mimelens-001-medium-byte-s1.safetensors(no tokenizer file needed for the byte cell) - a probe head: a multinomial logistic regression trained on clean 4 kb-head embeddings of the same cell, with the test streams excluded by content sha256
- a pcap file (tcpdump, wireshark, packet broker, etc.) carrying the payloads you care about
scapyfor pcap parsing
uv add scapy safetensors torch numpy scikit-learn the byte cell is a 1022-byte classifier
practically speaking, mimelens-medium-byte with seq_len=1024 consumes:
[CLS] b1 b2 ... b1022 [SEP] so it reads exactly the first 1022 bytes of whatever you hand it. a single 1448-byte udp datagram fully fills the model’s input window after stripping our 12-byte stream header ({stream_id: u32, seq_no: u32, total_n: u32}). subsequent packets in the same stream add bytes the model never reads.
this is the property the encoder buys you: the model was pretrained on 1024-token windows sampled uniformly at random across files and 64 kb fragments, so it does not care whether those 1022 bytes are the start of a file, the middle of a container, or a packet payload chopped out of the wire.
end-to-end: capture, parse, classify
1. capture
sudo tcpdump -i lo -U -s 0 -w capture.pcap "udp port 9999" flags worth pointing at:
-Uflushes per-packet so you can read partial files-s 0keeps the full payload (no snaplen truncation — the model needs the bytes)- replace
-i loand the filter with whatever interface and bpf expression matches your real traffic
2. parse the pcap
we use scapy because it gives us a per-packet iterator that streams cleanly:
from collections import defaultdict
from scapy.all import PcapReader, UDP
def stream_payloads(pcap_path: str, port: int = 9999) -> dict[tuple, list[bytes]]:
"""returns {(src_ip, src_port, dst_ip, dst_port): [payload_bytes, ...]} in arrival order."""
streams: dict[tuple, list[bytes]] = defaultdict(list)
with PcapReader(pcap_path) as pr:
for pkt in pr:
if UDP not in pkt:
continue
udp = pkt[UDP]
if udp.sport != port and udp.dport != port:
continue
key = (pkt.src, udp.sport, pkt.dst, udp.dport)
streams[key].append(bytes(udp.payload))
return streams if your transport carries its own framing — quic-tunnelled http, custom ipfix, anything with a sequence number — strip that header per-packet so what arrives at the classifier is the payload body, not the encapsulation.
3. build the 4 kb window
zero-pad if you have less than 4 kb; right-truncate if you have more. for the byte cell this only matters up to ~1.4 kb because the model never reads past byte 1022.
WINDOW = 4096
def cumulative_window(payloads: list[bytes], k: int) -> bytes:
cat = b"".join(payloads[:k])
if len(cat) >= WINDOW:
return cat[:WINDOW]
return cat + b"\x00" * (WINDOW - len(cat)) note: zero-padding mid-window does not hurt the byte cell because it stops reading at byte 1022 anyway. it does affect bpe cells — zero bytes tokenize differently than real content, so a 1.4 kb-of-real + 2.6 kb-of-zeros window yields a different token sequence than a fully-real 4 kb window. bpe cells take ~3 packets (~4.3 kb) to fill their effective window; byte is flat from packet 1. this is exactly why byte is the recommended cell for partial-data deployment.
4. encode → mean-pool → probe
import numpy as np
import torch
from safetensors.torch import load_file
from binary_embedding import _native
from binary_embedding.constants import BYTE_OFFSET, CLS_ID, NUM_SPECIAL_TOKENS, PAD_ID, SEP_ID
from binary_embedding.models.encoder import BinaryEncoder, medium_encoder_config
SEQ_LEN = 1024
DEV = "cuda" if torch.cuda.is_available() else "cpu"
# load byte cell — vocab_size = 256 + 7 specials = 263 for byte variant in this repo
cfg = medium_encoder_config(vocab_size=256 + NUM_SPECIAL_TOKENS, max_seq_len=SEQ_LEN)
model = BinaryEncoder(cfg)
model.load_state_dict(load_file("mimelens-001-medium-byte-s1.safetensors"), strict=False)
model.to(DEV).to(torch.bfloat16).eval()
def encode_byte(window: bytes) -> tuple[list[int], list[int]]:
"""returns (input_ids, attention_mask) of length SEQ_LEN."""
body = SEQ_LEN - 2
ids = [b + BYTE_OFFSET for b in window[:body]]
out = [CLS_ID, *ids, SEP_ID]
pad = SEQ_LEN - len(out)
return out + [PAD_ID] * pad, [1] * (len(ids) + 2) + [0] * pad
@torch.no_grad()
def embed(windows: list[bytes], batch: int = 64) -> np.ndarray:
"""4 kb windows in, body-mean-pooled embeddings out."""
pooled = []
for i in range(0, len(windows), batch):
chunk = windows[i:i + batch]
ids_rows, attn_rows = zip(*(encode_byte(w) for w in chunk))
ids = torch.tensor(ids_rows, dtype=torch.long, device=DEV)
attn = torch.tensor(attn_rows, dtype=torch.long, device=DEV)
h = model(ids, attn, labels=None, return_mlm_logits=False).hidden_states
S = attn.size(1)
pos = torch.arange(S, device=DEV).unsqueeze(0)
lens = attn.sum(dim=1, keepdim=True)
body_mask = ((pos >= 1) & (pos < (lens - 1))).to(h.dtype).unsqueeze(-1)
v = (h * body_mask).sum(dim=1) / body_mask.sum(dim=1).clamp(min=1)
pooled.append(v.float().cpu().numpy())
return np.concatenate(pooled, axis=0) three things to call out from the code above:
- mean-pool over body tokens, not cls. the
cls_poollayer in the released checkpoints is byte-identical to initialization — mlm-only pretraining never sends a gradient through it. using cls token reads a random projection. this is verified across all 28 checkpoints in the paper (section 5). - byte offset. the byte variant reserves the first
NUM_SPECIAL_TOKENSids for[CLS],[SEP],[PAD],[MASK], etc., so raw byte valuebbecomes token idb + BYTE_OFFSET. - bfloat16 + cuda. the cube was trained bf16; cpu inference works but expects ~547 ms/sample. for production gateway-side classification, batch and gpu.
5. load the probe and predict
the probe is a multinomial logistic regression with solver="saga" trained on clean 4 kb-head embeddings of the same cell. for the appendix-d experiment we excluded the 500 test-stream sha256s from the probe training set. for your own pcap classification you do the same — train the probe on labelled clean files, exclude any content you intend to test, then deploy:
import pickle
import numpy as np
# at training time (one-off)
# from sklearn.linear_model import LogisticRegression
# from sklearn.preprocessing import StandardScaler
# X, y, cats = ... # clean 4 kb-head embeddings + libmagic ground-truth
# sc = StandardScaler().fit(X)
# clf = LogisticRegression(solver="saga", max_iter=4000, random_state=0).fit(sc.transform(X), y)
# pickle.dump({"scaler": sc, "clf": clf, "cats": cats}, open("probe_byte_medium.pkl", "wb"))
# at deploy time
probe = pickle.load(open("probe_byte_medium.pkl", "rb"))
sc, clf, cats = probe["scaler"], probe["clf"], probe["cats"]
embeddings = embed(windows) # (n, hidden)
probs = clf.predict_proba(sc.transform(embeddings))
top1 = probs.argmax(axis=1)
labels = [cats[i] for i in top1] cats is the ordered list of mime-125 labels. the probe_byte_medium.pkl shipped with the paper artifacts is fine to reuse if your traffic distribution roughly matches the training corpus (binary-30k + magic-bpe + windows-drivers + glaurung).
putting it together
a minimal pcap → predictions driver, classifying every stream at every cumulative packet threshold:
import pickle, struct
from pathlib import Path
HEADER_FMT = "!III" # stream_id, seq_no, total_n
HEADER_LEN = struct.calcsize(HEADER_FMT)
streams = stream_payloads("capture.pcap", port=9999)
probe = pickle.load(open("probe_byte_medium.pkl", "rb"))
sc, clf, cats = probe["scaler"], probe["clf"], probe["cats"]
results = []
for key, payloads in streams.items():
# if your transport has framing, strip it here. example for our 12-byte header:
bodies = [p[HEADER_LEN:] for p in payloads if len(p) >= HEADER_LEN]
for k in (1, 2, 3, 5, 10, len(bodies)):
w = cumulative_window(bodies, k)
emb = embed([w])
probs = clf.predict_proba(sc.transform(emb))[0]
top3 = probs.argsort()[::-1][:3]
results.append({
"stream": key,
"k": k,
"predictions": [(cats[i], float(probs[i])) for i in top3],
})
for row in results[:5]:
print(row) what you should and shouldn’t expect
what works.
- magic-byte-anchored formats (png, flac, 7z, gzip, webp) converge at k=1 — a single packet is enough.
- office formats (excel, ole storage) need k=2–3 to locate header structures several hundred bytes into the stream.
- the classification is position-agnostic by construction: feed it a window from offset 0, offset 50 kb, or offset 1.2 gb of a container — the model was pretrained that way.
what doesn’t.
text/plainhovers around 55–65% across all k. genuine ambiguity in 1–4 kb of plain text — the same span could be source code, a config file, a log, a markdown doc.application/octet-streamis by definition the catch-all libmagic label; the model agrees with libmagic’s “i don’t know” by giving youoctet-streamback. that’s not failure, it’s faithful reproduction of the taxonomy.- encrypted transports trivially defeat byte-level classification. tls, quic, ssh: by design the payload bytes look uniform-random. mimelens is for cleartext-payload regimes.
- packet loss and reordering on real wans. the paper appendix runs on
lowith deterministic in-order delivery. real conditions add loss, reordering, retransmission. transmit-order sorting by sequence number is the simplest fix and works under modest loss; aggressive reordering or loss past 5% degrades the byte cell faster than the bpe cells because the byte cell’s effective input is exactly bytes 1..1022 of the concatenation. - adversarial header corruption. if you expect head-byte corruption — packed binaries, intentional obfuscation, truncation — use the bpe-64k cell instead. under directed perturbations of the first 4 / 16 / 64 bytes, bpe-64k loses 2–7 pp while byte and bpe-16k lose 2–16 pp. the worst clean-input cell is the most adversarially robust.
related
- mimelens overview — the family, the cube, the numbers, the caveats.
- draft paper (pdf) — appendix d covers the udp-loopback experiment in full, with per-format convergence curves, libmagic and trid baselines, and reproducibility notes.
- binary-bpe — the
bbperust crate that trains and runs the bpe tokenizers for the non-byte cells. - magika — google’s whole-file classifier. ~348× faster than mimelens on cpu; the right tool for sub-millisecond broad-category whole-file triage. wrong tool for chunks.