classifying packet payloads with mimelens

this page walks through the deployment regime that mimelens was actually built for: classifying file content from network packets — not whole files. background and full benchmarks are on the mimelens overview page; the paper pdf is at /papers/mimelens-2026-draft.pdf.

the punchline up front: mimelens-medium-byte hits 85.5% top-1 on the 125-class libmagic mime taxonomy from a single 1.4 kb udp datagram. magika v1.1 on the same entire-stream prefix scores 61.9%; libmagic 5.46 on the same 4 kb prefix scores 79.1%; trid 2.24 self-consistent scores 72.9%. on real tcpdump captures, the byte cell wins at every cumulative threshold.

what you need

  • python 3.12, uv, and the binary-embedding-paper repo
  • a checkpoint: mimelens-001-medium-byte-s1.safetensors (no tokenizer file needed for the byte cell)
  • a probe head: a multinomial logistic regression trained on clean 4 kb-head embeddings of the same cell, with the test streams excluded by content sha256
  • a pcap file (tcpdump, wireshark, packet broker, etc.) carrying the payloads you care about
  • scapy for pcap parsing
uv add scapy safetensors torch numpy scikit-learn

the byte cell is a 1022-byte classifier

practically speaking, mimelens-medium-byte with seq_len=1024 consumes:

[CLS] b1 b2 ... b1022 [SEP]

so it reads exactly the first 1022 bytes of whatever you hand it. a single 1448-byte udp datagram fully fills the model’s input window after stripping our 12-byte stream header ({stream_id: u32, seq_no: u32, total_n: u32}). subsequent packets in the same stream add bytes the model never reads.

this is the property the encoder buys you: the model was pretrained on 1024-token windows sampled uniformly at random across files and 64 kb fragments, so it does not care whether those 1022 bytes are the start of a file, the middle of a container, or a packet payload chopped out of the wire.

end-to-end: capture, parse, classify

1. capture

sudo tcpdump -i lo -U -s 0 -w capture.pcap "udp port 9999"

flags worth pointing at:

  • -U flushes per-packet so you can read partial files
  • -s 0 keeps the full payload (no snaplen truncation — the model needs the bytes)
  • replace -i lo and the filter with whatever interface and bpf expression matches your real traffic

2. parse the pcap

we use scapy because it gives us a per-packet iterator that streams cleanly:

from collections import defaultdict
from scapy.all import PcapReader, UDP

def stream_payloads(pcap_path: str, port: int = 9999) -> dict[tuple, list[bytes]]:
    """returns {(src_ip, src_port, dst_ip, dst_port): [payload_bytes, ...]} in arrival order."""
    streams: dict[tuple, list[bytes]] = defaultdict(list)
    with PcapReader(pcap_path) as pr:
        for pkt in pr:
            if UDP not in pkt:
                continue
            udp = pkt[UDP]
            if udp.sport != port and udp.dport != port:
                continue
            key = (pkt.src, udp.sport, pkt.dst, udp.dport)
            streams[key].append(bytes(udp.payload))
    return streams

if your transport carries its own framing — quic-tunnelled http, custom ipfix, anything with a sequence number — strip that header per-packet so what arrives at the classifier is the payload body, not the encapsulation.

3. build the 4 kb window

zero-pad if you have less than 4 kb; right-truncate if you have more. for the byte cell this only matters up to ~1.4 kb because the model never reads past byte 1022.

WINDOW = 4096

def cumulative_window(payloads: list[bytes], k: int) -> bytes:
    cat = b"".join(payloads[:k])
    if len(cat) >= WINDOW:
        return cat[:WINDOW]
    return cat + b"\x00" * (WINDOW - len(cat))

note: zero-padding mid-window does not hurt the byte cell because it stops reading at byte 1022 anyway. it does affect bpe cells — zero bytes tokenize differently than real content, so a 1.4 kb-of-real + 2.6 kb-of-zeros window yields a different token sequence than a fully-real 4 kb window. bpe cells take ~3 packets (~4.3 kb) to fill their effective window; byte is flat from packet 1. this is exactly why byte is the recommended cell for partial-data deployment.

4. encode → mean-pool → probe

import numpy as np
import torch
from safetensors.torch import load_file
from binary_embedding import _native
from binary_embedding.constants import BYTE_OFFSET, CLS_ID, NUM_SPECIAL_TOKENS, PAD_ID, SEP_ID
from binary_embedding.models.encoder import BinaryEncoder, medium_encoder_config

SEQ_LEN = 1024
DEV = "cuda" if torch.cuda.is_available() else "cpu"

# load byte cell — vocab_size = 256 + 7 specials = 263 for byte variant in this repo
cfg = medium_encoder_config(vocab_size=256 + NUM_SPECIAL_TOKENS, max_seq_len=SEQ_LEN)
model = BinaryEncoder(cfg)
model.load_state_dict(load_file("mimelens-001-medium-byte-s1.safetensors"), strict=False)
model.to(DEV).to(torch.bfloat16).eval()

def encode_byte(window: bytes) -> tuple[list[int], list[int]]:
    """returns (input_ids, attention_mask) of length SEQ_LEN."""
    body = SEQ_LEN - 2
    ids = [b + BYTE_OFFSET for b in window[:body]]
    out = [CLS_ID, *ids, SEP_ID]
    pad = SEQ_LEN - len(out)
    return out + [PAD_ID] * pad, [1] * (len(ids) + 2) + [0] * pad

@torch.no_grad()
def embed(windows: list[bytes], batch: int = 64) -> np.ndarray:
    """4 kb windows in, body-mean-pooled embeddings out."""
    pooled = []
    for i in range(0, len(windows), batch):
        chunk = windows[i:i + batch]
        ids_rows, attn_rows = zip(*(encode_byte(w) for w in chunk))
        ids = torch.tensor(ids_rows, dtype=torch.long, device=DEV)
        attn = torch.tensor(attn_rows, dtype=torch.long, device=DEV)
        h = model(ids, attn, labels=None, return_mlm_logits=False).hidden_states
        S = attn.size(1)
        pos = torch.arange(S, device=DEV).unsqueeze(0)
        lens = attn.sum(dim=1, keepdim=True)
        body_mask = ((pos >= 1) & (pos < (lens - 1))).to(h.dtype).unsqueeze(-1)
        v = (h * body_mask).sum(dim=1) / body_mask.sum(dim=1).clamp(min=1)
        pooled.append(v.float().cpu().numpy())
    return np.concatenate(pooled, axis=0)

three things to call out from the code above:

  1. mean-pool over body tokens, not cls. the cls_pool layer in the released checkpoints is byte-identical to initialization — mlm-only pretraining never sends a gradient through it. using cls token reads a random projection. this is verified across all 28 checkpoints in the paper (section 5).
  2. byte offset. the byte variant reserves the first NUM_SPECIAL_TOKENS ids for [CLS], [SEP], [PAD], [MASK], etc., so raw byte value b becomes token id b + BYTE_OFFSET.
  3. bfloat16 + cuda. the cube was trained bf16; cpu inference works but expects ~547 ms/sample. for production gateway-side classification, batch and gpu.

5. load the probe and predict

the probe is a multinomial logistic regression with solver="saga" trained on clean 4 kb-head embeddings of the same cell. for the appendix-d experiment we excluded the 500 test-stream sha256s from the probe training set. for your own pcap classification you do the same — train the probe on labelled clean files, exclude any content you intend to test, then deploy:

import pickle
import numpy as np

# at training time (one-off)
# from sklearn.linear_model import LogisticRegression
# from sklearn.preprocessing import StandardScaler
# X, y, cats = ...  # clean 4 kb-head embeddings + libmagic ground-truth
# sc = StandardScaler().fit(X)
# clf = LogisticRegression(solver="saga", max_iter=4000, random_state=0).fit(sc.transform(X), y)
# pickle.dump({"scaler": sc, "clf": clf, "cats": cats}, open("probe_byte_medium.pkl", "wb"))

# at deploy time
probe = pickle.load(open("probe_byte_medium.pkl", "rb"))
sc, clf, cats = probe["scaler"], probe["clf"], probe["cats"]

embeddings = embed(windows)              # (n, hidden)
probs = clf.predict_proba(sc.transform(embeddings))
top1 = probs.argmax(axis=1)
labels = [cats[i] for i in top1]

cats is the ordered list of mime-125 labels. the probe_byte_medium.pkl shipped with the paper artifacts is fine to reuse if your traffic distribution roughly matches the training corpus (binary-30k + magic-bpe + windows-drivers + glaurung).

putting it together

a minimal pcap → predictions driver, classifying every stream at every cumulative packet threshold:

import pickle, struct
from pathlib import Path

HEADER_FMT = "!III"      # stream_id, seq_no, total_n
HEADER_LEN = struct.calcsize(HEADER_FMT)

streams = stream_payloads("capture.pcap", port=9999)

probe = pickle.load(open("probe_byte_medium.pkl", "rb"))
sc, clf, cats = probe["scaler"], probe["clf"], probe["cats"]

results = []
for key, payloads in streams.items():
    # if your transport has framing, strip it here. example for our 12-byte header:
    bodies = [p[HEADER_LEN:] for p in payloads if len(p) >= HEADER_LEN]

    for k in (1, 2, 3, 5, 10, len(bodies)):
        w = cumulative_window(bodies, k)
        emb = embed([w])
        probs = clf.predict_proba(sc.transform(emb))[0]
        top3 = probs.argsort()[::-1][:3]
        results.append({
            "stream": key,
            "k": k,
            "predictions": [(cats[i], float(probs[i])) for i in top3],
        })

for row in results[:5]:
    print(row)

what you should and shouldn’t expect

what works.

  • magic-byte-anchored formats (png, flac, 7z, gzip, webp) converge at k=1 — a single packet is enough.
  • office formats (excel, ole storage) need k=2–3 to locate header structures several hundred bytes into the stream.
  • the classification is position-agnostic by construction: feed it a window from offset 0, offset 50 kb, or offset 1.2 gb of a container — the model was pretrained that way.

what doesn’t.

  • text/plain hovers around 55–65% across all k. genuine ambiguity in 1–4 kb of plain text — the same span could be source code, a config file, a log, a markdown doc.
  • application/octet-stream is by definition the catch-all libmagic label; the model agrees with libmagic’s “i don’t know” by giving you octet-stream back. that’s not failure, it’s faithful reproduction of the taxonomy.
  • encrypted transports trivially defeat byte-level classification. tls, quic, ssh: by design the payload bytes look uniform-random. mimelens is for cleartext-payload regimes.
  • packet loss and reordering on real wans. the paper appendix runs on lo with deterministic in-order delivery. real conditions add loss, reordering, retransmission. transmit-order sorting by sequence number is the simplest fix and works under modest loss; aggressive reordering or loss past 5% degrades the byte cell faster than the bpe cells because the byte cell’s effective input is exactly bytes 1..1022 of the concatenation.
  • adversarial header corruption. if you expect head-byte corruption — packed binaries, intentional obfuscation, truncation — use the bpe-64k cell instead. under directed perturbations of the first 4 / 16 / 64 bytes, bpe-64k loses 2–7 pp while byte and bpe-16k lose 2–16 pp. the worst clean-input cell is the most adversarially robust.
  • mimelens overview — the family, the cube, the numbers, the caveats.
  • draft paper (pdf) — appendix d covers the udp-loopback experiment in full, with per-format convergence curves, libmagic and trid baselines, and reproducibility notes.
  • binary-bpe — the bbpe rust crate that trains and runs the bpe tokenizers for the non-byte cells.
  • magika — google’s whole-file classifier. ~348× faster than mimelens on cpu; the right tool for sub-millisecond broad-category whole-file triage. wrong tool for chunks.
on this page