tokenizing raw executables for malware analysis with bbpe

pretrained models:mjbommar/binary-tokenizer-001-{4k,8k,16k,32k,64k} on HuggingFace
what:bpe tokenizers for raw binary executables — ELF, PE, Mach-O, APK
compression:2.89 bytes/token (64K vocab), 3x context vs raw byte tokenization
use cases:malware classification, binary similarity, function identification
license:Apache-2.0 (code), CC-BY-4.0 (dataset)

overview

bbpe applies byte pair encoding to raw binary executables — ELF, PE, Mach-O, and APK files. the “binary” in the name refers to compiled binaries, firmware, and malware samples, not binary classification.

the pretrained tokenizers are available on HuggingFace in five vocabulary sizes (4K through 64K) and can be loaded directly with the tokenizers python library. the 64K tokenizer achieves 2.89 bytes/token compression, giving transformers roughly 3x more context per token compared to raw byte-level tokenization.

the tokenizer family was trained on 30 GB of cross-platform binaries spanning 10 architectures. all tokenizers form a nested prefix hierarchy — the 4K vocabulary is a strict prefix of the 8K, which is a prefix of the 16K, and so on — enabling seamless embedding transfer when scaling models.

when to use

  • tokenizing executables for transformer-based malware classification
  • binary similarity search across platforms and architectures
  • function boundary or purpose identification in stripped binaries
  • any ML pipeline that needs fixed-vocabulary representations of compiled code

using the pretrained tokenizers

install

uv pip install tokenizers

load from huggingface

from tokenizers import Tokenizer

# load the 64K production tokenizer (recommended)
tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")

# or any other size in the family
tokenizer_32k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-32k")
tokenizer_16k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-16k")
tokenizer_8k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-8k")
tokenizer_4k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-4k")

tokenize a binary

the tokenizer operates on latin-1 decoded strings, which preserves a 1:1 byte mapping:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")

# read a binary file and decode as latin-1 (preserves all byte values)
with open("/usr/bin/ls", "rb") as f:
    binary_data = f.read()

text = binary_data.decode("latin-1")
encoding = tokenizer.encode(text)

print(f"Original size: {len(binary_data):,} bytes")
print(f"Token count: {len(encoding.ids):,} tokens")
print(f"Compression: {len(binary_data) / len(encoding.ids):.2f} bytes/token")
print(f"First 10 token IDs: {encoding.ids[:10]}")

decode back to bytes

# decode token ids back to latin-1 string, then to bytes
decoded_text = tokenizer.decode(encoding.ids)
decoded_bytes = decoded_text.encode("latin-1")

assert decoded_bytes == binary_data  # perfect reconstruction

inspect individual tokens

# look up what each token represents
for i, token_id in enumerate(encoding.ids[:5]):
    token_str = tokenizer.id_to_token(token_id)
    token_bytes = token_str.encode("latin-1")
    print(f"Token {i}: ID={token_id}, hex={token_bytes.hex()}, len={len(token_bytes)}")

compare vocabulary sizes on a file

from tokenizers import Tokenizer

sizes = ["4k", "8k", "16k", "32k", "64k"]
binary_data = open("/usr/bin/grep", "rb").read()
text = binary_data.decode("latin-1")

print(f"File size: {len(binary_data):,} bytes\n")
for size in sizes:
    tok = Tokenizer.from_pretrained(f"mjbommar/binary-tokenizer-001-{size}")
    enc = tok.encode(text)
    bpt = len(binary_data) / len(enc.ids)
    print(f"  {size:>3s}: {bpt:.3f} bytes/token ({len(enc.ids):,} tokens)")

batch processing

from tokenizers import Tokenizer
from pathlib import Path

tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")

results = []
for path in Path("/corpus").glob("**/*"):
    if not path.is_file():
        continue
    data = path.read_bytes()
    tokens = tokenizer.encode(data.decode("latin-1"))
    results.append({
        "file": path.name,
        "bytes": len(data),
        "tokens": len(tokens.ids),
        "bytes_per_token": len(data) / len(tokens.ids),
    })

# sort by compression ratio
for r in sorted(results, key=lambda x: x["bytes_per_token"], reverse=True):
    print(f"{r['file']:30s} {r['bytes_per_token']:.3f} bytes/token")

special tokens

the tokenizers include standard transformer special tokens for downstream model training:

tokenID (64K)purpose
<|start|>65529sequence start
<|end|>65530sequence end
<|pad|>65531padding
<|unk|>65532unknown token
<|cls|>65533classification
<|sep|>65534separator
<|mask|>65535masking

pretrained model details

tokenizer family

all five tokenizers form a nested prefix hierarchy trained on 30 GB of cross-platform binaries:

vocab sizebytes/tokenwhen to use
4K2.01very constrained context windows
8K2.21limited compute resources
16K2.41moderate compression
32K2.64good balance of compression and vocab size
64K2.89production (recommended)

compression by platform

measured on the binary-30k dataset:

platform4K8K16K32K64K
linux1.6251.7691.9372.1392.385
windows2.1192.3412.5952.9053.274
macos2.1832.3332.5422.8063.058
android1.0741.1021.1521.2421.411
overall1.7661.9252.1122.3412.613

windows binaries compress more efficiently due to the regularity of PE format structures. android APKs compress less because they contain compressed data (dex, zip) that resists BPE merges.

what the tokenizer learns

the 64K vocabulary breaks down roughly as:

  • 27.5% instruction patterns (x86/ARM opcodes and common instruction sequences)
  • 12.5% readable strings (ASCII text from symbol tables, error messages)
  • 7.5% high-byte patterns (non-ASCII data structures)
  • 52% mixed format structures (headers, metadata, padding)

binary-30k dataset

mjbommar/binary-30k is a companion dataset of 29,793 cross-platform binaries (10.8 GB, CC-BY-4.0):

  • 73% benign, 27% malware
  • windows 57%, linux 28%, macos 2%, android 1%, other 12%
  • 10 architectures: x86-64, x86, ARM64, ARMv7, MIPS64, MIPS32, PowerPC64, RISC-V64, s390x
  • 70/15/15 train/validation/test split with 4-dimensional stratification

training your own tokenizer

if you need to train on a custom corpus (domain-specific firmware, IoT devices, etc.), the bbpe rust crate provides a CLI:

cargo install bbpe

train on binary files

bbpe train firmware.bin --vocab-size 4096 --min-frequency 4 \
  --preprocessor null-delimited -o tokenizer.json

train on JSONL corpora

bbpe train --jsonl corpus.jsonl:text --vocab-size 8192 \
  --preprocessor ascii-whitespace --preprocessor-probability 0.75 \
  -o tokenizer.json

key training features

  • entropy filtering: --min-entropy / --max-entropy to skip compressed or encrypted regions
  • hierarchical families: smaller vocabs are strict prefixes of larger ones
  • streaming trainer: bounded memory for large corpora
  • JSONL support: dot-notation field paths with gzip decompression

references

  1. Bommarito, M.J. “Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis.” arXiv:2511.17573, November 2025. https://arxiv.org/abs/2511.17573
  2. Bommarito, M.J. “Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection.” arXiv:2511.22095, November 2025. https://arxiv.org/abs/2511.22095
  3. bbpe crate on crates.io: https://crates.io/crates/bbpe
  4. Pretrained tokenizers on HuggingFace: https://huggingface.co/mjbommar

related pages

on this page