tokenizing raw executables for malware analysis with bbpe

pretrained models:	`mjbommar/binary-tokenizer-001-{4k,8k,16k,32k,64k}` on HuggingFace
what:	bpe tokenizers for raw binary executables — ELF, PE, Mach-O, APK
compression:	2.89 bytes/token (64K vocab), 3x context vs raw byte tokenization
use cases:	malware classification, binary similarity, function identification
license:	Apache-2.0 (code), CC-BY-4.0 (dataset)

overview

bbpe applies byte pair encoding to raw binary executables — ELF, PE, Mach-O, and APK files. the “binary” in the name refers to compiled binaries, firmware, and malware samples, not binary classification.

the pretrained tokenizers are available on HuggingFace in five vocabulary sizes (4K through 64K) and can be loaded directly with the tokenizers python library. the 64K tokenizer achieves 2.89 bytes/token compression, giving transformers roughly 3x more context per token compared to raw byte-level tokenization.

the tokenizer family was trained on 30 GB of cross-platform binaries spanning 10 architectures. all tokenizers form a nested prefix hierarchy — the 4K vocabulary is a strict prefix of the 8K, which is a prefix of the 16K, and so on — enabling seamless embedding transfer when scaling models.

when to use

tokenizing executables for transformer-based malware classification
binary similarity search across platforms and architectures
function boundary or purpose identification in stripped binaries
any ML pipeline that needs fixed-vocabulary representations of compiled code

using the pretrained tokenizers

install

uv pip install tokenizers

load from huggingface

from tokenizers import Tokenizer

# load the 64K production tokenizer (recommended)
tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")

# or any other size in the family
tokenizer_32k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-32k")
tokenizer_16k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-16k")
tokenizer_8k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-8k")
tokenizer_4k = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-4k")

tokenize a binary

the tokenizer operates on latin-1 decoded strings, which preserves a 1:1 byte mapping:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")

# read a binary file and decode as latin-1 (preserves all byte values)
with open("/usr/bin/ls", "rb") as f:
    binary_data = f.read()

text = binary_data.decode("latin-1")
encoding = tokenizer.encode(text)

print(f"Original size: {len(binary_data):,} bytes")
print(f"Token count: {len(encoding.ids):,} tokens")
print(f"Compression: {len(binary_data) / len(encoding.ids):.2f} bytes/token")
print(f"First 10 token IDs: {encoding.ids[:10]}")

decode back to bytes

# decode token ids back to latin-1 string, then to bytes
decoded_text = tokenizer.decode(encoding.ids)
decoded_bytes = decoded_text.encode("latin-1")

assert decoded_bytes == binary_data  # perfect reconstruction

inspect individual tokens

# look up what each token represents
for i, token_id in enumerate(encoding.ids[:5]):
    token_str = tokenizer.id_to_token(token_id)
    token_bytes = token_str.encode("latin-1")
    print(f"Token {i}: ID={token_id}, hex={token_bytes.hex()}, len={len(token_bytes)}")

compare vocabulary sizes on a file

from tokenizers import Tokenizer

sizes = ["4k", "8k", "16k", "32k", "64k"]
binary_data = open("/usr/bin/grep", "rb").read()
text = binary_data.decode("latin-1")

print(f"File size: {len(binary_data):,} bytes\n")
for size in sizes:
    tok = Tokenizer.from_pretrained(f"mjbommar/binary-tokenizer-001-{size}")
    enc = tok.encode(text)
    bpt = len(binary_data) / len(enc.ids)
    print(f"  {size:>3s}: {bpt:.3f} bytes/token ({len(enc.ids):,} tokens)")

batch processing

from tokenizers import Tokenizer
from pathlib import Path

tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-64k")

results = []
for path in Path("/corpus").glob("**/*"):
    if not path.is_file():
        continue
    data = path.read_bytes()
    tokens = tokenizer.encode(data.decode("latin-1"))
    results.append({
        "file": path.name,
        "bytes": len(data),
        "tokens": len(tokens.ids),
        "bytes_per_token": len(data) / len(tokens.ids),
    })

# sort by compression ratio
for r in sorted(results, key=lambda x: x["bytes_per_token"], reverse=True):
    print(f"{r['file']:30s} {r['bytes_per_token']:.3f} bytes/token")

special tokens

the tokenizers include standard transformer special tokens for downstream model training:

token	ID (64K)	purpose
`<\|start\|>`	65529	sequence start
`<\|end\|>`	65530	sequence end
`<\|pad\|>`	65531	padding
`<\|unk\|>`	65532	unknown token
`<\|cls\|>`	65533	classification
`<\|sep\|>`	65534	separator
`<\|mask\|>`	65535	masking

pretrained model details

tokenizer family

all five tokenizers form a nested prefix hierarchy trained on 30 GB of cross-platform binaries:

vocab size	bytes/token	when to use
4K	2.01	very constrained context windows
8K	2.21	limited compute resources
16K	2.41	moderate compression
32K	2.64	good balance of compression and vocab size
64K	2.89	production (recommended)

compression by platform

measured on the binary-30k dataset:

platform	4K	8K	16K	32K	64K
linux	1.625	1.769	1.937	2.139	2.385
windows	2.119	2.341	2.595	2.905	3.274
macos	2.183	2.333	2.542	2.806	3.058
android	1.074	1.102	1.152	1.242	1.411
overall	1.766	1.925	2.112	2.341	2.613

windows binaries compress more efficiently due to the regularity of PE format structures. android APKs compress less because they contain compressed data (dex, zip) that resists BPE merges.

what the tokenizer learns

the 64K vocabulary breaks down roughly as:

27.5% instruction patterns (x86/ARM opcodes and common instruction sequences)
12.5% readable strings (ASCII text from symbol tables, error messages)
7.5% high-byte patterns (non-ASCII data structures)
52% mixed format structures (headers, metadata, padding)

binary-30k dataset

mjbommar/binary-30k is a companion dataset of 29,793 cross-platform binaries (10.8 GB, CC-BY-4.0):

73% benign, 27% malware
windows 57%, linux 28%, macos 2%, android 1%, other 12%
10 architectures: x86-64, x86, ARM64, ARMv7, MIPS64, MIPS32, PowerPC64, RISC-V64, s390x
70/15/15 train/validation/test split with 4-dimensional stratification

training your own tokenizer

if you need to train on a custom corpus (domain-specific firmware, IoT devices, etc.), the bbpe rust crate provides a CLI:

cargo install bbpe

train on binary files

bbpe train firmware.bin --vocab-size 4096 --min-frequency 4 \
  --preprocessor null-delimited -o tokenizer.json

train on JSONL corpora

bbpe train --jsonl corpus.jsonl:text --vocab-size 8192 \
  --preprocessor ascii-whitespace --preprocessor-probability 0.75 \
  -o tokenizer.json

key training features

entropy filtering: --min-entropy / --max-entropy to skip compressed or encrypted regions
hierarchical families: smaller vocabs are strict prefixes of larger ones
streaming trainer: bounded memory for large corpora
JSONL support: dot-notation field paths with gzip decompression

glaurung — binary analysis framework with first-class AI integration
binary-tokenizer-paper — paper reproduction code, analysis scripts, and sample binaries
ensemble-bpe-paper — ensemble tokenization experiments

references

Bommarito, M.J. “Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis.” arXiv:2511.17573, November 2025. https://arxiv.org/abs/2511.17573
Bommarito, M.J. “Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection.” arXiv:2511.22095, November 2025. https://arxiv.org/abs/2511.22095
bbpe crate on crates.io: https://crates.io/crates/bbpe
Pretrained tokenizers on HuggingFace: https://huggingface.co/mjbommar

tokenizing raw executables for malware analysis with bbpe

overview

when to use

using the pretrained tokenizers

install

load from huggingface

tokenize a binary

decode back to bytes

inspect individual tokens

compare vocabulary sizes on a file

batch processing

special tokens

pretrained model details

tokenizer family

compression by platform

what the tokenizer learns

binary-30k dataset

training your own tokenizer

train on binary files

train on JSONL corpora

key training features

references

related pages

more in programming

overview

when to use

using the pretrained tokenizers

install

load from huggingface

tokenize a binary

decode back to bytes

inspect individual tokens

compare vocabulary sizes on a file

batch processing

special tokens

pretrained model details

tokenizer family

compression by platform

what the tokenizer learns

binary-30k dataset

training your own tokenizer

train on binary files

train on JSONL corpora

key training features

related projects

references

related pages

more in programming