Binary-30K is the first comprehensive heterogeneous dataset designed specifically for deep learning research in binary analysis and malware detection, addressing a critical infrastructure gap in the field.

Problem Statement

Existing binary analysis datasets suffer from significant limitations:

Single-platform focus: Datasets target only Linux, Windows, or Android
Specialized tooling requirements: Many require complex preprocessing pipelines
Hand-engineered features: Incompatible with modern neural architectures that learn directly from raw data
Limited accessibility: No single dataset supports both research and pedagogy on realistic use cases

Binary-30K solves these problems by providing a unified, heterogeneous dataset that works out-of-the-box with modern deep learning frameworks.

Publication

“Binary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection”

Author: Michael J. Bommarito II
Published: November 2025
Available on arXiv
35 pages, 7 figures, 11 tables, 4 appendices

Dataset Overview

Core Statistics

30,000 unique binaries totaling 24GB of raw binary data
Multiple platforms: Linux (Alpine, Debian, Ubuntu), Windows (8/10/11), macOS, Android
Multiple architectures: x86-64, x86-32, ARM64, ARM32, MIPS, RISC-V
Multiple file formats: ELF, PE, Mach-O, APK
Balanced malware/benign split: Includes samples from SOREL-20M and Malware Bazaar

Pre-Tokenized Variants

Binary-30K is available in multiple tokenization formats to support different research workflows:

Raw bytes: Original binary executables for custom preprocessing
Binary BPE tokenized: Pre-processed with Binary BPE tokenizers (4K, 8K, 16K, 32K, 64K vocabularies)
Ready for transformers: Compatible with HuggingFace transformers library out-of-the-box

Key Features

Heterogeneous Platform Coverage

Binary-30K provides balanced representation across:

Operating Systems

Linux: Alpine, Debian, Ubuntu distributions
Windows: Windows 8, 10, and 11 executables
macOS: Intel and Apple Silicon binaries
Android: APK packages with native libraries

CPU Architectures

x86-64: Modern 64-bit Intel/AMD processors
x86-32: Legacy 32-bit x86 systems
ARM64: Modern mobile and server processors
ARM32: Embedded and older mobile devices
MIPS: Router firmware and embedded systems
RISC-V: Emerging open-source architecture

File Formats

ELF (Executable and Linkable Format): Linux and Unix systems
PE (Portable Executable): Windows executables and DLLs
Mach-O: macOS and iOS binaries
APK: Android application packages

Malware Samples

The dataset includes real-world malware for security research:

SOREL-20M samples: Diverse malware families from Sophos
Malware Bazaar samples: Contemporary threat samples
Multiple threat categories: Trojans, ransomware, adware, spyware
Ethical considerations: Curated for research with appropriate safeguards

Dataset Structure

Organization

binary-30k/
├── raw/                    # Original binary executables
│   ├── linux/             # Linux binaries (ELF)
│   ├── windows/           # Windows executables (PE)
│   ├── macos/             # macOS binaries (Mach-O)
│   └── android/           # Android packages (APK)
├── tokenized/             # Pre-tokenized variants
│   ├── bpe-4k/           # 4K vocabulary
│   ├── bpe-8k/           # 8K vocabulary
│   ├── bpe-16k/          # 16K vocabulary
│   ├── bpe-32k/          # 32K vocabulary
│   └── bpe-64k/          # 64K vocabulary
└── metadata.json          # Labels, file info, hashes

Metadata Fields

Each binary includes comprehensive metadata:

File hash: SHA256 for reproducibility
Platform information: OS, architecture, file format
Size metrics: File size, section counts, symbol counts
Label information: Benign/malware classification
Compilation details: Compiler, optimization flags (when available)
Library dependencies: Linked libraries and imports

Research Applications

Binary-30K enables research across multiple domains:

Malware Detection

Binary classification: Benign vs. malicious detection
Family classification: Identifying specific malware families
Behavioral analysis: Understanding malware capabilities from static features
Zero-day detection: Identifying novel threats without signatures

Binary Analysis

Function-purpose identification: Understanding what code does
Similarity detection: Finding related or plagiarized binaries
Vulnerability discovery: Identifying potential security flaws
Reverse engineering: Automated analysis for security audits

Deep Learning Research

Transformer baselines: Pre-training models on binary sequences
Architecture evaluation: Comparing CNN, RNN, and transformer approaches
Transfer learning: Cross-platform model generalization
Few-shot learning: Adapting models to new malware families

Educational Use

Course materials: Ready-to-use dataset for cybersecurity courses
Project assignments: Realistic data for student projects
Reproducible research: Common benchmark for comparing approaches
Hands-on learning: Practical experience with diverse binaries

Technical Advantages

Ready for Deep Learning

No preprocessing required: Use directly with PyTorch, TensorFlow, JAX
HuggingFace integration: Compatible with transformers library
Batch processing support: Efficient data loading for GPU training
Multiple tokenizations: Choose optimal vocabulary size for your model

Research Reproducibility

Fixed dataset splits: Predefined train/validation/test partitions
Deterministic sampling: Reproducible experiments with consistent data
Version control: Tagged releases for long-term reproducibility
Comprehensive documentation: Clear metadata and usage examples

Diverse Challenges

The heterogeneous nature provides multiple research challenges:

Cross-platform generalization: Models must learn platform-agnostic patterns
Architecture diversity: Handle different CPU instruction sets
File format variation: Process ELF, PE, Mach-O, and APK formats
Size heterogeneity: From 10KB embedded firmware to 100MB+ applications

Usage Example

from datasets import load_dataset
import torch
from torch.utils.data import DataLoader

# Load pre-tokenized dataset (64K vocabulary)
dataset = load_dataset("mjbommar/binary-30k", "tokenized-64k")

# Create DataLoader for training
train_loader = DataLoader(
    dataset["train"],
    batch_size=32,
    shuffle=True,
    collate_fn=custom_collate_fn
)

# Use with transformer model
for batch in train_loader:
    tokens = batch["tokens"]
    labels = batch["labels"]

    # Forward pass through your model
    outputs = model(tokens)
    loss = criterion(outputs, labels)

    # Backward pass and optimization
    loss.backward()
    optimizer.step()

Comparison to Existing Datasets

Dataset	Size	Platforms	Formats	Architecture	Malware	DL-Ready
SOREL-20M	20M	Windows	PE	x86	✓	✗
Drebin	5.5K	Android	APK	ARM	✓	✗
EMBER	1.1M	Windows	PE	x86	✓	Partial
Binary-30K	30K	Multi	Multi	Multi	✓	✓

Binary-30K is the only dataset that combines:

Heterogeneous platform support
Multiple file formats and architectures
Raw binary data
Pre-tokenized variants for immediate use
Comprehensive metadata

Ethical Considerations

Binary-30K includes malware samples for security research. Users must:

Use responsibly: Only for legitimate research and educational purposes
Ensure containment: Execute only in isolated sandboxed environments
Follow laws: Comply with local regulations on malware possession
Cite appropriately: Reference the dataset and paper in publications

Availability

All resources are freely available:

HuggingFace Dataset: mjbommar/binary-30k
arXiv Paper: 2511.22095
Binary BPE Tokenizers: mjbommar/binary-tokenizer-001-*

Binary BPE Tokenizers: The tokenization method used for pre-processed variants
bbpe: Rust implementation for training custom binary tokenizers
Future work includes transformer baselines and pre-trained models

Impact on Binary Analysis Research

Binary-30K establishes:

Unified benchmark: A common evaluation dataset for comparing approaches
Accessibility: Lowers barriers to entry for binary analysis research
Heterogeneity: Encourages development of generalizable models
Reproducibility: Fixed splits and comprehensive metadata enable reproducible research

The dataset provides the foundation for the next generation of deep learning research in binary analysis and malware detection, enabling researchers and educators to work with realistic, diverse data without extensive preprocessing infrastructure.