Binary-30K Dataset
datasetThe first heterogeneous binary analysis dataset for deep learning research, featuring 30,000 diverse executables spanning multiple platforms, architectures, and file formats
Binary-30K is the first comprehensive heterogeneous dataset designed specifically for deep learning research in binary analysis and malware detection, addressing a critical infrastructure gap in the field.
Problem Statement
Existing binary analysis datasets suffer from significant limitations:
- Single-platform focus: Datasets target only Linux, Windows, or Android
- Specialized tooling requirements: Many require complex preprocessing pipelines
- Hand-engineered features: Incompatible with modern neural architectures that learn directly from raw data
- Limited accessibility: No single dataset supports both research and pedagogy on realistic use cases
Binary-30K solves these problems by providing a unified, heterogeneous dataset that works out-of-the-box with modern deep learning frameworks.
Publication
βBinary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detectionβ
- Author: Michael J. Bommarito II
- Published: November 2025
- Available on arXiv
- 35 pages, 7 figures, 11 tables, 4 appendices
Dataset Overview
Core Statistics
- 30,000 unique binaries totaling 24GB of raw binary data
- Multiple platforms: Linux (Alpine, Debian, Ubuntu), Windows (8/10/11), macOS, Android
- Multiple architectures: x86-64, x86-32, ARM64, ARM32, MIPS, RISC-V
- Multiple file formats: ELF, PE, Mach-O, APK
- Balanced malware/benign split: Includes samples from SOREL-20M and Malware Bazaar
Pre-Tokenized Variants
Binary-30K is available in multiple tokenization formats to support different research workflows:
- Raw bytes: Original binary executables for custom preprocessing
- Binary BPE tokenized: Pre-processed with Binary BPE tokenizers (4K, 8K, 16K, 32K, 64K vocabularies)
- Ready for transformers: Compatible with HuggingFace transformers library out-of-the-box
Key Features
Heterogeneous Platform Coverage
Binary-30K provides balanced representation across:
Operating Systems
- Linux: Alpine, Debian, Ubuntu distributions
- Windows: Windows 8, 10, and 11 executables
- macOS: Intel and Apple Silicon binaries
- Android: APK packages with native libraries
CPU Architectures
- x86-64: Modern 64-bit Intel/AMD processors
- x86-32: Legacy 32-bit x86 systems
- ARM64: Modern mobile and server processors
- ARM32: Embedded and older mobile devices
- MIPS: Router firmware and embedded systems
- RISC-V: Emerging open-source architecture
File Formats
- ELF (Executable and Linkable Format): Linux and Unix systems
- PE (Portable Executable): Windows executables and DLLs
- Mach-O: macOS and iOS binaries
- APK: Android application packages
Malware Samples
The dataset includes real-world malware for security research:
- SOREL-20M samples: Diverse malware families from Sophos
- Malware Bazaar samples: Contemporary threat samples
- Multiple threat categories: Trojans, ransomware, adware, spyware
- Ethical considerations: Curated for research with appropriate safeguards
Dataset Structure
Organization
binary-30k/
βββ raw/ # Original binary executables
β βββ linux/ # Linux binaries (ELF)
β βββ windows/ # Windows executables (PE)
β βββ macos/ # macOS binaries (Mach-O)
β βββ android/ # Android packages (APK)
βββ tokenized/ # Pre-tokenized variants
β βββ bpe-4k/ # 4K vocabulary
β βββ bpe-8k/ # 8K vocabulary
β βββ bpe-16k/ # 16K vocabulary
β βββ bpe-32k/ # 32K vocabulary
β βββ bpe-64k/ # 64K vocabulary
βββ metadata.json # Labels, file info, hashes
Metadata Fields
Each binary includes comprehensive metadata:
- File hash: SHA256 for reproducibility
- Platform information: OS, architecture, file format
- Size metrics: File size, section counts, symbol counts
- Label information: Benign/malware classification
- Compilation details: Compiler, optimization flags (when available)
- Library dependencies: Linked libraries and imports
Research Applications
Binary-30K enables research across multiple domains:
Malware Detection
- Binary classification: Benign vs. malicious detection
- Family classification: Identifying specific malware families
- Behavioral analysis: Understanding malware capabilities from static features
- Zero-day detection: Identifying novel threats without signatures
Binary Analysis
- Function-purpose identification: Understanding what code does
- Similarity detection: Finding related or plagiarized binaries
- Vulnerability discovery: Identifying potential security flaws
- Reverse engineering: Automated analysis for security audits
Deep Learning Research
- Transformer baselines: Pre-training models on binary sequences
- Architecture evaluation: Comparing CNN, RNN, and transformer approaches
- Transfer learning: Cross-platform model generalization
- Few-shot learning: Adapting models to new malware families
Educational Use
- Course materials: Ready-to-use dataset for cybersecurity courses
- Project assignments: Realistic data for student projects
- Reproducible research: Common benchmark for comparing approaches
- Hands-on learning: Practical experience with diverse binaries
Technical Advantages
Ready for Deep Learning
- No preprocessing required: Use directly with PyTorch, TensorFlow, JAX
- HuggingFace integration: Compatible with transformers library
- Batch processing support: Efficient data loading for GPU training
- Multiple tokenizations: Choose optimal vocabulary size for your model
Research Reproducibility
- Fixed dataset splits: Predefined train/validation/test partitions
- Deterministic sampling: Reproducible experiments with consistent data
- Version control: Tagged releases for long-term reproducibility
- Comprehensive documentation: Clear metadata and usage examples
Diverse Challenges
The heterogeneous nature provides multiple research challenges:
- Cross-platform generalization: Models must learn platform-agnostic patterns
- Architecture diversity: Handle different CPU instruction sets
- File format variation: Process ELF, PE, Mach-O, and APK formats
- Size heterogeneity: From 10KB embedded firmware to 100MB+ applications
Usage Example
from datasets import load_dataset
import torch
from torch.utils.data import DataLoader
# Load pre-tokenized dataset (64K vocabulary)
dataset = load_dataset("mjbommar/binary-30k", "tokenized-64k")
# Create DataLoader for training
train_loader = DataLoader(
dataset["train"],
batch_size=32,
shuffle=True,
collate_fn=custom_collate_fn
)
# Use with transformer model
for batch in train_loader:
tokens = batch["tokens"]
labels = batch["labels"]
# Forward pass through your model
outputs = model(tokens)
loss = criterion(outputs, labels)
# Backward pass and optimization
loss.backward()
optimizer.step()
Comparison to Existing Datasets
| Dataset | Size | Platforms | Formats | Architecture | Malware | DL-Ready |
|---|---|---|---|---|---|---|
| SOREL-20M | 20M | Windows | PE | x86 | β | β |
| Drebin | 5.5K | Android | APK | ARM | β | β |
| EMBER | 1.1M | Windows | PE | x86 | β | Partial |
| Binary-30K | 30K | Multi | Multi | Multi | β | β |
Binary-30K is the only dataset that combines:
- Heterogeneous platform support
- Multiple file formats and architectures
- Raw binary data
- Pre-tokenized variants for immediate use
- Comprehensive metadata
Ethical Considerations
Binary-30K includes malware samples for security research. Users must:
- Use responsibly: Only for legitimate research and educational purposes
- Ensure containment: Execute only in isolated sandboxed environments
- Follow laws: Comply with local regulations on malware possession
- Cite appropriately: Reference the dataset and paper in publications
Availability
All resources are freely available:
- HuggingFace Dataset: mjbommar/binary-30k
- arXiv Paper: 2511.22095
- Binary BPE Tokenizers: mjbommar/binary-tokenizer-001-*
Related Projects
- Binary BPE Tokenizers: The tokenization method used for pre-processed variants
- bbpe: Rust implementation for training custom binary tokenizers
- Future work includes transformer baselines and pre-trained models
Impact on Binary Analysis Research
Binary-30K establishes:
- Unified benchmark: A common evaluation dataset for comparing approaches
- Accessibility: Lowers barriers to entry for binary analysis research
- Heterogeneity: Encourages development of generalizable models
- Reproducibility: Fixed splits and comprehensive metadata enable reproducible research
The dataset provides the foundation for the next generation of deep learning research in binary analysis and malware detection, enabling researchers and educators to work with realistic, diverse data without extensive preprocessing infrastructure.