on this page

Binary-30K Dataset

dataset

The first heterogeneous binary analysis dataset for deep learning research, featuring 30,000 diverse executables spanning multiple platforms, architectures, and file formats

period: 2025-present
tech:
Machine LearningBinary AnalysisMalware DetectionDeep LearningCybersecurity

Binary-30K is the first comprehensive heterogeneous dataset designed specifically for deep learning research in binary analysis and malware detection, addressing a critical infrastructure gap in the field.

Problem Statement

Existing binary analysis datasets suffer from significant limitations:

  • Single-platform focus: Datasets target only Linux, Windows, or Android
  • Specialized tooling requirements: Many require complex preprocessing pipelines
  • Hand-engineered features: Incompatible with modern neural architectures that learn directly from raw data
  • Limited accessibility: No single dataset supports both research and pedagogy on realistic use cases

Binary-30K solves these problems by providing a unified, heterogeneous dataset that works out-of-the-box with modern deep learning frameworks.

Publication

β€œBinary-30K: A Heterogeneous Dataset for Deep Learning in Binary Analysis and Malware Detection”

  • Author: Michael J. Bommarito II
  • Published: November 2025
  • Available on arXiv
  • 35 pages, 7 figures, 11 tables, 4 appendices

Dataset Overview

Core Statistics

  • 30,000 unique binaries totaling 24GB of raw binary data
  • Multiple platforms: Linux (Alpine, Debian, Ubuntu), Windows (8/10/11), macOS, Android
  • Multiple architectures: x86-64, x86-32, ARM64, ARM32, MIPS, RISC-V
  • Multiple file formats: ELF, PE, Mach-O, APK
  • Balanced malware/benign split: Includes samples from SOREL-20M and Malware Bazaar

Pre-Tokenized Variants

Binary-30K is available in multiple tokenization formats to support different research workflows:

  • Raw bytes: Original binary executables for custom preprocessing
  • Binary BPE tokenized: Pre-processed with Binary BPE tokenizers (4K, 8K, 16K, 32K, 64K vocabularies)
  • Ready for transformers: Compatible with HuggingFace transformers library out-of-the-box

Key Features

Heterogeneous Platform Coverage

Binary-30K provides balanced representation across:

Operating Systems

  • Linux: Alpine, Debian, Ubuntu distributions
  • Windows: Windows 8, 10, and 11 executables
  • macOS: Intel and Apple Silicon binaries
  • Android: APK packages with native libraries

CPU Architectures

  • x86-64: Modern 64-bit Intel/AMD processors
  • x86-32: Legacy 32-bit x86 systems
  • ARM64: Modern mobile and server processors
  • ARM32: Embedded and older mobile devices
  • MIPS: Router firmware and embedded systems
  • RISC-V: Emerging open-source architecture

File Formats

  • ELF (Executable and Linkable Format): Linux and Unix systems
  • PE (Portable Executable): Windows executables and DLLs
  • Mach-O: macOS and iOS binaries
  • APK: Android application packages

Malware Samples

The dataset includes real-world malware for security research:

  • SOREL-20M samples: Diverse malware families from Sophos
  • Malware Bazaar samples: Contemporary threat samples
  • Multiple threat categories: Trojans, ransomware, adware, spyware
  • Ethical considerations: Curated for research with appropriate safeguards

Dataset Structure

Organization

binary-30k/
β”œβ”€β”€ raw/                    # Original binary executables
β”‚   β”œβ”€β”€ linux/             # Linux binaries (ELF)
β”‚   β”œβ”€β”€ windows/           # Windows executables (PE)
β”‚   β”œβ”€β”€ macos/             # macOS binaries (Mach-O)
β”‚   └── android/           # Android packages (APK)
β”œβ”€β”€ tokenized/             # Pre-tokenized variants
β”‚   β”œβ”€β”€ bpe-4k/           # 4K vocabulary
β”‚   β”œβ”€β”€ bpe-8k/           # 8K vocabulary
β”‚   β”œβ”€β”€ bpe-16k/          # 16K vocabulary
β”‚   β”œβ”€β”€ bpe-32k/          # 32K vocabulary
β”‚   └── bpe-64k/          # 64K vocabulary
└── metadata.json          # Labels, file info, hashes

Metadata Fields

Each binary includes comprehensive metadata:

  • File hash: SHA256 for reproducibility
  • Platform information: OS, architecture, file format
  • Size metrics: File size, section counts, symbol counts
  • Label information: Benign/malware classification
  • Compilation details: Compiler, optimization flags (when available)
  • Library dependencies: Linked libraries and imports

Research Applications

Binary-30K enables research across multiple domains:

Malware Detection

  • Binary classification: Benign vs. malicious detection
  • Family classification: Identifying specific malware families
  • Behavioral analysis: Understanding malware capabilities from static features
  • Zero-day detection: Identifying novel threats without signatures

Binary Analysis

  • Function-purpose identification: Understanding what code does
  • Similarity detection: Finding related or plagiarized binaries
  • Vulnerability discovery: Identifying potential security flaws
  • Reverse engineering: Automated analysis for security audits

Deep Learning Research

  • Transformer baselines: Pre-training models on binary sequences
  • Architecture evaluation: Comparing CNN, RNN, and transformer approaches
  • Transfer learning: Cross-platform model generalization
  • Few-shot learning: Adapting models to new malware families

Educational Use

  • Course materials: Ready-to-use dataset for cybersecurity courses
  • Project assignments: Realistic data for student projects
  • Reproducible research: Common benchmark for comparing approaches
  • Hands-on learning: Practical experience with diverse binaries

Technical Advantages

Ready for Deep Learning

  • No preprocessing required: Use directly with PyTorch, TensorFlow, JAX
  • HuggingFace integration: Compatible with transformers library
  • Batch processing support: Efficient data loading for GPU training
  • Multiple tokenizations: Choose optimal vocabulary size for your model

Research Reproducibility

  • Fixed dataset splits: Predefined train/validation/test partitions
  • Deterministic sampling: Reproducible experiments with consistent data
  • Version control: Tagged releases for long-term reproducibility
  • Comprehensive documentation: Clear metadata and usage examples

Diverse Challenges

The heterogeneous nature provides multiple research challenges:

  • Cross-platform generalization: Models must learn platform-agnostic patterns
  • Architecture diversity: Handle different CPU instruction sets
  • File format variation: Process ELF, PE, Mach-O, and APK formats
  • Size heterogeneity: From 10KB embedded firmware to 100MB+ applications

Usage Example

from datasets import load_dataset
import torch
from torch.utils.data import DataLoader

# Load pre-tokenized dataset (64K vocabulary)
dataset = load_dataset("mjbommar/binary-30k", "tokenized-64k")

# Create DataLoader for training
train_loader = DataLoader(
    dataset["train"],
    batch_size=32,
    shuffle=True,
    collate_fn=custom_collate_fn
)

# Use with transformer model
for batch in train_loader:
    tokens = batch["tokens"]
    labels = batch["labels"]

    # Forward pass through your model
    outputs = model(tokens)
    loss = criterion(outputs, labels)

    # Backward pass and optimization
    loss.backward()
    optimizer.step()

Comparison to Existing Datasets

DatasetSizePlatformsFormatsArchitectureMalwareDL-Ready
SOREL-20M20MWindowsPEx86βœ“βœ—
Drebin5.5KAndroidAPKARMβœ“βœ—
EMBER1.1MWindowsPEx86βœ“Partial
Binary-30K30KMultiMultiMultiβœ“βœ“

Binary-30K is the only dataset that combines:

  • Heterogeneous platform support
  • Multiple file formats and architectures
  • Raw binary data
  • Pre-tokenized variants for immediate use
  • Comprehensive metadata

Ethical Considerations

Binary-30K includes malware samples for security research. Users must:

  • Use responsibly: Only for legitimate research and educational purposes
  • Ensure containment: Execute only in isolated sandboxed environments
  • Follow laws: Comply with local regulations on malware possession
  • Cite appropriately: Reference the dataset and paper in publications

Availability

All resources are freely available:

  • Binary BPE Tokenizers: The tokenization method used for pre-processed variants
  • bbpe: Rust implementation for training custom binary tokenizers
  • Future work includes transformer baselines and pre-trained models

Impact on Binary Analysis Research

Binary-30K establishes:

  1. Unified benchmark: A common evaluation dataset for comparing approaches
  2. Accessibility: Lowers barriers to entry for binary analysis research
  3. Heterogeneity: Encourages development of generalizable models
  4. Reproducibility: Fixed splits and comprehensive metadata enable reproducible research

The dataset provides the foundation for the next generation of deep learning research in binary analysis and malware detection, enabling researchers and educators to work with realistic, diverse data without extensive preprocessing infrastructure.

on this page