USearch Molecules

April 7, 2026 · View on GitHub

USearch Molecules 7B dataset thumbnail

USearch Molecules is a large Chem-Informatics dataset of small molecules. It includes 7'131'914'291 molecules with up to 50 "heavy" (non-hydrogen) atoms gathered from:

  • 115'034'339 molecules from the PubChem dataset.
  • 977'468'301 molecules from the GDB13 dataset.
  • 6'039'411'651 molecules from the Enamine REAL dataset.

All molecules have been encoded using rdkit and cdk to produce binary fingerprints (structural embeddings) of four kinds:

  • MACCS: Molecular ACCess System keys with 166 dimensions.
  • PubChem: Structure Fingerprints with 881 dimensions.
  • ECFP4: Extended Connectivity Fingerprint of diameter 4 with 2048 dimensions.
  • FCFP4: Functional Class Fingerprint of diameter 4 with 2048 dimensions.

Those fingerprints were then indexed using Unum's USearch to enable real-time search and clustering of molecular structures for drug discovery and broader chemistry. The dataset is included in AWS Open Data platform and is publicly available from the s3://usearch-molecules bucket, accessible even without AWS credentials, entirely anonymously:

aws s3 ls --no-sign-request s3://usearch-molecules

Dataset Structure

.
├── data
   ├── pubchem
   ├── index-maccs.usearch # 18.6 GB
   ├── index-maccs-ecfp4.usearch # 46.1 GB
   └── parquet # 30 GB
       ├── 0000000000-0001000000.parquet # 265 MB
       ├── 0001000000-0002000000.parquet # 265 MB
       ├── ... 
       └── 0115000000-0116000000.parquet # 177 MB
   ├── gdb13
   ├── index-maccs.usearch # 157.0 GB
   ├── index-maccs-ecfp4.usearch # 390.1 GB
   └── parquet # 189 GB
       ├── 0000000000-0001000000.parquet # 198 MB
       ├── 0001000000-0002000000.parquet # 198 MB
       ├── ... 
       └── 0977000000-0978000000.parquet # 93 MB
   └── real
       └── parquet # 477 GB
           ├── 0000000000-0001000000.parquet # 262 MB
           ├── 0001000000-0002000000.parquet # 262 MB
           ├── ... 
           └── 6039000000-6040000000.parquet # 108 MB
└── README.md

Pre-constructed search and clustering indexes for the Enamine REAL dataset are much harder to distribute and deploy. Those are not yet available in the bucket but are available per request. To view the dataset structure, one can use Python:

  $ pip install pyarrow
  $ python
>>> import pyarrow.parquet as pq
>>> pq.read_table('data/real/parquet/0000000000-0001000000.parquet')

pyarrow.Table
smiles: string not null
maccs: fixed_size_binary[21] not null
pubchem: fixed_size_binary[111] not null
ecfp4: fixed_size_binary[256] not null
fcfp4: fixed_size_binary[256] not null

In a tabular form that will look like:

smilesmaccspubchemecfp4fcfp4
0CNCC(C)NC(=O)C1(C(C)(C)OC)CC10x00000200000002002021227C488B9C02100615FFCC0x00733000000000000000000000001800000000000000000000000000000000000000001E00100000000E6CC18006020002C004000800011010000000000000000000810800000040160080001400000636008000000000000F800000000000000000000000000000000000000000000x400000000000000000008000000024000000000000000000000000000000000000000010000002000000000000000000008000000000000000000000000000000000000000020000000000020000000000200000000001000000000000000000000000000100000000400000000000000000000200000008000000000000000000000000480000000000000000000002802000000000000000000200000000000000000000000000000001000000000000000200000000000000000004000000010000000000000000000000000000000100040000000000000000000008000000000000000000008000000000000004000000000000000000100000200000000xE0001400000000000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000001000401000000000000000000000000400000000000000000000001000000000000000000000100080000004000000000000000000000000000000000000000000000000000000000000000004000800000000000000000000000000001000000200000000000000000000000000000000000020000000000000000000000000000000000000000000000000000000000000000000004000000000000000001000000000000000000000080020000004000000000000000000000000000000000000000080
1CN(C(=O)C1=CC2=C(F)C=C(F)C=C2N1)C1CN(C(=O)CC2=CC=CN=C2O)C10x00900000002000004011172DAC534CE55EF3EB7FFC0x007BB1800000000000000000000000005801600000003C400000000000000001F000001F00100800000C28C19E0C3EC4F3C99200A8033577540082802037222008D921BC6CDC0866F2C295B394710864D611C8D987BE99809E000000000002000000000000000400000000000000000x000000000000010000008000002001000001000000000000000000000000000200000000000000000400000000080020000000000000008080000000000000000002000000000000000000010000000000200000000000140000000010002001000000000140400000000000001040000000000201004000000000000000401000001100400000008800002000000000001000000000000004000000000000000000000000000001040400000800000000000000000800000001000000000000000000000000000420000000000040000200000000000140000042002000000000000000000080000020400000000004008000000000000000000040010000000xBE800000000000000001000000000000000080000000080000000000000000000000000000000000000200000000000000000000000900000000000000010000000000010000000000020000000000000000000000000000000000200000000000000080080000000000000000000000040000008000000000002000000080000000000000400004000000000000000010000000000000000000000000000000000000400000000000000014000000000008000000000000000000000000000000000800000000000000000000000400080000000000001000400000000100000000000000000040004000000000002404000000000000000002020040003180

I've also added a tiny sample dataset under the data/example directory, with only 2 shards totaling 2 million entries, with pre-constructed indexes to simplify the entry. Those come in handy if you want to test your application without downloading the whole dataset or visualize a few molecules using the StreamLit app.

.
└── data
    └── example # 1.8 GB
        ├── index-maccs.usearch # 329 MB
        ├── index-maccs-ecfp4.usearch # 817 MB
        ├── parquet # 30 GB
   ├── 0000000000-0001000000.parquet # 265 MB
   └── 0001000000-0002000000.parquet # 265 MB
        └── smiles # 30 GB
            ├── 0000000000-0001000000.smi # 58 MB
            └── 0001000000-0002000000.smi # 58 MB

Installation

The project supports multiple installation profiles for different use cases.

We recommend using uv for fast, reliable Python dependency management.

git clone https://github.com/ashvardanian/USearchMolecules.git
cd USearchMolecules

uv venv --python 3.12                   # or your preferred Python version
source .venv/bin/activate               # to activate the virtual environment
uv pip install setuptools wheel         # to pull the latest build tools
uv pip install -e . --force-reinstall   # to build locally from source
uv pip install -e ".[dev]"              # for fingerprinting & generation of indexes
uv pip install -e ".[gpu]"              # for GPU-accelerated processing with nvMolKit
uv pip install -e ".[viz]"              # for visualization with StreamLit
uv pip install -e ".[all]"              # for all features

Or install from PyPI:

uv pip install usearch-molecules
uv pip install "usearch-molecules[dev]" # for fingerprinting & generation of indexes
uv pip install "usearch-molecules[gpu]" # for GPU-accelerated processing with nvMolKit
uv pip install "usearch-molecules[viz]" # for visualization with StreamLit
uv pip install "usearch-molecules[all]" # for all features

For GPU acceleration with nvMolKit, we recommend using pixi which handles conda dependencies (RDKit, nvMolKit) seamlessly:

pixi install
pixi run python -m usearch_molecules.prep_conformers --datasets example --use-gpu --conformers 20 --batch-size 20

Usage

Exploring Dataset via Command Line Interface

Download the example dataset (2M molecules):

mkdir -p data/example
aws s3 sync --no-sign-request s3://usearch-molecules/data/example data/example/

If you need just one of the subsets:

aws s3 sync --no-sign-request s3://usearch-molecules/data/pubchem/ data/pubchem/
aws s3 sync --no-sign-request s3://usearch-molecules/data/gdb13/ data/gdb13/
aws s3 sync --no-sign-request s3://usearch-molecules/data/real/ data/real/

You can immediately check if the indexes are readable:

  $ python
>>> from usearch.index import Index
>>> Index.metadata("data/pubchem/index-maccs.usearch") # example of reading metadata

{'matrix_included': True,
 'matrix_uses_64_bit_dimensions': False,
 'version': '2.8.10',
 'kind_metric': <MetricKind.Tanimoto: 116>,
 'kind_scalar': <ScalarKind.B1: 1>,
 'kind_key': <ScalarKind.U64: 8>,
 'kind_compressed_slot': <ScalarKind.U32: 9>,
 'count_present': 115627267,
 'count_deleted': 0,
 'dimensions': 192}

>>> Index.restore("data/pubchem/index-maccs-ecfp4.usearch") # example of parsing it

usearch.Index
- config
-- data type: ScalarKind.B1
-- dimensions: 2240
-- metric: MetricKind.Tanimoto
-- connectivity: 16
-- expansion on addition:128 candidates
-- expansion on search: 64 candidates
- binary
-- uses OpenMP: 1
-- uses SimSIMD: 1
-- uses hardware acceleration: avx512+popcnt
- state
-- size: 115,627,267 vectors
-- memory usage: 69,631,939,864 bytes
-- max level: 4
--- 0. 115,627,267 nodes
--- 1. 7,148,410 nodes
--- 2. 461,450 nodes
--- 3. 37,714 nodes
--- 4. 5,152 nodes

With those out of the way, you can now query the downloaded files:

from usearch_molecules.dataset import FingerprintedDataset, shape_mixed

data = FingerprintedDataset.open("data/example", shape=shape_mixed)

# No inspiration? Pick a random molecule with `data.random_smiles()`
results = data.search('CC(O)C(CN)=NNCC(C)(C)C', 100)

results_keys = [r[0] for r in results]
results_smiles = [r[1] for r in results]
results_scores = [r[2] for r in results]

Exploring Dataset via Graphical Interface

The dataset also comes with a graphical sandbox implemented with StreamLit and 3DMol.js to help visualize similarities between molecules.

streamlit run streamlit_app.py

USearch Molecules StreamLit demo preview

Methodology

Dataset Sources

Original data came from:

Pipeline

The data processing pipeline consists of 5 steps, each implemented as a standalone script:

  1. prep_parquet.py: Convert raw datasets into standardized Parquet shards with SMILES strings.
  2. prep_encode.py: Add molecular fingerprints (MACCS, ECFP4, FCFP4, PubChem) to Parquet files.
  3. prep_index.py: Build USearch similarity indexes for fast nearest neighbor search.
  4. prep_conformers.py: Generate 3D conformers using ETKDG and optionally optimize with MMFF94.
  5. prep_smiles.py: Export SMILES strings to newline-delimited .smi files for StringZilla.

Every script is designed to work with bigger-than-memory data. In other words, processing 1 TB of molecules doesn't require 1 TB of RAM. Everything happens in a "gliding-window" fashion, with computationally intensive parts split between processes and threads.

uv run python -m usearch_molecules.prep_parquet --datasets example
uv run python -m usearch_molecules.prep_encode --datasets example
uv run python -m usearch_molecules.prep_index --datasets example
uv run python -m usearch_molecules.prep_smiles --datasets example
uv run python -m usearch_molecules.prep_conformers --datasets example

Once completed, datasets have been uploaded to S3:

aws s3 sync data/pubchem/parquet/ s3://usearch-molecules/data/pubchem/parquet/
aws s3 sync data/gdb13/parquet/ s3://usearch-molecules/data/gdb13/parquet/
aws s3 sync data/real/parquet/ s3://usearch-molecules/data/real/parquet/

What's Persisted

Each Parquet shard stores the columns below. The guiding principle is: cache what's expensive to recompute, skip what's cheap to reconstruct from SMILES.

Conformer generation (ETKDG + MMFF) costs 60-600 ms per molecule depending on size and conformer count. By contrast, parsing a SMILES string back into a full molecular graph with atom types, bond topology, formal charges, and stereochemistry takes under 0.2 ms. That 300-3000x cost gap is why we persist 3D coordinates but not the molecular graph.

ColumnTypeTypecal SizeDescription
smilesutf8~50 BCanonical SMILES string encodes the graph: atom types, bond orders, formal charges, stereochemistry.
maccsbinary(21)21 BMACCS structural keys (166 bits).
ecfp4binary(256)256 BExtended-connectivity fingerprint, radius 2 (2048 bits).
fcfp4binary(256)256 BFunctional-class fingerprint, radius 2 (2048 bits).
pubchembinary(111)111 BPubChem substructure fingerprint (881 bits). Optional, requires CDK.
n_heavy_atomsuint162 BNumber of heavy (non-hydrogen) atoms. Avoids SMILES parsing for basic filtering.
n_atomsuint162 BTotal atom count including hydrogens. Needed to reshape the coordinate blob.
n_bondsuint162 BNumber of bonds. Useful for graph-based models without reparsing.
molecular_weightfloat324 BExact molecular weight in Daltons. Universally needed for filtering.
n_conformersuint81 BNumber of stored conformers (after RMSD-based deduplication).
conformer_coordsbinaryK * N * 6 B3D coordinates as raw float16 bytes, shape (K, N, 3) where K = conformers, N = atoms.
conformer_energiesbinaryK * 4 BMMFF94 energies as raw float32 bytes, one per conformer (kcal/mol, lowest first).

Coordinates are stored as IEEE 754 float16 (not bfloat16) because PyArrow and the Parquet specification natively support float16, while bfloat16 has no Parquet encoding. The quantization error from float64 to float16 is under 0.002 Angstroms - well below thermal noise at room temperature (~0.1 Angstroms).

What we intentionally don't store:

  • Bond topology (atom pairs + bond orders) - this is literally what SMILES encodes. C-C(=O)-O directly specifies which atoms connect and by what bond type.
  • Atom types (element per atom index) - every letter in the SMILES string IS the atom type. After AddHs, hydrogen placement is deterministic.
  • Formal charges (integer per atom) - encoded explicitly in SMILES brackets, e.g. [NH3+], [O-].
  • Stereochemistry (chirality, E/Z geometry) - encoded with @/@@ and //\ in SMILES, and also inferable from the 3D coordinates.

All four are losslessly recoverable from the smiles column in under 0.2 ms via Chem.MolFromSmiles + AddHs.

To read conformers back into NumPy arrays:

import numpy as np

coords = np.frombuffer(row["conformer_coords"], dtype=np.float16)
coords = coords.reshape(row["n_conformers"], row["n_atoms"], 3)
energies = np.frombuffer(row["conformer_energies"], dtype=np.float32)

For a typical drug-like molecule (~40 atoms with H, 5 conformers), the coordinate column is ~1.2 KB per molecule in float16 versus ~15 KB for the previous SDF text format - a 12x reduction.