BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

November 24, 2025 ยท View on GitHub

Project Page Hugging Face Dataset Hugging Face Dataset PyPI biotrove-process 0.1.0

Banner Image


Contents

Data Preprocessing

Before using this script, please download the metadata from Hugging Face and pre-process the data using the biotrove_process library. The library is located in the BioTrove-preprocess/biotrove_process directory. A detailed description can be found in the README file.

The library contains scripts to generate machine learning-ready image-text pairs from the downloaded metadata in four steps:

  1. Processing metadata files to obtain category and species distribution.
  2. Filtering metadata based on user-defined thresholds and generating shuffled chunks.
  3. Downloading images based on URLs in the metadata.
  4. Generating text labels for the images.

Model Training

We train three models using a modified version of the BioCLIP/OpenCLIP codebase. Each model is trained for 40 epochs on BioTrove-40M, on 2 nodes, 8xH100 GPUs, on NYU's Greene high-performance compute cluster.

We optimize our hyperparameters prior to training with Ray. Our standard training parameters are as follows:

--dataset-type webdataset 
--pretrained openai 
--text_type random 
--dataset-resampled 
--warmup 5000 
--batch-size 4096 
--accum-freq 1 
--epochs 40
--workers 8 
--model ViT-B-16 
--lr 0.0005 
--wd 0.0004 
--precision bf16 
--beta1 0.98 
--beta2 0.99 
--eps 1.0e-6 
--local-loss 
--gather-with-grad 
--ddp-static-graph 
--grad-checkpointing

For more extensive documentation of the training process and the significance of each hyperparameter, we recommend referencing the OpenCLIP and BioCLIP documentation, respectively.

Model weights

See the BioTrove-CLIP Model card on HuggingFace to download the trained model checkpoints.

We released three trained model checkpoints in the BioTrove-CLIP model card on HuggingFace. These CLIP-style models were trained on BioTrove-Train (40M) for the following configurations:

  • BT-CLIP-O: Trained a ViT-B/16 backbone initialized from the OpenCLIP's checkpoint. The training was conducted for 40 epochs.
  • BT-CLIP-B: Trained a ViT-B/16 backbone initialized from the BioCLIP's checkpoint. The training was conducted for 8 epochs.
  • BT-CLIP-M: Trained a ViT-L/14 backbone initialized from the MetaCLIP's checkpoint. The training was conducted for 12 epochs.

These models were developed for the benefit of the AI community as an open-source product. Thus, we request that any derivative products are also open-source.

Model Validation

For validating the zero-shot accuracy of our trained models and comparing to other benchmarks, we use the VLHub repository with some slight modifications.

Pre-Run

After cloning this repository and navigating to the BioTrove/model_validation directory, we recommend installing all the project requirements into a conda container; pip install -r requirements.txt. Also, before executing a command in VLHub, please add BioTrove/model_validation/src to your PYTHONPATH.

export PYTHONPATH="$PYTHONPATH:$PWD/src";

Base Command

A basic BioTrove model evaluation command can be launched as follows. This example would evaluate a CLIP-ResNet50 checkpoint whose weights resided at the path designated via the --resume flag on the ImageNet validation set, and would report the results to Weights and Biases.

python src/training/main.py --batch-size=32 --workers=8 --imagenet-val "/imagenet/val/" --model="resnet50" --zeroshot-frequency=1 --image-size=224 --resume "/PATH/TO/WEIGHTS.pth" --report-to wandb

Baseline Models

We compare our trained checkpoints to three strong baselines. We describe our baselines in the table below, including the required flags to evaluate them.

Model NameOriginPath to checkpointRuntime Flags
BioCLIPhttps://arxiv.org/abs/2311.18803https://huggingface.co/imageomics/bioclip--model ViT-B-16 --resume "/PATH/TO/bioclip_ckpt.bin"
OpenAI CLIPhttps://arxiv.org/abs/2103.00020Downloads automatically--model ViT-B-16 --pretrained=openai
MetaCLIP-cchttps://github.com/facebookresearch/MetaCLIPDownloads automatically--model ViT-L-14-quickgelu --pretrained=metaclip_fullcc

Existing Benchmarks

In the BioTrove paper, we report results on the following established benchmarks from prior scientific literature: Birds525, BioCLIP-Rare, IP102 Insects, Fungi, Deepweeds, and Confounding Species. We also introduce three new benchmarks: BioTrove-Balanced, BioTrove-LifeStages, and BioTrove-Unseen.

Our package expects a valid path to each image to exist in its corresponding metadata file; therefore, metadata CSV paths must be updated before running each benchmark.

Benchmark NameImages URLMetadata PathRuntime Flag(s)
BioTrove-Balancedhttps://huggingface.co/datasets/BGLab/BioTrove-Trainhttps://huggingface.co/datasets/BGLab/BioTrove/tree/main/BioTrove-benchmark/BioTrove-Balanced.csv--arbor-val --taxon MY_TAXON
BioTrove-Lifestageshttps://huggingface.co/datasets/BGLab/BioTrove-Trainhttps://huggingface.co/datasets/BGLab/BioTrove/tree/main/BioTrove-benchmark/BioTrove-LifeStages.csv--lifestages --taxon MY_TAXON
BioTrove-Unseenhttps://huggingface.co/datasets/BGLab/BioTrove-Trainhttps://huggingface.co/datasets/BGLab/BioTrove/tree/main/BioTrove-benchmark/BioTrove-Unseen.csv--arbor-rare --taxon MY_TAXON
BioCLIP Rarehttps://huggingface.co/datasets/imageomics/rare-speciesmodel_validation/metadata/bioclip-rare-metadata.csv--bioclip-rare --taxon MY_TAXON
Birds525https://www.kaggle.com/datasets/gpiosenka/100-bird-speciesmodel_validation/metadata/birds525_metadata.csv--birds /birds525 --ds-filter birds
Confounding SpeciesTBDmodel_validation/metadata/confounding_species.csv--confounding
Deepweedshttps://www.kaggle.com/datasets/imsparsh/deepweedsmodel_validation/metadata/deepweeds_metadata.csv--deepweeds
Fungihttp://ptak.felk.cvut.cz/plants/DanishFungiDataset/DF20M-images.tar.gzmodel_validation/metadata/fungi_metadata.csv--fungi
IP102 Insectshttps://www.kaggle.com/datasets/rtlmhjbn/ip02-datasetmodel_validation/metadata/ins2_metadata.csv--insects2

Acknowledgments

If you find this repository useful, please consider citing these related papers --

VLHub

@article{
  feuer2023distributionally,
  title={Distributionally Robust Classification on a Data Budget},
  author={Benjamin Feuer and Ameya Joshi and Minh Pham and Chinmay Hegde},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2023},
  url={https://openreview.net/forum?id=D5Z2E8CNsD},
  note={}
}

BioCLIP

@INPROCEEDINGS{Stevens2024-hc,
  title     = "{BioCLIP}: A Vision Foundation Model for the Tree of Life",
  author    = "Stevens, Samuel and Wu, Jiaman and Thompson, Matthew J and
               Campolongo, Elizabeth G and Song, Chan Hee and Carlyn, David
               Edward and Dong, Li and Dahdul, Wasila M and Stewart, Charles and
               Berger-Wolf, Tanya and Chao, Wei-Lun and Su, Yu",
  booktitle = "Proceedings of the IEEE/CVF Conference on Computer Vision and
               Pattern Recognition (CVPR)",
  pages     = "19412--19424",
  year      =  2024,
  url       = "https://openaccess.thecvf.com/content/CVPR2024/papers/Stevens_BioCLIP_A_Vision_Foundation_Model_for_the_Tree_of_Life_CVPR_2024_paper.pdf"
}

OpenCLIP

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

Parts of this project page were adopted from the Nerfies page.

Website License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation

@misc{yang2024arboretumlargemultimodaldataset,
        title={Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity}, 
        author={Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K. Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, Baskar Ganapathysubramanian},
        year={2024},
        eprint={2406.17720},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2406.17720}, 
  }