NovelConcepts10 Dataset

July 8, 2025 ยท View on GitHub

NovelConcepts10 was created to study in-distribution (ID) and out-of-distribution (OD) data for generative models. In particular, we focus on the distribution of the LAION Aesthetic dataset and the generative prior of pretrained Stable Diffusion models.

NovelConcepts10 contains five ID concepts (BlackKeyboard, CautionSign, CokeCan, EuroCoin, MetalWrench) and five OD concepts (BlueElephant, ButterflyClip, CatPaw, FishDoll, MonkeySleigh). Each concept contains five images, giving a total of 50 images in the dataset. For each concept, each photo utilizes a different pose, orientation, and/or background to encourage semantic diversity between images. Each photo is of resolution 4080x3072 and is stored in PNG format to avoid lossy compression. Each image filename is an image caption, generated by BLIP (Li, 2022: https://github.com/salesforce/BLIP) We also include all images resized to 512x512, as well as binary concept masks generated using Segment Anything Model (Kirillov, 2023: https://github.com/facebookresearch/segment-anything).

Evaluating Distribution Inclusion

For each concept, we investigated the following:

  • Presence of exact matches and similar concepts in LAION-5B using https://haveibeentrained.com/
  • The ability of Stable Diffusion 1.5 to generate similar images

Using text queries describing each concept, we verified the presence (or lack thereof) of each concept from NovelConcepts10 in LAION-5B using https://haveibeentrained.com. Human judgment was used to discern whether the resulting images matched with those in NovelConcepts10. In general, haveibeentrained shows that OD concepts have zero or few exact matches in LAION-5B whereas ID concepts have many. Additionally, we compared images generated by Stable Diffusion 1.5 text2image using text prompts describing each concept to the images in NovelConcepts10. Human judgment was used to discern whether the resulting images matched with those in NovelConcepts10.