dall-e-baby

February 19, 2021 · View on GitHub

OpenAI's dall-e is a kick ass model that takes in a natural language prompt and generates an images based on that. Now I cannot recreate the complete Dall-E so I make the baby version of it trained in CIFAR10-100 dataset. If Dall-E is picasso this is well... shit.

Results

First step is training the discreteVAE and you can see the results below:

In the next step we train on 2.3Mn captions and images datasets and you can see the results below [In Progress].

Stream

I am streaming the progress of this side-project on Youtube, do check it out.

Datasets

Originally I was a fool who scraped images for the dataset, that is a very stupid process. Instead I should have first gone for academictorrents.com. This is a list of datasets I will be using in v2 of this model (these are just for training the AutoEncoder model):

namesizeimage countlinkused for VAEcaptions givencaptions generated
Downscale OpenImagesv416GB1.9Mtorrent
Stanford STL-102.64GB113Ktorrent
CVPR Indoor Scene Recognition2.59GB15620torrent
The Visual Genome Dataset v1.0 + v1.2 Images15.20GB108Ktorrent
Food-1015.69GB101Ktorrent
The Street View House Numbers (SVHN) Dataset2.64GB600Ktorrent
Downsampled ImageNet 64x6412.59GB1.28Mtorrent
COCO 201752.44GB287Ktorrent website
Flickr 30k Captions (bad data, downloads duplicates)8GB31Kkaggle

In order to download the files please refer to the instructions in download.txt. Note that though this looks like a shell file it still needs to be run in parallel to take full advantage.

Caption Datasets

Of the datasets above Visual Genome, COCO, Flickr30K has captions assosicated with the image. Rest of them have classes asssociated with each one of the images. In order to generate captions for the datasets run the script python3 generate_captions.py you need to have the above mentioned datasets on your system to do that. This will log all the details and create a json that looks like this (ignore double open_images :P):

{
  "open_images_open_images_0123e1f263cf714f": {
      "path": "../downsampled-open-images-v4/256px/validation/0123e1f263cf714f.jpg",
      "caption": "low res photo of human hand"
  },
  "indoor_15613": {
    "path": "../indoorCVPR/winecellar/wine_storage_42_07_altavista.jpg",
    "caption": "picture inside of winecellar"
  }
}

Training

Variational AutoEncoder

First step is to clean the data using an extra script provided python3 clean_data.py. Note that you need to update the folders as per your requirements. Train a discrete VAE easily by running:

python3 discrete_vae.py

It turns out training a VAE is not an easy task I trained using SGD but the training was taking too long and kept collapsing. Adam with gradient clipping works best. After training 100 models (wandb) this configuration has the best size/performance:

in_channels:    3
embedding_dim:  300
num_embeddings: 3000
hidden_dims:    [200, 300, 400]
add_residual:   False

And the dataset looks like this, however the model trains only after a 2462400 samples (~66.44%):

:: Dataset: <BabyDallEDataset (train) openimages256:1910396|food-101:101000|svhn:248823|indoor:15614|imagenet_train64x64:1331148|stl10:13000|genome1:64346|genome2:43733|total:3728060|train:3705691>

Once the model is trained you can actaully visualise the embeddings learned (codebook) to obtain the textures (cool right!):

This model took >12 hours of training.

Transformer

The code for transformer model and training is given in dalle.py. The model is a straight forward dense transformer without any sparse attention (next in the pipeline). For this part we need to generate captions which are done using generate_captions.py or you can also use captions.ipynb for visual approach.

Credits

This work would not have been possible because of millions of people who helped create the dataset.

License

MIT License.