CoreML-Models

April 24, 2026 · View on GitHub

Converted Core ML Model Zoo.

Core ML is a machine learning framework by Apple. If you are iOS developer, you can easly use machine learning models in your Xcode project.

Try the iOS sample-app collection (sample_apps/CoreMLModelsApp) on the App Store:

How to use

Take a look this model zoo, and if you found the CoreML model you want, download the model from google drive link and bundle it in your project. Or if the model have sample project link, try it and see how to use the model in the project. You are free to do or not.

If you like this repository, please give me a star so I can do my best.

Section Link

Image Classifier
- Efficientnetb0
- Efficientnetv2
- VisionTransformer
- Conformer
- DeiT
- RepVGG
- RegNet
- MobileViTv2
Object Detection
- D-FINE
- RF-DETR
- YOLOv5s
- YOLOv7
- YOLOv8
- YOLOv9
- YOLOv10
- YOLO11
- YOLO26
- YOLO-World
Multi-Object Tracking
- ByteTrack
Segmentation
- U2Net
- IS-Net
- RMBG1.4
- face-parsing
- Segformer
- BiseNetv2
- DNL
- ISANet
- FastFCN
- GCNet
- DANet
- Semantic FPN
- cloths_segmentation
- easyportrait
- MobileSAM
- SAM2-Tiny
Video Matting
- MatAnyone
Super Resolution
- Real ESRGAN
- GFPGAN
- BSRGAN
- A-ESRGAN
- Beby-GAN
- RRDN
- Fast-SRGAN
- ESRGAN
- UltraSharp
- SRGAN
- SRResNet
- LESRCNN
- MMRealSR
- DASR
- SinSR
Low Light Enhancement
- StableLLVE
- Zero-DCE
- Retinexformer
Image Restoration
- MPRNet
- MIRNetv2
Image Generation
- MobileStyleGAN
- DCGAN
Image2Image
- Anime2Sketch
- AnimeGAN2Face_Paint_512_v2
- Photo2Cartoon
- AnimeGANv2_Hayao
- AnimeGANv2_Paprika
- WarpGAN Caricature
- UGATIT_selfie2anime
- Fast-Neural-Style-Transfer
- White_box_Cartoonization
- FacialCartoonization
Inpainting
- AOT-GAN-for-Inpainting
- Lama
Monocular Depth Estimation
- Depth Anything 3
- MoGe-2
- MiDaS
Stable Diffusion :text2image
- Nitro-E
- Hyper-SD
- stable-diffusion-v1-5
- pastel-mix
- Orange Mix
- Counterfeit-V2.5
- anything-v4.5
- Openjourney
- dreamlike-photoreal-2.0
Image Colorization
- DDColor Tiny
Face Recognition
- AdaFace IR-18
3D Face Pose Estimation
- 3DDFA_V2
Speaker Diarization
- pyannote segmentation-3.0
Voice Conversion
- OpenVoice V2
Text-to-Speech
- Kokoro-82M
Text-to-Music Generation
- Stable Audio Open Small
Audio Source Separation
- HTDemucs
Vision-Language
- Florence-2-base
Language Model
- Gemma 4 E2B (text + image + audio + video)
- Gemma 4 E4B (text)
- Qwen3.5 2B (text)
- Qwen3.5 0.8B (text)
- Qwen3-VL 2B (text + image)
Zero-Shot Image Classification
- SigLIP ViT-B/16
Anomaly Detection
- EfficientAD
Music Transcription
- Basic Pitch

How to get the model

You can get the model converted to CoreML format from the link of Google drive. See the section below for how to use it in Xcode. The license for each model conforms to the license for the original project.

Image Classifier

Efficientnet

Google Drive Link	Size	Dataset	Original Project	License
Efficientnetb0	22.7 MB	ImageNet	TensorFlowHub	Apache2.0

Efficientnetv2

Google Drive Link	Size	Dataset	Original Project	License	Year
Efficientnetv2	85.8 MB	ImageNet	Google/autoML	Apache2.0	2021

VisionTransformer

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Google Drive Link	Size	Dataset	Original Project	License	Year
VisionTransformer-B16	347.5 MB	ImageNet	google-research/vision_transformer	Apache2.0	2021

Conformer

Local Features Coupling Global Representations for Visual Recognition.

Google Drive Link	Size	Dataset	Original Project	License	Year
Conformer-tiny-p16	94.1 MB	ImageNet	pengzhiliang/Conformer	Apache2.0	2021

DeiT

Data-efficient Image Transformers

Google Drive Link	Size	Dataset	Original Project	License	Year
DeiT-base384	350.5 MB	ImageNet	facebookresearch/deit	Apache2.0	2021

RepVGG

Making VGG-style ConvNets Great Again

Google Drive Link	Size	Dataset	Original Project	License	Year
RepVGG-A0	33.3 MB	ImageNet	DingXiaoH/RepVGG	MIT	2021

RegNet

Designing Network Design Spaces

Google Drive Link	Size	Dataset	Original Project	License	Year
regnet_y_400mf	16.5 MB	ImageNet	TORCHVISION.MODELS	MIT	2020

MobileViTv2

CVNets: A library for training computer vision networks

Google Drive Link	Size	Dataset	Original Project	License	Year	Conversion Script
MobileViTv2	18.8 MB	ImageNet	apple/ml-cvnets	apple	2022

Object Detection

D-FINE

Download Link	Size	Output	Original Project	License	Note	Sample Project
dfine-n-coco	13MB	Confidence(MultiArray (Float32 300 × 80)), Coordinates (MultiArray (Float32 300 × 4))	Peterande/D-FINE	Apache 2.0	Input 640×640. Coordinates are normalized cxcywh. No NMS — filter by confidence threshold.	peaceofcake DFINEDemo

RF-DETR

Download Link	Size	Output	Original Project	License	Note	Sample Project
rfdetr-n-coco	95MB	Confidence(MultiArray (Float32 300 × 91)), Coordinates (MultiArray (Float32 300 × 4))	roboflow/rf-detr	Apache 2.0	Input 384×384. 91 classes (index 0 = background, 1-90 = COCO category IDs). Coordinates are normalized cxcywh. No NMS.	peaceofcake DFINEDemo

YOLOv5s

Google Drive Link	Size	Output	Original Project	License	Note	Sample Project
YOLOv5s	29.3MB	Confidence(MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4))	ultralytics/yolov5	GNU	Non Maximum Suppression has been added.	CoreML-YOLOv5

YOLOv7

Google Drive Link	Size	Output	Original Project	License	Note	Sample Project	Conversion Script
YOLOv7	147.9MB	Confidence(MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4))	WongKinYiu/yolov7	GNU	Non Maximum Suppression has been added.	CoreML-YOLOv5

YOLOv8

Google Drive Link	Size	Output	Original Project	License	Note	Sample Project
YOLOv8s	45.1MB	Confidence(MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4))	ultralytics/ultralytics	GNU	Non Maximum Suppression has been added.	CoreML-YOLOv5

YOLOv9

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. Uses PGI and GELAN architecture for efficient object detection.

Download Link	Size	Output	Original Project	License	Year	Note	Sample Project
yolov9s.mlpackage.zip	14 MB	Confidence (MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4))	WongKinYiu/yolov9	GPL-3.0	2024	Non Maximum Suppression has been added.	YOLOv9Demo

YOLOv10

YOLOv10: Real-Time End-to-End Object Detection. NMS-free architecture using consistent dual assignments — no post-processing needed.

Download Link	Size	Output	Original Project	License	Year	Note	Sample Project
yolov10s.mlpackage.zip	14 MB	MultiArray (1 × 300 × 6)	THU-MIG/yolov10	AGPL-3.0	2024	NMS-free end-to-end detection.	YOLO26Demo

YOLO11

YOLO11: Ultralytics latest YOLO with improved backbone and neck architecture. 22% fewer parameters than YOLOv8 with higher mAP.

Download Link	Size	Output	Original Project	License	Year	Note	Sample Project
yolo11s.mlpackage.zip	18 MB	Confidence (MultiArray (Double 0 × 80)), Coordinates (MultiArray (Double 0 × 4))	ultralytics/ultralytics	AGPL-3.0	2024	Non Maximum Suppression has been added.	YOLOv9Demo

YOLO26

YOLO26: Edge-first vision AI with NMS-free end-to-end detection. Up to 43% faster CPU inference vs YOLO11 with DFL removal and ProgLoss.

Download Link	Size	Output	Original Project	License	Year	Note	Sample Project
yolo26s.mlpackage.zip	18 MB	MultiArray (1 × 300 × 6)	ultralytics/ultralytics	AGPL-3.0	2026	NMS-free end-to-end detection.	YOLO26Demo

YOLO-World

YOLO-World: Real-Time Open-Vocabulary Object Detection. Type any text query and detect it — no fixed class list. Uses CLIP text encoder for open-vocabulary matching.

Download Link	Size	Description	Original Project	License	Year	Sample Project
yoloworld_detector.mlpackage.zip	25 MB	YOLO-World V2-S visual detector	AILab-CVC/YOLO-World	GPL-3.0	2024	YOLOWorldDemo
clip_text_encoder.mlpackage.zip	121 MB	CLIP ViT-B/32 text encoder	openai/CLIP	MIT	2021	—
clip_vocab.json.zip	1.6 MB	BPE vocabulary for tokenizer	—	—	—	—

Multi-Object Tracking

ByteTrack

ByteTrack: Multi-Object Tracking by Associating Every Detection Box. Pure-Swift on-device tracker that adds persistent IDs to any of the object detectors above. Pairs a per-track 8D constant-velocity Kalman filter with a two-stage IoU association — high-confidence detections are matched first, then low-confidence detections are used to rescue tracks about to be lost through motion blur and brief occlusions. No appearance / ReID network, so it runs for free on top of an existing detector.

Implementation	Source	Paper	License	Year	Note	Sample Project
Pure Swift (no download)	Tracker.swift	ByteTrack (arXiv 2110.06864)	MIT (this port) / Original	2022	8D Kalman + two-stage IoU association, class-aware, greedy matching, lost-track buffer of 30 frames. Drop-in on top of any `[Detection]` stream.	YOLO26Demo

Segmentation

U2Net

Google Drive Link	Size	Output	Original Project	License
U2Net	175.9 MB	Image(GRAYSCALE 320 × 320)	xuebinqin/U-2-Net	Apache
U2Netp	4.6 MB	Image(GRAYSCALE 320 × 320)	xuebinqin/U-2-Net	Apache

IS-Net

Google Drive Link	Size	Output	Original Project	License	Year	Conversion Script
IS-Net	176.1 MB	Image(GRAYSCALE 1024 × 1024)	xuebinqin/DIS	Apache	2022
IS-Net-General-Use	176.1 MB	Image(GRAYSCALE 1024 × 1024)	xuebinqin/DIS	Apache	2022

RMBG1.4

RMBG1.4 - The IS-Net enhanced with our unique training scheme and proprietary dataset.

Download Link	Size	Output	Original Project	License	year	Sample Project	Conversion Script
RMBG_1_4.mlpackage.zip	42 MB (INT8)	Alpha mask 1024x1024	briaai/RMBG-1.4	Creative Commons	2024	RMBGDemo	convert_rmbg.py

face-Parsing

Google Drive Link	Size	Output	Original Project	License	Sample Project
face-Parsing	53.2 MB	MultiArray(1 x 512 × 512)	zllrunning/face-parsing.PyTorch	MIT	CoreML-face-parsing

Segformer

Simple and Efficient Design for Semantic Segmentation with Transformers

Google Drive Link	Size	Output	Original Project	License	year
SegFormer_mit-b0_1024x1024_cityscapes	14.9 MB	MultiArray(512 × 1024)	NVlabs/SegFormer	NVIDIA	2021

BiSeNetV2

Bilateral Network with Guided Aggregation for Real-time Semantic Segmentation

Google Drive Link	Size	Output	Original Project	License	year
BiSeNetV2_1024x1024_cityscapes	12.8 MB	MultiArray	ycszen/BiSeNet	Apache2.0	2021

DNL

Disentangled Non-Local Neural Networks

Google Drive Link	Size	Output	Dataset	Original Project	License	year
dnl_r50-d8_512x512_80k_ade20k	190.8 MB	MultiArray[512x512]	ADE20K	yinmh17/DNL-Semantic-Segmentation	Apache2.0	2020

ISANet

Interlaced Sparse Self-Attention for Semantic Segmentation

Google Drive Link	Size	Output	Dataset	Original Project	License	year
isanet_r50-d8_512x512_80k_ade20k	141.5 MB	MultiArray[512x512]	ADE20K	openseg-group/openseg.pytorch	MIT	ArXiv'2019/IJCV'2021

FastFCN

Rethinking Dilated Convolution in the Backbone for Semantic Segmentation

Google Drive Link	Size	Output	Dataset	Original Project	License	year
fastfcn_r50-d32_jpu_aspp_512x512_80k_ade20k	326.2 MB	MultiArray[512x512]	ADE20K	wuhuikai/FastFCN	MIT	ArXiv'2019

GCNet

Non-local Networks Meet Squeeze-Excitation Networks and Beyond

Google Drive Link	Size	Output	Dataset	Original Project	License	year
gcnet_r50-d8_512x512_20k_voc12aug	189 MB	MultiArray[512x512]	PascalVOC	xvjiarui/GCNet	Apache License 2.0	ICCVW'2019/TPAMI'2020

DANet

Dual Attention Network for Scene Segmentation(CVPR2019)

Google Drive Link	Size	Output	Dataset	Original Project	License	year
danet_r50-d8_512x1024_40k_cityscapes	189.7 MB	MultiArray[512x1024]	CityScapes	junfu1115/DANet	MIT	CVPR2019

Semantic-FPN

Panoptic Feature Pyramid Networks

Google Drive Link	Size	Output	Dataset	Original Project	License	year
fpn_r50_512x1024_80k_cityscapes	108.6 MB	MultiArray[512x1024]	CityScapes	facebookresearch/detectron2	Apache License 2.0	2019

cloths_segmentation

Code for binary segmentation of various cloths.

Google Drive Link	Size	Output	Dataset	Original Project	License	year
clothSegmentation	50.1 MB	Image(GrayScale 640x960)	fashion-2019-FGVC6	facebookresearch/detectron2	MIT	2020

easyportrait

EasyPortrait - Face Parsing and Portrait Segmentation Dataset.

Google Drive Link	Size	Output	Original Project	License	year	Swift sample	Conversion Script
easyportrait-segformer512-fp	7.6 MB	Image(GrayScale 512x512) * 9	hukenovs/easyportrait	Creative Commons	2023	easyportrait-coreml

MobileSAM

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications. MobileSAM replaces the heavy ViT-H image encoder with a lightweight ViT-Tiny encoder via decoupled knowledge distillation, making it ~60x smaller and ~40x faster than the original SAM.

Download Link	Size	Output	Original Project	License	Year	Sample Project
MobileSAM.zip	23 MB (Encoder 13 MB + Decoder 9.8 MB)	Segmentation Mask	ChaoningZhang/MobileSAM	Apache 2.0	2023	SamKit

SAM2-Tiny

SAM 2: Segment Anything in Images and Videos. SAM 2 extends promptable segmentation from images to videos using a streaming architecture with memory. The Tiny variant uses a Hiera-T backbone for efficient on-device inference.

Download Link	Size	Output	Original Project	License	Year	Sample Project
SAM2Tiny.zip	76 MB (ImageEncoder 64 MB + PromptEncoder 2 MB + MaskDecoder 9.8 MB)	Segmentation Mask	facebookresearch/sam2	Apache 2.0	2024	SamKit

Video Matting

MatAnyone

pq-yang/MatAnyone (CVPR 2025) — temporally consistent video matting with object-level memory propagation. Given a first-frame mask the network tracks and refines an alpha matte across the whole clip, holding sharp edges (hair, semitransparent regions) much better than per-frame matting baselines. Built on the Cutie video object segmentation backbone with a dedicated mask decoder for matting.

The CoreML port splits the network into 5 stateless modules so the per-frame memory state machine can live in Swift while CoreML handles the heavy compute. End-to-end alpha matte parity vs the official PyTorch reference: MAE < 2e-4, correlation 0.9999+ across 18 frames including 3 memory cycles.

The sample app uses Vision's VNGeneratePersonSegmentationRequest to bootstrap the first-frame mask automatically — pick a video, tap "Remove BG", and it composites the foreground over the chosen background colour.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
MatAnyone (5 mlpackages, ~111 MB FP16 total)	111 MB	image [1,3,432,768] (per-frame state in Swift)	alpha matte [1,1,432,768]	pq-yang/MatAnyone	NTU S-Lab 1.0	2025	MatAnyoneDemo	convert_matanyone.py

See sample_apps/MatAnyoneDemo/README.md for the per-frame state machine, the 5-module split, and conversion details.

Super Resolution

Real ESRGAN

Google Drive Link	Size	Output	Original Project	License	year
Real ESRGAN4x	66.9 MB	Image(RGB 2048x2048)	xinntao/Real-ESRGAN	BSD 3-Clause License	2021
Real ESRGAN Anime4x	66.9 MB	Image(RGB 2048x2048)	xinntao/Real-ESRGAN	BSD 3-Clause License	2021

GFPGAN

Towards Real-World Blind Face Restoration with Generative Facial Prior

Google Drive Link	Size	Output	Original Project	License	year
GFPGAN	337.4 MB	Image(RGB 512x512)	TencentARC/GFPGAN	Apache2.0	2021

BSRGAN

Google Drive Link	Size	Output	Original Project	License	year
BSRGAN	66.9 MB	Image(RGB 2048x2048)	cszn/BSRGAN		2021

A-ESRGAN

Google Drive Link	Size	Output	Original Project	License	year	Conversion Script
A-ESRGAN	63.8 MB	Image(RGB 1024x1024)	aesrgan/A-ESRGANN	BSD 3-Clause License	2021

Beby-GAN

Best-Buddy GANs for Highly Detailed Image Super-Resolution

Google Drive Link	Size	Output	Original Project	License	year
Beby-GAN	66.9 MB	Image(RGB 2048x2048)	dvlab-research/Simple-SR	MIT	2021

RRDN

The Residual in Residual Dense Network for image super-scaling.

Google Drive Link	Size	Output	Original Project	License	year
RRDN	16.8 MB	Image(RGB 2048x2048)	idealo/image-super-resolution	Apache2.0	2018

Fast-SRGAN

Fast-SRGAN.

Google Drive Link	Size	Output	Original Project	License	year
Fast-SRGAN	628 KB	Image(RGB 1024x1024)	HasnainRaz/Fast-SRGAN	MIT	2019

ESRGAN

Enhanced-SRGAN.

Google Drive Link	Size	Output	Original Project	License	year
ESRGAN	66.9 MB	Image(RGB 2048x2048)	xinntao/ESRGAN	Apache 2.0	2018

UltraSharp

Pretrained: 4xESRGAN

Google Drive Link	Size	Output	Original Project	License	year
UltraSharp	34 MB	Image(RGB 1024x1024)	Kim2019/	CC-BY-NC-SA-4.0	2021

SRGAN

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.

Google Drive Link	Size	Output	Original Project	License	year
SRGAN	6.1 MB	Image(RGB 2048x2048)	dongheehand/SRGAN-PyTorch		2017

SRResNet

Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network.

Google Drive Link	Size	Output	Original Project	License	year
SRResNet	6.1 MB	Image(RGB 2048x2048)	dongheehand/SRGAN-PyTorch		2017

LESRCNN

Lightweight Image Super-Resolution with Enhanced CNN.

Google Drive Link	Size	Output	Original Project	License	year	Conversion Script
LESRCNN	4.3 MB	Image(RGB 512x512)	hellloxiaotian/LESRCNN		2020

MMRealSR

Metric Learning based Interactive Modulation for Real-World Super-Resolution

Google Drive Link	Size	Output	Original Project	License	year	Conversion Script
MMRealSRGAN	104.6 MB	Image(RGB 1024x1024)	TencentARC/MM-RealSR	BSD 3-Clause	2022
MMRealSRNet	104.6 MB	Image(RGB 1024x1024)	TencentARC/MM-RealSR	BSD 3-Clause	2022

DASR

Pytorch implementation of "Unsupervised Degradation Representation Learning for Blind Super-Resolution", CVPR 2021

Google Drive Link	Size	Output	Original Project	License	year
DASR	12.1 MB	Image(RGB 1024x1024)	The-Learning-And-Vision-Atelier-LAVA/DASR	MIT	2022

SinSR

wyf0912/SinSR — single-step diffusion-based super-resolution (CVPR 2024, ~113M params). Distilled from ResShift for one-step 4x upscaling. Uses a Swin Transformer UNet with VQ-VAE latent space.

Left: bicubic 4x upscale, Right: SinSR single-step diffusion SR (128x128 → 512x512)

3 CoreML models: VQ-VAE encoder, Swin-UNet denoiser (single step), and VQ-VAE decoder with vector quantization.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
SinSR_Encoder.mlpackage.zip	39 MB	image [1,3,1024,1024]	latent [1,3,256,256]	wyf0912/SinSR	S-Lab	2024	SinSRDemo	convert_sinsr.py
SinSR_Denoiser.mlpackage.zip	420 MB	input [1,6,256,256]	predicted_latent [1,3,256,256]
SinSR_Decoder.mlpackage.zip	58 MB	latent [1,3,256,256]	image [1,3,1024,1024]

See sample_apps/SinSRDemo/README.md for the inference pipeline and conversion details.

Low Light Enhancement

StableLLVE

Learning Temporal Consistency for Low Light Video Enhancement from Single Images.

Google Drive Link	Size	Output	Original Project	License	Year
StableLLVE	17.3 MB	Image(RGB 512x512)	zkawfanx/StableLLVE	MIT	2021

Zero-DCE

Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement

Google Drive Link	Size	Output	Original Project	License	Year	Conversion Script
Zero-DCE	320KB	Image(RGB 512x512)	Li-Chongyi/Zero-DCE	See Repo	2021

Retinexformer

Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement

Google Drive Link	Size	Output	Original Project	License	Year	Conversion Script
ZRetinexformer FiveK	3.4MB	Image(RGB 512x512)	caiyuanhao1998/Retinexformer	MIT	2023
ZRetinexformer NTIRE	3.4MB	Image(RGB 512x512)	caiyuanhao1998/Retinexformer	MIT	2023

Image Restoration

MPRNet

Multi-Stage Progressive Image Restoration.

Debluring

Denoising

Deraining

Google Drive Link	Size	Output	Original Project	License	Year
MPRNetDebluring	137.1 MB	Image(RGB 512x512)	swz30/MPRNet	MIT	2021
MPRNetDeNoising	108 MB	Image(RGB 512x512)	swz30/MPRNet	MIT	2021
MPRNetDeraining	24.5 MB	Image(RGB 512x512)	swz30/MPRNet	MIT	2021

MIRNetv2

Learning Enriched Features for Fast Image Restoration and Enhancement.

Denoising

Super Resolution

Contrast Enhancement

Low Light Enhancement

Google Drive Link	Size	Output	Original Project	License	Year
MIRNetv2Denoising	42.5 MB	Image(RGB 512x512)	swz30/MIRNetv2	ACADEMIC PUBLIC LICENSE	2022
MIRNetv2SuperResolution	42.5 MB	Image(RGB 512x512)	swz30/MIRNetv2	ACADEMIC PUBLIC LICENSE	2022
MIRNetv2ContrastEnhancement	42.5 MB	Image(RGB 512x512)	swz30/MIRNetv2	ACADEMIC PUBLIC LICENSE	2022
MIRNetv2LowLightEnhancement	42.5 MB	Image(RGB 512x512)	swz30/MIRNetv2	ACADEMIC PUBLIC LICENSE	2022

Image Generation

MobileStyleGAN

Google Drive Link	Size	Output	Original Project	License	Sample Project
MobileStyleGAN	38.6MB	Image(Color 1024 × 1024)	bes-dev/MobileStyleGAN.pytorch	Nvidia Source Code License-NC	CoreML-StyleGAN

DCGAN

Google Drive Link	Size	Output	Original Project
DCGAN	9.2MB	MultiArray	TensorFlowCore

Image2Image

Anime2Sketch

Google Drive Link	Size	Output	Original Project	License	Usage
Anime2Sketch	217.7MB	Image(Color 512 × 512)	Mukosame/Anime2Sketch	MIT	Drop an image to preview

AnimeGAN2Face_Paint_512_v2

Google Drive Link	Size	Output	Original Project	Conversion Script
AnimeGAN2Face_Paint_512_v2	8.6MB	Image(Color 512 × 512)	bryandlee/animegan2-pytorch

Photo2Cartoon

Google Drive Link	Size	Output	Original Project	License	Note
Photo2Cartoon	15.2 MB	Image(Color 256 × 256)	minivision-ai/photo2cartoon	MIT	The output is little bit different from the original model. It cause some operations were converted replaced　manually.

AnimeGANv2_Hayao

Google Drive Link	Size	Output	Original Project	Sample
AnimeGANv2_Hayao	8.7MB	Image(256 x 256)	TachibanaYoshino/AnimeGANv2	AnimeGANv2-iOS

AnimeGANv2_Paprika

Google Drive Link	Size	Output	Original Project
AnimeGANv2_Paprika	8.7MB	Image(256 x 256)	TachibanaYoshino/AnimeGANv2

WarpGAN Caricature

Google Drive Link	Size	Output	Original Project
WarpGAN Caricature	35.5MB	Image(256 x 256)	seasonSH/WarpGAN

UGATIT_selfie2anime

スクリーンショット 2021-12-27 8 18 33 スクリーンショット 2021-12-27 8 28 11

Google Drive Link	Size	Output	Original Project
UGATIT_selfie2anime	266.2MB(quantized)	Image(256x256)	taki0112/UGATIT

CartoonGAN

Google Drive Link	Size	Output	Original Project
CartoonGAN_Shinkai	44.6MB	MultiArray	mnicnc404/CartoonGan-tensorflow
CartoonGAN_Hayao	44.6MB	MultiArray	mnicnc404/CartoonGan-tensorflow
CartoonGAN_Hosoda	44.6MB	MultiArray	mnicnc404/CartoonGan-tensorflow
CartoonGAN_Paprika	44.6MB	MultiArray	mnicnc404/CartoonGan-tensorflow

Fast-Neural-Style-Transfer

Google Drive Link	Size	Output	Original Project	License	Year
fast-neural-style-transfer-cuphead	6.4MB	Image(RGB 960x640)	eriklindernoren/Fast-Neural-Style-Transfer	MIT	2019
fast-neural-style-transfer-starry-night	6.4MB	Image(RGB 960x640)	eriklindernoren/Fast-Neural-Style-Transfer	MIT	2019
fast-neural-style-transfer-mosaic	6.4MB	Image(RGB 960x640)	eriklindernoren/Fast-Neural-Style-Transfer	MIT	2019

White_box_Cartoonization

Learning to Cartoonize Using White-box Cartoon Representations

Google Drive Link	Size	Output	Original Project	License	Year
White_box_Cartoonization	5.9MB	Image(1536x1536)	SystemErrorWang/White-box-Cartoonization	creativecommons	CVPR2020

FacialCartoonization

White-box facial image cartoonizaiton

Google Drive Link	Size	Output	Original Project	License	Year
FacialCartoonization	8.4MB	Image(256x256)	SystemErrorWang/FacialCartoonization	creativecommons	2020

Inpainting

AOT-GAN-for-Inpainting

Google Drive Link	Size	Output	Original Project	License	Note	Sample Project
AOT-GAN-for-Inpainting	60.8MB	MLMultiArray(3,512,512)	researchmm/AOT-GAN-for-Inpainting	Apache2.0	To use see sample.	john-rocky/Inpainting-CoreML

Lama

Google Drive Link	Size	Input	Output	Original Project	License	Note	Sample Project	Conversion Script
Lama	216.6MB	Image (Color 800 × 800), Image (GrayScale 800 × 800)	Image (Color 800 × 800)	advimman/lama	Apache2.0	To use see sample.	john-rocky/lama-cleaner-iOS	mallman/CoreMLaMa

Monocular Depth Estimation

Depth Anything 3

ByteDance-Seed/Depth-Anything-3 (ICLR 2026 Oral) — relative monocular depth from a single image. DA3 Main Series uses a plain DINOv2 ViT backbone plus a DualDPT head with a unified depth-ray representation; this Core ML port exposes only the monocular depth + confidence subgraph (camera / multi-view / sky / 3DGS branches are stripped). First public Core ML conversion of DA3.

Module	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
DA3 Small 504×504	~44 MB FP16	Image (RGB 504 × 504)	depth + confidence	ByteDance-Seed/Depth-Anything-3	Apache 2.0	2025	Hub App	convert_depth_anything_v3.py
DA3 Base 504×504	~173 MB FP16	Image (RGB 504 × 504)	depth + confidence	ByteDance-Seed/Depth-Anything-3	Apache 2.0	2025	Hub App	convert_depth_anything_v3.py

MoGe-2

microsoft/MoGe (CVPR 2025 Oral) — open-domain monocular 3D geometry from a single image. Predicts a metric depth map, surface normals, and a confidence mask in one forward pass on a DINOv2 ViT-B backbone with three task heads. The successor to MiDaS-style relative depth: depth comes out in real meters.

Left: original photo, center: metric depth (turbo colormap), right: surface normals.

Module	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
MoGe-2 ViT-B + normal	~200 MB FP16	Image (RGB 504 × 504)	depth + normal + mask + metric_scale	microsoft/MoGe	MIT	2025	MoGe2Demo	convert_moge2.py

MiDaS

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

Google Drive Link	Size	Output	Original Project	License	Year	Conversion Script
MiDaS_Small	66.3MB	MultiArray(1x256x256)	isl-org/MiDaS	MIT	2022

Stable Diffusion

Nitro-E

amd/Nitro-E — AMD's 304M-parameter E-MMDiT text-to-image model released October 2025. 4-step distilled variant generates 512×512 images from a prompt in ~2–3 seconds on iPhone 15+. Uses Llama 3.2 1B as the text encoder, a DC-AE f32c32 VAE decoder, and an ASA-based (Alternating Subregion Attention) diffusion transformer. Full pipeline fits in ~1.04 GB after INT4 / INT8 palettization (TextEncoder 590 MB + E-MMDiT 295 MB + VAE 159 MB).

4-step generation on iPhone, 512×512. Prompt: "a hot air balloon in the shape of a heart, grand canyon".

3 CoreML models total:

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
NitroE_TextEncoder.mlpackage	590 MB (INT4) / 2.3 GB (FP16)	input_ids [1,128], attention_mask [1,128]	last_hidden_state [1,128,2048]	meta-llama/Llama-3.2-1B	Llama 3.2 (gated)	2024	NitroEDemo	convert_nitro_e_text_encoder.py
NitroE_EMMDiT.mlpackage	295 MB (INT8) / 578 MB (FP16)	latent [1,32,16,16], encoder_hs [1,128,2048], timestep [1]	noise_pred [1,32,16,16]	amd/Nitro-E	MIT	2025		convert_nitro_e_emmdit.py
NitroE_VAEDecoder.mlpackage	159 MB (INT8) / 608 MB (FP32)	latent [1,32,16,16]	image [1,3,512,512]	mit-han-lab/dc-ae-f32c32-sana-1.0-diffusers	MIT	2024		convert_nitro_e_vae_decoder.py

See sample_apps/NitroEDemo/README.md for the Swift FlowMatchEulerScheduler port, tokenizer details, and iOS 18 palettization notes.

Hyper-SD

ByteDance/Hyper-SD — single-step text-to-image distilled from SD1.5 via Trajectory Segmented Consistency Distillation. ByteDance reports user preference 2x over SD-Turbo at 1 step. Combined with Apple's ml-stable-diffusion (Split-Einsum attention, chunked UNet, 6-bit palettization), runs at acceptable speed and quality on iPhone 15+.

1-step generations on iPhone, 512×512. Prompts: cat with sunglasses, cyberpunk city, japanese garden, astronaut on horse.

4 CoreML models (~947 MB total): CLIP text encoder + Swin-style chunked UNet (6-bit palettized) + VAE decoder. Uses TCD scheduler for single-step inference.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
HyperSDTextEncoder.mlpackage.zip	235 MB	input_ids [1,77]	encoder_hidden_states [1,77,768]	ByteDance/Hyper-SD	OpenRAIL++	2024	HyperSDDemo	convert_hypersd.py
HyperSDUnetChunk1.mlpackage.zip	318 MB	latent + encoder_hs + timestep	first half intermediates
HyperSDUnetChunk2.mlpackage.zip	299 MB	first half outputs + skip connections	noise_pred [2,4,64,64]
HyperSDVAEDecoder.mlpackage.zip	95 MB	latent [1,4,64,64]	image [1,3,512,512]

See sample_apps/HyperSDDemo/README.md for the LoRA fusion, chunked-UNet palettization, and TCD scheduler details.

stable-diffusion-v1-5

Google Drive Link	Original Model	Original Project	License	Run on mac	Conversion Script	Year
stable-diffusion-v1-5	runwayml/stable-diffusion-v1-5	runwayml/stable-diffusion	Open RAIL M license	godly-devotion/MochiDiffusion	godly-devotion/MochiDiffusion	2022

pastel-mix

Pastel Mix - a stylized latent diffusion model.This model is intended to produce high-quality, highly detailed anime style with just a few prompts.

Google Drive Link	Original Model	License	Run on mac	Conversion Script	Year
pastelMixStylizedAnime_pastelMixPrunedFP16	andite/pastel-mix	Fantasy.ai	godly-devotion/MochiDiffusion	godly-devotion/MochiDiffusion	2023

Orange Mix

Google Drive Link	Original Model	License	Run on mac	Conversion Script	Year
AOM3_orangemixs	WarriorMama777/OrangeMixs	CreativeML OpenRAIL-M	godly-devotion/MochiDiffusion	godly-devotion/MochiDiffusion	2023

Counterfeit

Google Drive Link	Original Model	License	Run on mac	Conversion Script	Year
Counterfeit-V2.5	gsdf/Counterfeit-V2.5	-	godly-devotion/MochiDiffusion	godly-devotion/MochiDiffusion	2023

anything-v4

Google Drive Link	Original Model	License	Run on mac	Conversion Script	Year
anything-v4.5	andite/anything-v4.0	Fantasy.ai	godly-devotion/MochiDiffusion	godly-devotion/MochiDiffusion	2023

Openjourney

Google Drive Link	Original Model	License	Run on mac	Conversion Script	Year
Openjourney	prompthero/openjourney	-	godly-devotion/MochiDiffusion	godly-devotion/MochiDiffusion	2023

dreamlike-photoreal-2

Google Drive Link	Original Model	License	Run on mac	Conversion Script	Year
dreamlike-photoreal-2.0	dreamlike-art/dreamlike-photoreal-2.0	CreativeML OpenRAIL-M	godly-devotion/MochiDiffusion	godly-devotion/MochiDiffusion	2023

Image Colorization

DDColor Tiny

DDColor — AI image colorization for grayscale/B&W photos using dual decoders (ICCV 2023).

Input	Output

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
DDColor_Tiny.mlpackage.zip	242 MB	512×512 RGB	AB channels (LAB)	piddnad/DDColor	Apache-2.0	2023	DDColorDemo	convert_ddcolor.py

Face Recognition

AdaFace IR-18

AdaFace — Quality-adaptive face recognition. Outputs 512-dim embedding for face verification and identification.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
AdaFace_IR18.mlpackage.zip	48 MB	Image (112×112 face)	512-dim L2-normalized embedding	mk-minchul/AdaFace	MIT	2022	AdaFaceDemo	convert_adaface.py

3D Face Pose Estimation

3DDFA_V2

3DDFA_V2 — 3D face reconstruction and head pose estimation (yaw, pitch, roll) from a single face image.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project
3DDFA_V2.mlpackage.zip	6.3 MB	Image (120×120 RGB)	62 params (12 pose + 40 shape + 10 expression)	cleardusk/3DDFA_V2	MIT	2020	Face3DDemo

Speaker Diarization

pyannote segmentation-3.0

pyannote segmentation — Speaker diarization with up to 3 simultaneous speakers. Identifies who speaks when, with overlap detection and per-speaker transcription.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
SpeakerSegmentation.mlpackage.zip	5.8 MB	10s mono 16kHz [1,1,160000]	[1, 589, 7] speaker logits	pyannote/segmentation-3.0	MIT	2023	DiarizationDemo	convert_diarization.py

Voice Conversion

OpenVoice V2

OpenVoice — Zero-shot voice conversion. Record source and target voice, convert on-device.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
OpenVoice_SpeakerEncoder.mlpackage.zip	1.7 MB	Spectrogram [1, T, 513]	256-dim speaker embedding	myshell-ai/OpenVoice	MIT	2024	OpenVoiceDemo	convert_openvoice.py
OpenVoice_VoiceConverter.mlpackage.zip	64 MB	Spectrogram + speaker embeddings	Waveform audio (22050 Hz)

Audio Source Separation

HTDemucs

Hybrid Transformer Demucs — separates music into 4 stems: drums, bass, vocals, and other instruments.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
HTDemucs_SourceSeparation_F32.mlpackage.zip	80 MB	Audio Waveform [1, 2, 343980] at 44.1kHz	4 stems (drums, bass, other, vocals) stereo	facebookresearch/demucs	MIT	2022	DemucsDemo	convert_htdemucs.py

Vision-Language

Florence-2-base

Microsoft Florence-2 — a unified vision-language model supporting image captioning, OCR, and object detection from a single model. Converted as 3 CoreML models (INT8): Vision Encoder (DaViT), Text Encoder (BART), and Decoder with autoregressive generation.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
Florence2VisionEncoder / TextEncoder / Decoder	260 MB (INT8, 3 models total)	768x768 RGB image + task prompt	Generated text (caption, OCR, etc.)	microsoft/Florence-2-base	MIT	2024	Florence2Demo	convert_florence2.py

Language Model

john-rocky/CoreML-LLM — Companion repository for running LLMs on the Apple Neural Engine. Unlike MLX Swift (GPU-only), CoreML-LLM targets ANE for ~10x lower power draw, making always-on on-device LLMs practical on iPhone. Current release v1.4.0 — Gemma 4 E2B 3-chunk decode (31.6 → 34.2 tok/s, +8.2%), chunk pipelining default ON, still-image vision encoder on ANE. All models below load via the same CoreMLLLM.load(...) Swift API and are available in-app through the Models Zoo hub.

Model	Size	Modalities	iPhone 17 Pro decode	HuggingFace
Gemma 4 E2B	3.1 GB	Text + image + audio + video	31–34 tok/s	mlboydaisuke/gemma-4-E2B-coreml
Gemma 4 E4B	5.5 GB	Text	~14 tok/s	mlboydaisuke/gemma-4-E4B-coreml
Qwen3.5 2B	2.4 GB	Text	~17 tok/s (~200 MB RSS)	mlboydaisuke/qwen3.5-2B-CoreML
Qwen3.5 0.8B	754 MB	Text	~20 tok/s	mlboydaisuke/qwen3.5-0.8B-CoreML
Qwen3-VL 2B	4.7 GB	Text + image	~7.5 tok/s	mlboydaisuke/qwen3-vl-2b-coreml

Gemma 4 E2B (CoreML-LLM)

Google Gemma 4 E2B (2.3B effective parameters with Per-Layer Embeddings) running fully on ANE. Multimodal: text, image (native 384x384 encoder, 196 tokens/image), audio (12-layer Conformer encoder), and video (64 tokens/frame). 2048 context length, Sliding Window Attention (28/35 layers are O(W)), PLE computed inside the graph. The default 4-chunk decode ships at 31.6 tok/s on iPhone 17 Pro; LLM_3CHUNK=1 bumps it to 34.2 tok/s by collapsing chunk 2+3.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Swift Package
mlboydaisuke/gemma-4-E2B-coreml	3.1 GB (INT4, 4 chunks + vision + audio + video encoders)	Text + image + audio + video (≤2048 tokens)	Generated text (streaming)	google/gemma-3n-E2B-it	Gemma ToU	2025	CoreMLLLMChat	CoreML-LLM

Gemma 4 E4B

Larger text-only Gemma 4 variant — 42-layer decoder, ~4B effective parameters, 100% ANE-resident. Use when you want maximum text quality and have the storage budget. No vision / audio / video encoders.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Swift Package
mlboydaisuke/gemma-4-E4B-coreml	5.5 GB (INT4, 4 chunks)	Text prompt (≤2048 tokens)	Generated text (streaming)	google/gemma-3n-E4B-it	Gemma ToU	2025	CoreMLLLMChat	CoreML-LLM

Qwen3.5 2B

Alibaba Qwen3.5 2B — hybrid Gated-DeltaNet SSM + attention. Shipped as 4 INT8 body chunks (6 layers each) + tail + mmap fp16 embed sidecar so a 2B-param model fits in ~200 MB phys_footprint. The 4-chunk split is required to stay ANE-resident — a 2-chunk variant at 2 GB fp16/chunk exceeds the single-mlprogram budget and falls to GPU.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Swift Package
mlboydaisuke/qwen3.5-2B-CoreML	2.4 GB (INT8, 4 chunks + embed)	Text prompt	Generated text (streaming)	Qwen/Qwen3.5-2B	Apache-2.0	2025	CoreMLLLMChat	CoreML-LLM

Qwen3.5 0.8B

Compact hybrid SSM+attention model, INT8 palettized — same semantic precision as fp16 (top-3 = 100% parity vs fp32 oracle), half the bundle size. Smallest and fastest option in the lineup at 754 MB / ~20 tok/s decode.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Swift Package
mlboydaisuke/qwen3.5-0.8B-CoreML	754 MB (INT8 palettized)	Text prompt	Generated text (streaming)	Qwen/Qwen3.5-0.8B	Apache-2.0	2025	CoreMLLLMChat	CoreML-LLM

Qwen3-VL 2B

Qwen3-VL multimodal — text + image input with DeepStack injection at L0/1/2 and interleaved mRoPE for the 196 image tokens. 28-layer GQA text backbone shipped as 6 INT8 body chunks + chunk_head + raw fp16 embed sidecar that Swift mmaps. Vision tower re-uses Qwen3-VL's native ViT.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Swift Package
mlboydaisuke/qwen3-vl-2b-coreml	4.7 GB (INT8, 6 body chunks + head + embed)	Text + image	Generated text (streaming)	Qwen/Qwen3-VL-2B-Instruct	Apache-2.0	2025	CoreMLLLMChat	CoreML-LLM

See CoreML-LLM for the full conversion pipeline, ANE optimization techniques (cat-trick RMSNorm, Conv2d Linear, pre-computed RoPE, stateless KV with explicit I/O), and the Swift sample app.

Zero-Shot Image Classification

SigLIP ViT-B/16

Google SigLIP — sigmoid-based contrastive image-text model for zero-shot classification. Type any labels (e.g. "cat, dog, car") and get per-label probabilities. Converted as 2 CoreML models (INT8): Image Encoder and Text Encoder.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
SigLIP_ImageEncoder / TextEncoder	386 MB (FP16, 2 models total)	224x224 RGB image + text labels	Per-label similarity scores (softmax)	google/siglip-base-patch16-224	Apache-2.0	2024	SigLIPDemo	convert_siglip.py

Text-to-Speech

Kokoro-82M

hexgrad/Kokoro-82M — open-weight 82M-parameter TTS by hexgrad. Style-conditioned StyleTTS2 architecture (BERT + duration predictor + iSTFTNet vocoder) producing 24kHz speech in 9 languages from per-voice style embeddings. The first CoreML port with on-device bilingual (English + Japanese) free-text input — no MLX, no MeCab, no IPADic, no Python G2P at runtime.

2 CoreML models: a flexible-length Predictor (BERT + LSTM duration head + text encoder) and 3 fixed-shape Decoder buckets (128 / 256 / 512 frames). The Swift pipeline picks the smallest bucket that fits the predicted total duration, pads input features with zeros, and trims the output audio.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
Kokoro_Predictor.mlpackage.zip	75 MB	input_ids [1, T≤256] (int32) + ref_s_style [1, 128]	duration [1, T] + d_for_align [1, 640, T] + t_en [1, 512, T]	hexgrad/Kokoro-82M	Apache-2.0	2025	KokoroDemo	convert_kokoro.py
Kokoro_Decoder_128.mlpackage.zip	238 MB	en_aligned [1, 640, 128] + asr_aligned [1, 512, 128] + ref_s [1, 256]	audio [1, 76800] @ 24kHz
Kokoro_Decoder_256.mlpackage.zip	241 MB	en_aligned [1, 640, 256] + asr_aligned [1, 512, 256] + ref_s [1, 256]	audio [1, 153600] @ 24kHz
Kokoro_Decoder_512.mlpackage.zip	246 MB	en_aligned [1, 640, 512] + asr_aligned [1, 512, 512] + ref_s [1, 256]	audio [1, 307200] @ 24kHz

See sample_apps/KokoroDemo/README.md for the on-device G2P (English + Japanese), bucketed decoder strategy, and conversion details.

Anomaly Detection

EfficientAD

EfficientAD (PDN-Small) — lightweight unsupervised anomaly detection for industrial inspection. Wraps teacher, student, and autoencoder networks into a single model that outputs a per-pixel anomaly heatmap and image-level anomaly score. Pretrained on MVTec AD bottle category.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
EfficientAD_Bottle.mlpackage.zip	15 MB (FP16)	256x256 RGB image	anomaly_map [1,1,256,256] + anomaly_score [0-1]	nelson1425/EfficientAD	MIT	2023	EfficientADDemo	convert_efficientad.py

Music Transcription

Basic Pitch

spotify/basic-pitch — polyphonic Automatic Music Transcription. Converts any audio (any instrument, any voice) into MIDI notes with pitch bend detection. Just 17K parameters / 272 KB — runs in real time on iPhone with full ANE acceleration.

The first open-source iOS implementation. Loads any audio file, runs the CoreML model in 2-second sliding windows, then runs the full Python note_creation.py pipeline natively in Swift (onset inference, greedy backwards-in-time tracking, melodia trick, pitch bend extraction). Detected notes are visualized as a piano roll, exported as a Standard MIDI File, and played back through a built-in additive sine synth so you can A/B compare with the original audio.

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project
BasicPitch_nmp.mlpackage.zip	272 KB	audio waveform [1, 43844, 1] @ 22050 Hz mono	note [1,172,88] + onset [1,172,88] + contour [1,172,264]	spotify/basic-pitch	Apache-2.0	2022	BasicPitchDemo

See sample_apps/BasicPitchDemo/README.md for the sliding-window inference, post-processing port, and iOS-specific gotchas.

Text-to-Music Generation

Stable Audio Open Small

stabilityai/stable-audio-open-small — text-to-music generation (497M params). Generates up to 11.9 seconds of stereo 44.1kHz audio from text prompts using rectified flow diffusion.

4 CoreML models: T5 text encoder, NumberEmbedder (seconds conditioning), DiT (diffusion transformer), and VAE decoder (Oobleck).

Download Link	Size	Input	Output	Original Project	License	Year	Sample Project	Conversion Script
StableAudioT5Encoder.mlpackage.zip	105 MB	input_ids [1, 64]	text_embeddings [1, 64, 768]	stabilityai/stable-audio-open-small	Stability AI Community	2024	StableAudioDemo	convert_stable_audio.py
StableAudioNumberEmbedder.mlpackage.zip	396 KB	normalized_seconds [1]	seconds_embedding [1, 768]
StableAudioDiT.mlpackage.zip	326 MB	latent [1,64,256] + timestep + conditioning	velocity [1,64,256]
StableAudioDiT_FP32.mlpackage.zip	1.3 GB	latent [1,64,256] + timestep + conditioning	velocity [1,64,256]
StableAudioVAEDecoder.mlpackage.zip	149 MB	latent [1, 64, 256]	stereo audio [1, 2, 524288] at 44.1kHz

See sample_apps/StableAudioDemo/README.md for INT8 vs FP32 DiT selection and conversion details.

Models converted by someone other than me.

Stable Diffusion

apple/ml-stable-diffusion

How to use in a xcode project.

Option 1,implement Vision request.


import Vision
lazy var coreMLRequest:VNCoreMLRequest = {
   let model = try! VNCoreMLModel(for: modelname().model)
   let request = VNCoreMLRequest(model: model, completionHandler: self.coreMLCompletionHandler)
   return request
   }()

let handler = VNImageRequestHandler(ciImage: ciimage,options: [:])
   DispatchQueue.global(qos: .userInitiated).async {
   try? handler.perform([coreMLRequest])
}

If the model has Image type output:

let result = request?.results?.first as! VNPixelBufferObservation
let uiimage = UIImage(ciImage: CIImage(cvPixelBuffer: result.pixelBuffer))

Else the model has Multiarray type output:

For visualizing multiArray as image, Mr. Hollance’s “CoreML Helpers” are very convenient. CoreML Helpers

Converting from MultiArray to Image with CoreML Helpers.

func coreMLCompletionHandler（request：VNRequest？、error：Error？）{
   let = coreMLRequest.results？.first as！VNCoreMLFeatureValueObservation
   let multiArray = result.featureValue.multiArrayValue
   let cgimage = multiArray？.cgImage（min：-1、max：1、channel：nil）

Option 2,Use CoreGANContainer. You can use models with dragging&dropping into the container project.

Make the model lighter

You can make the model size lighter with Quantization if you want. https://coremltools.readme.io/docs/quantization

The lower the number of bits, more the chances of degrading the model accuracy. The loss in accuracy varies with the model.

import coremltools as ct
from coremltools.models.neural_network import quantization_utils

# load full precision model
model_fp32 = ct.models.MLModel('model.mlmodel')

model_fp16 = quantization_utils.quantize_weights(model_fp32, nbits=16)
# nbits can be 16(half size model), 8(1/4), 4(1/8), 2, 1

quantized sample (U2Net)

InputImage / nbits=32(original) / nbits=16 / nbits=8 / nbits=4

Thanks

Cover image was taken from Ghibli free images.

On YOLOv5 convertion, dbsystel/yolov5-coreml-tools give me the super inteligent convert script.

And all of original projects

Auther

Daisuke Majima Freelance engineer. iOS/MachineLearning/AR I can work on mobile ML projects and AR project. Feel free to contact: rockyshikoku@gmail.com

GitHub Twitter Medium