🖼️ Image Captioning

Upload a photo and generate a natural-language description using one of three trained models — or compare all at once.

Model
BLIP + LoRA produces the best captions
Model details
Custom 5k / 100k — EfficientNet-V2-S + Transformer, trained from scratch on COCO subsets
BLIP + LoRA — Salesforce BLIP base fine-tuned with LoRA adapters on COCO 2014
Runtime device: cpu · First inference per model is slower (lazy loading)