Back to projects

AI · MSc Deep Learning coursework

SmartKitchen — Vision-Language Cooking Assistant

A multimodal AI assistant for everyday food decisions.

Problem

Deciding what to cook or eat means juggling disconnected steps — identifying a dish, knowing its ingredients, finding a recipe, handling substitutions, then figuring out where to buy or eat. No single tool ties vision, language, and location together for that workflow.

Approach

I built a full-stack system that combines computer vision and language models behind one interface. A CLIP/ResNet50 pipeline recognizes dishes and runs multi-label ingredient detection from a photo; a retriever searches a database of 2,000+ recipes; and a FLAN-T5/Qwen model powers a RAG-based assistant for cooking questions and ingredient substitutions. Location-aware recommendations surface nearby restaurants and grocery stores via OpenStreetMap.

Outcome

A responsive web app that turns a single food image into dish recognition, detected ingredients, matching recipes, conversational cooking help, and nearby places — an end-to-end multimodal product rather than an isolated model.

Tech stack

PythonPyTorchFastAPINext.jsCLIP/ResNet50FLAN-T5/QwenOpenStreetMapTailwind CSS

PyTorch models served via a FastAPI backend with a Next.js + Tailwind frontend. CV models (CLIP/ResNet50) handle perception, FLAN-T5/Qwen handles generation and RAG, and OpenStreetMap handles geosearch. Deployed across Vercel (frontend) and DigitalOcean (API).