2025-mvp-stack

My 2025 AI Stack for Production RAG (and why)

July 25, 2025

AI product development has changed dramatically in the last two years.
In 2023, most AI engineers defaulted to Python-heavy stacks — FastAPI for APIs, LangChain (Python) for orchestration, HuggingFace for embeddings and fine-tunes. That made sense when models were new and everything happened in notebooks.

But here’s the problem:
When you’re building a real product, 90% of the work is not in Python. It’s in UI/UX, authentication, payments, user state, analytics — and the faster you can ship those, the faster you can get feedback and pivot.

That’s why in 2025, I’ve shifted to a TypeScript-first stack for nearly everything, and I only bring Python into the picture when it’s truly needed.
The result? Faster iteration, cleaner integration, and an architecture that’s observable, swappable, and built for change.

#The Core Philosophy

  1. TS-First for speed — Next.js, Convex, and modern TypeScript tooling let me ship features in hours instead of days.
  2. Python only where it’s irreplaceable — fine-tuning, LoRA, or specialized CV/NLP pipelines.
  3. Everything observable — if I can’t see what the model retrieved or why it hallucinated, it’s not a product, it’s a demo.
  4. Swappable components — vector DBs, embeddings, even the LLM can change without rewriting everything.

#1. Frontend & Application Layer

I keep the entire user-facing layer in TypeScript.

  • Framework: Next.js (App Router) — lets me mix server and client components, stream responses, and keep all API contracts in the same repo.
  • Database & State: Convex — real-time updates, serverless storage, and cron jobs without extra infra.
  • Auth: Clerk — OAuth, email magic links, and SSO in minutes.
  • UI/Styling: Tailwind CSS + shadcn/ui — fast, consistent, and themeable.
  • File Handling: UploadThing or Vercel Blob.

💡 Example: Updating my RAG index nightly is just a Convex cron job calling the vector DB — no extra servers, no DevOps overhead.

#2. LLM Orchestration & Retrieval

For most projects, I orchestrate entirely in LangChain.js or LangGraph.js:

  • Retrieval: Weaviate or Qdrant for managed ops; pgvector when I want Postgres-native queries.
  • Embeddings: OpenAI text-embedding-3-large for general use, Voyage AI for higher multilingual accuracy, or Cohere for budget-friendly scale.
  • LLMs: Mix and match — Groq (Llama 3) for low-latency, OpenAI for reasoning-heavy queries.
  • UI Integration: Vercel AI SDK for streaming chat and completion.

💡 Why not Pinecone? I prefer Weaviate/Qdrant for more control over index params and cost structure.

#3. The Python “Island”

I don’t run my whole backend in Python anymore — but I keep a Python microservice for when it matters:

  • LoRA / Fine-Tuning — HuggingFace PEFT & Transformers.
  • Custom CV/Audio/NLP Pipelines — OpenFace for video feature extraction, librosa for audio, spaCy for NLP.
  • Self-Hosted Inference — GPU-heavy models running on Modal, Replicate, or Runpod.

This runs in FastAPI, deployed separately, and is called from the TS backend only when needed.

💡 Benefit: I can scale the Python service independently — if training spikes GPU usage, the rest of the app isn’t affected.

#4. Evaluation & Observability

If you can’t debug model behavior, you’re flying blind.
I track:

  • Retrieval hits and metadata.
  • Latency per pipeline stage.
  • Hallucination scores (LLM-graded or heuristic).
  • User feedback loops.

Tools I use:

  • LangSmith for tracing, dataset runs, and A/B prompt testing.
  • Sentry for app-level error tracking.

💡 Example: By logging retrieval hits with timestamps, I caught a timezone bug that was silently breaking a client’s nightly index refresh.

#5. Why This Stack Works

  • Speed: Building product features in TS is simply faster.
  • Flexibility: I can change vector DBs or LLM providers in hours.
  • Scalability: Python workloads don’t slow down the rest of the app.
  • Observability: Every stage is logged, traceable, and measurable.

#When to Break the TS-First Rule

There are times when Python-first makes sense:

  • Research-heavy prototypes where speed to model iteration > product speed.
  • Internal tools where UI/UX isn’t a priority.
  • Deep integration with Python-only libraries.

But for production-grade AI products — especially those with real users — TS-first wins for me every time.

#TL;DR

Ship fast.
Track everything.
Keep it modular.

In AI, pivots are inevitable. This stack is built so I can change anything — the LLM, the embeddings, the vector DB — without starting over.