My 2025 AI Stack for Production RAG (and why)

AI product development has changed dramatically in the last two years.
In 2023, most AI engineers defaulted to Python-heavy stacks — FastAPI for APIs, LangChain (Python) for orchestration, HuggingFace for embeddings and fine-tunes. That made sense when models were new and everything happened in notebooks.

But here’s the problem:
When you’re building a real product, 90% of the work is not in Python. It’s in UI/UX, authentication, payments, user state, analytics — and the faster you can ship those, the faster you can get feedback and pivot.

That’s why in 2025, I’ve shifted to a TypeScript-first stack for nearly everything, and I only bring Python into the picture when it’s truly needed.
The result? Faster iteration, cleaner integration, and an architecture that’s observable, swappable, and built for change.

#The Core Philosophy

TS-First for speed — Next.js, Convex, and modern TypeScript tooling let me ship features in hours instead of days.
Python only where it’s irreplaceable — fine-tuning, LoRA, or specialized CV/NLP pipelines.
Everything observable — if I can’t see what the model retrieved or why it hallucinated, it’s not a product, it’s a demo.
Swappable components — vector DBs, embeddings, even the LLM can change without rewriting everything.

#1. Frontend & Application Layer

I keep the entire user-facing layer in TypeScript.

Framework: Next.js (App Router) — lets me mix server and client components, stream responses, and keep all API contracts in the same repo.
Database & State: Convex — real-time updates, serverless storage, and cron jobs without extra infra.
Auth: Clerk — OAuth, email magic links, and SSO in minutes.
UI/Styling: Tailwind CSS + shadcn/ui — fast, consistent, and themeable.
File Handling: UploadThing or Vercel Blob.

💡 Example: Updating my RAG index nightly is just a Convex cron job calling the vector DB — no extra servers, no DevOps overhead.

#2. LLM Orchestration & Retrieval

For most projects, I orchestrate entirely in LangChain.js or LangGraph.js:

Retrieval: Weaviate or Qdrant for managed ops; pgvector when I want Postgres-native queries.
Embeddings: OpenAI text-embedding-3-large for general use, Voyage AI for higher multilingual accuracy, or Cohere for budget-friendly scale.
LLMs: Mix and match — Groq (Llama 3) for low-latency, OpenAI for reasoning-heavy queries.
UI Integration: Vercel AI SDK for streaming chat and completion.

💡 Why not Pinecone? I prefer Weaviate/Qdrant for more control over index params and cost structure.

#3. The Python “Island”

I don’t run my whole backend in Python anymore — but I keep a Python microservice for when it matters:

LoRA / Fine-Tuning — HuggingFace PEFT & Transformers.
Custom CV/Audio/NLP Pipelines — OpenFace for video feature extraction, librosa for audio, spaCy for NLP.
Self-Hosted Inference — GPU-heavy models running on Modal, Replicate, or Runpod.

This runs in FastAPI, deployed separately, and is called from the TS backend only when needed.

💡 Benefit: I can scale the Python service independently — if training spikes GPU usage, the rest of the app isn’t affected.

#4. Evaluation & Observability

If you can’t debug model behavior, you’re flying blind.
I track:

Retrieval hits and metadata.
Latency per pipeline stage.
Hallucination scores (LLM-graded or heuristic).
User feedback loops.

Tools I use:

LangSmith for tracing, dataset runs, and A/B prompt testing.
Sentry for app-level error tracking.

💡 Example: By logging retrieval hits with timestamps, I caught a timezone bug that was silently breaking a client’s nightly index refresh.

#5. Why This Stack Works

Speed: Building product features in TS is simply faster.
Flexibility: I can change vector DBs or LLM providers in hours.
Scalability: Python workloads don’t slow down the rest of the app.
Observability: Every stage is logged, traceable, and measurable.

#When to Break the TS-First Rule

There are times when Python-first makes sense:

Research-heavy prototypes where speed to model iteration > product speed.
Internal tools where UI/UX isn’t a priority.
Deep integration with Python-only libraries.

But for production-grade AI products — especially those with real users — TS-first wins for me every time.

#TL;DR

Ship fast.
Track everything.
Keep it modular.

In AI, pivots are inevitable. This stack is built so I can change anything — the LLM, the embeddings, the vector DB — without starting over.