The graveyard of failed support chatbots is full of bots that were really just keyword-triggered FAQ lookups with a chat interface. Users ask in natural language, bots match on keywords, the answer misses the question, the user escalates. Everyone loses.
Retrieval-Augmented Generation, not a fine-tuned model
We didn’t fine-tune a model on historical tickets. Fine-tuning is expensive to maintain — every time the product changes, the training set goes stale. Instead we used RAG: a vector store of the current documentation, FAQs, and resolved tickets, with a retrieval step before every generation call.
The retrieval step is where most RAG implementations underperform. We use a hybrid search — dense vector similarity plus BM25 keyword matching, reranked by a cross-encoder. The extra latency (about 200 ms) is worth it for the precision improvement.
Confidence thresholds and graceful escalation
When the model isn’t confident — low similarity scores across the retrieved chunks, or a query that doesn’t match the domain — it escalates explicitly: “I’m not confident I can answer this accurately. Let me connect you with the support team.” Users prefer honest escalation to a confidently wrong answer.
What actually drives the 40% deflection
The biggest factor wasn’t the model quality — it was documentation quality. The bot can only be as good as the content it retrieves. We spent two weeks rewriting the top 30 FAQ entries to be more specific and answer-first. That single change improved deflection rate by 12 percentage points.
If you’re building a support bot, audit your documentation before you build anything else.