Blog post
April 9, 2026

How to build a RAG pipeline directly on the Anthropic API with Supabase

Step-by-step tutorial for building a production RAG system with Supabase pgvector, OpenAI embeddings, and Anthropic Claude - no frameworks or middleware, just direct API implementation.

RAG pipeline architecture diagram showing Supabase pgvector database, OpenAI embeddings generation, cosine similarity search, and Anthropic Claude API integration without middleware

Most RAG tutorials use LangChain, LlamaIndex, or automation platforms. These add layers between you and the system you're building.This tutorial shows direct implementation - Supabase for vector storage, OpenAI for embeddings, Anthropic's Claude for generation. No middleware. You control every step.We build RAG systems this way because clients own the entire pipeline with no platform dependencies.

What you're building

A production RAG system with four components: document ingestion (chunk text, generate embeddings, store in Supabase), vector storage (PostgreSQL with pgvector), retrieval (cosine similarity search via Supabase RPC), and generation (Claude with retrieved context injected into prompts).Stack: Supabase (PostgreSQL plus pgvector), OpenAI embeddings API, Anthropic Claude API, Node.js for orchestration.Why this stack: Supabase is PostgreSQL - standard, portable, self-hostable. Embeddings and LLM are swappable (can switch providers). No vendor lock-in beyond API dependencies.

Prerequisites

You need a Supabase project (free tier works), an OpenAI API key for embeddings, an Anthropic API key for Claude, Node.js installed locally, and basic understanding of async JavaScript and SQL.

Part 1: Supabase setup

Enable pgvector extension

In your Supabase dashboard, go to Database, SQL Editor, and enable the vector extension. This adds vector data types and similarity operators to PostgreSQL, allowing you to store and search embedding vectors efficiently.

Create the documents table

Create a table with fields for document ID, text content, metadata as JSON, embedding as a vector with 1536 dimensions, and timestamp. The embedding column dimension matches OpenAI's text-embedding-3-small output. If you use a different embedding model, adjust the dimension to match its output size.

Create the similarity search function

Build a database function that takes three parameters: a query embedding vector, a similarity threshold, and a count of results to return. The function should calculate cosine similarity between the query vector and all stored document vectors, filter results above the threshold, order by similarity descending, and limit to the specified count.The function returns document ID, content, metadata, and similarity score for each match. This RPC function becomes your main search interface - you call it from your application code with a query embedding and get back the most relevant documents.

Add a vector index

Create an IVFFlat index on the embedding column using cosine distance operations. The lists parameter should be roughly your total row count divided by 1000. For 100,000 documents, use lists equals 100. For 1 million documents, use lists equals 1000.Without this index, similarity searches are slow - potentially taking seconds for large datasets. With the index, searches complete in milliseconds even with millions of vectors.

Get your Supabase credentials

From Settings, API, copy your project URL and service role key (the anon key won't work for server-side operations). You'll use these to connect from your application code.

Part 2: Document ingestion

This is the process that takes your source documents, chunks them into manageable pieces, generates embeddings, and stores everything in Supabase.

Chunking strategy

Break your documents into 400-600 word segments with 50-word overlap between chunks. This balances context preservation with retrieval precision.Why this size: Smaller chunks (200-300 words) give more precise retrieval but can miss broader context. Larger chunks (800-1000 words) preserve context better but dilute relevance scores. The 400-600 word range works well for most business documents.Why overlap: The 50-word overlap ensures sentences aren't split awkwardly across chunk boundaries and maintains context continuity. If a critical sentence appears near a chunk boundary, the overlap increases the chance it's fully captured in at least one chunk.Adjust for your content: Legal documents might need larger chunks to preserve complete clauses. FAQ content might work better with smaller chunks per question-answer pair. Test with your specific documents.

Generate embeddings

Use OpenAI's embeddings API to convert text chunks into 1536-dimension vectors. The text-embedding-3-small model balances cost and quality well.Batch processing: Don't generate embeddings one chunk at a time. Batch 50-100 chunks per API call. This dramatically reduces API overhead and speeds up ingestion. The embeddings API accepts arrays of text and returns arrays of vectors in the same order.Cost consideration: At $0.0001 per 1K tokens, embedding 10,000 document chunks (roughly 4 million tokens) costs about $0.40. This is a one-time cost per document.

Store in Supabase

For each chunk, insert a record containing the text content, the embedding vector, and metadata (source document, chunk index, category, date, whatever is relevant for filtering later).The metadata field as JSON gives you flexibility. You might store source filename, document type, author, creation date, or any other attributes useful for filtering search results or understanding where information came from.

Ingestion workflow

The complete ingestion process: read source document, clean the text (remove headers, footers, formatting artifacts), chunk into segments with overlap, batch chunks and send to embeddings API, receive embedding vectors, insert chunks with embeddings and metadata into Supabase, log success or errors.For production systems, add error handling at each step, retry logic for API failures, and progress tracking so you can resume interrupted ingestion jobs.

Part 3: Query and retrieval

This is the real-time component that responds to user questions by finding relevant documents and generating answers.

Embed the query

When a user asks a question, first convert their question into a vector using the same embedding model you used for documents. This ensures the query and document vectors exist in the same semantic space and can be meaningfully compared.

Search for similar documents

Call your Supabase similarity search function with the query embedding, a similarity threshold (start with 0.5), and how many results you want (typically 4-6). The function returns the most relevant document chunks ranked by similarity score.What similarity scores mean: Scores range from 0 to 1. Above 0.7 is very relevant. 0.5-0.7 is moderately relevant. Below 0.5 is typically not useful. The threshold filters out low-relevance results before they reach your LLM.

Log what was retrieved

Before generating a response, log the retrieved documents, their similarity scores, and the original query. This debugging data is invaluable for improving your system. You'll identify gaps in your knowledge base, tune your similarity threshold, and understand which types of questions work well versus poorly.In production, store these logs in a database table with fields for query text, retrieved document IDs, similarity scores, and timestamp. Review weekly to spot patterns and opportunities for improvement.

Format the context

Take the retrieved document chunks and format them into a context string. Separate chunks with clear delimiters so Claude can distinguish between different sources. Include enough context but not so much that you exceed token limits or dilute relevance.Context window management: Claude Sonnet 4 handles 200K tokens but you don't want to use it all for context. Limit retrieved context to 2000-3000 tokens (roughly 1500-2000 words). This leaves room for the question, system instructions, and a substantial response.

Generate response with Claude

Construct a prompt that includes the retrieved context and the user's question. Instruct Claude to answer based on the provided context and to say so if the context doesn't contain relevant information.Prompt structure: System message: "You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain relevant information, say so clearly. Always cite which parts of the context you're using."User message: "Context from knowledge base: [formatted context here]. Question: [user's question]. Answer based on the context provided."Temperature setting: Use 0.2-0.3 for factual responses. Higher temperatures (0.7+) work for creative applications but introduce variability in factual retrieval scenarios.

Return the response

Send Claude's response back to the user. Optionally include citations showing which source documents were used. This transparency helps users trust the system and verify information.

Part 4: Tuning for production

Setting the similarity threshold

The threshold parameter critically affects result quality. Too low includes irrelevant documents. Too high misses relevant ones.Start at 0.5: This catches moderately relevant results. Monitor the similarity scores actually being returned. If you consistently see irrelevant results with scores of 0.6, raise the threshold to 0.65. If you're missing good results, lower it to 0.45.Adjust by use case: Customer support knowledge bases might use 0.6 (high precision). Research tools might use 0.4 (high recall). Test with real queries and tune accordingly.

Choosing result count

We retrieve 5 documents by default but this isn't universal.Too few (2-3): Might miss important context if information is distributed across multiple chunks. Good for focused, specific queries where you know relevant info is concentrated.Too many (10+): Clutters the context window, increases API costs, can confuse the LLM with contradictory or tangential information. Rarely beneficial.Sweet spot (4-6): Works for most business applications. Provides enough context without overwhelming. Test with your specific content.

Logging for debugging

Always log retrieval results. Track which documents get retrieved for which queries, what similarity scores they had, and whether users found the answers helpful.Essential logs: Query text, timestamp, retrieved document IDs, similarity scores, whether an answer was generated or returned "no information available", user feedback if collected.Use this data to: Identify gaps in knowledge base (frequent queries with no good matches), tune similarity thresholds (distribution of scores for successful vs unsuccessful queries), improve chunking strategy (are relevant chunks scoring poorly?), prioritize content additions (what are people asking about that you don't have?).

Handling no results

If similarity search returns nothing (all chunks below threshold), handle it gracefully. Return a clear message like "I don't have information about that in my knowledge base" rather than letting Claude generate a response without context (which leads to hallucination).Implement fallback logic: broaden the search with a lower threshold, suggest related topics you do have information about, or collect the unanswered question for content team review.

Caching embeddings

If users ask the same questions repeatedly, cache the query embeddings. Check the cache before calling the embeddings API. For a question you've embedded before, use the cached vector.Implementation: Use an in-memory Map for simple cases. Use Redis for production systems with multiple servers. Set reasonable TTL (time to live) since embeddings for the same text never change.Cost savings: Significant for FAQ-style systems where 20% of questions account for 80% of traffic. Less impactful for research tools with diverse unique queries.

Streaming responses

For better user experience, stream Claude's response as it's generated rather than waiting for the complete answer.The Anthropic API supports streaming. As each chunk of text is generated, send it to your frontend. Users see the response appear word-by-word in real-time instead of waiting 5-10 seconds for a complete answer to arrive all at once.This makes the system feel dramatically more responsive even though total time to completion is the same.

Part 5: Cost and performance

Realistic costs for production

For a system with 10,000 document chunks and 1,000 queries per month:One-time ingestion cost: 10K chunks at 400 words average equals roughly 4M tokens. Embeddings cost $0.40 total (at $0.0001 per 1K tokens for text-embedding-3-small).Monthly query costs: 1K queries at 400 words each equals 400K tokens for query embeddings. Cost: $0.04 per month.Claude API costs: 1K queries with 500 tokens context plus 200 tokens response each equals 700K tokens total. Input cost $3.50, output cost $7.50, total $11 per month (Claude Sonnet 4 pricing).Supabase costs: Free tier handles this volume easily. Pro tier ($25/month) if you exceed database limits.Total monthly operating cost: Approximately $11-36 depending on Supabase tier. Scales linearly with query volume.

Performance optimization

Batch everything possible: Generate embeddings in batches of 50-100 texts per API call instead of individual calls. This reduces overhead dramatically.Index your vectors: The IVFFlat index is absolutely essential. Without it, similarity search on 100K vectors takes seconds. With proper indexing, it takes 50-200 milliseconds.Use connection pooling: If running in serverless functions, use Supabase's connection pooler to avoid exhausting database connections. Each function invocation opening a direct connection will hit limits quickly.Monitor query performance: Add timing logs to each step (embedding generation, similarity search, LLM call). Identify bottlenecks. Usually it's either embedding API latency or similarity search without proper indexing.

Scaling considerations

This architecture handles up to roughly 1 million document chunks and 10,000 queries per day on modest infrastructure.When to scale up: If vector searches take over 500ms consistently, increase IVFFlat lists parameter or switch to HNSW indexing. If you exceed Supabase free tier limits, upgrade to Pro tier. If embedding API rate limits become an issue, implement queuing and batch processing.When to rearchitect: If you need multi-tenancy with tenant isolation, consider separate tables per tenant. If you need real-time document updates with instant search, implement incremental indexing. If query volume exceeds 50K per day, consider dedicated vector database like Qdrant.

Why we build this way

No LangChain. No LlamaIndex. No automation platforms. Direct API calls and SQL.Full control: We see exactly what happens at every step. When something breaks, debugging is straightforward. No hunting through framework abstractions trying to understand what's happening behind the scenes.No abstraction overhead: Every layer between you and your system adds complexity, failure points, and performance cost. Direct implementation removes this overhead.Portable: This code runs anywhere Node.js runs. No framework lock-in. Move between hosting providers freely. Self-host if needed.Maintainable: Standard SQL, standard REST APIs, standard JavaScript. Any developer can read and modify this. No framework-specific knowledge required.Cost-effective: Pay only for what you use - embeddings and LLM calls. No platform markup. No per-execution fees. Predictable costs that scale linearly.Client ownership: Clients can take this code and run it themselves completely independently. No dependency on ThinkSwift-specific infrastructure or tools. True system ownership.

When frameworks make sense

Frameworks like LangChain aren't universally bad. They're useful in specific situations.Use frameworks when: You're prototyping multiple approaches quickly and want pre-built components. You're not deploying to production yet and speed of iteration matters more than control. Your team doesn't want to write orchestration code and prefers higher-level abstractions. You need specific features the framework provides like prompt template libraries or agent scaffolding.But for production systems where performance matters, costs matter, and long-term maintainability matters, direct implementation wins. You avoid framework complexity, breaking changes, and vendor lock-in. You gain full debugging visibility, cost control, and portability.

TL;DR Summary

What you built: A production RAG pipeline using Supabase for vector storage, OpenAI for embeddings, and Anthropic Claude for generation - no frameworks or middleware.Stack components: Supabase PostgreSQL with pgvector extension for vector storage and similarity search. OpenAI text-embedding-3-small for converting text to 1536-dimension vectors. Anthropic Claude Sonnet 4 for generating responses with retrieved context. Node.js with direct API calls for orchestration.Key steps: Enable pgvector extension in Supabase. Create documents table with vector column and similarity search RPC function. Chunk documents into 400-600 word segments with 50-word overlap. Generate embeddings via OpenAI and store in Supabase with metadata. At query time: embed question, search via RPC function, retrieve top matches by cosine similarity, format context, inject into Claude prompt. Log retrieved documents with similarity scores for debugging.Tuning parameters: Chunk size 400-600 words balances context preservation and retrieval precision. Similarity threshold starting at 0.5, adjust based on result quality. Retrieve 4-6 documents for most queries to balance context and cost. Cache frequently asked question embeddings to reduce API costs.Production optimization: Batch embedding generation (50-100 texts per call) to reduce overhead. Create IVFFlat index on vectors for fast search (milliseconds vs seconds). Use connection pooling for serverless deployments. Log query performance to identify bottlenecks. Implement streaming responses for better user experience.Cost at scale: For 10,000 chunks and 1,000 monthly queries: $0.40 one-time ingestion cost, $11 monthly for Claude API calls, $0.04 monthly for query embeddings. Total approximately $11-36 per month depending on Supabase tier. Scales linearly with volume.Why direct implementation: Full control over every step with no framework abstraction, complete debugging visibility, portable code that runs anywhere, maintainable by any developer, cost-effective with no platform fees, clients own the system outright with no vendor lock-in.When to use frameworks: For rapid prototyping, when not deploying to production, if team prefers higher-level abstractions over direct API calls, or when you need specific framework features like prompt templates and pre-built agent components.Scaling limits: This architecture handles 1 million document chunks and 10,000 queries per day on modest infrastructure. Beyond that, consider dedicated vector databases or architectural changes for multi-tenancy and real-time updates.

Building RAG systems for production? We can implement this architecture for your business or help you build it yourself.

[Talk to us about RAG implementation]

About ThinkSwift

We're a creative software agency in Melbourne building AI-powered knowledge systems for Australian businesses. We implement RAG pipelines this way - direct API calls, no middleware - because clients get full system ownership with no platform dependencies. The code is portable, maintainable, and cost-effective at scale.

Talk to Penny
Digital Receptionist
Learn More