Beyond the Wall: Building a Low-Cost, High-Efficiency Cloud RAG Application with Firestore Vector Search
RAG (Retrieval Augmented Generation) is an AI framework that allows developers to add external information without retraining the LLM, improving the accuracy of its answers. As of 2026, it is a widely known technology.
The concept is roughly as follows: First, vectorize external information (the data you want the LLM to know) using an embedding model and store it. After a user enters a prompt, the prompt is also vectorized using the same embedding model. It is then compared against the previously stored vectors to retrieve the most similar pieces of data. These are then integrated by the LLM to generate a response for the user. This approach allows the LLM to answer using specific knowledge integrated by developers without the need for retraining.
Recently, I used RAG to develop an AI bot that uses quotes from “Game of Thrones: A Song of Ice and Fire” to answer whether I should skip my gym workout today. This project is called “Iron Counsel”. The project goals were:
- Use a Telegram Bot as the frontend for easy access.
- Deploy the RAG system to the cloud; I didn’t want to keep my own computer on all day or deal with complex networking and security issues.
- Following goal 2, implement a whitelist for personal use or for a few friends to prevent billing explosions from excessive traffic.
- Bilingual support (Chinese and English).
- Low cost.
The project code can be found on my GitHub. This post mainly explains the concepts and technical choices, operation, and implementation from an architectural perspective.
Concept Explanation
Vector Search
Unlike traditional keyword search, vector search doesn’t look for exact matches. Instead, it converts every piece of data into a multi-dimensional vector. During a search, it calculates the similarity between the search criteria and the stored vectors to find the closest matches.
Embedding Model
A model that converts data into multi-dimensional vectors. Different embedding models have varying levels of support for different data types—some are multilingual, some only support a single language; some support images, while others only support text. The dimensions of the generated vectors also vary. When choosing an embedding model, you must match it to your needs. Furthermore, the same model must be used for both initial data processing and subsequent user prompt vectorization; otherwise, the comparison in vector search will be biased.
Low-Cost Cloud RAG Architecture & Technical Choices
Phase 1: The Ingestion Pipeline
The primary goal of the Ingestion Pipeline is to vectorize and store data so it can be used later to match user prompts.
Before the data hits the cloud, I chose to perform the heavy lifting on my own computer. A MacBook with an M1 CPU or newer can handle vectorizing data with a lightweight embedding model with ease.
- Local Embedding: I developed an
ingest.pyscript. Instead of calling expensive cloud APIs, it uses the MacBook’s CPU directly to convert over 2,000 Game of Thrones script lines into vectors via FastEmbed (ONNX). - Decoupled Upload: After generating
vectors.json, it is uploaded in batches to the Firestore vector database. - Q&A: Why not do it in the cloud? Because local computation is free! This “offline pre-processing” ensures the cloud only handles the core storage and querying.
Embedding Model Choice
Why choose paraphrase-multilingual-MiniLM-L12-v2?
- Multilingual Support: I wanted a bot that understands both Chinese and English, and this model performs excellently in semantic alignment for both languages.
- Small User Base: Since the project has few users, I don’t need to use third-party pay-per-token embedding model APIs to balance load, which would cost more.
- Tiny but Mighty: It only has 384 dimensions.
- Lower dimensionality means lower storage costs in Firestore and faster queries.
- Compared to OpenAI’s 1,536-dimensional models, its computation speed on a CPU is extremely fast, making it ideal for running in Cloud Run containers without GPUs.
- ONNX Compatibility: Through FastEmbed, this model runs in ONNX format. This eliminates the need for the heavy PyTorch library, resulting in smaller container sizes and faster cold starts.
Firestore Vector and History Database
- Vector Store Implementation: Firestore supports KNN vector search. I converted script lines into 384-dimensional vectors locally and uploaded them to Firestore. When a user sends a prompt, executing a KNN vector comparison in Firestore retrieves the most relevant lines. For example, even if you don’t mention “dragons,” if your meaning relates to “powerful force,” Firestore might pull up a Daenerys quote.
- Chat History Persistence: Besides vectors, the project stores the chat history for each whitelisted user in Firestore. Using sub-collections, it can retrieve the last 10 messages with minimal latency and inject them into the LLM prompt. This gives the project long-term memory; it will remember you complained about your boss two minutes ago.
- Why not a dedicated vector database (like Pinecone)? At this scale, Firestore’s “one-stop shop” allows us to perform vector retrieval and history logging within the same ACID transaction space. This reduces a network hop, prioritizing low latency and easier maintenance.
Phase 2: The Runtime
IQ and Speed — LLM (Groq)
I considered and tested several LLM options before choosing Groq with Llama 3.3 70B.
- For a RAG project, the LLM doesn’t need to be “massive” because the core information is the data provided by the developer, not the LLM’s internal knowledge. In this scenario, the LLM just needs to answer quickly. Groq’s LPU (Language Processing Unit) architecture provides blazing-fast inference speeds. While typical GPUs are still loading model weights, Groq has already streamed hundreds of tokens.
- Model Selection: Llama 3.3 70B: I needed a model that supports both Chinese and English contexts and can maintain a specific persona. Llama 3.3 70B’s reasoning capabilities are close to GPT-4, and its response speed on Groq is remarkably fast.
- Pareto Efficiency of Cost and Performance: Groq is currently very, very generous to developers! It allows me to enjoy the logical reasoning power highly praised in the open-source world with minimal (or zero for low volume) API costs.
LangChain
LangChain is responsible for coordinating the dialogue between the user, the embedding model, and the LLM. I used the FastEmbed wrapper within LangChain’s Embeddings interface. When a user asks, “Should I drink this wine?”, the process looks like this:
- LangChain calls FastEmbed to convert the sentence into a 384-dimensional array like
[0.12, -0.05, ...]. - LangChain passes this array to Firestore for vector comparison.
- The LLM then receives the relevant script lines retrieved by LangChain.
GCP Cloud Run
- Scale-to-Zero: I chose Google Cloud Run. Since the bot might only be called a dozen times a day, Cloud Run only charges when a request is active. When idle, the bill stays at $0. (Cloud Run would not be cost-effective for a high-traffic bot).
- Low Maintenance: With a small user base, there’s no need for a persistent GKE cluster and its complex maintenance.
- GCP Ecosystem: GCP offers Firestore, Cloud Run, and Artifact Registry as integrated services. Having everything in one place makes checking logs and managing images very convenient.
CI/CD: Terraform & GitHub Actions
- Infrastructure as Code (IaC): Everything is defined with Terraform, from Firestore index configurations and Secret Manager keys to Cloud Run permissions. This ensures reproducibility and reduces operational complexity. If I ever need to move the bot to another project, a
terraform applycan rebuild everything in five minutes. It also prevents human errors associated with manual console configuration. - CI/CD with GitHub Actions: Automated deployment pipelines ensure that every code push triggers a Docker build and push to Artifact Registry.
- Secure Environment Variables: Telegram tokens and Groq API keys are stored in GCP Secret Manager and authorized to the Cloud Run service account via Terraform. Sensitive info never appears in logs or source code, adhering to the Principle of Least Privilege.
Telegram Bot Webhook
- Low Overhead: Telegram bots can communicate via long-polling (inefficiently checking for messages every second) or Webhooks. Webhooks are passive triggers; the backend only reacts when a message is received.
- Scale to Zero: When someone messages the bot, Telegram sends an HTTPS POST request to our FastAPI backend on Cloud Run. This perfectly fits the “Scale-to-Zero” requirement.
- Token Verification: To prevent malicious calls, I implemented a Secret Token verification mechanism checking the
X-Telegram-Bot-Api-Secret-Tokenheader. - Access Control (The Gatekeeper): I implemented a Telegram User ID whitelist in the code. Only authorized users can interact with the bot, protecting my budget from random traffic.
Architecture Blueprint
Conclusion
This project successfully created an extremely low-cost RAG application with several key features:
- Decoupled Dev/Prod: Ingestion pipeline is local; runtime is in the cloud.
- Robustness: Low probability of GCP downtime or being overwhelmed by attacks.
- Security: Ensured by whitelisting, Telegram headers, and Secret Manager.
- Balanced Performance & Cost: Leveraging Groq’s speed and Cloud Run’s elasticity provides a responsive and nearly zero-cost implementation for a specific user scale.
The design principle: Use minimal resources to design the most rational, secure, and stable architecture.