Chat With Your Documents Using Garudust Agent — No Vector Da…
DEV Community: rust (Garudust)Most RAG tutorials start the same way: "First, install a vector database…" Then come the embedding models, the chunking strategies, the similarity thresholds. By the time you can ask a question about a PDF, you've deployed three services and written 200 lines of boilerplate.
Garudust Agent takes a different path. RAG is built in — backed by SQLite FTS5 with a trigram tokenizer. No vector database. No embedding API calls. Drop a PDF (or TXT, CSV, Markdown, JSON) into the conversation and start asking questions in seconds.
How It Works
When you ingest a document, Garudust:
- Extracts text (native PDF parser, no external tools)
- Splits it into chunks (≤ 800 chars, paragraph-aware)
- Indexes chunks into an FTS5 virtual table with
tokenize = 'trigram'
When you ask a question, doc_search runs a full-text query against the index and feeds the top matching chunks to the LLM as context. That's the whole pipeline — one SQLite file at ~/.garudust/state.db.
The trigram tokenizer means it works on any language, including Thai, Chinese, and Japanese, without any tokenizer configuration.
Setup
RAG is enabled by default. The only thing you need to configure is which directories the agent is allowed to read from:
# ~/.garudust/config.yaml
security:
allowed_read_paths:
- /home/you/documents
- /data/company-docs
That's it. If you want to turn RAG off entirely:
disabled_toolsets: [rag]
Your First Ingestion
Start the CLI:
garudust
Then tell the agent to ingest a file:
You: ingest /home/you/documents/employee-handbook.pdf
Agent: Indexed employee-handbook.pdf — 47 chunks ready for search.
Preview: "This handbook outlines the policies and procedures for all employees…"
Now ask anything:
You: What is the remote work policy?
Agent: According to the employee handbook, remote work is permitted up to 3 days per week
for roles that do not require on-site presence. Employees must notify their manager
at least 24 hours in advance and maintain availability during core hours (10am–4pm).
The Four RAG Tools
You never call these directly — the agent decides when to use them based on your question. But knowing they exist helps you understand what's happening.
Re-ingesting a fileIf a document changes, just ingest it again. The old index for that path is replaced automatically.
Forgetting a documentYou: Remove the Q1 report from the index.
Agent: Document removed from index.
Or clear everything:
You: Clear all indexed documents.
Agent: Removed 5 document(s) from index.
FTS5 Query Syntax
doc_search supports full FTS5 syntax, which the agent uses automatically when your question benefits from it:
You don't need to write FTS5 queries yourself — the agent figures this out. But if you want to guide it:
You: Search for "termination clause" in the contract documents.
Real-World Use Cases
1. Company Knowledge BaseIngest your onboarding docs, SOPs, and internal wikis. New team members can ask questions in plain language instead of searching through Confluence.
You: ingest /docs/sop-release-process.md
You: What approvals are needed before a hotfix can go to production?
2. Contract and Legal Review
You: ingest /legal/vendor-agreement-2025.pdf
You: Does this contract include a limitation of liability clause? What is the cap?
3. Log Analysis
Ingest a log file and ask questions without writing grep patterns:
You: ingest /var/log/app/error.log
You: Which service caused the most errors in the last hour?
You: Are there any database connection timeouts?
4. Codebase Documentation
You: ingest /project/docs/api-reference.md
You: What parameters does the /auth/refresh endpoint accept?
Ingesting Files Sent via Telegram or LINE
If you're running garudust-server with a messaging platform, users can send files directly to the bot. Attachments are automatically saved to a temporary path and can be ingested on request:
User sends: quarterly-report.pdf (attached)
Agent: I received your file. Would you like me to index it for search?
User: yes
Agent: Indexed quarterly-report.pdf — 83 chunks ready.
Preview: "Q1 2025 Financial Summary — Total Revenue: $4.2M…"
User: What was the gross margin for Q1?
Agent: According to the report, gross margin for Q1 2025 was 61.3%,
up from 58.9% in Q4 2024.
Platform attachments (files from Telegram, LINE, Discord, etc.) are always allowed regardless of
allowed_read_paths, since they're written to/tmp/garudust_*by the platform adapter.
Multiple Documents at Once
You can ingest multiple files and search across all of them in the same session:
You: ingest /docs/policy-2024.pdf
You: ingest /docs/policy-2025.pdf
You: What changed in the travel expense policy between 2024 and 2025?
The agent searches both documents and synthesizes the differences.
Check what's indexed at any time:
You: What documents have you indexed?
Agent: 2 documents indexed:
- policy-2024.pdf | 34 chunks | ingested 2025-05-21 09:14
- policy-2025.pdf | 38 chunks | ingested 2025-05-21 09:15
Limitations
- No semantic search — FTS5 is keyword/trigram matching, not embedding similarity. If the document says "annual leave" and you ask about "vacation days," the agent bridges this with its language understanding, but results depend on the LLM's reasoning.
- Session-scoped by default — The index persists in
state.db, but searches are scoped to the current conversation key. Starting a new session means re-ingesting if you want to query the same files. - Text-only — Images, tables, and charts inside PDFs are not extracted. Text content only.
Summary
Garudust's RAG won't replace a purpose-built vector search pipeline for large-scale production retrieval. But for a developer who wants to ask questions about their documents right now — without running a second service — it's the fastest path from PDF to answer.
Garudust Agent — GitHub · Releases
Generated by RSStT. The copyright belongs to the original author.