Core Concepts
DocWeb is built on several key principles that differentiate it from standard AI models and traditional documentation tools.
Real-Time Retrieval vs. Outdated Training
Most LLMs have a knowledge cut-off. DocWeb doesn't rely on a memory bank; it crawls the live site the moment you request it. This provides Just-in-Time intelligence.
Grounded Context
By using Retrieval-Augmented Generation (RAG), DocWeb strictly limits the AI's "brain" to the Markdown data scraped from the specific website, eliminating hallucinations. Every response from Dex includes source citations.
Five-Phase Data Flow
- Discovery Phase: User enters URL → Waterfall discovery (robots.txt → sitemaps → navigation crawl) → URLs saved to Firestore
- Visualization Phase: Frontend subscribes to Firestore → Real-time graph updates via Sigma.js → ForceAtlas2 layout
- Scraping Phase: On-demand or auto-scrape → Content extraction via Cheerio → Markdown conversion → Stored in Firestore
- Embedding Phase: Scraped content → Chunked text (4000 chars, 400 overlap) → Gemini embeddings (768-dim) → Global embeddings collection
- Chat Phase: User query → Hybrid search (vector + BM25) → Context retrieval → Gemini response → Source citations
Core Algorithms
| Algorithm | Purpose |
|---|---|
| Waterfall Discovery | Multi-stage URL discovery |
| Priority Scoring | Rank URLs by importance (0-100) |
| Cluster Assignment | Overlapping topic groupings |
| Hybrid Search | Vector + keyword retrieval |
| ForceAtlas2 | Physics-based graph layout |
Key Technical Decisions
- Global Caching: All embeddings and scraped content are shared across users to reduce redundant work
- Overlapping Clusters: URLs can belong to multiple topic clusters (e.g.,
/api/webhooksis in both/apiand/webhooks) - 768-Dimensional Embeddings: Using Google's text-embedding-004 model for semantic similarity
Check out our Architecture for a deeper dive.