Core Concepts

DocWeb is built on several key principles that differentiate it from standard AI models and traditional documentation tools.

Real-Time Retrieval vs. Outdated Training

Most LLMs have a knowledge cut-off. DocWeb doesn't rely on a memory bank; it crawls the live site the moment you request it. This provides Just-in-Time intelligence.

Grounded Context

By using Retrieval-Augmented Generation (RAG), DocWeb strictly limits the AI's "brain" to the Markdown data scraped from the specific website, eliminating hallucinations. Every response from Dex includes source citations.

Five-Phase Data Flow

Discovery Phase: User enters URL → Waterfall discovery (robots.txt → sitemaps → navigation crawl) → URLs saved to Firestore
Visualization Phase: Frontend subscribes to Firestore → Real-time graph updates via Sigma.js → ForceAtlas2 layout
Scraping Phase: On-demand or auto-scrape → Content extraction via Cheerio → Markdown conversion → Stored in Firestore
Embedding Phase: Scraped content → Chunked text (4000 chars, 400 overlap) → Gemini embeddings (768-dim) → Global embeddings collection
Chat Phase: User query → Hybrid search (vector + BM25) → Context retrieval → Gemini response → Source citations

Core Algorithms

Algorithm	Purpose
Waterfall Discovery	Multi-stage URL discovery
Priority Scoring	Rank URLs by importance (0-100)
Cluster Assignment	Overlapping topic groupings
Hybrid Search	Vector + keyword retrieval
ForceAtlas2	Physics-based graph layout

Key Technical Decisions

Global Caching: All embeddings and scraped content are shared across users to reduce redundant work
Overlapping Clusters: URLs can belong to multiple topic clusters (e.g., /api/webhooks is in both /api and /webhooks)
768-Dimensional Embeddings: Using Google's text-embedding-004 model for semantic similarity

Check out our Architecture for a deeper dive.

Real-Time Retrieval vs. Outdated Training​

Grounded Context​

Five-Phase Data Flow​

Core Algorithms​

Key Technical Decisions​

Real-Time Retrieval vs. Outdated Training

Grounded Context

Five-Phase Data Flow

Core Algorithms

Key Technical Decisions