Skip to main content

Architecture

DocWeb's architecture is designed for scalability and real-time performance.

System Overview

                                +------------------+
| Firebase Auth |
+--------+---------+
|
+------------------+ +----------v-----------+
| | HTTPS | |
| Next.js Frontend+----------> Firebase Functions |
| (React 19) | | (Node.js 20) |
| <----------+ |
+--------+---------+ Realtime+----------+-----------+
| Firestore |
| |
+--------v---------+ +----------v-----------+
| | | |
| Sigma.js Graph | | Google Gemini AI |
| Visualization | | (2.5 Flash) |
| | | |
+------------------+ +----------+-----------+
|
+----------v-----------+
| |
| Firestore Database |
| + Global Embeddings |
| |
+----------------------+

Frontend Stack

TechnologyVersionPurpose
Next.js16.1.1React framework with App Router
React19.2.3UI library
TypeScript5.xType safety
Tailwind CSS4.xStyling
Sigma.js3.0.2WebGL graph rendering
@react-sigma/core5.0.6React bindings for Sigma
graphology0.26.0Graph data structure
d3-polygon3.0.1Cluster hull calculations
Firebase12.7.0Auth, Firestore, Functions client

Backend Stack (Cloud Functions)

TechnologyVersionPurpose
Node.js20Runtime
firebase-functions6.3.0Cloud Functions framework
firebase-admin13.0.2Admin SDK
@google/generative-ai0.24.1Gemini API client
Cheerio1.0.0HTML parsing
Turndown7.2.2HTML to Markdown conversion
Axios1.7.9HTTP client
robots-parser3.0.1robots.txt parsing
Stripe20.2.0Payment processing
p-limit3.1.0Concurrency control

AI Models

ModelPurposeDimensions
gemini-2.5-flashChat responses, content analysisN/A
text-embedding-004Vector embeddings768

Data Flow

  1. User Authentication: Firebase Auth (Email/Password or Google OAuth)
  2. Discovery: Waterfall algorithm crawls the site structure
  3. URL Classification: Pages are categorized and prioritized
  4. Real-time Sync: Firestore listeners push updates to the frontend
  5. Visualization: Sigma.js renders the graph with ForceAtlas2 layout
  6. Content Scraping: Cheerio extracts HTML, Turndown converts to Markdown
  7. Embedding Generation: Gemini text-embedding-004 creates 768-dim vectors
  8. RAG Chat: Hybrid search retrieves context, Gemini generates responses

Global Caching Layer

All caching is shared across users for optimal performance:

Cache TypeCollectionTTL
Domain Cacheartifacts/global/domainCache24 hours
Page Cacheartifacts/global/pageCache24 hours
Embeddingsartifacts/global/embeddingsPermanent

See Caching for more details.