Content Extraction

DocWeb's ability to extract and understand content is what powers Dex, our AI chatbot.

Technology Stack

Library	Purpose
Cheerio 1.0.0	HTML parsing and DOM traversal
Turndown 7.2.2	HTML to Markdown conversion
Axios 1.7.9	HTTP client for fetching pages

Extraction Pipeline

Fetch: HTTP GET with 15-second timeout, 3 retries
Parse: Cheerio loads HTML, extracts <main>, <article>, or body content
Clean: Remove scripts, styles, navigation, and non-content elements
Convert: Turndown transforms HTML to clean Markdown
Store: Markdown saved to Firestore (max 50KB)

Scraping Modes

Auto-Scrape

After discovery, DocWeb automatically scrapes structural nodes (found in navigation elements):

{
    action: "autoScrape",
    appId: string,
    searchId: string,
    maxNodes: 10
}

Prioritizes nodes with isStructuralNode: true and importanceScore: 1.0.

Cluster Scrape

Scrape all nodes in a specific cluster:

{
    action: "scrapeCluster",
    appId: string,
    searchId: string,
    clusterId: "/api",
    maxNodes: 20
}

On-Demand Scrape

Scrape specific document IDs:

{
    action: "scrapeNodes",
    appId: string,
    searchId: string,
    docIds: ["abc123", "def456"]
}

JIT Scraping

When Dex needs content from an unscraped page during chat, it triggers Just-in-Time scraping automatically.

Scraped Content Schema

interface ScrapedContent {
    title?: string;              // Page title
    description?: string;        // First 500 chars
    mainContent?: string;        // Markdown (max 50KB)
    scrapingStatus: "pending" | "scraping" | "complete" | "failed" | "skipped";
    lastScraped?: Date;
    scrapeError?: string;
}

Scraping Status

Status	Meaning
`pending`	Not yet scraped
`scraping`	Currently being scraped
`complete`	Successfully scraped
`failed`	Scrape attempt failed
`skipped`	Intentionally skipped (e.g., non-English)

Content Filtering

Using URL Classification, DocWeb filters content:

Skip non-English URLs: Detected via path patterns (e.g., /fr/, /zh-cn/)
Prioritize documentation: URLs with doc keywords scraped first
Limit utility pages: Login, search, legal pages get lower priority

Caching

Scraped content is cached globally:

Cache	Collection	TTL
Page Cache	`artifacts/global/pageCache`	24 hours

Content hash (MD5, 12 chars) tracks freshness - unchanged pages aren't re-scraped.

Cloud Function

Endpoint: scrapeNodes Memory: 512 MiB Timeout: 120 seconds

Response:

{
    success: boolean;
    scraped: number;
    failed: number;
    skipped: number;
    results: ScrapeResult[];
}

Technology Stack​

Extraction Pipeline​

Scraping Modes​

Auto-Scrape​

Cluster Scrape​

On-Demand Scrape​

JIT Scraping​

Scraped Content Schema​

Scraping Status​

Content Filtering​

Caching​

Cloud Function​