Skip to main content

Limitations

While DocWeb is a powerful tool for visual intelligence and content extraction, there are certain technical and ethical limitations to its operation.

Discovery Limitations

DocWeb's Waterfall Discovery engine may encounter issues with:

LimitationDescription
Dynamic Content (SPA)Websites relying heavily on client-side JavaScript may not be fully indexed if content isn't rendered on initial load
Paywalls & AuthenticationDocWeb cannot access content behind paywalls or requiring user login
Anti-Scraping MeasuresSites using bot detection (Cloudflare Turnstile, Akamai) may block requests
Non-English ContentURLs detected as non-English are skipped during scraping (discovery still captures them)

Technical Limits

ParameterLimitReason
Max Crawl Depth3 levelsPrevent infinite loops
Max Sitemap Depth5 levelsLimit recursion
Max URLs Per Source10,000Memory and performance
Request Timeout15 secondsPrevent hanging requests
Main Content Size50 KBFirestore document limits
Chunk Size4,000 charactersEmbedding model limits
Max Sessions (Free)5Tier limitation
Max Sessions (Pro/Max)20Tier limitation

Content Limitations

  • Large Files: PDFs, images, and videos are not processed for RAG intelligence
  • Markdown Conversion: Complex HTML layouts may not convert cleanly to Markdown
  • Code Blocks: Syntax highlighting depends on proper language annotations in source HTML

Search Limitations

  • Real-Time Latency: Deep-diving into large sites takes time depending on page count and server response rates
  • Embedding Generation: Initial embedding generation for large sites can take several minutes
  • BM25 Approximation: Keyword search uses estimated document length averages (5,000 chars)

Rate Limits

TierCredits/DayActions
Free33 discoveries or chat messages
Pro1010 discoveries or chat messages
Max100100 discoveries or chat messages

Credits reset at midnight UTC.

Caching Behavior

  • 24-Hour TTL: Cached domains and pages expire after 24 hours
  • Shared Cache: All users share the same cache, so your discovery may use cached data from another user
  • Stale Content: If a site updates frequently, cached content may be outdated until TTL expires

Ethical Considerations

DocWeb is designed for research and intelligence purposes. Users are expected to:

  • Respect robots.txt: DocWeb strictly follows robots.txt rules and crawl-delay directives
  • Avoid Server Overload: Built-in rate limiting prevents excessive requests, but users should avoid repeatedly discovering the same site
  • Comply with Terms of Service: Users are responsible for ensuring their use complies with target site ToS
  • Data Privacy: Do not use DocWeb to access or index personal or sensitive information without authorization

See our Privacy Policy and Terms of Service for more details.