Modern web pages are cluttered with tracking scripts, analytics, styling, ads, and interactive elements that waste tokens and dilute semantic meaning when processing content for AI systems. This library strips away the noise to give you clean, meaningful HTML that:
- Reduces token count by 60-90% (fewer API costs)
- Improves embedding quality (less noise = better semantic search)
- Speeds up processing (smaller payloads = faster inference)
- Preserves structure (headings, paragraphs, links stay intact)
- Zero dependencies (pure JavaScript, no bloat)
Basically, I use a headless Chromium with Puppeteer to render the page. Then, some logic extracts and cleans the HTML content. Finally, I use Gemini with a specific schema to return a JSON response.
So yeah, it needs to use a headful browser, and I’m not able to do that for now. It’s too much work for a POC, but really good case at least I learned something :) thank you again
> The page displays a Cloudflare challenge error, indicating a potential security or server issue that could impact SEO.
> Semantic Analysis
The page currently shows a Cloudflare error and lacks essential SEO elements like a descriptive title, meta description, and proper header tags. The content is largely unavailable due to the error. Fixing the Cloudflare issue is paramount, followed by implementing appropriate title, meta description, and header tags for optimal search engine visibility.
I'm not sure how I can bypass the Cloudflare CAPTCHA, but it's an interesting use case. I'll try to fix it and keep you updated. Thank you again for giving it a try!