More

nirvanist · 2026-01-12T05:04:29 1768194269

Extract Structured Data from Any Web Page https://page-replica.com/structured/live-demo

nirvanist · 2026-01-09T14:26:06 1767968766

thank you for comment, probably not in this module but defiantly I m thinking about how to implement this

nirvanist · 2026-01-09T13:00:23 1767963623

Modern web pages are cluttered with tracking scripts, analytics, styling, ads, and interactive elements that waste tokens and dilute semantic meaning when processing content for AI systems. This library strips away the noise to give you clean, meaningful HTML that:

- Reduces token count by 60-90% (fewer API costs) - Improves embedding quality (less noise = better semantic search) - Speeds up processing (smaller payloads = faster inference) - Preserves structure (headings, paragraphs, links stay intact) - Zero dependencies (pure JavaScript, no bloat)

nirvanist · 2025-04-29T23:26:30 1745969190

At the moment, I use it in client projects to build agents for their chat systems by adding RAG to models

mahi_novice · 2025-04-30T17:50:03 1746035403

nirvanist · 2025-04-29T23:25:27 1745969127

Basically, I use a headless Chromium with Puppeteer to render the page. Then, some logic extracts and cleans the HTML content. Finally, I use Gemini with a specific schema to return a JSON response.

mahi_novice · 2025-04-30T17:50:13 1746035413

Okay, thanks!

nirvanist · on Dec 3, 2024

So yeah, it needs to use a headful browser, and I’m not able to do that for now. It’s too much work for a POC, but really good case at least I learned something :) thank you again

nirvanist · on Dec 2, 2024

hey , can you give it a try again , i did some update.

Oras · on Dec 2, 2024

Same

> The page displays a Cloudflare challenge error, indicating a potential security or server issue that could impact SEO.

> Semantic Analysis The page currently shows a Cloudflare error and lacks essential SEO elements like a descriptive title, meta description, and proper header tags. The content is largely unavailable due to the error. Fixing the Cloudflare issue is paramount, followed by implementing appropriate title, meta description, and header tags for optimal search engine visibility.

nirvanist · on Dec 2, 2024

hmmm , actually this is can be also challenging for search engine bots for sure I m gonna dive more on this issues, thank you again

Oras · on Dec 2, 2024

if it helps, I checked my website with https://www.seobility.net/en/seocheck/ and looks fine.

I also checked my CloudFlare dashboard, and I can see crawls from Google and Bing.

nirvanist · on Dec 2, 2024

really helpful are you enabling "Block AI Bots" under scurity/bots on cloudflare

Oras · on Dec 2, 2024

no

Bot Fight Mode => Enabled. Block AI Bots => Disabled.

nirvanist · on Dec 2, 2024

absolutely , I will give tweak the warper to provide better solution , thank you again this valuable feedback

nirvanist · on Dec 2, 2024

I'm not sure how I can bypass the Cloudflare CAPTCHA, but it's an interesting use case. I'll try to fix it and keep you updated. Thank you again for giving it a try!

nirvanist · on Dec 1, 2024

did you tried with an other url ? and can you share yours with me it can be an interesting case for me , thank you

rishikeshs · on Dec 1, 2024

Not sure what caused the issue, but I tried it on incognito and it worked!

nirvanist · on Dec 1, 2024

Glad to hear that