Hacker Newsnew | past | comments | ask | show | jobs | submit | nirvanist's commentslogin

Extract Structured Data from Any Web Page https://page-replica.com/structured/live-demo


thank you for comment, probably not in this module but defiantly I m thinking about how to implement this


Modern web pages are cluttered with tracking scripts, analytics, styling, ads, and interactive elements that waste tokens and dilute semantic meaning when processing content for AI systems. This library strips away the noise to give you clean, meaningful HTML that:

- Reduces token count by 60-90% (fewer API costs) - Improves embedding quality (less noise = better semantic search) - Speeds up processing (smaller payloads = faster inference) - Preserves structure (headings, paragraphs, links stay intact) - Zero dependencies (pure JavaScript, no bloat)


At the moment, I use it in client projects to build agents for their chat systems by adding RAG to models


Okay


Basically, I use a headless Chromium with Puppeteer to render the page. Then, some logic extracts and cleans the HTML content. Finally, I use Gemini with a specific schema to return a JSON response.


Okay, thanks!


So yeah, it needs to use a headful browser, and I’m not able to do that for now. It’s too much work for a POC, but really good case at least I learned something :) thank you again


hey , can you give it a try again , i did some update.


Same

> The page displays a Cloudflare challenge error, indicating a potential security or server issue that could impact SEO.

> Semantic Analysis The page currently shows a Cloudflare error and lacks essential SEO elements like a descriptive title, meta description, and proper header tags. The content is largely unavailable due to the error. Fixing the Cloudflare issue is paramount, followed by implementing appropriate title, meta description, and header tags for optimal search engine visibility.


hmmm , actually this is can be also challenging for search engine bots for sure I m gonna dive more on this issues, thank you again


if it helps, I checked my website with https://www.seobility.net/en/seocheck/ and looks fine.

I also checked my CloudFlare dashboard, and I can see crawls from Google and Bing.


really helpful are you enabling "Block AI Bots" under scurity/bots on cloudflare


no

Bot Fight Mode => Enabled. Block AI Bots => Disabled.


absolutely , I will give tweak the warper to provide better solution , thank you again this valuable feedback


I'm not sure how I can bypass the Cloudflare CAPTCHA, but it's an interesting use case. I'll try to fix it and keep you updated. Thank you again for giving it a try!


did you tried with an other url ? and can you share yours with me it can be an interesting case for me , thank you


Not sure what caused the issue, but I tried it on incognito and it worked!


Glad to hear that


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: