_bitliner's comments

_bitliner · on Jan 6, 2015

I was wondering which is your market. I mean, who is going to use a service like this? Or which are typical use cases of it?

jfoster · on Jan 6, 2015

Might be useful for search. (not necessarily just web search, but ecommerce as well)

For example, instances of "aqua" should probably match the search query "blue". Google seems like it may already be that advanced, but other search engines perhaps not. Large-scale search engines probably would keep this in their own DB, though.

impostervt · on Jan 6, 2015

No idea yet. Seems to have gotten a good response from hacker news/product hunt crowd. I mainly built it because I needed it for another project.

_bitliner · on Nov 17, 2014

I really like the flow/UX. Congratulations! Nice job!

What is the roadmap?

I am really inside scraping, it is one of my daily job. I could consider to integrate it in one of my architectures

_bitliner · on Nov 17, 2014

Furthermore, what you mean with `Javascript pages supported`? Could I just specify where it has to click or do I need to make a reverse engineering of the ajax calls?

binux · on Nov 17, 2014

http://demo.pyspider.org/debug/js_test_sciencedirect is a sample for this.

There is a phantomjs fetcher that can render the page as WebKit did. Furthermore, you can have some JavaScript running before/after page loaded to simulate a mouse click.

pknerd · on Nov 17, 2014

But will it not be slow? Assuming downloading css/images etc?

binux · on Nov 17, 2014

Images not downloaded default. Both the fetcher and the phantomjs proxy is totally async.

binux · on Nov 17, 2014

To make it more flexible and easy to reuse? I have implemented most features I need now.

_bitliner · on Nov 17, 2014

Because I already have a powerful distributed architecture. I was curious about the architecture of pyspider.

For example, how the queue is handled? Is it centralized? Is there a server managing it?

binux · on Nov 17, 2014

the architecture of pyspider: http://blog.binux.me/assets/image/pyspider-arch.png

And yes for centralized queue which is in scheduler. It's designed to satisfy about 10-100 million urls for each project.

scheduler, fetchers, processors are connected with rabbitmq(alternatively). Only one scheduler is allowed. But you can run multiple fetchers or processors as needed.

maratc · on Nov 17, 2014

Will it be a good fit if I, running on a hundred servers, need to scrape just the home page of a million sites? No analysis of the pages, that is done later.

binux · on Nov 17, 2014

The fetcher fit you already...

maratc · on Nov 17, 2014

You are running

   phantomjs phantomjs_fetcher.js

and using it as proxy? The setup instructions are a bit unclear on this.

binux · on Nov 17, 2014

I want to make it a http proxy in the beginning. But I found it hard to do so. Then I post every to it, but haven't change the name.

But it works like a proxy, that any request with `fetch_type == 'js'` would be fetched through phantomjs and the response back to tornado_fetcher.