Forum Moderators: phranque

Message Too Old, No Replies

Scraping websites with Google Chrome console / addons

         

tpb101

9:17 am on Apr 10, 2015 (gmt 0)

10+ Year Member



I heard that it may be possible to scrape large websites with Google Chrome console, or scraper addons for Chrome, or also Firefox. I need to scrape title tags only, for sites with 25 - 50 million pages, or so. Is something like this possible?

Thanks.

phranque

6:07 pm on Apr 10, 2015 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Xenu Linksleuth or Screaming Frog SEO paid edition can do this.

tpb101

7:04 am on Apr 11, 2015 (gmt 0)

10+ Year Member



I am familiar with both. They will not work on sites of this size.

lucy24

7:25 am on Apr 11, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<tangent>
Is this even the most efficient way to extract the information? Seems like you could more easily write a onetime script to go into the database and pull out the line of the table that contains the page titles. Or adapt the function that generates the page title, depending on type of site.

That's assuming, ahem, cough-cough, that this is your own site we're talking about.
</tangent>

tpb101

2:21 pm on Apr 11, 2015 (gmt 0)

10+ Year Member



No, this is not my site / sites. I want to scrape titles from up to 100 sites, very large (could be 50 million pages or more). I would say that most of them don't have accessible sitemaps too.

I know this can be done in Java, PHP, some kind of script on fast Linux server (somebody recommended Amazon), and some other things.

In general:

- websites are not mine
- there is no sitemaps / url list for these sites
- this needs to be done fairly fast (I can wait two or three months to scrape this, but I would need to work on multiple websites at the same time, I am not sure how many, this could be factored in, if this would take months, I could go down to 3 sties too)
- how much would it cost to make is important too, I may / will be paying for it with my own money

Thanks.

adder

12:56 pm on Apr 15, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Are you sure Xenu can't help? It may be down to the entry point you give it. Mind you, the homepage is not always the best entry point. Depends on the site structure.

Other than that, I'd say a PHP script based on CURL and using a reliable list of 'proxy' IP addresses would be your best bet.

The main challenge will be to figure out how to crawl the sites because if it's got a dumb internal linking structure, even a custom script will struggle to find all pages.

timchuma

5:45 am on Apr 16, 2015 (gmt 0)

10+ Year Member Top Contributors Of The Month



Xenu can end up getting blocked if you are on a network with restrictions. Check how many connections it uses.