Welcome to WebmasterWorld Guest from 35.172.195.49

Forum Moderators: phranque

Message Too Old, No Replies

Scraping websites with Google Chrome console / addons

     
9:17 am on Apr 10, 2015 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 23, 2014
posts:46
votes: 0


I heard that it may be possible to scrape large websites with Google Chrome console, or scraper addons for Chrome, or also Firefox. I need to scrape title tags only, for sites with 25 - 50 million pages, or so. Is something like this possible?

Thanks.
6:07 pm on Apr 10, 2015 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:11874
votes: 245


Xenu Linksleuth or Screaming Frog SEO paid edition can do this.
7:04 am on Apr 11, 2015 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 23, 2014
posts:46
votes: 0


I am familiar with both. They will not work on sites of this size.
7:25 am on Apr 11, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15937
votes: 889


<tangent>
Is this even the most efficient way to extract the information? Seems like you could more easily write a onetime script to go into the database and pull out the line of the table that contains the page titles. Or adapt the function that generates the page title, depending on type of site.

That's assuming, ahem, cough-cough, that this is your own site we're talking about.
</tangent>
2:21 pm on Apr 11, 2015 (gmt 0)

Junior Member

5+ Year Member

joined:Apr 23, 2014
posts:46
votes: 0


No, this is not my site / sites. I want to scrape titles from up to 100 sites, very large (could be 50 million pages or more). I would say that most of them don't have accessible sitemaps too.

I know this can be done in Java, PHP, some kind of script on fast Linux server (somebody recommended Amazon), and some other things.

In general:

- websites are not mine
- there is no sitemaps / url list for these sites
- this needs to be done fairly fast (I can wait two or three months to scrape this, but I would need to work on multiple websites at the same time, I am not sure how many, this could be factored in, if this would take months, I could go down to 3 sties too)
- how much would it cost to make is important too, I may / will be paying for it with my own money

Thanks.
12:56 pm on Apr 15, 2015 (gmt 0)

Preferred Member from GB 

10+ Year Member Top Contributors Of The Month

joined:July 25, 2005
posts:406
votes: 17


Are you sure Xenu can't help? It may be down to the entry point you give it. Mind you, the homepage is not always the best entry point. Depends on the site structure.

Other than that, I'd say a PHP script based on CURL and using a reliable list of 'proxy' IP addresses would be your best bet.

The main challenge will be to figure out how to crawl the sites because if it's got a dumb internal linking structure, even a custom script will struggle to find all pages.
5:45 am on Apr 16, 2015 (gmt 0)

Full Member from AU 

10+ Year Member

joined:Oct 20, 2003
posts:259
votes: 1


Xenu can end up getting blocked if you are on a network with restrictions. Check how many connections it uses.