Welcome to WebmasterWorld Guest from 18.208.211.150

Forum Moderators: phranque

Message Too Old, No Replies

Webpages not being crawled by Screamingfrog app

     
3:38 pm on Jun 23, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:953
votes: 12


Hi, I just made an archive page of my older articles. These are displayed on a paginated page so one can browse through the entire archive.

However, when I use screaming frog seo tool to crawl my site the archive section is not being crawled at all.

What could be the reason? thanks
4:17 pm on June 23, 2015 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


There could be few:
- pages blocked by robots.txt and Screaming Frog honouring robots.txt
- something blocking Screaming Frog user agent

Have you checked your server log files? Can you see requests from Screaming Frog? If so, what does the response say?
4:21 pm on June 23, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:953
votes: 12


Are these the access-log files in Apache? What should I search for in that file? Thx
4:32 pm on June 23, 2015 (gmt 0)

Administrator from GB 

WebmasterWorld Administrator engine is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:May 9, 2000
posts:25913
votes: 880


robots.txt is the first file to look at - it should be in the root, if you have one, and it should give a list.

For example, this blocks everything

User-agent: *
Disallow: /
4:47 pm on June 23, 2015 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


Can you crawl other parts of your site sucessfully? If so then it is either robots.txt block or your URL for archive section is not created in the way robots.txt could understand it.
4:53 pm on June 23, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15506
votes: 750


Emphasis mine:
The Screaming Frog SEO Spider is robots.txt compliant. It obeys robots.txt in the same way as Google.

It will check robots.txt of the (sub) domain and follow (allow/disallow) directives specifically for the Screaming Frog SEO Spider user-agent, if not Googlebot and then ALL robots. It will follow any directives for Googlebot currently as default. Hence, if certain pages or areas of the site are disallowed for Googlebot, the spider will not crawl them either. The tool supports URL matching of file values (wildcards * / $) just like Googlebot.

My, how familiar that sounds.

Edit:
The user-agent switcher has inbuilt preset user agents for Googlebot, Bingbot, Yahoo! Slurp, various browsers and more. This feature also has a custom user-agent setting which allows you to specify your own user agent.

I don't perfectly understand why this is necessary or even desirable. Wouldn't misrepresenting your UA just make you more likely to be blocked outright?
7:05 pm on June 23, 2015 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Apr 30, 2008
posts:2630
votes: 191


There is an option in Screaming Frog to ignore robots.txt exclusion - perhaps worth trying?
2:09 am on June 24, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:953
votes: 12


I shall check robots file today and report back here
2:15 am on June 24, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:953
votes: 12


Yeah I checked robots file looks ok
3:12 am on June 24, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:953
votes: 12


Yeah this is puzzling it (seo spider) goes only upto 6 pages within the archive page (the articles are paginated). Then it doesn't index any other the articles within it. What could be the reason? Thanks!
3:16 am on June 24, 2015 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:15506
votes: 750


Does "6 pages" mean that it has to follow five links? Or does the first page have direct links to each of the other pages? I'm thinking there might be a setting for recursion depth.
4:03 am on June 24, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:953
votes: 12


Yeah the first page has direct links to other pages. Where is the recursion depth setting? Thanks
10:01 am on June 24, 2015 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:953
votes: 12


Well I included the wildcard path for that archive page and it crawled them fine. I guess it takes time to complete 100% would have to wait.