Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Decreased Crawl Rate - and Many Crawl Errors

         

gzoo

6:00 am on May 16, 2012 (gmt 0)

10+ Year Member



I am facing severe crawling issues at our website since a week. Let me try to explain the situation:

By April 28th we modified the structure of our 3 years old website and created over 250 sub-domains for different regions and industries. Data on all sub-domains is old and we've applied 301 redirect to move old URLs to new one. In this case 85% of the pages are redirecting to new one. Within 1 week Google PR is also assigned to the sub-domains which are linked with home page.

Initially, everything went fine but after May 5th the crawl rate of our website gradually decreased by 80-90%. Not just crawl, website traffic also effected 50% with compare to previous weeks.

Using "site:" I can see good number of index pages of new sub-domains. Website also seems OK while checking it using "Fetch as Google" and Googlebot is also allowed in robots.txt. As usual in cache current date is appearing which means home page is crawled by Google every day like before.

As I mentioned earlier, with this mega change 85% of the pages are redirecting from sub-directories to sub-domains. Secondly, I am noticing many Not Found and Not Followed pages under crawl errors. Can these be the reason of current problem with my website, or Google just don't liked massive change in URL structure of our website?

Please share your expert opinions to find out possible reasons of this problem!

tedster

6:07 pm on May 16, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It sounds like you may have introduced technical errors when you changed things last month. I'd say you have a lot of work to reviewing all your redirects, etc. There's no easy answer for this kind of a mess, just hard work ;(

[edited by: tedster at 6:11 pm (utc) on May 16, 2012]

deadsea

6:08 pm on May 16, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How are you measuring crawl rate? If you are using google webmaster tools, keep in mind that if you have example.com authorized, the crawl rate will be only for example.com and not for ANY of the subdomains. I authorize each and every subdomain separately in webmaster tools so that I can see crawl activity and problems on each of them.

Assuming that you are looking at your access logs and your crawl rate actually IS down, it sounds like a case of lost pagerank leading to lost rankings. Did links to your site get taken down recently? Maybe Google made an algo change that discounted many of the inbound links to your site.

lucy24

10:31 pm on May 16, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Secondly, I am noticing many Not Found and Not Followed pages under crawl errors.

Not Found means what it says. It applies equally to humans and robots. Humans will see the 404 page, so make sure you have a nice one while you work on fixing your redirects.

Not Followed applies strictly to robots. It means-- or should mean-- that all links to the page are flagged as "nofollow". If only some links have "nofollow", the googlebot will find one of the other links, even if you have 8000 "nofollow" and just one where you forgot to say "nofollow". Unfortunately your logs won't usually say where the robot came from, so you will have to do some hunting. Get hold of a text editor or other tool that lets you search the content of all your pages in a batch, and hunt for any occurrences of "nofollow".

In theory, the Crawl Errors will tell you where a problem page is linked from. But the link tends to be something dating from 2008*, so it is seldom very useful.


* I believe the current record is a link from a page that disappeared in 1996. Don't remember who posted it.

gzoo

7:19 am on May 19, 2012 (gmt 0)

10+ Year Member



Thank you for your contribution to this thread.

@tedster I have reviewed all links and re-directs of my website but could not figure out any error. It clear that ratio of "Not Found or 404" and "Not Followed" pages increased by 5th and problem with crawling & traffic starts in the same duration.

Ratio of "Not Found 404" errors is 7000+ per day
Ratio of "Not Followed" errors is 90 per day

@deadsea I am measuring crawling rate with server logs as well as Webmaster Tool and yes I created a separate profile of multiple sub-domains and watching top sub-domains separately.

As far as links issue is concern so I don't think that its an issue, because Google has extremely slower down the crawling of my website. Earlier it was crawling more than 400K pages everyday that reduced down to less than 20K pages including all sub-domains and main website.

Does the usage of excessive "301" code is problematic for Google?

gzoo

7:56 am on May 19, 2012 (gmt 0)

10+ Year Member



Few more things I want to share that my robots.txt is empty. As per my basic info empty robots.txt refer SE as "Allowed", Is it correct Or do I need to define following lines strictly?

User-agent: *
Allow: /

I have observed that number of Blocked URLs are gradually increasing day by day in "Blocked URLs" section while the downloaded date of robots.txt with 200 (Success) is mentioned as April 28, 2012 since the same date.

This was the date robots.txt (with disallowed status) was moved from test folder to main website which identified within 1 hour and replaced with black robots.txt. Although blank robots.txt is still available on server but downloaded date is not refreshed.

Is it normal or something is wrong with it?

lucy24

9:30 am on May 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Try turning it around. Instead of "allow everything" say "disallow nothing".

User-Agent: Googlebot
Disallow:

User-Agent: *
Disallow:

levo

10:08 am on May 19, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do some of your redirections end up with 404? I can't find any information if there is a side effect of redirections ending up with 404s.

I use to have non-www-fixing and querysting-dropping redirections before the actual mod_rewrite rules. For example, for article pages that have URLs like www.example.com/article/1234, if an article had been deleted;

example.com/article/1200?somequery > www.example.com/article/1200?somequery > www.example.com/1200 (404)


I suspect Google and Bing see those 301s somewhere between 200 and 404, one that doesn't work and not gone either. OK, its normal behavior for 301 redirects, but if you're mass redirecting sub-folders to sub-domains, and some of them are ending up with 404s, that might be the problem.

gzoo

11:09 am on May 19, 2012 (gmt 0)

10+ Year Member



@lucy24 I was thinking robots.txt is not an issue since GBot is just slower down rather than stop crawling.


@levo Initially due to increased soft 404s we applied 404 code in some pages but later on we removed it.

This was a massive re-direction from sub-directories to sub-domains and as I mentioned earlier Webmaster Tool is showing 7000+ error pages since 8-10 days in every date but sample of error pages under "Top 1,000 pages with errors" is not refreshed after 10th so its difficult to analyze what kind of current 404 pages are.

What about if I disallow old sub-directories? In this case massive redirection will stopped but we will loose value of old page which was suppose to transfer to new sub-domain pages with 301 redirects.

gzoo

12:21 pm on May 19, 2012 (gmt 0)

10+ Year Member



Can anyone tell me how to refresh the robots.txt from Google Cache?

lucy24

1:07 am on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Do some of your redirections end up with 404? I can't find any information if there is a side effect of redirections ending up with 404s.

I should think they absolutely hate it, wouldn't you? Sometimes it's the only sanity-saving approach: say 99% of your old directory corresponds to a page in the new directory. Redirect as a block, and let the 1% fall by the wayside. But it's definitely preferable to serve the 404 in the first place. Or still better a 410, if the page is simply gone.

There's a place in WMT where you can check robots.txt. It only shows a download date, not an exact time, but it can't possibly be older than yesterday.* Use this to test randomly selected pages and see if they can be crawled as intended.


* OK, technically it can. The robots.txt for my art studio's site, which is smaller than mine by orders of magnitude, is datestamped three days ago. But we're talking about normally sized sites.

gzoo

10:17 am on May 20, 2012 (gmt 0)

10+ Year Member



As per webmaster tool Google has last downloaded robots.txt by April 28, 2012 that usually Google refreshes every 24 in their cache. As per server logs Google is daily accessing our robots.txt in which Google is allowed but I am unable to understand that why it is not being updated in Google cache.

Can anyone tell me how to refresh robots.txt in Google's cache? Or is there any way to update Google about this problem?

My webmaster tool is showing following information:

domain.com/robots.txt - 61,740 - Apr 28, 2012 - 200(Success) - Googlebot is blocked from domain.com

lucy24

12:30 pm on May 20, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ouch, ouch!

On that same page in gwt, have you tried testing some random pages? Confirm that google really thinks it's blocked.

If googlebot meets a completely blank page under the name "robots.txt", it may wig out and think there is something wrong. Try changing it to say explicitly

User-Agent: *
Disallow:

If it makes a difference in gwt, it should make a difference in real crawling.

gzoo

5:14 am on May 22, 2012 (gmt 0)

10+ Year Member



Last Saturday I made the same change in my robots.txt and next day G has just refreshed the download date of robots.txt, even today it is showing downloaded 22 hours ago.

Its true that G thought something is wrong with blank robots.txt so it didn't download it and our website was actually blocked for gbot.

G has not speedup the crawl rate of our website so far, seems it will take some time to boost crawling.

Thanks lucy24 for your continuous contribution :)

Sgt_Kickaxe

2:37 pm on Sep 7, 2012 (gmt 0)



There is a new thread discussing *possibly* the same issue - [webmasterworld.com...]

Crawl rates have plunged off a cliff for a few of us at the same time.