"No URLs" parameter but crawl rate climbed rapidly?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

"No URLs" parameter but crawl rate climbed rapidly?

jammy8891

9:09 am on Mar 26, 2014 (gmt 0)

I changed some parameters on my site and set them to crawl "No URLs" yet since then my crawl rate has jumped considerably (about 4x higher). Pretty much the exact opposite of what I wanted to happen.

Also, I got a message from Google saying it had crawled an unusually high number of URLs on my site, and some of the examples listed included parameters that I'd specifically blocked.

Has anyone seen anything like this in the past?

aakk9999

10:01 am on Mar 26, 2014 (gmt 0)

Is this the only thing you changed?

What are the chances of Google not knowing of these URLs before and by setting up the parameters in WMT Google suddenly "discovered" many URLs ?

jammy8891

10:37 am on Mar 26, 2014 (gmt 0)

Yep this is the only thing changed.

It's possible, but Google had already indexed a large number of parameter-based URLs prior to these changes. Basically it is the site's internal search function which is then driven by faceted navigation. It had already indexed a large number of URLs containing /your-search-results? etc, etc. so surely it had already known about these URLs anyway?

aakk9999

11:44 pm on Mar 26, 2014 (gmt 0)

I don't know why would Google suddenly crawl all these URLs only because you have set up parameter handling in WMT, unless setting this somehow shifted Google's focus on what to crawl.

The message on crawling unusually high number of URLs usually indicate there has been some kind of floodgate opened. Have you looked at your logs to see what is Google actually crawling?

I tend to use robots.txt to block Google from internal search more often than the parameter settings and this is working well for me.

not2easy

1:41 am on Mar 27, 2014 (gmt 0)

I have found that they pay little or no attention to the parameters in GWT. I do test them in there using 'Fetch as Google' but they still decide to go on a rampage crawl now and then. Like jammy8891 I try to prevent them crawling search results URLs. Mine are not actual pages and are generated on the fly and cached for 24-48 hours with a session ID. They did one of these taliban crawls on one site last Thursday and Friday, all over the place generating new cached files until I set a new crawl delay. They bumped up my disk space usage until I got a server notification and less than 12 hours later were bumping the (low) ceiling I set for that site again with another 600Mb of new cached files. They are also blocked in robots.txt and the 'Fetch' results show this clearly. Like jammy8891's case, it is possible that they crawled and indexed these pages a long time ago (5-6 years), before I caught on to what they were doing. I don't see any way they can visit those results URLs and find anything but a 404. This is a very old niche site that was #1-#2 for 10+ years and then disappeared, now bouncing back. No clear reasons for any part of their behavior.