Google crawling permutation of search box results

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google crawling permutation of search box results

aakk9999

4:46 am on Apr 3, 2010 (gmt 0)

As of 3 weeks ago I have noticed that Google permuted all parameters in the product search box and attempted to crawl the search listing results pages generated in this way. The search box has calendar for entry to-from dates and Google has been filling in dates and generating lots of "new" URLs.

This was not happening before. I have only noticed this because URLs with date parameters are excluded via robots.txt and now GWT shows lots of restricted URLs. The search on the site is executed via js using location.href after building the URL based on search parameters entered.

We do not want to use rel canonical as the search results are different pages to the listing pages we allowed to be indexed and I would not know to which of the listing page I should set the canonical to. I am now wandering about crawling budget being wasted unecessary owing to MC comment that URLs disallowed in robots.txt are still part of crawling budget.

I am not sure what is the best course of action here?

tedster

6:03 am on Apr 3, 2010 (gmt 0)

Google has been experimenting with that kind of crawling, off and on, since back in 2008. The best advice is usually not to let Google crawl your search results pages at all. They often make terrible landing pages for a searcher, at any rate.

aakk9999

11:17 am on Apr 3, 2010 (gmt 0)

Thanks on reply Tedster. I am not letting Google crawl search results pages - this is stopped through robots.txt.

But as I said, what I noticed is that Google is attempting to crawl them, creating lots of entries in GWT "blocked by robots" and my question is - would this hurt crawling budget, and if so, what can I do about this?

tedster

4:39 pm on Apr 3, 2010 (gmt 0)

Aha, now I understand better - you're not taking about something you noticed ib your server logs here. In other words, googlebot hasn't actually asked your server for those pages.

Those URLs were just ready in googlebot's crawling queue, and then the robots.txt disallow rule kicked them out. So, you got some information in your Webmaster Tools - just in case that isn't the effect you intended.

I'd say there's absolutely no problem for you.