| 2:49 pm on Sep 9, 2010 (gmt 0)|
Blocking indexing by robots.txt is not the same as 'blocking googlebot'.
Google has long maintained their ability to crawl content in directories that are blocked by robots.txt exclusions.
Your only recourse is to specifically ban googlebot via an htaccess denial line.
| 7:54 pm on Sep 9, 2010 (gmt 0)|
|Google has long maintained their ability to crawl content in directories that are blocked by robots.txt exclusions. |
Hi Brett, thanks on replying, do you have any reference to the above? I know they are 'able to', but they should not, should they? Because this is precisely why robots.txt is there, to tell bots "Do not go there please". What I am wondering at reading your reply - are you actually saying that Google has said that they are not honouring robots.txt?
With regards to stopping the access with "brute force" - the site is on IIS6 so no .htaccess (and no ISAPI either).
| 9:28 pm on Sep 9, 2010 (gmt 0)|
>Hi Brett, thanks on replying, do you have any reference to the above?
Did on WebmasterWorld here. I looked for a bit and couldn't find it. It was a thread started by Toolman and had input from GoogleGuy - circa 2003.
| 9:32 pm on Sep 9, 2010 (gmt 0)|
| 9:32 pm on Sep 9, 2010 (gmt 0)|
try using a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.
This may help.
| 10:51 pm on Sep 9, 2010 (gmt 0)|
Brett, thanks for the links. What I gathered from reading that thread, Google should obey robots.txt (and they say they do) but some posts in that thread show otherwise.
Interestingly, I have other URLs blocked by robots.txt and they are not crawled. I also have URLs with the same (disallowed) pattern that do get reported in "Restricted by robots.txt" section of WMT and yet a number of URL with the same disallowed pattern get reported in duplicate titles/descriptions (therefore must have been crawled for Google to pick this info).
So it appears that Google sometimes follows directive and sometimes not.
| 1:38 am on Sep 10, 2010 (gmt 0)|
I've never seen google NOT crawl pages that were blocked by robots.txt. (they always crawl them)
I've never seen google INDEX and display pages that were properly blocked by robots exclusion standards. (they never list them in serps)
Just because it shows on WMT, doesn't mean it is going to get returned in serps.
| 3:51 am on Sep 10, 2010 (gmt 0)|
Possibly relevant thread for this discussion: Why Google Might "Ignore" a robots.txt Disallow Rule [webmasterworld.com]
| 5:49 am on Sep 10, 2010 (gmt 0)|
Thanks to both, and the relevant thread was useful. I do not have Googlebot specific entry, I have only one entry for all robots which up to few weeks ago seemed to have worked fine for Googlebot too.
Disallowed but crawled pages do not show in SERPs at all, so no problem here. No drop in traffic and rankings are as usual. I am more worried about crawl budget being spent on URLs with permuted dates and the impact it could (or not) have in the future.
Anyway, will change robots.txt to add a separate section explicitly for Googlebot to see if this has any impact. If not, I will ask developers to change the location.href into doPostBack, this should solve it.
| 11:59 am on Sep 11, 2010 (gmt 0)|
If you are trying to conserve a "crawl budget", I think another option you have is to play around with the Parameter Handling settings in Google webmaster Tools (site configuration > settings > paramter handling).
I just had a similar problem on a very large dynamic site. Thousands of errors were showing on GWT. Google ignored robots.txt exclusions. I set up meta robots exclusions for some directories and X-Robots (exclusion at the HTTP page header) for others.
All this helped, but GWT still kept showing errors (dup titles, 404s, 500s, etc).
The last thing I changed was to block directories via the Parameter Handling. It took a while, but GWT is finally showing very few errors.
None of this is really helping the site, which took a huge hit from Mayday, but that's another story.
| 12:06 pm on Sep 11, 2010 (gmt 0)|
I also wanted to elaborate on Brett's comment above, that I've always seen Google crawl a page that has a link pointing at it, regardless of any robots.txt disallow. Someone did that to my xml sitemap files and they all got indexed (actually showed up in serps).