Google not honouring robots.txt - according to WMT reports! - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google not honouring robots.txt - according to WMT reports!

aakk9999

10:28 pm on Sep 8, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

As of few weeks ago I have noticed that Google WMT reports duplicate titles and descriptions for URLs that are blocked by robots.txt. I have verified via "test robots.txt" feature in WMT that the robots.txt directive is constructed correctly and that these URLs are disallowed.

It however seems Google has crawled these pages anyway, otherwise how could it find out what the title element and description meta tag of these URLs is in order to report duplicates in WMT.

These URLs are product searches based on entering dates, so obviously the permutations will be endless. I am now concerned that crawling all these URLs may have impact on crawling budget and ultimately, on site ranking.

URLs that are blocked by robots.txt (but still crawled) are result of clicking on the <Search> button and are algoritmically constructed by on-page javascript which then executes location.href. We never had issue with these before - they started to appear in WMT about 4 weeks ago and the number is steadily rising.

I am thinking to ask for javascript to be moved to external file and blocking this file with robots.txt, however, if Google is not honouring robots.txt already, is there any point of doing so? Or any other solution?

Brett_Tabke

2:49 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Blocking indexing by robots.txt is not the same as 'blocking googlebot'.

Google has long maintained their ability to crawl content in directories that are blocked by robots.txt exclusions.

Your only recourse is to specifically ban googlebot via an htaccess denial line.

Google is starting to crawl more and more javascript these days.

aakk9999

7:54 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Google has long maintained their ability to crawl content in directories that are blocked by robots.txt exclusions.

Hi Brett, thanks on replying, do you have any reference to the above? I know they are 'able to', but they should not, should they? Because this is precisely why robots.txt is there, to tell bots "Do not go there please". What I am wondering at reading your reply - are you actually saying that Google has said that they are not honouring robots.txt?

With regards to stopping the access with "brute force" - the site is on IIS6 so no .htaccess (and no ISAPI either).

Brett_Tabke

9:28 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

>Hi Brett, thanks on replying, do you have any reference to the above?

Did on WebmasterWorld here. I looked for a bit and couldn't find it. It was a thread started by Toolman and had input from GoogleGuy - circa 2003.

Brett_Tabke

9:32 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

[webmasterworld.com...]
[google.com...]

bwnbwn

9:32 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

try using a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.

This may help.

aakk9999

10:51 pm on Sep 9, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Brett, thanks for the links. What I gathered from reading that thread, Google should obey robots.txt (and they say they do) but some posts in that thread show otherwise.

Interestingly, I have other URLs blocked by robots.txt and they are not crawled. I also have URLs with the same (disallowed) pattern that do get reported in "Restricted by robots.txt" section of WMT and yet a number of URL with the same disallowed pattern get reported in duplicate titles/descriptions (therefore must have been crawled for Google to pick this info).

So it appears that Google sometimes follows directive and sometimes not.

Brett_Tabke

1:38 am on Sep 10, 2010 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

I've never seen google NOT crawl pages that were blocked by robots.txt. (they always crawl them)

I've never seen google INDEX and display pages that were properly blocked by robots exclusion standards. (they never list them in serps)

Just because it shows on WMT, doesn't mean it is going to get returned in serps.

tedster

3:51 am on Sep 10, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Possibly relevant thread for this discussion: Why Google Might "Ignore" a robots.txt Disallow Rule [webmasterworld.com]

aakk9999

5:49 am on Sep 10, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Thanks to both, and the relevant thread was useful. I do not have Googlebot specific entry, I have only one entry for all robots which up to few weeks ago seemed to have worked fine for Googlebot too.

Disallowed but crawled pages do not show in SERPs at all, so no problem here. No drop in traffic and rankings are as usual. I am more worried about crawl budget being spent on URLs with permuted dates and the impact it could (or not) have in the future.

Anyway, will change robots.txt to add a separate section explicitly for Googlebot to see if this has any impact. If not, I will ask developers to change the location.href into doPostBack, this should solve it.

spiral

11:59 am on Sep 11, 2010 (gmt 0)

10+ Year Member

If you are trying to conserve a "crawl budget", I think another option you have is to play around with the Parameter Handling settings in Google webmaster Tools (site configuration > settings > paramter handling).

I just had a similar problem on a very large dynamic site. Thousands of errors were showing on GWT. Google ignored robots.txt exclusions. I set up meta robots exclusions for some directories and X-Robots (exclusion at the HTTP page header) for others.

All this helped, but GWT still kept showing errors (dup titles, 404s, 500s, etc).

The last thing I changed was to block directories via the Parameter Handling. It took a while, but GWT is finally showing very few errors.

None of this is really helping the site, which took a huge hit from Mayday, but that's another story.

spiral

12:06 pm on Sep 11, 2010 (gmt 0)

10+ Year Member

I also wanted to elaborate on Brett's comment above, that I've always seen Google crawl a page that has a link pointing at it, regardless of any robots.txt disallow. Someone did that to my xml sitemap files and they all got indexed (actually showed up in serps).