homepage Welcome to WebmasterWorld Guest from 50.17.27.205
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google not honouring robots.txt - according to WMT reports!
aakk9999




msg:4198867
 10:28 pm on Sep 8, 2010 (gmt 0)

As of few weeks ago I have noticed that Google WMT reports duplicate titles and descriptions for URLs that are blocked by robots.txt. I have verified via "test robots.txt" feature in WMT that the robots.txt directive is constructed correctly and that these URLs are disallowed.

It however seems Google has crawled these pages anyway, otherwise how could it find out what the title element and description meta tag of these URLs is in order to report duplicates in WMT.

These URLs are product searches based on entering dates, so obviously the permutations will be endless. I am now concerned that crawling all these URLs may have impact on crawling budget and ultimately, on site ranking.

URLs that are blocked by robots.txt (but still crawled) are result of clicking on the <Search> button and are algoritmically constructed by on-page javascript which then executes location.href. We never had issue with these before - they started to appear in WMT about 4 weeks ago and the number is steadily rising.

I am thinking to ask for javascript to be moved to external file and blocking this file with robots.txt, however, if Google is not honouring robots.txt already, is there any point of doing so? Or any other solution?

 

Brett_Tabke




msg:4199359
 2:49 pm on Sep 9, 2010 (gmt 0)

Blocking indexing by robots.txt is not the same as 'blocking googlebot'.

Google has long maintained their ability to crawl content in directories that are blocked by robots.txt exclusions.

Your only recourse is to specifically ban googlebot via an htaccess denial line.

Google is starting to crawl more and more javascript these days.

aakk9999




msg:4199635
 7:54 pm on Sep 9, 2010 (gmt 0)

Google has long maintained their ability to crawl content in directories that are blocked by robots.txt exclusions.


Hi Brett, thanks on replying, do you have any reference to the above? I know they are 'able to', but they should not, should they? Because this is precisely why robots.txt is there, to tell bots "Do not go there please". What I am wondering at reading your reply - are you actually saying that Google has said that they are not honouring robots.txt?

With regards to stopping the access with "brute force" - the site is on IIS6 so no .htaccess (and no ISAPI either).

Brett_Tabke




msg:4199669
 9:28 pm on Sep 9, 2010 (gmt 0)

>Hi Brett, thanks on replying, do you have any reference to the above?

Did on WebmasterWorld here. I looked for a bit and couldn't find it. It was a thread started by Toolman and had input from GoogleGuy - circa 2003.

Brett_Tabke




msg:4199670
 9:32 pm on Sep 9, 2010 (gmt 0)

[webmasterworld.com...]
[google.com...]

bwnbwn




msg:4199671
 9:32 pm on Sep 9, 2010 (gmt 0)

try using a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.

This may help.

aakk9999




msg:4199728
 10:51 pm on Sep 9, 2010 (gmt 0)

Brett, thanks for the links. What I gathered from reading that thread, Google should obey robots.txt (and they say they do) but some posts in that thread show otherwise.

Interestingly, I have other URLs blocked by robots.txt and they are not crawled. I also have URLs with the same (disallowed) pattern that do get reported in "Restricted by robots.txt" section of WMT and yet a number of URL with the same disallowed pattern get reported in duplicate titles/descriptions (therefore must have been crawled for Google to pick this info).

So it appears that Google sometimes follows directive and sometimes not.

Brett_Tabke




msg:4199777
 1:38 am on Sep 10, 2010 (gmt 0)

I've never seen google NOT crawl pages that were blocked by robots.txt. (they always crawl them)

I've never seen google INDEX and display pages that were properly blocked by robots exclusion standards. (they never list them in serps)

Just because it shows on WMT, doesn't mean it is going to get returned in serps.

tedster




msg:4199808
 3:51 am on Sep 10, 2010 (gmt 0)

Possibly relevant thread for this discussion: Why Google Might "Ignore" a robots.txt Disallow Rule [webmasterworld.com]

aakk9999




msg:4199840
 5:49 am on Sep 10, 2010 (gmt 0)

Thanks to both, and the relevant thread was useful. I do not have Googlebot specific entry, I have only one entry for all robots which up to few weeks ago seemed to have worked fine for Googlebot too.

Disallowed but crawled pages do not show in SERPs at all, so no problem here. No drop in traffic and rankings are as usual. I am more worried about crawl budget being spent on URLs with permuted dates and the impact it could (or not) have in the future.

Anyway, will change robots.txt to add a separate section explicitly for Googlebot to see if this has any impact. If not, I will ask developers to change the location.href into doPostBack, this should solve it.

spiral




msg:4200477
 11:59 am on Sep 11, 2010 (gmt 0)

If you are trying to conserve a "crawl budget", I think another option you have is to play around with the Parameter Handling settings in Google webmaster Tools (site configuration > settings > paramter handling).

I just had a similar problem on a very large dynamic site. Thousands of errors were showing on GWT. Google ignored robots.txt exclusions. I set up meta robots exclusions for some directories and X-Robots (exclusion at the HTTP page header) for others.

All this helped, but GWT still kept showing errors (dup titles, 404s, 500s, etc).

The last thing I changed was to block directories via the Parameter Handling. It took a while, but GWT is finally showing very few errors.

None of this is really helping the site, which took a huge hit from Mayday, but that's another story.

spiral




msg:4200478
 12:06 pm on Sep 11, 2010 (gmt 0)

I also wanted to elaborate on Brett's comment above, that I've always seen Google crawl a page that has a link pointing at it, regardless of any robots.txt disallow. Someone did that to my xml sitemap files and they all got indexed (actually showed up in serps).

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved