WMT reports allowed files "restricted by Robots.txt"

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

WMT reports allowed files "restricted by Robots.txt"

doughayman

6:57 pm on Oct 18, 2010 (gmt 0)

Why is it that Google ignores changes to my Robots.txt ?

This file gets spidered successfully each and every day, yet I still see certain files that are "restricted by Robots.txt" in WMT, which haven't been included in robots.txt in many, many months. They were once prohibited in robots.txt, but those definitions have long been removed, and there is no chance that any sort of pattern matching is prohibited them from being crawled via robots.txt. Additionally, I have pumped my entire robots.txt file into their test utility, and these files are never denied, using this simulation tool.

Anyone else have this problem ? Is there a remedy here ?

tedster

8:47 pm on Oct 18, 2010 (gmt 0)

I've just been looking at a similar issue where the robots.txt file allows everything to be crawled, and yet WMT says 14 files are disallowed. In that case, those files are crawled and indexed anyway. WMT does have buggy data occasionally - or even frequently. What do we expect for "free", dependability? [hint - yes, I do]

doughayman

10:37 pm on Oct 18, 2010 (gmt 0)

Thanks for your feedback, Ted. As a former developer/manager of institutional trading systems, I always demanded perfection of myself and my staff. In my naive little world, I expect the same of others, particularly when you are a mega-billion company. Things like this irk me, but unfortunately, part of the problem is my compulsive personality and expectations. The bigger Google has gotten, the more problems we have seen. At least, IMO.

jdMorgan

11:08 pm on Oct 18, 2010 (gmt 0)

It's provable that the GWMT robots.txt tester does not use the same code as the real Googlebot.

On several sites, I use multiple user-agent declarations per policy record - a construct that has been explicitly defined by the "Standard" since its adoption. However GWMT's robots.txt tester has reported (since its adoption) that the sites cannot be crawled at all.

It's also possible that the "Crawl Errors" you're seeing are not generated using the same code as either the real Googlebot or the robot.txt tester, but that it uses a third version instead.

But regardless, there are plenty of bugs in the GWT, and it sounds like this is just another.

Jim

tedster

11:32 pm on Oct 18, 2010 (gmt 0)

I always demanded perfection of myself and my staff. In my naive little world, I expect the same of others, particularly when you are a mega-billion company.

It's the challenge that comes with scale. Petabytes of data in constant churn just doesn't allow for perfection - and it's something that few of us have ever grappled with. Still, I'm sure Google can do better than we currently see in WebmasterTools.