Forum Moderators: goodroi
The project led to some fascinating data:
- 5-6% of all robots.txt files had fundamental errors.
- 2% had serious errors
- 1.5% had obvious syntax errors that were causing the site to be deindexed on search engines.
Given the above, we felt it was time to fire up the spiders and have a fresh look. We have a url input list of about 3.5 million domains we are going to look at at this time. The urls come from the ODP Directory, Yahoo Directory, and general web sidering data we have amassed over the years.
Look for it in your logs as an agent named:
( Robots.txt Validator [searchengineworld.com...] )
We will let you know how it is going in a few days to a week. It is alot of data to crunch.
150k robots.txt requested so far.
29k found and downloaded.
largest robots.txt seen: 6.7meg [jbbs.livedoor.jp]
Big stat thus far?
257 robots.txt reference webmasterworld or are direct copies of the webmasterworld/robots.txt [webmasterworld.com] file.
And 154 references to my favorite made up bot name: the Flaming AttackBot [google.com].
So far 24 hrs of light spidering:
150k robots.txt requested so far.
29k found and downloaded.
largest robots.txt seen: 6.7meg
6.7 mB? That's almost defeating the purpose, is it not? I mean trying to tell *every* robot to download 6.7 mB before they even start is going to waste a lot of bandwidth!
Anyway, having thought we don't use a robots.txt file I find we do, I had forgotten about it. It's 226 bytes long!
Matt