The Great SearchEngineWorld robots.txt Survey #2 is Under Way

Forum Moderators: goodroi

Message Too Old, No Replies

The Great SearchEngineWorld robots.txt Survey #2 is Under Way

Brett_Tabke

9:27 pm on Apr 4, 2005 (gmt 0)

About 4 years ago, we downloaded every robots.txt [searchengineworld.com] file from every domain listed in the Open Directory Project.

The project led to some fascinating data:

- 5-6% of all robots.txt files had fundamental errors.
- 2% had serious errors
- 1.5% had obvious syntax errors that were causing the site to be deindexed on search engines.

Given the above, we felt it was time to fire up the spiders and have a fresh look. We have a url input list of about 3.5 million domains we are going to look at at this time. The urls come from the ODP Directory, Yahoo Directory, and general web sidering data we have amassed over the years.

Look for it in your logs as an agent named:
( Robots.txt Validator [searchengineworld.com...] )

We will let you know how it is going in a few days to a week. It is alot of data to crunch.

oddsod

11:49 am on Apr 5, 2005 (gmt 0)

I can't wait :)

I'll tidy up my robots.txts before you visit. There's a lot of junk in there that needs to be sorted.

pmkpmk

12:10 pm on Apr 5, 2005 (gmt 0)

Just looked at WW's own robots.txt. What happens if I request the QuickSand directory? Sounds like a bot-bait to me?

Brett_Tabke

9:14 pm on Apr 5, 2005 (gmt 0)

So far 24 hrs of light spidering:

150k robots.txt requested so far.
29k found and downloaded.
largest robots.txt seen: 6.7meg [jbbs.livedoor.jp]

Big stat thus far?

257 robots.txt reference webmasterworld or are direct copies of the webmasterworld/robots.txt [webmasterworld.com] file.
And 154 references to my favorite made up bot name: the Flaming AttackBot [google.com].

Reid

9:57 am on Apr 15, 2005 (gmt 0)

looks like that 6.7mb file could be a lot smaller just by saying

DISALLOW: /anime/
DISALLOW: /bbs/

done.

Matt Probert

11:40 am on Apr 15, 2005 (gmt 0)

So far 24 hrs of light spidering:
150k robots.txt requested so far.
29k found and downloaded.
largest robots.txt seen: 6.7meg

6.7 mB? That's almost defeating the purpose, is it not? I mean trying to tell *every* robot to download 6.7 mB before they even start is going to waste a lot of bandwidth!

Anyway, having thought we don't use a robots.txt file I find we do, I had forgotten about it. It's 226 bytes long!

Matt

oddsod

11:47 am on Apr 15, 2005 (gmt 0)

Brett, what's the latest?

pmkpmk

9:54 am on Apr 18, 2005 (gmt 0)

Your bot still hasn't paid our sites a visit. You are sampling ALL of DMOZ?

Brett_Tabke

2:35 pm on Apr 29, 2005 (gmt 0)

Whew - it is DONE!

2,493,402 Domains spidered, over 200k robots.txt files dl'd. Stats being accumulated.