Help me understand spiders and their behavior

Forum Moderators: open

Message Too Old, No Replies

Help me understand spiders and their behavior

From a purely academic standpoint... I'm just curious :)

ThatAdamGuy

10:32 pm on Sep 16, 2003 (gmt 0)

I've noticed that sometimes my site gets hit by several spiders from the same search engine simultaneously... sometimes even slurping the same page!

Other times, a search engine will request my robots.txt file eight times in the same day.

What's going on? Do they really think I will have changed my robots.txt file several times daily? ;) And why don't these spiders talk to each other? Wouldn't it save the search engines some bandwidth and computation time to have just one spider look at a page at a time, or even just once per day?

Mind you, I'm not complaining about the spiders. They're not eating up significant bandwidth or hampering my site operation in any way, and I'm happy to be indexed in the search engines!

But purely from a curiosity standpoint, I'm just wondering why they seem to behave so inefficiently.

Your thoughts?

jeremy goodrich

6:37 pm on Sep 17, 2003 (gmt 0)

For some engines, there will be clusters of computers that maintain their own cache files - so, even though *that engine* has a copy of yoru robots.txt file, the other servers in the cluster / or other clusters, might not have a copy.

Therefor, the other servers / clusters will need to come round, and grab a copy, becuase even though it's a bit more "effort" in grabbing your robots.txt file multiple times daily, it is far perferable to having their spiders disregard your wishes entirely.

In an imperfect world where search engine technology is still changing at a fast clip, I am glad that most major engines follow robots.txt "guidelines" and respect the wishes of those who creat the stuff that their systems are built on - the content of the web.

ThatAdamGuy

9:08 pm on Sep 17, 2003 (gmt 0)

I agree that it's good (and proper) for all the search engines to respect the robots.txt file. I just am a bit baffled as to why at least a "Directory" isn't mirrored amongst servers within a single search engine company, enabling them to hit a page just once per server and then mirror the data nightly or weekly or whatever :)