|Who are these spiders and what do they want?|
| 12:55 am on Jan 8, 2004 (gmt 0)|
I have discovered that the following spiders have accessed the webmail pages on our portal ... Wget, Alexa, MSIECrawler, Googlebot and Nomad. It is obvious who Googlebot is and I've read somewhere that MSIECrawler is just IE downloading to bookmark the page (is this true?)
Would this be the case with the others?
I've managed to track Nomad back to the Colorado State University .... but why would they be accessing our webmail pages? And these are the only 5 bots that are recorded as having accessed since last July (2003) when we started recording this activity.
We don't have a robots.txt file.
| 1:46 am on Jan 8, 2004 (gmt 0)|
wget is a site slurping program. Basically you can point it at a website and it downloads everything it can find a link to.
| 3:56 am on Jan 8, 2004 (gmt 0)|
Welcome to WebmasterWorld [webmasterworld.com]!
You've provided the answer yourself:
Q. > why would they be accessing our webmail pages?
A. > We don't have a robots.txt file.
If a spider finds a link, it will follow that link, unless the resource (page, image, etc.) that the link leads to is disallowed in robots.txt *and* the spider is a good one that obeys robots.txt
| 4:37 am on Jan 8, 2004 (gmt 0)|
OK ... I understand that we are wide open in that respect, but why only 5 bots into that section of our site when we constantly get hit by between 40 and 50 different bots every month. Why aren't the others accessing it?
Is there any way to tell if they can access the private information of our webmail subscribers? I'm concerned about security in this case. What sort of information is being gathered?
| 4:55 am on Jan 8, 2004 (gmt 0)|
What sort of information? Anything they can find! The spiders collect information without ascribing any meaning to it. If they find a link, they follow it. They download all of the 'allowed' pages of your site for later analysis. If you are marketing, then it's a good thing if they take all of your pages and list them in search results. If you are providing an online service, it may not be a good thing, since you may not want all of those pages showing in search results.
If one of your users has placed a link to the webmail directory somewhere on the 'net that a spider finds, it will crawl into the webmail directory. One link is all it takes.
As to why some spiders and not others, who can tell? Some spiders are more aggressive because their owners provide them with more bandwidth, processing power, and disk space so that they can dig deeper and retrieve more Web pages.
If your users' accounts are password-protected, and there are no "back-door" entries to bypass the password authorization, then you should be OK as far as their "personal" inforamtion being safe.
I'd strongly suggest you put up a robots.txt that disallows robots from your 'sensitive' areas, though. Being in control of the spiders, instead of at their mercy, is a good thing. Disallow the good spiders from sensitive areas of your site by using robots.txt, and block the bad spiders that don't read or obey robots.txt using other means (e.g. ISAPI filters on MS servers, mod_rewrite on Apache). (You define 'sensitive' - it varies from site to site.)
Out of your list of user-agents, the only one I'd allow without a lot more investigation would be GoogleBot.
| 5:25 am on Jan 8, 2004 (gmt 0)|
Thanks Jim ... I'll pass this on to my tech team.
| 4:51 pm on Jan 14, 2004 (gmt 0)|
If you use the Alexa toolbar beware!
Although Alexa provides a great service, their bot is one of the hungriest out there. The toolbar feeds it every url you visit through your browser.
If you use Alexa you definitely want to have a robots.txt file.
My only gripe about a robots.txt file is that it highlights your sensitive directories to potential hackers so rather than place them in a robots.txt file, I prefer to use an .htaccess file to block them from the robots.