|Is it the robot? No, it's the programmer of the robot.|
Robot at IBM Almaden Research Center has a problem
I was watching my main access log file and noticed the robot from the IBM Almaden Research Center go for the robots.txt file. Good, but then it went for my home page.
Normally, that's perfectly alright, however, I've recently done some redirects where all my www.domainname.com pages get redirected to domainname.com without the "www".
The robot got the 301 redirect OK, but didn't go for the real robots.txt file after seeing the redirect. It just went ahead and grabbed the home page.
I'm not blocking the robot from my home page, but that's not the point here. It failed to follow the redirect to go get the robots.txt file and I think that's not right.
The robot came from 188.8.131.52 and the referrer says "http://www.almaden.ibm.com/cs/crawler [fc10]" and a whois confirms the address.
I'm thinking that this robot will just start grabbing whatever it wants now. Perhaps I should just block it until they fix the problem.
What's the purpose of having IBM crawl my Commodore oriented site, anyway? They wouldn't be interested in my stuff.
Or are they? :)
You're right about programmers choosing to override (let alone fetch) robots.txt.
Coincidentally I just ranted a bit about this kind of bad robot/programmer behavior over in "Search Engine Spider Identification [webmasterworld.com]" -- where new bot sightings are as thrilling to us as are rare bird sightings to avid birdwatchers:)
(The thread's title is about a variation of Nutch called "Comrite/0.7.1 [webmasterworld.com]" -- and I vented a bit about "Rude Nutch Users" generally.)
Anyway, some bots say they'll grab pages but not make them publicly available if they encounter a robots.txt telling them not to. Others intentionally mine the directories and files in robots.txt. (boo-hiss)
So if you want to ask about the been-around-forever IBM crawler, about what it did robots.txt- and redirect-wise, about what they're doing spidering you, you'll find the e-mail address at the bottom of the crawler's info URL:
If you're inclined to inquire, I'll be interested in hearing what they tell you! I've long wondered what IBM does with the data it mines.
OK, I sent IBM an email about this.
[edited by: Receptional at 3:11 pm (utc) on April 4, 2006]
|(The thread's title is about a variation of Nutch called "Comrite/0.7.1" -- and I vented a bit about "Rude Nutch Users" generally.) |
Speaking of Nutch, I had to block it from part of my website because it doesn't know how to deal with some of the files on my site and was just filling up my error log file with a bunch of 400 errors. This is Nutch running from the University of Washington. It's still hitting my site and also trying to access the part that is blocked in the robots.txt file. So, I've also added a block in my .htaccess file in the directory where I don't want it to go. So, that put's the 403 errors in my error log file.
What it's doing wrong to begin with is it's not handling the filenames properly in the directory where I'm blocking it from.
Back when the GEnie Online service was still going, the Commodore area was very popular and it had a huge file library. Commodore users uploaded several thousand files to this library. Being the normal Commodore user, many of the uploaders would include space characters in the filenames. Anyway, this entire library is on my site now and the space characters were still in the filenames.
Nutch would request a file without inserting "%20" where the spaces are. So, the error would be "erroneous characters after the protocol string" because the filename was broken up and Apache saw extra stuff in there that didn't belong.
I've finally decided to replace all the spaces with dashes because many users were having trouble downloading files due to their browsers not handling it correctly either. Oh well.
Meanwhile, Nutch is still trying to grab files with space characters instead of refreshing the HTML pages where the files are indexed from. And so it continues to fill the error logs.
I'd hate to block U of W. I'll probably report the problem to the Apache group to see if they can fix it in the programming of Nutch, or whoever it is that bug reports go to.
This Seattle native and U.W. alum is inclined to urge you to touch base locally first:) I show the U.W.'s Nutch coming from multiple hosts but the UA string and @ddress are always the same:
NutchCVS/0.7.1 (Nutch running at UW; [nutch.org...] ; email@example.com)
And again, if you're inclined to inquire, I'll be interested in hearing what they tell you, too! (And thank you.)
Two of the servers at U.W. that I've described have come from zork.cs.washington.edu and pacman.cs.washington.edu.
Gee, those two server names sound like they could be interested in vintage computer stuff, don't you think?
I'm not placing blame on U.W. nor on the programmers of Nutch, but I think the problems I've seen with Nutch crawling my system should be looked into.
I'll see about making contact with someone and mention it. Of course, maybe they'll run across my message here anyway.
I'd say that a crawler should be programmed to keep track of how many failed links are on a page and if a certain number fail that it should toss the page and refetch it. If it hits a certain number of failures again, then it should ignore that page.
If Nutch would refetch a few pages, then it would have a successful crawl. In my case, those pages were changed in order to eliminate the space characters in the filenames of the files that are indexed on the page.
I sent IBM an email a week ago and have not received any response. Also, it's obvious that they are either not reading their emails or they are ignoring requests to fix their problems. The same behaviour I described at the start of this thread is still going on just about every day.
I guess it's time to block the big blue IBM boys.
I noticed another behaviour similar to what IBM is doing, this one is coming from looksmart.net.
184.108.40.206 - - [10/Apr/2006:07:16:21 -0400] "GET /robots.txt HTTP/1.1" 301 317 "-" "Mozilla/4.0 compatible ZyBorg/1.0 (firstname.lastname@example.org; [WISEnutbot.com)"...]
You can see they went for the robots.txt file but got a redirect because they went for www.mydomain.com. The redirect goes to mydomain.com without the www. They never followed the redirect to get the robots.txt file.
This is not the only instance of this, they've been doing this for quite some time.
AND... they are entering directories that are disallowed in the robots.txt file.
So, I had to manually block them from those directories using .htaccess. If looksmart.net continues this behaviour, they will get blocked from the entire site.
My site caters to a niche market and blocking looksmart.net won't affect it. So, if they don't fix the problem, they're done.
I sent off an email to looksmart. Hopefully, they will do better than IBM and fix their problem.
Next on my list is gigablast.com.
They hit my site every day with 4 different servers. Of the 4, only one is behaving, but that is probably because it is only accessing my site using the domain name without the www. It gets the robots.txt file just fine. The other three try to get the robots.txt file, they get a redirect and don't follow through on it.
And the result is they go into a directory that is disallowed. And just like with looksmart and their ZyBorg robot, I have to block gigablast.com and their Gigabot robot with a redirect from this directory using .htaccess. They end up grabbing numerous copies of the same file.
Here's gigablast.com's 4 servers that hit my site.
220.127.116.11 gets the robots.txt file OK and behaves.
18.104.22.168 doesn't follow the robots.txt redirect
22.214.171.124 doesn't follow the robots.txt redirect
And the really bad one:
126.96.36.199 NEVER attempts to get the robots.txt file at all!
That 4th one is good enough reason to block gigablast entirely. The server doesn't go for the robots.txt file and it obviously doesn't get the info from the other servers either. It just happily goes into disallowed directories and gets whatever it wants.
What's the matter with these search engine companies? Are they all using the same programmer?
"Are they all using the same programmer?"
I'm not sure if that's a rhetorical question but in case it isn't, erm, nope. Not the same programmer. Not the same programs. Not the same algorithms. Or sure, there's overlap, but just as you can take five robot hunters here and find out we all do different things at different times for different reasons, the same can be said of all the different companies running all the different robots, spiders, crawlers, and more.
Robots. Spiders. Crawlers... Oh, my!
You mentioned the relentless majors I've blocked by agent and/or host/IP for years. Years. I, too, am in a niche market so the money in my pockets doesn't depend on someone else spending my bandwidth. That said, many a bot hunter here evaluates the worth of a bot's marauding with its obvious return. For example, Google. Google is voracious. The black hole of bandwidth. But I get hundreds of hits from real people using G every day, so it's worth it.
Then there's Gigablast. Let's see. Their site offers my site -- what? I cannot recall even one visitor from that site. So for me, clearly their sucking outweighs others' searching. (For more details about gigabot's pros and cons, ditto new bots, and new-old bots, check this site's archives, and specifically the "Search Engine Spider Identification [webmasterworld.com]" and "robots.txt [webmasterworld.com]" areas.)
Then there are those search engines offering 'members' the ability to privately 'save' your page(s) on their servers and they'll re-serve your pages to their members. Not links, but entire pages. (Hello? Copyright infringement anyone? But that's a whole 'nuther forum [webmasterworld.com].) Ditto those search engines compiling your indexed content and selling it in aggregate to whomever, for use however.
Then there's IBM. And even my alma mater, the U.W., doing -- research? And then there are major SEs and data miners and 'shapers' (Google, Ask, Alexa; Cyveillance; etc.) cloaking themselves behind bare IPs. And the regular browser, toolbar and/or desktop search and/or check-for-page-updates (every hour!) 'features' AND the too-many anonymous 'people' hailing from server farms using the likes of NutchCVS and yacy and Jakarta Commons and JobSpider_BA and StackRambler and Snoopy and Vespa Crawler and EmailSiphon and BuildCMS crawler and NuSearch Spider and genieBot and ISC Systems iRc Search and Yahoo! Mindset and Octora Beta and ODP entries test and MJ12bot and Jyxobot and BOI_crawl_005 and lwp-request --
-- and that's not even half of bots I've seen. In the last five days.
What can you do about Them?
Okay, finally:), to help you solve your problems with unrelenting and/or rogue sucker-uppers, I'd skip e-mailing and spend time studying the following marathon thread(s). Start at the end of Part 3 and work your way backwards as need be.
A Close to perfect .htaccess ban list: Part 1 [webmasterworld.com]
A Close to perfect .htaccess ban list: Part 2 [webmasterworld.com]
A Close to perfect .htaccess ban list: Part 3 [webmasterworld.com]
The entries are not all current -- Part 3 ended almost exactly two years ago(!) -- or even all code-correct, but they'll give you an idea of what you're up against, and which bots you're seeing now that others have been repelling for years, and why eyeballing logs can become its own sucker-upper -- of free time:)
|"Are they all using the same programmer?" |
I'm not sure if that's a rhetorical question but in case it isn't, erm, nope.
My whole point of this thread is that these search-engine wannabees don't know how to code properly. It's obvious by what I've discovered.
I'm sure this has been going on for a long time with my web site, but only recently have I spent any great deal of time studying the logs.
And I am amazed at how sloppy these search engines are, or should I say, the search engine programmers.
Being a programmer myself, I tend to get critical when it comes to poor programming. Sorry.
|And the really bad one: |
188.8.131.52 NEVER attempts to get the robots.txt file at all!
I've noticed that many bots with the same signature use multiple IP addresses - what it looked like to me was that one IP just read robots.txt, and different IP's were used to read the site itself. That would be consistent with the behavior you documented.
|I've noticed that many bots with the same signature use multiple IP addresses - what it looked like to me was that one IP just read robots.txt, and different IP's were used to read the site itself. |
And after looking through the logs some more, I found out that 184.108.40.206 was behaving because all it grabbed was the robots.txt file. So, it would seem that it should share the info it receives with the other servers, otherwise, what's the point in having that robot running around?
So, it looks like these robot servers at gigablast.com don't work properly.
About an hour ago, I made the decision to block Gigablast from my site entirely. They are stopped at the front door. They can't even send me an email.
To the folks at Gigablast: Post a response to this thread once you've fixed the problem and I'll reinstate your access.
Maurice, things looks sloppy to you because so many SEs/programs/programmers are not following, say, robots.txt. But remember, that's an arguably voluntary standard.
And look at/follow the benefits. Ignoring your wishes means someone gets more stuff. And those with the most stuff win...
The ignore-robots.txt coding is clearly intentional when it comes to all too many SE companies, software developers, and CS students. From a university-level data mining project blog:
"# Crawl the URLs in the robots.txt. This would violate the robot exclusion standards but our goal is to collect statistics and analyse..."
Finally (no, really) --
Sites are kind of like celebrities to search engine paparazzi, and individual opportunists and snoops. If you're in public, or what they regard as public, every single thing about you is fair game. And if you want privacy? Well, good luck. Because for the most aggressive and intrusive, it's pretty much up to you to stop them.
|I've long wondered what IBM does with the data it mines. |
Maybe they just want to have their own search index so Google and Yahoo! don't know what their engineers, executives, marketers, etc. are searching for.
"Ask Jeeves" is another search engine with a problem. Their problem is similar to the others but maybe in a worse way that is only hurting themselves. I'll explain.
Just like the others, Ask Jeeves tries to get my robots.txt file at the www. address but gets redirected to the address without the www. It doesn't follow through.
But it's also trying to grab files from the www. address and getting redirects on all the files, and then not following those redirects either. Many of these are files that are OK to grab and some of them are in directories that are disallowed.
So, how does Ask Jeeves plan to be a good search engine if it doesn't know how to fetch the data in the first place? It's not even getting the files that are OK to grab.
I'll give the big players credit, such as Google, Yahoo, and MSN. Yahoo was hitting my site pretty steadily the other day when I was about ready to make a small change to my robots.txt file. Yahoo grabs the robots.txt file quite frequently, I've noticed. Following the change I made, the very next time Yahoo grabbed the robots.txt file, all of its servers backed right off from my site. It was like they packed up and left.
After a while, maybe an hour or two, they started coming back. Now, there's some good logic. Give the programmers some credit. They noticed through the robots.txt file that something is changing and based on that, they determine that due to some changes taking place, there's no point in grabbing files right now.
The big players stay out of the directories that I have disallowed in the robots.txt file. The wannabees, IBM, Gigablast, Looksmart, and now Ask Jeeves don't know how to follow a simple redirect. So, they stay stupid and that's why they aren't the big players.
Actually, I shouldn't mention IBM in there. They're not a search engine company. But then again, why are they scouring the internet. They should forget it, though, because they don't know how to do it right, anyway. No wonder Commodore beat them up in the home computer wars. :)
This morning I received a reply from IBM about their crawler. It didn't say much, but I'll give them credit for responding. This was in the email:
"As of April 7th, I forwarded this report to our crawler developer. There might be a few issues here, one of which is an issue of multiple crawlers hitting your site."
I still say if multiple crawlers are going to hit a site, they should share information between themselves. It only makes sense if they want to gather information properly. Sharing information includes sharing the robots.txt content and abiding by it.
Maybe the IBM developers will correct their problems. Time will tell. So for now, I won't block them as they will likely be watching the results of hitting my site as a test if they are working on their programming.
If after a month or so and no change in behaviour, then I will block them.
My thinking is this: If a robot doesn't behave here, it doesn't belong here.