|Google Image Bot with an Empty User Agent|
Msg#: 3661791 posted 12:54 pm on May 29, 2008 (gmt 0)
About 6 months ago I was pretty much attacked on all my sites by bad bots crawling and slowing one of my servers down markedly. I loaded an ISAPI filter which has been working very well. One of my sites has a gallery section that used to have quite a few images in Google image search, but I noticed yesterday that most of the images are gone. Hence I started going through log files to see if the bot had visited lately. I found that a Googlebot has been trying to crawl images, but it is being rejected by the ISAPI filter because the bot is showing an empty user agent. Since the ISAPI filter is blocking and logging the block, there is nothing in the normal logfiles that show the bot even visiting. The IP address for the bot is 220.127.116.11. The filter log shows the bot coming in and trying to crawl 5 or 6 images, getting rejected because of the empty user agent and then leaving. I suppose I could open the IP for class b to be unfiltered (all those IPs are owned by Google), but why would Google be sending out a robot with no user agent? Anyone else ever see anything like this?
Msg#: 3661791 posted 2:20 pm on May 29, 2008 (gmt 0)
I've just checked logs for a few sites, and I can't find an example of this IP from Google so far.
Here's another oddity in what you're seeing. According to the official Google Blog post about Verifying Googlebot [googlewebmastercentral.blogspot.com] "the recommended technique would be to do a reverse DNS lookup, verify that the name is in the googlebot.com domain, and then do a corresponding forward DNS->IP lookup using that googlebot.com name."
But this IP address doesn't seem to have reverse IP set up at all. So I'd say just forget about it, except that your image indexing is affected - so something's up here. Still, if you open up that class-b to be unfiltered, I suppose there's a chance that you would be more vulnerable to IP spoofing in this range from malicious bots. But then again, I guess malicious bots are even more likely to spoof a googlebot user agent anyway.
If Google Image Search traffic is useful for your business, I guess it's worth the experiment to open up the filter.
Msg#: 3661791 posted 3:52 pm on May 29, 2008 (gmt 0)
I just assumed this was the google image bot. The files it was trying to access were either jpgs or gifs. I did a search in the regular lof files for googlebot-image and did find legitimate hits, albeit just a couple a day. This could be something altogether different. I did find some interesting referrals here...
Still, I am having trouble getting images in the index. The big problem is it takes soooooooo looooooong before you see any results when trying to optimize for this. About a year ago, I changed all the images from pic1, pic2, pic3 to small-red-widget, big-green-widget, medium-blue-widget, etc. along with appropriate alt tags. All the images disappeared and I am still waiting for any fruits to bloom off of the labor tree ;-)
Msg#: 3661791 posted 4:47 pm on May 29, 2008 (gmt 0)
Well it is a Google IP. We have had similar issues with Google IPs either 1) not having a google user agent 2) not providing reverse dns 3) behaving like a scraper while doing both of the previous
We ended up having to do an IP whois just before blocking scrapers just to ensure it wasn't one of these Google IPs. I am even curious if Google even uses IPs that it doesn't "own" as well. As every time we put our scraper scripts into effect we have a serp drop very soon after and when we take it off it goes back up. I was wondering if Google had a bot pretending to be human that we end up blocking that causes this. We have gone through the log files to ensure we only block bots and I whois them and they are all potential scrapers in the traditional sense.
Msg#: 3661791 posted 5:40 pm on May 29, 2008 (gmt 0)
|I just assumed this was the google image bot |
I don't think so, this is something different.
I just checked and I'm seeing this IP but I didn't see it ask for robots.txt.
However, I did see this user agent also associated with the IP:
"Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:18.104.22.168) Gecko/20080404 Firefox/22.214.171.124"
I could be wrong but so far every time you see a combination of images and Firefox Linue it's typically someone making screen shots. However, I suspect Google would be making a lot more screen shots if that's what they're really doing unless they're just experimenting because images and HTML combined from this IP was only 32 files.
FWIW, don't worry about blocking things from Google's IP ranges in general because they have other services that have been used by scrapers which I block all the time without repercussion.
Msg#: 3661791 posted 6:57 pm on Jun 2, 2008 (gmt 0)
You're right, this ip never requests the robots.txt file. Nor sitemaps. Pretty scary, though, if they are taking screenshots of this site. Maybe they are putting together a powerpoint on how to do great optimization, eh?