Googlebot acting strangely

Forum Moderators: open

Message Too Old, No Replies

Googlebot acting strangely

Finding it in error logs

grandma genie

6:09 pm on Feb 3, 2011 (gmt 0)

Hello,

This Googlebot is coming from what is seemingly the correct IP range, but I keep finding it in my error logs with this designation:

66.249.71.50 - - [03/Feb/2011:04:37:43 -0500] "GET /mammals/ox/ox2.jpg%22%20width=%2258%22%20height=%2250%22%20alt=%22image%22/%3E%3C/a%3E%20%3C/div%3E%20%3C/div%3E%20%3C/div%3E%20%3Cdiv%20class=%22kowpfb%22%3E%3Ca%20href=%22/m/search?site=images HTTP/1.1" 404 8760 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Is this normal behavior for a Googlebot? I checked the IP via a Google search and find that there are two URLs pointing to that IP. They are:

Crawl-66-249-71-50.googlebot.com

and

online.ezdataroom.com

I do not want Google indexing my images. I have indicated this in robots.txt and on the page itself with this: <meta name="robots" content="noimageindex">

Any ideas what is going on here?

Jeannie

incrediBILL

6:23 pm on Feb 3, 2011 (gmt 0)

Not sure how you got online.ezdataroom.com from 66.249.71.50, what tool are you using?

As far as the goofy URL, Google often finds those on other websites, it probably shows up on some scraper site, and the bot will attempt to crawl it to verify the URL.

Crawling does not always imply indexing and bots may crawl pages blocked from indexing in robots.txt just to verify the URL if it's found linked externally.

grandma genie

7:51 pm on Feb 3, 2011 (gmt 0)

I Googled the IP and found the reference to online.ezdataroom.com at robtex.com. Don't know how accurate their data is. Bottom line is this is a Googlebot. From what you say scrapers are the problem, but it would seem they are everyone's problem. Thanks for the comments.

dstiles

10:59 pm on Feb 3, 2011 (gmt 0)

You are correct - robtex does resolve the IP to the subdomain online.ezdataroom.com with the comment that its rDNS is google's crawler. A DNS lookup tool resolves the subdomain in the same way - ie to google.

I think it's a DNS error somewhere - the root domain resolves to a UK server.

incrediBILL

2:48 am on Feb 4, 2011 (gmt 0)

What happens with these odd page and file requests are thus:

1. Some poorly written bot/spider/scraper reads your page and it can't quite interpret all HTML properly so it writes out a broken link to some a page or file on your site.

2. This broken link collected by the scraper is usually spewed out by an auto-content generator somewhere for Google to index.

3. Now Google crawls the site written by the poorly crafted scraper/auto-content generator and attempts to also crawl all of the mangled URLs it created that link back to your site.

4. Now your log file fills up thanks to the scrapers and auto-content generators giving Google all that garbage and you suffer wondering if your site is broken when you start digging through all the 404s.

Follow what's happening so far?

Now let me show you how this can be put to pure evil use.

I ran across one scraper that was feeding Bing a bunch of pages with URLs designed to attempt to upload a file to your site, and it would notify the hacker when the file was executed upon upload.

Same basic situation you're reporting above except the URL path was crafted to make the search engine into a potential weapon against the very sites the SE indexed.

How's that for cool? :)

grandma genie

3:52 am on Feb 4, 2011 (gmt 0)

Don't encourage them, incrediBILL. They need to know that what goes around, comes around. They may get away with it for awhile, but eventually they will reap what they sow.