|Googlebot crawls certain content differently - am I seeing a penalty?|
I have noticed in my logs that the Googlebot has recently been hitting certain pages on my site with "?iframe=true&width=100%&height=100%" appended. I don't use these query values and they are to be found nowhere on the Internet as far as my site is concerned (at least, I haven't found them, and I've looked pretty hard), yet they consistently show up in my logs, always with the Googlebot agent and their IP 188.8.131.52.
It appears that the particular threads hit this way all have something in common: they contain a healthy portion of Amazon affiliate links (links to Amazon including my affiliate ID so we can try to get a referral from the info we're providing).
These threads aren't SOLELY links - I take pride in including a lot of supplemental information to go along with store links, but I'm wondering if I've passed some magic unknown Google threshhold to get them tagged as spammy or not useful or at least requiring another look - and if the "?iframe=true&width=100%&height=100%" is a clue that that's what's happening.
So I guess I have two questions:
Is anyone else seeing the Googlebot append a query string when crawling your site and if so, do you notice any pattern to the links that are being crawled this way?
Has anyone noticed that some percentage of linked text (Amazon or otherwise) on a page will keep that page from showing up in results?
In the past, as my results have traveled up and down in the rankings, I've been content to ride it out and just wait for the next Google change, but now that I'm not appearing AT ALL and also seeing these pretty strange Googlebot hits, I figured it didn't hurt to ask around. :)
Googlebot does test servers with all manner of odd URLs from time to time. Given what the query string looks like, I'd guess it's some kind of a check to see if your pages have been hacked and are hosting a parasite iframe to do malware downloads.
It's true that pages with a lot of affiliate links can have a hard time ranking - I don't think it's a hard-coded percentage or anything like that. But Google has said they don't want to rank affiliate pages unless the site offers some significant extra value that can't be had by visiting the parent website. The guideline I read once was "a useful, quality website with some affiliate links is fine, but a website that is essentially affiliate links with just a bit of extra content thrown in is not."
My guess? Sounds like a manual review process.
Well, I knew as soon as I posted, something would come along that would break my pattern :)
Last night I got another hit from the Googlebot on a new page and from a different crawler IP (184.108.40.206), again with "?iframe=true&width=100%&height=100%" appended to the page link - only THIS page has ZERO affiliate links, as far as I can tell.
I think the other troubling thing about this is, if I hadn't manually coded in a 301 to disregard these variables, I don't know that the web server would have even returned the intended page, so I'm left baffled as to where Googlebot would get this syntax and why it would use it.
Moving out from my own site, I notice that a Google search on "iframe=true&width=100%&height=100%" returns about 13,000 results - many of which are "page not found" errors. Maybe this is just some Google weirdness?
I'm hoping I'm not causing more trouble by configuring that 301 to work around this, but I DO want the page to get crawled, obviously - I'd like ALL our pages to be crawled and included in results as quickly as possible, of course!
It looks like I started seeing these queries back in May.
So let me ask again: Has anyone else seen Googlebot crawling pages with this extra substring appended to their request?
If so, in your case, are you returning the 404, are you coding around it, or does it turn out that you're returning the page you'd expect even with that extra query string, so none's the wiser?
If it returns content, then you have a Duplicate Content problem on your hands.
Googlebot has visited this in my site.
If the ?twitterfeed&utm_medium=twitter is appended to your url then somebody has used twitterfeed to post your RSS feed to their twitter page.
If it's your own twitter page, then you may want to do what I'm currently trying to do. Remove that query string in .htaccess
I'll post on that in the appropriate section....
"Apache Web Server"
".htaccess, mod_rewrite, and other Apache specific topics."