Googlebot crawling strange "?ref=something.com" attached to my pages

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot crawling strange "?ref=something.com" attached to my pages

flicky

4:54 am on Feb 22, 2010 (gmt 0)

Anyone have any insight into what this could be? Maybe it has something to do with my penalty.

I was browsing my log file tonight and noticed googlebot was busy. So when viewing the URLs he was spidering, I noticed alot of these:

(keep in mind that I 301 redirect anything with a "?" attached to my pages so it does make it to the correct url)

02/21 19:46 /mypage1.html?ref=example.org 301 redirect -
02/21 19:46 /mypage1.html

02/21 19:52 /mypage2.html?ref=example.com 301 redirect-
02/21 19:53 /mypage2.html

02/21 20:22 /mypage3.html?ref=example-2.com 301 redirect -
02/21 20:22 /mypage3.html

Anyone know what to make of this? It couldn't be good to make googlebot have to 301 like that to find the right page. And how the heck did someone make this happen? Should I inform google of this in a reconsideration?

thanks,
marc

[edited by: tedster at 5:16 am (utc) on Feb 22, 2010]
[edit reason] make the domains involved anonymous [/edit]

flicky

5:29 am on Feb 22, 2010 (gmt 0)

ah, sorry... thanks Tedster

One other strange thing I just noticed in my log file. Even though I'm 301 redirecting domain.com to www.domain.com, googlebot continues to try to crawl the domain.com in "some" cases. I never thought I'd have to set the "preferred domain" setting in GWT since I have been 301 redirecting for years. But I just did it. When I went to do it, it said... "you need to verify ownership of "domain.com". So I added it to GWT. So I set that one to "www.domain.com" as well. Hopefully that does the trick.

The last anomoly I saw in my log file is that googlebot tries to access many of my pages with a ? at the end like this...

page1.html?
page2.html?

Those aren't redirected... probably my mistake... I was only redirecting "?ref" ... they are crawled with no problem. Is google smart enough to know the difference between page1.html? and page1.html? Or is that a duplicate content situation? I suppose I should redirect anything with a "?"

thanks... so many issues!

marc

tedster

7:15 am on Feb 22, 2010 (gmt 0)

This usually means someone is linking to your pages with that query string - as a kind of log file spam or part of some other more aggressive spam scheme. Yes, it "might" be a sign of something that is hurting your rankings.

Do you have a Webmaster Tools Account? If so you may be able to see such a page referenced in your backlinks list.

I'd be very tempted to make these query string urls go 404, instead of trying to squeeze some link juice out of the them with a 301 redirect. If there are backlink pages (and I'm pretty sure there are) you may be better off without their vote for your page.

I'd also check a couple other things:

- Make sure these requests are really googlebot - reference [webmasterworld.com...]

- Use the "Fetch as googlebot" utility in WMT to be sure that these pages aren't hosting some cloaked, parasite content

flicky

7:47 am on Feb 22, 2010 (gmt 0)

Thanks for the reply...

Unfortunately I can't find any reference to these in GWT... I looked at some of the specific pages this was affecting and nothing was there related to those strange domains.

I worry about 404 because what if Google for whatever reason can't figure out where the original page is and decides to just "get rid of it" totally?

I just have a feeling this is but one of many reasons why I am penalized... this, possible google bowling from a huge backlink spike and a huge (growing) number of spam occurrences when querying "www.domain.com + forex".

So far google has ignored my requests, but this is something new I suppose.

marc

tedster

8:42 am on Feb 22, 2010 (gmt 0)

I worry about 404 because what if Google for whatever reason can't figure out where the original page is and decides to just "get rid of it" totally?

Google actually ranks URLs and not "pages", which is an imprecise, non-technical concept - good enough for everyday speech, but not for computer algorithms. So if the URL resolves when it doesn't include a query string, then a 404 for the added query string version should not be a problem.