Forum Moderators: Robert Charlton & goodroi
If you have an idea that Google could use to alleviate this problem, or that a webmaster could use to fix or avoid this problem, please post it here.
Each post should contain only one idea. Each idea should have only one post. There's no need for a long code example, just the mechanism.
Any followup discussion belongs in the Google's 302 Redirect Problem [webmasterworld.com] thread, not here.
I haven't checked all those, but it's common for large-ish sites, as you noted. A lot of news sites do this as well. They don't do this in order to redirect www to non-www (or the other way round) - some of them do have www and non-www duplicate issues anyway (a 302 does not help with this).
What they do is to use the 302 redirect as it's supposed to be used. On their front page, they always redirect to the most recent version. As that version can change URL, they want the browser/user/spider to keep the main page URL for next visit, but look up the content on the newest URL.
These include, CN++, M$, Or*cle, S*n, Newswe*k, IB*, and Genital MotorsWhat claus said, and those sites all enjoy substantial Page Rank which makes them extremely difficult to PageJack.
As GoogleGuy said:
PageRank is a pretty good proxy for reputation, and incorporating PageRank into the decision for the canonical url helps to choose the right url.
Interesting observation. But not all the sites I listed seem to have in mind what you are saying, that is, some just have a target of www.somedomain.com rather than a specific file that should be captured. So if you type in the the www version there is no redirection.
One thing that confuses me with regards to their 302 status codes is that some will say, "Object Moved", while others say, "Moved Temporarily". What is the difference from a technical standpoint, or are these decriptors supplied by the coders for their own personal reasons?
Too bad these big boys can't fall prey to this hi-jacking. We'd get a lot more attention from Google if they could.
Question: If I notice that the hi-jacker's site is returning a 404, can I then use the removal tool just as if my site were delivering a 404? Or does it have to be my site that is showing a 404 to do this magic?
GuinnessGuy
Consider an internal page that has been hijacked:
hijacker.com/link.php?id=1234 --> 302 redirect --> mysite.com/somepage.html
Search for
site:mysite.com and you get a bunch of pages from mysite.com, plus the hijacking URL from hijacker.com. Classic sign of the hijack. Now search for
site:mysite.com inurl:mysite.com/somepage.html and the results still include the hijacker.com URL, even though it (hijacker.com/link.php?id=1234) doesn't contain your site name or page name. This tells me that for each "page," Google stores at least two URLs in its index:
1) Display URL: The URL that is displayed and linked to in the SERPs
2) Content URL: The URL that is queried by "inurl" and "allinurl" searches
When determining which URL to display, there's no need for Google to even consider duplicate content. It sees that there is more than one "page" with the same Content URL, assumes duplicate content based on that fact alone, and then chooses a Display URL based on the factors GG mentioned in his Slashdot post. Game over.
(This makes sense when you consider that the cache and the search index are essentially two different systems. The search index only knows about Content URLs. The SERP display and cache systems only know about Display URLs, and some piece in between links Content URLs with Display URLs. This duality would appear to be pretty deeply ingrained in Google's multi-tiered search architecture, and hence may not be as easy a problem to sort out as we'd like to think. Though
if (DisplayURL == ContentURL) { it's the canonical URL } would be a pretty obvious fix...) In this context, the
<base href="http://mysite.com/somepage.html"> suggestion makes a lot of sense, and is certainly a factor that Google could be considering as part of its canonicalization algo.
An informational site in a foreign country that I tried to get some incorrect content modified on, updated their content recently. There are two URLs for the content, but they represent the same physical harddrive space.
Both URLs showed in the results for a while, but when you clicked on "cache" for www.domain.it/keyword/ the text above the cached copy said "this is Google's cache of keyword.otherkeyword.it/somefolder/" - Google knows that they are the same content, has only one cache copy, has identified the cached copy under one URL, and then redirects calls for the "other" URL back to this one.
othersite.com/links/index.php3?mode=update_ link&link=http://www.mysite.com%2F - 30k - Supplemental Result - Cached - Similar pages
I send request to google at help@google.com with subject canonicalpages They replied:
" We'd like to assist you, but we only
respond to messages submitted through our online contact form. Please
visit [google.com...] to submit your message, and we'll get
back to you soon. "
I cant remove these URLs by removal tool as these pages don't exists.
What should I do now?