Forum Moderators: Robert Charlton & goodroi
Sometimes, an HTTP status 302 redirect or an HTML META refresh causes Google to replace the redirect's destination URL with the redirect URL. The word "hijack" is commonly used to describe this problem, but redirects and refreshes are often implemented for click counting, and in some cases lead to a webmaster "hijacking" his or her own URLs.
Normally in these cases, a search for cache:[destination URL] in Google shows "This is G o o g l e's cache of [redirect URL]" and oftentimes site:[destination domain] lists the redirect URL as one of the pages in the domain.
Also link:[redirect URL] will show links to the destination URL, but this can happen for reasons other than "hijacking".
Searching Google for the destination URL will show the title and description from the destination URL, but the title will normally link to the redirect URL.
There has been much discussion on the topic, as can be seen from the links below.
How to Remove Hijacker Page Using Google Removal Tool [webmasterworld.com]
Google's response to 302 Hijacking [webmasterworld.com]
302 Redirects continues to be an issue [webmasterworld.com]
Hijackers & 302 Redirects [webmasterworld.com]
Solutions to 302 Hijacking [webmasterworld.com]
302 Redirects to/from Alexa? [webmasterworld.com]
The Redirect Problem - What Have You Tried? [webmasterworld.com]
I've been hijacked, what to do now? [webmasterworld.com]
The meta refresh bug and the URL removal tool [webmasterworld.com]
Dealing with hijacked sites [webmasterworld.com]
Are these two "bugs" related? [webmasterworld.com]
site:www.example.com Brings Up Other Domains [webmasterworld.com]
Incorrect URLs and Mirror URLs [webmasterworld.com]
302's - Page Jacking Revisited [webmasterworld.com]
Dupe content checker - 302's - Page Jacking - Meta Refreshes [webmasterworld.com]
Can site with a meta refresh hurt our ranking? [webmasterworld.com]
Google's response to: Redirected URL [webmasterworld.com]
Is there a new filter? [webmasterworld.com]
What about those redirects, copies and mirrors? [webmasterworld.com]
PR 7 - 0 and Address Nightmare [webmasterworld.com]
Meta Refresh leads to ... Replacement of the target URL! [webmasterworld.com]
302 redirects showing ultimate domain [webmasterworld.com]
Strange result in allinurl [webmasterworld.com]
Domain name mixup [webmasterworld.com]
Using redirects [webmasterworld.com]
redesigns, redirects, & google -- oh my [webmasterworld.com]
Not sure but I think it is Page Jacking [webmasterworld.com]
Duplicate content - a google bug? [webmasterworld.com]
How to nuke your opposition on Google? [webmasterworld.com] (January 2002 - when Google's treatment of redirects and META refreshes were worse than they are now)
Hijacked website [webmasterworld.com]
Serious help needed: Is there a rewrite solution to 302 hijackings? [webmasterworld.com]
How do you stop meta refresh hijackers? [webmasterworld.com]
Page hijacking: Beta can't handle simple redirects [webmasterworld.com] (MSN)
302 Hijacking solution [webmasterworld.com] (Supporters' Forum)
Location: versus hijacking [webmasterworld.com] (Supporters' Forum)
A way to end PageJacking? [webmasterworld.com] (Supporters' Forum)
Just got google-jacked [webmasterworld.com] (Supporters' Forum)
Our company Lisiting is being redirected [webmasterworld.com]
This thread is for further discussion of problems due to Google's 'canonicalisation' of URLs, when faced with HTTP redirects and HTML META refreshes. Note that each new idea for Google or webmasters to solve or help with this problem should be posted once to the Google 302 Redirect Ideas [webmasterworld.com] thread.
<Extra links added from the excellent post by Claus [webmasterworld.com]. Extra link added thanks to crobb305.>
[edited by: ciml at 11:45 am (utc) on Mar. 28, 2005]
You will probably have to re-enter them in your robots.txt (*), but it will be easier if they have a generic componenet, as; when you request removal from Google with the URL-console, there's a limit to the size of your robots.txt file.
Also, you might have to enter them one-by-one in the url-console, which will take some time.
---
(*) That is, if you can't make them return html with the meta tag <meta value="robots" content="noindex">
Once an URL is indexed it will not be removed by putting it in "robots.txt" - this will only keep the spider from revisiting the URL. In order to get it removed you must specifically request removal.
If you've got your redirect script in some folder, like, say:
example.com/redir/redir.php?id=1234567890
... then you can just put "/redir/" (or "/redir/redir.php") in your "robots.txt", you don't need to put in every single redirect.
(i have removed such URLs from my own sites a few times, so i know the process)
(and the additional reason being that i like to have control over what is indexed - i especially don't want "internal" things or "errors" to be indexed. All i want in the index is my real pages and nothing more - one URI per page. For that reason i do remove all kinds of different stuff that should not be there whenever i see it. I like to keep things clean, as this helps me avoid "surprises" of many different kinds.)
RE: Nofollow - we've been discussing that and I favor placing them at most of our outbound links.
As I read all the discussion, everyone is barking at points 1 and 2, because of point 3.
If it wasn't for point 3 (duplicate content), who cares about redirects from other webpages, as long as your own page is still in serps on it's own merit.
If bark we must, then I suggest we bark at Google for their method of eliminating duplicate content (define fair/unfair or show certain duplicates).
Some suggest the current G choice is based on PRank, arguing that the redirect badguys have a higher Prank and focus on webpages with lower Prank so that they get chosen in the duplicate content dilemma, and not the other guy. Maybe so.
The problem of criteria for eliminating duplicate content has always existed, maybe it is aggravated by the inventiveness of badguys using redirects, maybe protocols need to be changed (point2) to deal with that, but duplicate content is close to "similar" content, and before you know it we are talking about how serps ranking choices are made.
A level playing field is impossible....except maybe for napoleon.
We can only watch as Google copes.
my 2cts
Reid why are you saying that? We've had several excluded directories and have run adsense for some time. To Google's credit (but our frustration) our adsense reps have been nice talking about this but unable to help with our problems because they are very separated from search side of things.
I should not have said that. last month I read on adsense guidelines 'do not use a robots.txt file. and then buried in the optimization tips there is a line
If you have a robots.txt file, remove the file or add the following two lines to the top of the file:User-agent: Mediapartners-Google*
Disallow:This change will allow our bot to crawl the content of your site, so that we may provide you with the most relevant Google ads.
I wonder if media-partners-google* includes 'googlebot'?
Claus this could be a real problem for people running adsense because your not allowed to exclude googlebot from any part of your site.
If you run a robotstxt file you have to allow google full access in the first line.
I have started a new thread on this topic in the AdSense forum, so let's continue with regard to that issue over there:
How much should Google be allowed to spider WRT AdSense? [webmasterworld.com]
If it wasn't for point 3 (duplicate content), who cares about redirects from other webpages, as long as your own page is still in serps on it's own merit.
I think the problem is much deeper than that.
if your home page is the temporary location of the hijacking page then the hijacking page takes the home page's place in the SERP's
This could also happen aside from duplicate content issues. ie googlebot see's that it has indexed the page AND the temporary location of the page. Aside from duplicate content it say's 'I have indexed the same page twice' so it removes the temp page (your home page) and leaves the hijack page which points at the temp.
Here's what I did: I temporarily renamed files at the host forcing 404 errors.
I used the Google emergency removal tool and selected "remove all".
I changed the filenames back quickly before they might get spidered by anyone.
The next day, I checked on my requests, and all three reqs were "DENIED"!
I have no idea why denied, but it seems to have done some good anyhow!
All 3 pages fell to the bottom dead end of the listings when
I ask for site:mysite.net Dead last, all three, plus a 4rth one I forgot about.
AND, all four jacked URLs are now marked 'supplemental result'.
None of my pages are so marked.
None of the jacks rate above me for my original
test snippets any more.
Comments anyone? Do I have this under control at least? - Larry
On another site I had a similar url from the same place appearing in site: this was before the URL removal tool became known to me.
I contacted the offending directory and asked them to remove my site. They removed it without a response. My site then came out of the sandbox and has been rising rapidly in google traffic. On this one I can't remove the url from google though because the go.cgi file is not returning a header for that perticular id#.
On the previous site they are refusing to remove the site from their directory, they must have been hit hard by the 302 scare. Thay have banned my ip too.
I seriously think they are running tests and doing rollbacks changing algos trying to fix the 302 thing.
I bet they know exactly what the problem is (but they'll never tell) but it is not so easy to get rid of.
I've seen things like this a thousand times in the semiconductor
industry. Somebody screws up royally due to an oversight.
They will never admit it, but quietly, they make sure the problem gets fixed.
If that's the case, I don't need mea-culpas .. just some reassurance
that the problem is indeed being addressed.
I can't help but feel that my "denied request" did have some beneficial effects.
One odd thing. Unlike so many pond scum scrapers, the particular one
I dealt with actually had good taste! He only scraped the very best
sites in my arcane field (UFOs) avoiding all the loonies and
amateurish junk. If he put all his black-hat efforts into some honest
research, he could actually be a positive influence in the field!
As it is, he has the most authoritative people in the field ready
to cut his head off; those who can figure out what he's doing that is.
Very ironic. - Larry
After those threads
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]
I wonder whether Google will visit the pages again after 90 days (or maybe more) and the problem will start all over again.
For a different search query the same page might be returned in the results, but might be a normal result and with a more up to date snippet.
At no time is there a rule to say that the words in the snippet can still be found in the cached page or on the real live site.
site:www.dmoz.org brings up NOTHING AT ALL any more.
site:dmoz.org (no www.) yields 11.2 million pages, all real dmoz URLs
as far as I looked (several pages worth) and not a scraper in sight!
Did this thread embarrass somebody, or is it just coincidence?
I don't understand why the www should make any difference. -Larry