Forum Moderators: Robert Charlton & goodroi
Sometimes, an HTTP status 302 redirect or an HTML META refresh causes Google to replace the redirect's destination URL with the redirect URL. The word "hijack" is commonly used to describe this problem, but redirects and refreshes are often implemented for click counting, and in some cases lead to a webmaster "hijacking" his or her own URLs.
Normally in these cases, a search for cache:[destination URL] in Google shows "This is G o o g l e's cache of [redirect URL]" and oftentimes site:[destination domain] lists the redirect URL as one of the pages in the domain.
Also link:[redirect URL] will show links to the destination URL, but this can happen for reasons other than "hijacking".
Searching Google for the destination URL will show the title and description from the destination URL, but the title will normally link to the redirect URL.
There has been much discussion on the topic, as can be seen from the links below.
How to Remove Hijacker Page Using Google Removal Tool [webmasterworld.com]
Google's response to 302 Hijacking [webmasterworld.com]
302 Redirects continues to be an issue [webmasterworld.com]
Hijackers & 302 Redirects [webmasterworld.com]
Solutions to 302 Hijacking [webmasterworld.com]
302 Redirects to/from Alexa? [webmasterworld.com]
The Redirect Problem - What Have You Tried? [webmasterworld.com]
I've been hijacked, what to do now? [webmasterworld.com]
The meta refresh bug and the URL removal tool [webmasterworld.com]
Dealing with hijacked sites [webmasterworld.com]
Are these two "bugs" related? [webmasterworld.com]
site:www.example.com Brings Up Other Domains [webmasterworld.com]
Incorrect URLs and Mirror URLs [webmasterworld.com]
302's - Page Jacking Revisited [webmasterworld.com]
Dupe content checker - 302's - Page Jacking - Meta Refreshes [webmasterworld.com]
Can site with a meta refresh hurt our ranking? [webmasterworld.com]
Google's response to: Redirected URL [webmasterworld.com]
Is there a new filter? [webmasterworld.com]
What about those redirects, copies and mirrors? [webmasterworld.com]
PR 7 - 0 and Address Nightmare [webmasterworld.com]
Meta Refresh leads to ... Replacement of the target URL! [webmasterworld.com]
302 redirects showing ultimate domain [webmasterworld.com]
Strange result in allinurl [webmasterworld.com]
Domain name mixup [webmasterworld.com]
Using redirects [webmasterworld.com]
redesigns, redirects, & google -- oh my [webmasterworld.com]
Not sure but I think it is Page Jacking [webmasterworld.com]
Duplicate content - a google bug? [webmasterworld.com]
How to nuke your opposition on Google? [webmasterworld.com] (January 2002 - when Google's treatment of redirects and META refreshes were worse than they are now)
Hijacked website [webmasterworld.com]
Serious help needed: Is there a rewrite solution to 302 hijackings? [webmasterworld.com]
How do you stop meta refresh hijackers? [webmasterworld.com]
Page hijacking: Beta can't handle simple redirects [webmasterworld.com] (MSN)
302 Hijacking solution [webmasterworld.com] (Supporters' Forum)
Location: versus hijacking [webmasterworld.com] (Supporters' Forum)
A way to end PageJacking? [webmasterworld.com] (Supporters' Forum)
Just got google-jacked [webmasterworld.com] (Supporters' Forum)
Our company Lisiting is being redirected [webmasterworld.com]
This thread is for further discussion of problems due to Google's 'canonicalisation' of URLs, when faced with HTTP redirects and HTML META refreshes. Note that each new idea for Google or webmasters to solve or help with this problem should be posted once to the Google 302 Redirect Ideas [webmasterworld.com] thread.
<Extra links added from the excellent post by Claus [webmasterworld.com]. Extra link added thanks to crobb305.>
[edited by: ciml at 11:45 am (utc) on Mar. 28, 2005]
All I have is a bunch of sweat equity in my site and a wife that thinks I'm nuts. Did you ever try to explain this to an outsider?
I started this whole thing because people at work tell me that writing is a strenght of mine. I'm asked to write on all kinds of topics, so why not write for myself.
When I was looking for my first job all those years ago, someone gave me a gift - The Psychology of Winning. A couple of things stuck with me all these years and one had to do with watching Television...
Paraphrasing:
When you watch TV, you are watching entertainers working. They are getting rich because so many people are willing to sit idly by and watch them. When you watch TV, you are watching entertainers working. Why invest in them, when you could invest in yourself?
I try to put all my time to good use, whether it be with my family or working on this project. I'd rather talk to a person than stare at the boob tube. I love to learn and the TV just doesnt do it for me.
Is it okay to use redirects for statistics purposes when the redirect link goes through your cgi-bin AND you block all robots from links to your cgi-bin in your robots.txt file?
Google Removal Tool.
First you need an e-mail address to register - reply to auto-generated response.
when you login
you get an 'options' page.
Please keep in mind that submitting via the automatic URL removal system will cause a temporary, six months, removal of your site from the Google index. You may review the status of submitted requests in the column to the right.
there are 4 options
1."remove pages, subdirectories or images using a robots.txt file"
2. "Remove a single page using META tags"
3. "remove an outdated link"
4. "remove your usenet post from Google Groups"
the first option is the one to clean up the google index from the cgi-bin URL's. It links to a page with a box to type in the URL of your robots.txt file (example provided)
Before you do this it is esential to check your robots.txt file for any errors. whatever is disallowed will be removed 'for six months' so if you
disallow: /
your site will be removed from google for six months
but if you
disallow: /cgi-bin/
all the 302's or "URL only"s from your cgi-bin will be removed for 6 months and never get indexed again if you leave that disallow there.
so it is critical to understand what robots.txt is allowing and disallowing before you submit your robots.txt to google.
after you submit you will be given a 'sucess page' with a link on it 'view options'
this takes you back to the original 'options' page where you will see in the 'grey area' what it will be removing within 48 hours or so (when googlebot visits)
they will show as pending so if you messed up and somethings in there pending that you dont want removed then you can alter your robots.txt before 'removalbot visits'
lets say it finds a bunch of stuff from your cgi-bin that you didn't want removed you could alter your robots.txt file so that cgi-bin is allowed.
then you will get 'request denied' and you didn't remove anything.
I would recommmend running your robots.txt through a validator like the one here at WW.
another option is there is a tool called 'poodle predictor' that is a good diagnostic tool to 'crawl your own site'. It does a good job of mimicking.
one guy had a site that was doing fine in MSN and Yahoo but googlebot would just ask for robots.txt and / and then leave. All that was in the index was a 14 month old cache of his homepage 'under construction'
well poodle predictor showed a '500' for that page because the server wasn't returning a 'last modified'
date. That was the problem with googlebot. So it would be a good idea to use that tool and make sure everything looks ok.
BTW - I did use this method to clean out my cgi-bin and it worked fine - and google still crawls my site.
I did alter the robots.txt file and got 'request denied' and had to re-submit the non-existant 301's that I disallowed in my robots.txt file and got google to remove them. i just didn't wait long enough (I waited 24 hrs) before cleaning out my robots.txt file (from disallowing files that don't exist).
Lets say www.badguy.com/sites/site123 points to my page
www.mysite/mypage.html.
I check his header and sure enough, an unauthorized 302 redirect.
Suppose I ALTER my filename /mypage.html to something else, forcing
temporary 404 errors.
THEN I use the G Remove Tool option #3 to "KILL THE LINK"
Would that effectively nuke the 302 redirect for that one page at least?
It seems a lot safer, just wondering if it would work at all. -Larry
Hi Reid: " 3. "remove an outdated link" Would option #3 work?
I'm frankly scared of using robots.txt for these purposes
option# 2 works if you put the META tag
META name="googlebot" content="noindex,nofollow"
in the page header.
This option is for those who do not (or cannot) have a robots.txt file
option #3 works for pages which no longer exist (must return a 404)
As far as removing 302 redirects pointing at your page from another site. What we were doing is fooling the removal tool by causing the target of the 302 to pass a 404 or using the META tag on the target page and then submitting the other guys URL (the 302 pointing at your page) into the removal tool. So for 302's you are stuck with option 2 or 3 since the removalbot probably wont be seeing your robots.txt file when it follows the 302 to your site. And you can't disallow the URL from another site in your robots.txt (I would like to try this on the removal tool though).
I did use the robots.txt method to remove some old non-existant files that reappeared after the last update.
I was using .htaccess to 301 these URL's to existing pages and they came out of nowhere and appeared in the index.
I just disallowed these non-existant files in robots.txt and submitted it, worked like a charm.
Not sure how it would take this but would like to try it
disallow: ht*p://w*w.badguys302.php
Before submitting robots.txt to google it is critical that you know your robots.txt is flawless and you understand EXACTLY what it does.
It's a lot like updating firmware. Scary but exhilerating.
Option #1 submit your robts.txt URL
Option #2 submit URL of page (with META tag)
Option #3 submit URL of page (404)
after you submit you get 'pending removal' in the grey bar on the options page of the removal tool.
You must leave the page or robots.txt in the state it was in until you get 'complete' status in the grey area of the removal tool 'options' page. Otherwise you will get 'request denied'.
1. Submit
shows in the grey area as 'pending removal'
2. within 5 days removalbot visits robots.txt or page (whatever you submitted) Did mine within 48hrs.
shows in grey area as 'complete'
If you remove the META tag, 404 or alter your robots.txt before removalbot visits, you will get 'request denied'.
The robots.txt method is by far superior because I was able to leave the 301 in my .htaccess file for the non-existant pages and just use robots.txt to remove them.
You can just leave robots.txt there as long as you like but if I want to remove a 302 pointing at my index page i don't want to have the META tag or the 404 condition on it for 5 days waiting for removalbot. (what if REAL googlebot visits?)
That is why if
disallow: ht*tp://w*w.baguysURL.php
works on the removal tool then this would be the far better option.
Of course, you submit www.badsite.com/redir.php?url=www.yoursite.com to Google console. You know perfectly well that you must not send www.yoursite.com or you'll remove your own site.
But what if someone else submitted your site to url console during this time?
BTW, did anyone tried Disallow: /?
May be I got delisted for doing it?
For the ODP they have stretched 650 000 categories, 650 000 category charters, 70 000 profiles, and 2 000 informational pages into more than 11 000 000 listings. Where did the additional 9 600 000 entries come from?
The site: command is now truly broken by Google trying to filter 302 redirects out of the results (rather than removing them from the database). You cannot get to see 1000 results for any search term, even those reporting millions of matches.
.
>> Since one of the heuristics to pick a canonical site was to take PageRank into account.. <<
Yes, but they should be comparing PR of real pages, not the PR of the entry point of a redirect, that entry point being just a URL. The redirect-start-URL is not a real page.
.
>> The problem is not consistent. The only consistent thing is Google calls links "pages". As long as they do that, problems of many kinds will occur. <<
Yes, you can also link to a page and add whatever dynamic strings you want and totaly rename the target page in the SERPs if the linking page has enough PR: www.yoursite.com/shiny.widgets.html?this-product-is-junk-do-not-buy-it and it works; and that is scary. Google doesn't ask the target server what the page is called, it lists the page as having whatever name was on the link that it followed to get to it.
I did that to a page on a site that had information that was four years out of date on it. The webmaster refused to admit that printing very old contact information, where nearly every telephone number and email address in it had an error, was wasting people's time. I replaced the URL in the SERPs with www.domain.com/contact.list.html?this-page-is-four-years-out-of-date and linked to it from two PR 6 pages, and within a week the URL was changed in the SERPs. After a further 6 months, the site owner eventually updated the page information with what had been emailed to him every 3 months for the last 3 years.
Reid,
didn't GoogleGuy say not to try this ;)?
steveb:NOTE: Do not submit your own site to our url removal tool in attempt to force a canonical url. I repeat, do not submit your own site to our url removal tool. Using the url removal tool was some idea that a WebmasterWorld member came up with and started talking about. I just talked with user support about a reinclusion request, and using the url removal tool on your own site will *not* help. All it will do is remove your site for six months.
Very few people used the url removal tool to take out their own sites, so I can try to gather some people into one group and ask someone if we can do anything on our end.
For the person who asked about the url removal tool: its removal for six months, not 90 days. I understand how someone thought it might help to try the url removal tool, but please don't use it on one's own site. arubicus, did you say you saw weird behavior with www vs. non-www or trailing slashes vs. without?
steveb:
"If you remove the META tag, 404 or alter your robots.txt before removalbot visits, you will get 'request denied'."
Definitely not true of the META tag. You can (and should) remove it immediately... so the tag would only be on the page for five seconds or so.
yeah I beleive you are right about that , option 2 and 3 (META or 404) you get instant results but option 1 (robots.txt) you gotta wait for the bot.
Thats why I like option 1, because it tells you what it is going to do (so you still have a chance to change robots.txt if you want)
failsafe robots.txt: (to cause all your removal requests to be denied)
user-agent: *
disallow:
the other options 2 and 3 merely tell you what you've done already.
shurik
BTW, did anyone tried Disallow: /?
May be I got delisted for doing it?
Funny, i didn't notice this before. Sreveb you're right, URLs like that are an open invitation for duplicate "page" creation.
I'm sure you can multiply the number of real dmoz pages with at least two due to different spelling of the URLs in links. Yet another case (pun not intended) where an URL does not equal a page.
If you put the line you indicated (disallow: / ) in your robots.txt you were probably deleted from the indexes. It's telling the bots "do not index me" and if used with robots exclusion tool at Google it removes from Google index all pages of your site in less than 48 hours.
Remove that line from your robots.txt!
Then submit a reinclusion request via google.com/support/
Everytime it happens it gets my hopes up, but after a few min. I remeber ohh I have seen this before.
when was that Shurik? did you make sure the site was clean (by G standards)?
are you banned (as in NOT on the index) or just have bad rankings? Two very different things...
Regarding Google support telling us "no penalty":
We did NOT lose home page but did appear to lose about 100k indexed pages of about 350k total (though frankly I'm increasingly skeptical about learning much from "site:oursite.com"
Some are back in index now but Google traffic remains at about 5% of pre Feb 2 level. Yahoo traffic fairly stable.
Shurik -
I misread what you meant with that question mark. You had the mark in your robots.txt as in this: "disallow /?"
I don't know how the bot would interpret that instruction. If your site is completely gone it appears it ignored the question mark. Search for "robots exclusion protocol" for details on syntax.
As for "Disallow: /?" - i have read the specs before attempting to do it. Nothing special was mentioned about "?" character. And google's extended robots.txt syntax does not mention any special meaning of "?" either.
Actually yes, definitely. You should know by now that *indexed* means nothing to Google. They count URLs.
1.28 million "pages" in the Google index due to the report abuse link alone:
[google.com...]