Welcome to WebmasterWorld Guest from 220.127.116.11
Sometimes, an HTTP status 302 redirect or an HTML META refresh causes Google to replace the redirect's destination URL with the redirect URL. The word "hijack" is commonly used to describe this problem, but redirects and refreshes are often implemented for click counting, and in some cases lead to a webmaster "hijacking" his or her own URLs.
Normally in these cases, a search for cache:[destination URL] in Google shows "This is G o o g l e's cache of [redirect URL]" and oftentimes site:[destination domain] lists the redirect URL as one of the pages in the domain.
Also link:[redirect URL] will show links to the destination URL, but this can happen for reasons other than "hijacking".
Searching Google for the destination URL will show the title and description from the destination URL, but the title will normally link to the redirect URL.
There has been much discussion on the topic, as can be seen from the links below.
How to Remove Hijacker Page Using Google Removal Tool [webmasterworld.com]
Google's response to 302 Hijacking [webmasterworld.com]
302 Redirects continues to be an issue [webmasterworld.com]
Hijackers & 302 Redirects [webmasterworld.com]
Solutions to 302 Hijacking [webmasterworld.com]
302 Redirects to/from Alexa? [webmasterworld.com]
The Redirect Problem - What Have You Tried? [webmasterworld.com]
I've been hijacked, what to do now? [webmasterworld.com]
The meta refresh bug and the URL removal tool [webmasterworld.com]
Dealing with hijacked sites [webmasterworld.com]
Are these two "bugs" related? [webmasterworld.com]
site:www.example.com Brings Up Other Domains [webmasterworld.com]
Incorrect URLs and Mirror URLs [webmasterworld.com]
302's - Page Jacking Revisited [webmasterworld.com]
Dupe content checker - 302's - Page Jacking - Meta Refreshes [webmasterworld.com]
Can site with a meta refresh hurt our ranking? [webmasterworld.com]
Google's response to: Redirected URL [webmasterworld.com]
Is there a new filter? [webmasterworld.com]
What about those redirects, copies and mirrors? [webmasterworld.com]
PR 7 - 0 and Address Nightmare [webmasterworld.com]
Meta Refresh leads to ... Replacement of the target URL! [webmasterworld.com]
302 redirects showing ultimate domain [webmasterworld.com]
Strange result in allinurl [webmasterworld.com]
Domain name mixup [webmasterworld.com]
Using redirects [webmasterworld.com]
redesigns, redirects, & google -- oh my [webmasterworld.com]
Not sure but I think it is Page Jacking [webmasterworld.com]
Duplicate content - a google bug? [webmasterworld.com]
How to nuke your opposition on Google? [webmasterworld.com] (January 2002 - when Google's treatment of redirects and META refreshes were worse than they are now)
Hijacked website [webmasterworld.com]
Serious help needed: Is there a rewrite solution to 302 hijackings? [webmasterworld.com]
How do you stop meta refresh hijackers? [webmasterworld.com]
Page hijacking: Beta can't handle simple redirects [webmasterworld.com] (MSN)
302 Hijacking solution [webmasterworld.com] (Supporters' Forum)
Location: versus hijacking [webmasterworld.com] (Supporters' Forum)
A way to end PageJacking? [webmasterworld.com] (Supporters' Forum)
Just got google-jacked [webmasterworld.com] (Supporters' Forum)
Our company Lisiting is being redirected [webmasterworld.com]
This thread is for further discussion of problems due to Google's 'canonicalisation' of URLs, when faced with HTTP redirects and HTML META refreshes. Note that each new idea for Google or webmasters to solve or help with this problem should be posted once to the Google 302 Redirect Ideas [webmasterworld.com] thread.
<Extra links added from the excellent post by Claus [webmasterworld.com]. Extra link added thanks to crobb305.>
[edited by: ciml at 11:45 am (utc) on Mar. 28, 2005]
USE EXTREME CAUTION with this - it's very powerful and *if you accidentally disallow* your entire site or directories it will delete them from Google's index in 1-2 days.
We used this successfully to remove pages listed under site:oursite.com that seemed to be created by our own 302 redirects to affiliate sites. To do this we excluded our CGI directory in robots.txt and then submitted the revised robots.txt to Google via the tool. We also deleted some entire domains we did not want indexed. But use this cautiously.
If only one could get reincluded as easily we'd have a happier thread here.
[edited by: joeduck at 5:46 am (utc) on April 23, 2005]
To do this we excluded our CGI directory in robots.txt and then submitting the revised robots.txt
so I use the removal tool to remove robots.txt because google has it cached and this will force googlebot to refetch it?
Or are you talking about submitting robots.txt for inclusion into google?
To get this thread back on topic now - since Iv'e been following it since the beginning of the MEGA thread about google 302's started by Japanese
by the way Japanese wasn't 'kicked out' of webmaster world for posting something wrong, like someone said earlier, he got mad and left because Brett snipped one of his posts (TOS).
If you want to read the 700+ thread Brett did tell him he is still welcome here but he hasn't come back.
Any way we were using the url removal tool in this fashion.
1. Block the hijacked page (your page) from googlebot in robots.txt (or with META on the page).
2. Submit hijackers URL into removal tool. (which will show a 404 because googlebot is blocked from seeing the page the 302 points at)
302 URL gets removed from google index.
3. Remove the robots.txt entry (or the META) so that googlebot will be able to continue crawling as it should.
This would remove the 302 badguy URL for 90 days before googlebot finds the 302 on badguys site and the whole cycle starts over.
Google extends the 90 days to 6 months - this is a good thing because it gives them lots of time before these URL's come back to haunt.
Some guys went beyond this good advice and tried to 'clean up' their SERP's and started working on their own site, this is when the thread got out of hand.
So they ended up removing their entire site because they submitted the non w*w version of their own site.
And a few other mistakes were made, one guy put
he submitted the hijackers URL and his own site went crashing down.
So to some the 6 months is really BAD news.
Googleguy shows up and helps some of these guys that messed it up get reincluded. But the 302 thing......
It seems that we did do the right thing - most of us did-- and google seems to have fixed something to do with 302's.
There are a few threads going about 301 problems since that fix. I even had some old long gone 301's re-appear. Some people have had old long gone DOMAINS reappear due to the 301 thing, no casualties yet but a lot of nervous webmasters.
So I assume the 302 thing is fixed but I want to nuke those old 301's with the removal tool. Along with a couple of my own stray cgi url's floating around in the SERP's.
[edited by: Reid at 6:01 am (utc) on April 23, 2005]
You could put a disallow in your robots.txt and then submit the robots.txt via the tool. Those pages will be deleted in 24-48 hours.
In our case it appeared that robots.txt was ignored during an update, leaving us with many odd pages, but then followed when we did the exclusion via robots.txt.
Again I urge caution to anybody using this. A simple typo in your robots.txt could blast your site out of the indexes entirely.
The one that does the indexing will probably see an URL somewhere on the internet and include that URL in the giant link pool, which means that it becomes part of the index. Later, another googlebot comes by and tries to spider that URL, but as it finds out that it's "robots.txt'ed" it gives up so the indexed URL becomes an "URL only" listing.
As always, the spider adds links, it doesn't remove them. So, to get the links that the spider bot found removed you have to ask the removal bot.
If your "robots.txt" file already includes the links that you want to remove you don't need to change it in any way. Just submit it to the removal bot (URL Console) and that one will go through all of your links and remove those that are in conflict with your "robots.txt" file.
Actually this is a pretty smart thing Google has, and it usually works like a charm. I often wish that the other SE's had an URL Console like it, as it's often close to impossible to get URL's removed from, say, Yahoo (several months of 404 or 410 seems to be needed).
I find that a lot of dead pages have a "%20", probably partly due to the reason mentioned above. I suppose hijackers and other nare-do-wells can exploit this phenomenon to make their URLs harder to remove. Can the removal tool be tweaked to accept a URL with a %20?
Thanks for a nice summary there...
Not sure if I understand the question.
Your use of it seems to be a very clever way to delete hijacker pages, but different from ours which was more conventional and based on g1smd's experiences (thanks g1smd!).
Claus' explanation jives with some of our experiences -basically that you have two independent things here: spidering, which adds pages in somewhat unpredictable ways, and robots exclusion which very quickly deletes Google indexed pages according to the robots.txt exclusion protocol.
No, not remove the robots.txt file! Simply spider the robots.txt file and remove from the index all the folders and pages mentioned in that file.
>> Or are you talking about submitting robots.txt for inclusion into google? <<
>> I don't understand why the removal tool causes a page to be removed for six months. The removal tool will not work at all if it finds the page. Following that logic, shouldn't the page be re-indexed as soon as Googlebot comes across it again? <<
Think carefully about this. Every URL that Google has ever seen in a link is recorded in a database. Some of those URLs show in the public index. Many do not. Pages that are spam, pages that no longer exist, malformed links, and stuff that has "noindex" assigned to it are still recorded as URLs along with their status. It has to be so, otherwise everytime a bot came across your link again it would simply re-add it to the public index. So, such URLs are in a "secret store" and should not be visible in the public index. In order that every page of the web isn't re-checked every 30 seconds, the data has an expiry date on it, for re-checking. If you remove pages using the URL controller, then they are flagged to not be included in public results for 6 months. After that time the robots.txt file is looked at again. If the URL is still in robots.txt then I assume it stays out of the public index, otherwise the URL itself is spidered for possible reinclusion.
Here the last few days I see some of my site again in the seps, still only 5-10% like before, so now Im not sure if it is just google old serps or my site is realy back again. Today googlebot was by 14 times, I dont think it would do that if the site was removed completly with the removal tool.
At your and Google support's suggestion we have used 301 redirects to fix 302 and canonical problems.
However, our pages with the 301s still show up even after the new ones have been spidered - ie the index now shows even more unintentional duplicate content than before (20k+ pages across 3 geographic sections of our site).
Any suggestions how to avoid duplicate penalties here?
We don't want to kill the OLD 301 redirected pages because they have links and PR we want to pass to the new pages.
If your server is correctly set up then if anything attempts to access yourdomain.com it will be redirected to www.yourdomain.com with a HTTP status of "301 Moved". The content at the old location will not actually be accessed at all.
If Google is crawling only the new URLs then it will not have seen the redirect. You need to make a list of all the URLs in the index that should not be there and make a page of links pointing to those URLs. Put that big list on another site. Google will find the list, crawl the URLs, and see the redirects. Allow at least a few weeks for the old URLs to drop out of the index after that.
Take a look at some of the old 301s caches if it shows. Check the date.
From what we have seen is that google has spidered 301's and instead of just showing the url it has been showing full title and descriptions and cached content of the new pages that it redirects to. This cached content is of our newest design in which the 301 has never seen at all. Now under a 301 this shouldn't happen. It is saying page A is now gone and B is the new location. Get rid of A show B.
We have seen a bunch of non www's fully listed (cahed pages) in which they have always been under a 301 redirect. They have ALWAYS returned a 301.
Normally1 a 301 redirect will take at least one full month to propagate to the SERP's. If they have changed the way they handle 301's i do not believe that this time frame is now shorter than before.
So, a 301 takes time. This, of course, does not explain wrong listings for URL's that have always been 301's.
1) That is, at least until a few months ago. I haven't done any 301's in a few months.
I have never checked where the robot is going, Im not even sure how to do that, I use Urchin.
Take a look here for info about how to read it
you can just do a quick scan and see what files googlebot is requesting.
If it is continually requesting the same files you have a problem but if is requesting a lot of different files then you can break open the champaign
With the latest update these pages with the original cache were showing up in site:mysite.
I just commented out the .htaccess lines regarding these pages and nuked them with the removal tool.
Waiting to see what happens with fingers crossed.
Upon reflecting I think if I could do it over I would put a disallow in robots.txt for these URL's and submit the robots.txt instead.
I'm still worried this could bother Google since the index reflects a lot of dupes.
g1 sounds good except that we've got so many redirects (tens of thousands at server level) that I'd be worried the spider would see that ocean of new links as spammy.
I think we'll sit tight and hope the 301s shake out in the coming update.
So this time I inserted a disallow in robots.txt for these non-existant files and submitted robots.txt - see what happens. I left the .htaccess 301's in place this time since the robots.txt will disallow googlebot from requesting them in the first place.
crossed fingers again.
It seems to me that apart from trying to remove a URL from another site, the robots.txt method is the best way to go with this removal tool.
Whatever is disallowed in robots.txt will be removed from the index - even if it does not exist.
It even found a few images related to these files that no longer exist and included them in the removal request.
After filing all the DMCA stuff, writing all the long letters explaining exactly what these SEOs are knowingly doing the only thing I've seen is that site bouce around in nowhere land, without titles and descriptions, or buried under pages of of crap. So a friend told me to come and look at this thread, and I filed my little reinclusion request, ands this moring woke up to a new listing for my site.... one that has my title and description, but again, ins't associated with my URL.
I'm just so burnt on theives, I could spit. You can't even sue them because they aren't actually breaking any copyright laws.... unfair business practices, maybe, but not worth persuing in court. As far as I can tell, Google can't seem to figure out how to properly index a 302 so I would suggest, they simply stop trying to. If someone moves their site, that's their problem. If googlebot is so smart it should find the new site on its own and index it. Period. But if they can't do it without without completely screwing up a site that has never even moved servers or IPs, I would suggest they just don't try.
The number of my non-www pages in the index, which was inflated by about 300%, has been going down except for a small blip on April 22:
It would appear that--for my site, at least--Google is crunching its data (albeit slowly) and removing obsolete or duplicate URLs from its index in response to the redirect in my .htaccess file.
ands this moring woke up to a new listing for my site.... one that has my title and description, but again, ins't associated with my URL.
Is this the only thing you get when you search site:yoursite?
Is the cache a picture of your page but with the wrong URL?
Is it an old outdated cache of your page?
If so this is classic 302 hijack and can be fixed by you.
Exactly. Why bother if it does nothing but decrease quality and frustrate webmasters anyway. Personally i've suggested many things; "treat them as a straight link", "treat cross-domain 302's differently" etc. but totally ignoring them so that they don't even count as a vote would also be fine with me.
I will welcome most things except from thinking that they are pages, as this is about the last thing they are. Imho, Google should not index stuff that is not pages.
Any luck with your site traffic wise.
My rankings are back to normal or near-normal for most of the keyphrases that I track, but Google referrals are still way down, so it would appear that I'm still doing poorly for many "inside pages" that used to generate little traffic individually but fairly high traffic in the aggregate.
To put it another way, my Google referrals dropped about 75% on March 23. Now they're down about 70%. And my Yahoo-to-Google ratio is up to maybe 2.4:1 after being at a low of 2:1 a few weeks ago. That isn't a dramatic improvement, but at least things seem to be moving in a positive direction.