Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google's 302 Redirect Problem

         

ciml

4:17 pm on Mar 25, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



(Continuing from Google's response to 302 Hijacking [webmasterworld.com] and 302 Redirects continues to be an issue [webmasterworld.com])

Sometimes, an HTTP status 302 redirect or an HTML META refresh causes Google to replace the redirect's destination URL with the redirect URL. The word "hijack" is commonly used to describe this problem, but redirects and refreshes are often implemented for click counting, and in some cases lead to a webmaster "hijacking" his or her own URLs.

Normally in these cases, a search for cache:[destination URL] in Google shows "This is G o o g l e's cache of [redirect URL]" and oftentimes site:[destination domain] lists the redirect URL as one of the pages in the domain.

Also link:[redirect URL] will show links to the destination URL, but this can happen for reasons other than "hijacking".

Searching Google for the destination URL will show the title and description from the destination URL, but the title will normally link to the redirect URL.

There has been much discussion on the topic, as can be seen from the links below.

How to Remove Hijacker Page Using Google Removal Tool [webmasterworld.com]
Google's response to 302 Hijacking [webmasterworld.com]
302 Redirects continues to be an issue [webmasterworld.com]
Hijackers & 302 Redirects [webmasterworld.com]
Solutions to 302 Hijacking [webmasterworld.com]
302 Redirects to/from Alexa? [webmasterworld.com]
The Redirect Problem - What Have You Tried? [webmasterworld.com]
I've been hijacked, what to do now? [webmasterworld.com]
The meta refresh bug and the URL removal tool [webmasterworld.com]
Dealing with hijacked sites [webmasterworld.com]
Are these two "bugs" related? [webmasterworld.com]
site:www.example.com Brings Up Other Domains [webmasterworld.com]
Incorrect URLs and Mirror URLs [webmasterworld.com]
302's - Page Jacking Revisited [webmasterworld.com]
Dupe content checker - 302's - Page Jacking - Meta Refreshes [webmasterworld.com]
Can site with a meta refresh hurt our ranking? [webmasterworld.com]
Google's response to: Redirected URL [webmasterworld.com]
Is there a new filter? [webmasterworld.com]
What about those redirects, copies and mirrors? [webmasterworld.com]
PR 7 - 0 and Address Nightmare [webmasterworld.com]
Meta Refresh leads to ... Replacement of the target URL! [webmasterworld.com]
302 redirects showing ultimate domain [webmasterworld.com]
Strange result in allinurl [webmasterworld.com]
Domain name mixup [webmasterworld.com]
Using redirects [webmasterworld.com]
redesigns, redirects, & google -- oh my [webmasterworld.com]
Not sure but I think it is Page Jacking [webmasterworld.com]
Duplicate content - a google bug? [webmasterworld.com]
How to nuke your opposition on Google? [webmasterworld.com] (January 2002 - when Google's treatment of redirects and META refreshes were worse than they are now)

Hijacked website [webmasterworld.com]
Serious help needed: Is there a rewrite solution to 302 hijackings? [webmasterworld.com]
How do you stop meta refresh hijackers? [webmasterworld.com]
Page hijacking: Beta can't handle simple redirects [webmasterworld.com] (MSN)

302 Hijacking solution [webmasterworld.com] (Supporters' Forum)
Location: versus hijacking [webmasterworld.com] (Supporters' Forum)
A way to end PageJacking? [webmasterworld.com] (Supporters' Forum)
Just got google-jacked [webmasterworld.com] (Supporters' Forum)
Our company Lisiting is being redirected [webmasterworld.com]

This thread is for further discussion of problems due to Google's 'canonicalisation' of URLs, when faced with HTTP redirects and HTML META refreshes. Note that each new idea for Google or webmasters to solve or help with this problem should be posted once to the Google 302 Redirect Ideas [webmasterworld.com] thread.

<Extra links added from the excellent post by Claus [webmasterworld.com]. Extra link added thanks to crobb305.>

[edited by: ciml at 11:45 am (utc) on Mar. 28, 2005]

Reid

2:03 am on Apr 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



then submit the URL of the robots.txt file to their URL console

You meant to the URL submission console?
Not the url removal console?

joeduck

4:55 am on Apr 23, 2005 (gmt 0)

10+ Year Member



Reid - I'm pretty sure g1 means this tool at Google:
[services.google.com:8882...]

USE EXTREME CAUTION with this - it's very powerful and *if you accidentally disallow* your entire site or directories it will delete them from Google's index in 1-2 days.

We used this successfully to remove pages listed under site:oursite.com that seemed to be created by our own 302 redirects to affiliate sites. To do this we excluded our CGI directory in robots.txt and then submitted the revised robots.txt to Google via the tool. We also deleted some entire domains we did not want indexed. But use this cautiously.

If only one could get reincluded as easily we'd have a happier thread here.

[edited by: joeduck at 5:46 am (utc) on April 23, 2005]

Reid

5:23 am on Apr 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



To do this we excluded our CGI directory in robots.txt and then submitting the revised robots.txt

so I use the removal tool to remove robots.txt because google has it cached and this will force googlebot to refetch it?

Or are you talking about submitting robots.txt for inclusion into google?

302 RECAP
To get this thread back on topic now - since Iv'e been following it since the beginning of the MEGA thread about google 302's started by Japanese

by the way Japanese wasn't 'kicked out' of webmaster world for posting something wrong, like someone said earlier, he got mad and left because Brett snipped one of his posts (TOS).
If you want to read the 700+ thread Brett did tell him he is still welcome here but he hasn't come back.

Any way we were using the url removal tool in this fashion.
1. Block the hijacked page (your page) from googlebot in robots.txt (or with META on the page).
2. Submit hijackers URL into removal tool. (which will show a 404 because googlebot is blocked from seeing the page the 302 points at)
302 URL gets removed from google index.
3. Remove the robots.txt entry (or the META) so that googlebot will be able to continue crawling as it should.

This would remove the 302 badguy URL for 90 days before googlebot finds the 302 on badguys site and the whole cycle starts over.

Google extends the 90 days to 6 months - this is a good thing because it gives them lots of time before these URL's come back to haunt.
Some guys went beyond this good advice and tried to 'clean up' their SERP's and started working on their own site, this is when the thread got out of hand.
So they ended up removing their entire site because they submitted the non w*w version of their own site.
And a few other mistakes were made, one guy put
user-agent: googlebot
disallow: /
he submitted the hijackers URL and his own site went crashing down.
So to some the 6 months is really BAD news.

Googleguy shows up and helps some of these guys that messed it up get reincluded. But the 302 thing......
It seems that we did do the right thing - most of us did-- and google seems to have fixed something to do with 302's.
There are a few threads going about 301 problems since that fix. I even had some old long gone 301's re-appear. Some people have had old long gone DOMAINS reappear due to the 301 thing, no casualties yet but a lot of nervous webmasters.
So I assume the 302 thing is fixed but I want to nuke those old 301's with the removal tool. Along with a couple of my own stray cgi url's floating around in the SERP's.

[edited by: Reid at 6:01 am (utc) on April 23, 2005]

joeduck

5:55 am on Apr 23, 2005 (gmt 0)

10+ Year Member



Reid - I think no to both your questions. They have a good description at the tool, but my understanding is that the tool simply gets a bot to revisit your robots.txt and follow the instructions. For example: you have pages that are appearing in Google that you _don't want to appear_. They are located in a directory called www.yoursite/cgi/

You could put a disallow in your robots.txt and then submit the robots.txt via the tool. Those pages will be deleted in 24-48 hours.

In our case it appeared that robots.txt was ignored during an update, leaving us with many odd pages, but then followed when we did the exclusion via robots.txt.

Again I urge caution to anybody using this. A simple typo in your robots.txt could blast your site out of the indexes entirely.

Reid

6:10 am on Apr 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



joeduck - so I guess it's a good idea to wait and see if googlebot is handling your new robots.txt ok before you blast it with the removal tool because if googlebot can't get by that robots'txt file, you end up blasting everything.

claus

10:22 am on Apr 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Afaik, the googlebot that does the removals is another googlebot than the one that does the indexing.

The one that does the indexing will probably see an URL somewhere on the internet and include that URL in the giant link pool, which means that it becomes part of the index. Later, another googlebot comes by and tries to spider that URL, but as it finds out that it's "robots.txt'ed" it gives up so the indexed URL becomes an "URL only" listing.

As always, the spider adds links, it doesn't remove them. So, to get the links that the spider bot found removed you have to ask the removal bot.

If your "robots.txt" file already includes the links that you want to remove you don't need to change it in any way. Just submit it to the removal bot (URL Console) and that one will go through all of your links and remove those that are in conflict with your "robots.txt" file.

---
Actually this is a pretty smart thing Google has, and it usually works like a charm. I often wish that the other SE's had an URL Console like it, as it's often close to impossible to get URL's removed from, say, Yahoo (several months of 404 or 410 seems to be needed).

Vec_One

3:37 pm on Apr 23, 2005 (gmt 0)

10+ Year Member



I don't understand why the removal tool causes a page to be removed for six months. The removal tool will not work at all if it finds the page. Following that logic, shouldn't the page be re-indexed as soon as Googlebot comes across it again?

Vec_One

3:37 pm on Apr 23, 2005 (gmt 0)

10+ Year Member



GoogleGuy,
URLs with spaces are indexed with "%20" but the removal tool doesn't recognize these URLs, presumably because it cannot find the pages.

I find that a lot of dead pages have a "%20", probably partly due to the reason mentioned above. I suppose hijackers and other nare-do-wells can exploit this phenomenon to make their URLs harder to remove. Can the removal tool be tweaked to accept a URL with a %20?

joeduck

5:49 pm on Apr 23, 2005 (gmt 0)

10+ Year Member



Reid -

Thanks for a nice summary there...
Not sure if I understand the question.
Your use of it seems to be a very clever way to delete hijacker pages, but different from ours which was more conventional and based on g1smd's experiences (thanks g1smd!).

Claus' explanation jives with some of our experiences -basically that you have two independent things here: spidering, which adds pages in somewhat unpredictable ways, and robots exclusion which very quickly deletes Google indexed pages according to the robots.txt exclusion protocol.

g1smd

6:30 pm on Apr 23, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



>> so I use the removal tool to remove robots.txt because google has it cached and this will force googlebot to refetch it? <<

No, not remove the robots.txt file! Simply spider the robots.txt file and remove from the index all the folders and pages mentioned in that file.

>> Or are you talking about submitting robots.txt for inclusion into google? <<

See above.


>> I don't understand why the removal tool causes a page to be removed for six months. The removal tool will not work at all if it finds the page. Following that logic, shouldn't the page be re-indexed as soon as Googlebot comes across it again? <<

Think carefully about this. Every URL that Google has ever seen in a link is recorded in a database. Some of those URLs show in the public index. Many do not. Pages that are spam, pages that no longer exist, malformed links, and stuff that has "noindex" assigned to it are still recorded as URLs along with their status. It has to be so, otherwise everytime a bot came across your link again it would simply re-add it to the public index. So, such URLs are in a "secret store" and should not be visible in the public index. In order that every page of the web isn't re-checked every 30 seconds, the data has an expiry date on it, for re-checking. If you remove pages using the URL controller, then they are flagged to not be included in public results for 6 months. After that time the robots.txt file is looked at again. If the URL is still in robots.txt then I assume it stays out of the public index, otherwise the URL itself is spidered for possible reinclusion.

This 467 message thread spans 47 pages: 467