Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google's 302 Redirect Problem

         

ciml

4:17 pm on Mar 25, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



(Continuing from Google's response to 302 Hijacking [webmasterworld.com] and 302 Redirects continues to be an issue [webmasterworld.com])

Sometimes, an HTTP status 302 redirect or an HTML META refresh causes Google to replace the redirect's destination URL with the redirect URL. The word "hijack" is commonly used to describe this problem, but redirects and refreshes are often implemented for click counting, and in some cases lead to a webmaster "hijacking" his or her own URLs.

Normally in these cases, a search for cache:[destination URL] in Google shows "This is G o o g l e's cache of [redirect URL]" and oftentimes site:[destination domain] lists the redirect URL as one of the pages in the domain.

Also link:[redirect URL] will show links to the destination URL, but this can happen for reasons other than "hijacking".

Searching Google for the destination URL will show the title and description from the destination URL, but the title will normally link to the redirect URL.

There has been much discussion on the topic, as can be seen from the links below.

How to Remove Hijacker Page Using Google Removal Tool [webmasterworld.com]
Google's response to 302 Hijacking [webmasterworld.com]
302 Redirects continues to be an issue [webmasterworld.com]
Hijackers & 302 Redirects [webmasterworld.com]
Solutions to 302 Hijacking [webmasterworld.com]
302 Redirects to/from Alexa? [webmasterworld.com]
The Redirect Problem - What Have You Tried? [webmasterworld.com]
I've been hijacked, what to do now? [webmasterworld.com]
The meta refresh bug and the URL removal tool [webmasterworld.com]
Dealing with hijacked sites [webmasterworld.com]
Are these two "bugs" related? [webmasterworld.com]
site:www.example.com Brings Up Other Domains [webmasterworld.com]
Incorrect URLs and Mirror URLs [webmasterworld.com]
302's - Page Jacking Revisited [webmasterworld.com]
Dupe content checker - 302's - Page Jacking - Meta Refreshes [webmasterworld.com]
Can site with a meta refresh hurt our ranking? [webmasterworld.com]
Google's response to: Redirected URL [webmasterworld.com]
Is there a new filter? [webmasterworld.com]
What about those redirects, copies and mirrors? [webmasterworld.com]
PR 7 - 0 and Address Nightmare [webmasterworld.com]
Meta Refresh leads to ... Replacement of the target URL! [webmasterworld.com]
302 redirects showing ultimate domain [webmasterworld.com]
Strange result in allinurl [webmasterworld.com]
Domain name mixup [webmasterworld.com]
Using redirects [webmasterworld.com]
redesigns, redirects, & google -- oh my [webmasterworld.com]
Not sure but I think it is Page Jacking [webmasterworld.com]
Duplicate content - a google bug? [webmasterworld.com]
How to nuke your opposition on Google? [webmasterworld.com] (January 2002 - when Google's treatment of redirects and META refreshes were worse than they are now)

Hijacked website [webmasterworld.com]
Serious help needed: Is there a rewrite solution to 302 hijackings? [webmasterworld.com]
How do you stop meta refresh hijackers? [webmasterworld.com]
Page hijacking: Beta can't handle simple redirects [webmasterworld.com] (MSN)

302 Hijacking solution [webmasterworld.com] (Supporters' Forum)
Location: versus hijacking [webmasterworld.com] (Supporters' Forum)
A way to end PageJacking? [webmasterworld.com] (Supporters' Forum)
Just got google-jacked [webmasterworld.com] (Supporters' Forum)
Our company Lisiting is being redirected [webmasterworld.com]

This thread is for further discussion of problems due to Google's 'canonicalisation' of URLs, when faced with HTTP redirects and HTML META refreshes. Note that each new idea for Google or webmasters to solve or help with this problem should be posted once to the Google 302 Redirect Ideas [webmasterworld.com] thread.

<Extra links added from the excellent post by Claus [webmasterworld.com]. Extra link added thanks to crobb305.>

[edited by: ciml at 11:45 am (utc) on Mar. 28, 2005]

Reid

10:58 am on Mar 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



what to do to prevent 302 hijacks?
Apparently nothing you can do.

There have been some sugguestions
1. Put dynamic content on the page.
2. Use cannonical url's for internal links.

I don't know if any of these things can prevent the 302 problem from happening. After all when googlebot finds a 302 link on another website there is little you can do to change that fact.

Myself I still use relative links for internal crosslinking but I have a base href= META tag on every page. I wouldn't use relative linking without that tag. It seems that anypage is vulnerable esp if it is a deliberate hijack.
I don't understand how cannonical links can sheild a page from hijacks.
Dynamic content- the hijacking links seem to keep the original cache and never update it. I had 2 of these to deal with myself where the current page is completely different than the cached hijacking page. It never gets re-cached so how would dynamic content help?

A custom 404 page? completely useless for preventing hijacks

DTD tag? Nothing to do with hijacks.

Before googlebot even fetches your page it 'already knows' that you are only a temporary location for the hijackers page. You could feed it a 301 if it was possible to know when googlebot is going to show up for this one time fetch but that is impossible.

claus

11:00 am on Mar 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Oh, i didn't mean that the hijacked sites came back. I'll try rephrasing:

Has anybody managed to remove a wrong URL on somebody else's site from Google (ie. a 302 redirect URL) and seen the wrong URL (the redirect script) come back in the serps?

accidentalGeek

2:37 pm on Mar 29, 2005 (gmt 0)

10+ Year Member



HTTP Filter Defense. Take 2.

After aleksl demonstrated why my first take [webmasterworld.com] failed to solve the problem and I came up with a couple of ways that a determined attacker could hijack a Google listing without using a 302, I left the problem alone for a couple of days and focused on other things. This morning, a variation of the defense popped into my head and I'd like to see if aleksl (or anyone else) can shoot this one down.

Remember that this deals with HTTP 302 hijackings only, some of which appear to be acccidental. It's useless against more clever or more brutish attacks.

Set up a filter on the Web server which intercepts all inbound requests and does the following:

  1. If the client is definitely not a googlebot (or any other targeted robot), take no action. Allow the request to be processed normally.
  2. If the URL contains a special code that we provide for robots (see the next step), and if the code has not yet been used in a request, take internal steps to insure that this code is not used again. Then allow the request to be processed normally.
  3. If the client might be a targeted robot, present it with a dynamic splash screen that contains an ordinary hyperlink. The hyperlink contains the URL it requested in absolute form along with a code as a GET parameter. The code will be generated by our filter. This splash screen may contain a bit of text that we don't mind being indexed under the attacker's page. It should not contain the name of our organization or anything else that we don't want indexed under the attacker's page.

I dislike splash screens as much as the next guy, or maybe more than the next guy. The idea here is not to create a splash screen that everyone sees. It's to dynamically create one for a robot that might have been referred to this page by a 302 link. Because robots do not provide meaningful referrer headers, there is no direct way to tell how they arrived at this page. Because they hit a page, index it, and return for the hyperlinks at some later time, we cannot use a timing mechanism like the one I described in my first take.
Therefore, I think our best bet is to decorate the HTTP request. An unfortunate side effect of this will be that the decoration will appear in the Google listing for this page. This should have no technical effect because the filter will notice that users referred by google are not robots and will let the request straight through. If the Web site contains scripts that rely on GET parameters, it might be a good idea for the filter to strip its code from the parameter list before letting the request through -- just to be safe.

This approach requires a slightly more sophisticated filter than the one I described in my first take because it will need to generate, track, and evaluate codes for one time passes. Because filters on most Web servers are necessarily stateless, the codes will need to be stored in a file, database, or some sort of session agent. There's a performance hit associated with this, but it should apply only to clients that might be robots. This should minimize its impact.

I believe that this defense will succeed where my first take failed because the robot now receives harmless content that it can associate with the 302 referrer. The dynamic splash screen gets indexed under the attacker's listing, but the content is a hyperlink that points to our site. As far as the robot is concerned, this is an ordinary static hyperlink to a completely different Web server, not part of the 302 redirect. If it follows this link, it should index the content it finds under our listing, not that of the attacker.

Does this approach hold more promise than my first take?

theBear

3:22 pm on Mar 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Reid asks:

"Dynamic content- the hijacking links seem to keep the original cache and never update it. I had 2 of these to deal with myself where the current page is completely different than the cached hijacking page. It never gets re-cached so how would dynamic content help?"

The dynamic content prevents the duplicate content filter from triping in the first case.

At least that is one of the theories, (one that may account for why one of the sites I work on didn't totally tank).

There is also the theory set forth by others that a 302 hijack is an automatic permanent dup content problem because Google says the target pages are the same so the content must be the same so filter this sucker always.

Since we don't have access to the crown jewels of Google we will never know for sure.

Reid further asks:

"I don't understand how cannonical links can sheild a page from hijacks."

Once again only a theory here as is all of what anyone says on this site.

If the 302 injection causes a site split (plays with Google's cannonical page determination subroutines, or inserts links for the bots to follow) then the 301 rewrite rules prevent the site split thus preventing massive duplicate content problems. Use of relative links is implied here of course.

Please note that any page of a site would be subject to replication even if nonrelative hrefs were used, but it would be on a page by page basis and would self correct in time (maybe not fast enough however).

Remember this all theory.

Marcia

4:07 pm on Mar 29, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's being compounded by meta refresh, sometimes being used on its own and sometimes together with 302.

grail

5:19 pm on Mar 29, 2005 (gmt 0)

10+ Year Member



FAGIN

(spoken) You see, Oliver...

(sung) In this life, one thing counts
first page serps, large amounts
I'm afraid these don't grow on trees,
You've got to link with three oh two

You've got to link with three oh two, boys,
You've got to link with three oh two.

BOYS

Large amounts don't grow on trees.
You've got to link with three oh two.

FAGIN

(spoken) Let's show Oliver how it's done, shall we, my dears?

(sung) Why should we break our backs
Stupidly paying tax?
Better get some adsense income
Better link with three oh two.

You've got to link with three oh two, boys
You've got to link with three oh two.

BOYS

Why should we all break our backs?
Better link with three oh two.

FAGIN

(spoken) Who says crime doesn't pay?

(sung) Widget Website, what a crook!
Gave away, what he took.
Charity's fine, subscribe to mine.
Get out and join adsense too

You've got to join adsense too, boys
You've got to join adsense too.

BOYS

Widget Website was far too good
He had to join adsense too.

FAGIN

Take a tip from scraper sites
they can rip what they likes.
I recall, they started small
then they link with three oh two.

You've got to link with three oh two, boys
You've got to link with three oh two.

BOYS

We can rank like scraper sites
If we link with three oh two.

FAGIN

(spoken) Stop thief!

Dear old gent passing by
Something nice takes his eye
Everything's clear, attack the rear
Get in and link with three oh two.

You've got to link with three oh two, boys
You've got to link with three oh two.

BOYS

Have no fear, attack the rear
Get in and link with three oh two.

FAGIN

When scraper see content rich,
adsense thumbs start to itch
now they rank some page of mine
they have link with three oh two.

You've got to link with three oh two, boys
You've got to link with three oh two.

BOYS

Just to find some page of mine

FAGIN AND BOYS

We have to link with three oh two!

Reid

7:40 am on Mar 30, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That's a funny poem Grail but although the 302 hijackers may enjoy stealing other peoples hard work a true SEO strategy is to build a good website.
Vultures come and go but the REAL website will steadily get better and better.
So have fun while it lasts but don't forget "what comes around goes around"

grail

9:53 am on Mar 30, 2005 (gmt 0)

10+ Year Member




Just to explain to those unfamiliar with 'fagin'. That was just a joke to be sung to the tune of "Pick a pocket or two" from "Oliver Twist".

It was meant to 'take the mickey' out of google/adsense/fagin not antagonise the victims of 302.

vincentg

2:58 pm on Mar 30, 2005 (gmt 0)

10+ Year Member



The concern on 302 is in my opinion being blown way out of proportion.

I am seeing posts to create bot to try and defend against such a thing.

The web does not need more bots!
Bots written by non-professionals are only going to cause problems.

I have seen no hard facts to support this claim but I will not dismiss it as a possibility.

First a 302 by itself will not harm a website according to those that have brought this topic to life.

If you do not make it clear as to what the problem is you will have touched off a frenzy to remove 302 link every where.

Website owners that do this will in fact be hurting their PR rather than helping it.

All links are important and just removing links due to a scare based on a 302 Google problem is not an answer.

I run a website that does a redirect rather than a direct link. There is nothing wrong with this.
Yahoo does a Redirect as do PPC Search Engines and others.

I am listed in Yahoo and they have hurt my PR and I have listed in many PPC engines which have not effect my PR either.

My Directory is listed in other Directories which use redirects and again I have no problem.

If there are websites that cause a problem then I say post them here or bring a Google Rep into the Forum to clear this up!

Vincent G. Click4choice

accidentalGeek

10:54 pm on Mar 30, 2005 (gmt 0)

10+ Year Member



Vincent, I get the sense from your post that, like nearly everyone who uses the World Wide Web today, you are unfamiliar with the HTTP protocol. I don't intend this as an insult. One sure sign that a technology has matured is that you don't need to be a geek to use it.

You can build a fine static web site without even knowing that HTTP exists. You can build a fine dynamic one with very little knowledge of the protocol. However, when it comes down to the expected behavior of clients (usually web browsers) when faced with various HTTP response codes, it's time to bring out the official protocol specification [webmasterworld.com] and get geeky.

HTTP Response Codes: a simple overview
At its most basic level, HTTP is a simple request-response protocol where the client sends one request and the server returns one response.(1) The HTTP specification details what constitutes a valid request and a valid response. A valid response will always contain exactly one numeric status code. The status code contains exactly three digits and the first digit places the response into one of four broad categories. The third category (status code 300-307) covers various types of redirects which can be used to inform a client that the content is available in some other location or must be accessed using some other means.

Different Types of Redirect (300-307)
The key here is that HTTP 1.1 specifies seven different kinds of redirect (Count seven rather than eight because 306 is unspecified). These redirect codes tell the client something about the nature of the redirect. But here's where it gets tricky. The specification does not dictate exactly what the client is expected to do with the response. In some places, the specification recommends an action (note the word "SHOULD" in the specification), but ultimately the client is free to do whatever it would like.

From your post, it seems to me that you were confused into thinking that all HTTP redirects were the same. A quick review of the specification will show that they are not. The problem we're facing with "hijacked" Google listings results from the way that a particular HTTP client, a googlebot, handles a particular kind of redirect, HTTP 302:


302 Found
The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.

In other words, 302 means that the "resource" (generally a Web page) is temporarily unavailable at the requested URL, but can be found at some other URL.

Robots and HTTP 302
When confronted with a 302, googlebot uses the redirect to load the resource and then indexes it under the original URL. This behavior makes perfect sense if we make the assumption that the resource will eventually return home and be available at its original URL. After all, this is what the response code indicates.

However, this assumption introduces an element of trust. The client must trust that the server is telling the truth. That is, a) the resource normally lives at the requested URL and b) the resource is also available at some other location on a temporary basis.

This required element of trust creates an opening for a misbehaving or malicious Web server to "hijack" a google listing. It needs only to issue a 302 redirect to some other Web server. A googlebot will assume that the content it finds on the other end of the redirect really belongs on the first Web server. Note that this is not a stupid assumption on the part of the googlebot. The assumption is built into the specification of response code 302. My guess is that this code was specified in a more innocent era when systems generally trusted one another. I doubt that the architects of HTTP 1.0 had robots in mind and there's no noticable consequence if an ordinary web browser follows a 302 redirect that really should have been a 301 ("moved permanently"). The only difference should be in the way that the browser maintains its cache.

If you're Google or the maintainer of some other crawler robot, this situation presents a problem that is difficult to solve. How do you avoid getting duped by an incorrect 302 without behaving in a way inconsistent with the specification? Put more simply, how do you fix the current problem without breaking all sorts of well-established systems, some of which you won't know about until the complaints come flooding in?

Scope of the Problem
From what I've read, people who study such things have known about this vulnerability for several years. However, it has recently become more widely known and the number of reported exploitations has been on the rise. The aspect about this vulnerability that bothers me the most is how easy it is to exploit. It takes virtually no expertise. An exploit can be achieved with a line or two of server-side script code. The effect of the exploit is that the target's listing in Google (and other search engines) will be replaced by one that contains the target's content and a hyperlink to any arbitrary URL that the attacker designates. If you run a site that helps small children learn to read, an attacker can make your Google listing point to a porn site. If you run a banking site, an attacker can make your Google listing point to a phishing site.
The push toward a solution may be mitigated by the fact that there are other attacks that achieve the same result and are much more difficult to detect and likely impossible to defend against.

Possible Defenses
I've seen a number of proposals for defending a site against being indexed by robots that were referred by an HTTP 302. Most of them involve tweaking content or deploying meta-tags. In my view, these are unlikely to be effective because they operate at a higher level than the problem. It's like trying to deal with a flooded basement when your sump and hoses are stuck on the third floor. Because this is a protocol-level problem, I believe that effective solutions are to be found on the protocol level. I proposed a couple of solutions earlier. Aleksl demonstrated that my first solution was doomed from the start. I haven't seen any response to my second.

-----------
1. This statement is correct for HTTP 1.0. It's an oversimplification for HTTP 1.1 which introduces some flow control and allows multiple requests and responses per connection.

This 467 message thread spans 47 pages: 467