|Dupe content checker - 302's - Page Jacking - Meta Refreshes|
You make the call.
My site, lets call it: www.widget.com, has been in Google for over 5-years, steadily growing year by year to about 85,000 pages including forums and articles achieved, with a PageRank of 6 and 8287 backlinks in Google, No spam, No funny stuff, No special SEO techniques nothing.
Normally the site grows at a tempo of 200 to 500 pages a month indexed by Google and others ... but since about 1-week I noticed that my site was loosing about
5,000 to 10,000 pages a week in the Google Index.
At first I simply presumed that this was the unpredictable Google flux, until yesterday, the main index-page from www.widget.com disappeared completely our of the Google index.
The index-page was always in the top-3 position for our main topics, aka keywords.
I tried all the techniques to find my index page, such as: allinurl:, site:, direct link etc ... etc, but the index page has simply vanished from the Google index
As a last resource I took a special chunk of text, which can only belong to my index-page: "company name own name town postcode" (which is a sentence of 9
words), from my index page and searched for this in Google.
My index page did not show up, but instead 2 other pages from other sites showed up as having the this information on their page.
Lets call them:
www.foo1.net and www.foo2.net
Wanting to know what my "company text" was doing on those pages I clicked on:
(with mykeyword being my site's main topic)
The page could not load and the message:
"The page cannot be displayed"
was displayed in my browser window
Still wanting to know what was going on, I clicked " Cached" on the Google serps ... AND YES ... there was my index-page as fresh as it could be, updated only yesterday by Google himself (I have a daily date on the page).
Thinking that foo was using a 301 or 302 redirect, I used the "Check Headers Tool" from
webmasterworld only to get a code 200 for my index-page on this other site.
So, foo is using a Meta-redirect ... very fast I made a little robot in perl using LWP and adding a little code that would recognized any kind of redirect.
Fetched the page, but again got a code 200 with no redirects at all.
Thinking the site of foo was up again I tried again to load the page and foo's page with IE, netscape and Opera but always got:
"The page cannot be displayed"
Tried it a couple of times with the same result: LWP can fetch the page but browsers can not load any of the pages from foo's site.
Wanting to know more I typed in Google:
to get a huge load of pages listed, all constructed in the same way, such as:
Also I found some more of my own best ranking pages in this list and after checking the Google index all of those pages from my site has disappeared from the Google index.
None of all the pages found using "site:www.foo1.com" can be loaded with a browser but they can all be fetched with LWP and all of those pages are cached in their original form in the Google-Cache under the Cache-Link of foo
I have send an email to Google about this and am still waiting for a responds.
I am sort of in the same boat. My site was alos remove, but the link remains and is redirected to the offending sites home page.
As for the cloak checker, most of the sites utilizing cloaks are using ip based cloaks which a tool cannot check. You would have to spoof the IP of googlebot in order to do this. A good cloak is transparent to the user and very hard to find. This is what is making this so frustrating.
I know what you are saying, but I did find some differences using the cloaking checker I found and was wondering if any of the things I posted were relevant. I can send you a sticky of the URL of the cloacking checker site I used if you're interested.
PS-Still wondering if you had tried the email@example.com email address I suggested earlier in the thread. When I attempted to use the "remove URL" tool that was also listed as the contact email address to use if you had further questions. Although, at this point, it seems they have stopped responding to anybody.
Yes, sticky please and I will check it out.
Yes, I have emailed firstname.lastname@example.org . Still no response. Not even an acknowledgement :-(
I think I have a handle on the programming/interpretation error that *may* be causing this as it relates to 302 redirects but not meta refresh. It appears to be an artifact of the way specific dynamic page handlers are implemented on apache, on version 1.3x in particular. In other words, this appears to be *mostly* the result of the choice of tool rather than maliciousness on the part of external sites.
The number of samples that I have been has been *extremely* small, but there is a definite commonality. I would really like to see more samples for confirmation purposes. If anyone has additional samples I would really like to see them via sticky.
The technique will not be published here, however email originating from google.com will be answered provided they appear genuinely interested in fixing the problem or at least verifying the correctness of their handling of the particular protocol interactions.
Just checked Copy Scape and I have been hijacked AGAIN by a different directory. As of today.
The difference this time is that I see other sites that have been hijacked by them and they are using a meta-refresh-"1" instead of "0".
This is really fun!
Now I am getting really ticked off :-(
I came in this morning and noticed that I got crawled by googlebot. Not a full crawl, but about 1/3 of the site. I thought, okay, this is good. Maybe it will straighten the mess out.
Guess what! My money 3 word phrase popped up to #1! Yippee! But wait a minute. I better check that link out. Oh No! It couldn't be! Could it? Guess what. The link is the offending site! My cache! My Backlinks!
My other 3 word phrase is now at #5. And here is the kicker. It is another site's redirect! My cache! My backlinks!
Money 2 word phrase went from #50 to #51. It is STILL the same offending site with my cache and my backlinks.
So it seems that instead of improving, it got worse.
|It appears to be an artifact of the way specific dynamic page handlers are implemented on apache, on version 1.3x in particular. |
You are right sir. One site is using Apache/1.3.29 (Unix) and the other is using Apache/1.3.26 (Unix). Not sure what this means.
Thanks to the posters who sent me their responses from Google. What a joke. I think they have a bug but they won't/can't/ fix it. Lot's of the standard double speak with nothing being done.
Oh, and by the way - You guessed it. No response from google. Not even an acknowledgement. Oh Well, maybe that is just as well.
Are you sick of me yet?
|You guessed it. No response from google. Not even an acknowledgement. |
You don't want the kind of response that I got.
If they have not responded, perhaps they are still looking at it and scratching their heads.
When I reported it to them, they said I should shut up and color. They told me that they are perfectly happy with the way it is working.
|I think I have a handle on the programming/interpretation error that *may* be causing this as it relates to 302 redirects but not meta refresh. |
I looked at two different samples on the same Apache (the same host).
Both codes written in PHP and use redirects 302. One code doesn't have linked pages cached on google, but another one does.
Here is what wannabrowser with NetSpider shows:
1. the one that doesn't have links cached
HTTP/1.1 302 Found
Date: Thu, 23 Sep 2004 15:46:55 GMT
Server: Apache/1.3.31 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.3.8 FrontPage/126.96.36.19934a mod_ssl/2.8.18 OpenSSL/0.9.6b
Location: http:// www blahblahblah net
2. this one hijacked pages
HTTP/1.1 302 Found
Date: Thu, 23 Sep 2004 15:50:25 GMT
Server: Apache/1.3.31 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.3.8 FrontPage/188.8.131.5234a mod_ssl/2.8.19 OpenSSL/0.9.6b
P3P: CP="CAO DSP COR CURa ADMa DEVa OUR IND PHY ONL UNI COM NAV INT DEM PRE" policyref="www somesite com/w3c/p3p.xml"
Set-Cookie: hits=++929+; expires=Thu, 23-Sep-04 15:51:26 GMT
Set-Cookie: linksread=%5BEND%5D929%5B%2C%5D1095954626%5BEND%5D; expires=Fri, 23-Sep-05 15:50:26 GMT
Location: [www...] blahblahblah net
<meta http-equiv="refresh" content="0;url=http://www blahblahblah com">
Also I was trying to find the changes in Apache 1.3 from the previous versions - didn't really find anything on 302. But in one of the discussions I read they mentioned to use no-cache
<META HTTP-EQUIV="EXPIRES" CONTENT="Tue, 04 Dec 1996 21:29:02 GMT"> <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"> <META HTTP-EQUIV="CACHE-CONTROL" CONTENT="PRIVATE">
is there logic?
here is the post - search on G for: apache 302 redirect + refresh + cache
the first link with redhat domain.
here is some more by W3C
search on G for: W3C "Checkpoint 7.5"
Edit: desided to quote
7.5 Until user agents provide the ability to stop auto-redirect, do not use markup to redirect pages automatically. Instead, configure the server to perform redirects. [Priority 2]
Content developers sometimes create pages that refresh or change without the user requesting the refresh. This automatic refresh can be very disorienting to some users. Instead, in order of preference, authors should:
Configure the server to use the appropriate HTTP status code (301). Using HTTP headers is preferable because it reduces Internet traffic and download times, it may be applied to non-HTML documents, and it may be used by agents who requested only a HEAD request (e.g., link checkers). Also, status codes of the 30x type provide information such as "moved permanently" or "moved temporarily" that cannot be given with META refresh.
Replace the page that would be redirected with a static page containing a normal link to the new page.
It certainly sais right there not to use metarefresh with 302... now we need to figure out the rest :)lol I don't know if we should write to Apache or actually it is SE problem.. G and Y! has the same... so it must be something Apache could answer.
|They made this decision because of the last email reply from G on this issue, basically saying that both URL's (the client URL, and the affiliate URL) have the same content, so G is listing the one with the higher 'rank'. |
This interests me because I am considering creating a website that would feature a large amount of copyright-free text, along with my own commentary and related data from other sources. The navigation and page design would be my own work.
The text explicitly says that everything produced by this organization is free of copyright and may be reproduced. In fact, other publishers sell the text (in paper). But the organization that produces it does display it on their website. So I am worried that Google may spot the similarities and ban my website because (presumably) it will have the lower page rank.
What exactly does Google look for? The exact same HTML code? (which would not be the case here) or large amounts of same text as it is displayed in a browser?
To avoid a ban for duplicate content, simply provide snippets in the main page and use an IFRAME to deliver the main content. Exclude robots from the directory in which the main content is stored.
Of course, you may not rank as well this way but your users will be happy.
On the matter of this being an Apache problem, unless I am missing something, this sounds like nonsense. If Google indexed content pages instead of pages that redirect to content the problem simply would not exist. I can see no possible way that this can be explained as a server problem - it's a Google/Yahoo problem.
What is the legitimate use if any of using the 302 redirect location:url *with* an accompanying meta-refresh. If the two url's (the redirect and the refresh) were the same the two would be redundant. If site 1 redirects to site 2 and then at the same time metarefreshes to the target site... is this the method of the hijacks? Does site 2 get credit for the target sites pages and pr? If so, why is this different from site 1 just caching the target sites code to begin with and cloaking it to the bots? ...Keeping in mind there are no doubt several variations of this going on at any one time.
|On the matter of this being an Apache problem, unless I am missing something, this sounds like nonsense. |
i did not say that it was an apache *problem*, i said that the specific cases i know of are all from servers running apache 1.3x.
i also see, from network traces, not a header checker, that there is a subtle difference in the way that apache handles certain boundary cases. while *fully* conformant to rfc 2616, the difference still sort of stands out. so, it is not an *apache* problem, rather it is behaviour that is specifically seen in certain configurations of apache 1.3x. the possibility exists that it is being interpreted in an unexpected way by an external program under someone else's control.
as an experienced programmer, i think you will recognise that boundary cases are often where bugs lie because a programmer relied on behaviour that did not hold true at the boundary.
i will repeat again that i still need more samples. rather than guessing, i would like to be able to say that of X samples, there were Z commonalities.
All I know at this point is there definitely seems to be a google bug. After the first site that hijacked me removed the link to me, all of the links with my title, snippet and their URL and my cached page disappeared from the SERPS. Gone.
The only thing left is that when you search for www.[my-site].com the result shows:
No title, no snippet, just the link that used to link to me.
Google can show you the following information for this URL:
Find web pages that are similar to www.[my-site].com
Find web pages that link to www.[my-site].com
Find web pages that contain the term "www.[my-site].com"
When you hover over the "pages that are similar" the link goes to "pages that are similar to the www.[other-site].com.
When you hover over the "pages that link to" it goes to the www.[other-site].com.
When you hover over the "pages that contain the term" it goes to www.[MY-SITE].com.
Now, if the page was cloaking, wouldn't it still be showing up in the SERPS for keyword searches like it did before they removed my listing from their site?
Why, then is Google still associating me with this site's link that contains no title, no snippet, no cache and no longer seems to have anything to do with me?
My knowledge of http is about as thin as a cigarette paper so I am perfectly happy to accept that I am wrong. My logic is that servers simply chuck out text responses to requests in headers and that if the responses of two servers are essentially the same (as they should be) then Googlebot's response will be the same whether the server is running Apache, Windows or whatever.
However, I guess you are working on the "garbage in, garbage out" principle. If the responses of some servers are outside the norm, then Googlebot may become confused. However, I do not see this as a robot problem (though it might be I suppose) I see it as an indexation problem.
So why am I prepared to stick my head above the parapet and say it is an indexation bug? Well, first reports (and some email responses from Google) suggest that the redirection trick works best if you have a PR advantage. Other than for scheduling, I see no reason for robots to have any knowledge of PR - therefore the problem must lie in indexation. If it lies in indexation, in order for the problem to be caused by server responses, those responses must be recorded in detail and processed by the indexation service. Whilst this is possible, I would say it is more likely that the robots record only simplified versions of response headers.
You are definitely in the right neighbourhood.
Consider if you will that search engines are not one monolith, but rather a suite of processes implemented in multiple programs.
Consider also that these programs are implemented by individual teams or programmers.
Consider finally that these programmers depend upon informal and formal knowledge of the output of the programs upstream from their own program and the expectations with respect to the outputs from their own program.
For gigo to happen, requires only one mistake anywhere in the chain. The results of the observation points available so far suggest a boundary condition trap that I as a programmer could easily fall into. The reasoning then becomes: if one programmer could fall into the trap, then why not another?
On a more general note, I have received a few more urls by sticky, but as usual, could use more. I need both the victim and donor url's. If someone is reluctant to release the victim url, then the url of the external page is fine. That way, if it is a page containing multiple outbounds, I have no way of identifying the victim site.
I use xoops. And after reading this thread i have understood the effect of the linking strategy that comes with it. It affects me more because its my users usually who post the link and its usually to their own site or sites they like. We totally have only 31 links as i never was into linking games and strategy. And the links that have been put there are because of their value to the users. I hate the word reciprocal. (My policy was always if i like what you show then you can have my link anytime and vice versa)
My site is in my profile. Just visit the links section. Any technical assistance on how to cirumvent this problem is appreciated.
After doing a complete check I found lots of things. For example a redirect page on mky site has 16500 backlinks. But luckily the site it is linking to is a behemoth with a PR9 or 10. So good enough. And my page also has not been removed because of duplicate content.
Yes, I have experienced this problem with Google.
So not-my-page meta refreshes to my-page, and my-page is dropped, but not-my-page remains in Google search results, with my-page content but the link is not-my-page.
So I email Google and they say to do a 301 redirect, that they do not manually adjust search results, but I say that I cannot 301 redirect, not-my-page is well not mine.
End result? Nothing.
"Are you sick of me yet?
Not even! Hang tough, Pard!
|So I email Google and they say to do a 301 redirect, that they do not manually adjust search results, but I say that I cannot 301 redirect, not-my-page is well not mine. |
I am beginning to think that email@example.com needs to be outsourced to India.
The people answering it lately apparently have not the faintest bit of understanding of what kind of troubles they are causing.
The lack of systems engineering talent at that address is criminal.
Re: messages #206 and #164
First, I made a mistake in #206, the page with the meta refresh to the test page returns a 200 response. Next, I was wrong about cloaking in #164.
The test pages I redirected have all been spidered and cached. I redirected three existing pages listed in Google, using a 301, a 302, and a meta refresh, to newly created test pages. In ALL cases, the cache now contains the new test pages.
I expected the 301 redirect to update the cache. I expected the meta refresh to update the cache but I expected it to take longer. I did NOT expect the 302 to update the cache.
This testing was all done within one domain. The server header for this server displays "Server: Apache/1.3.27 (Unix) (Red-Hat/Linux) PHP/4.3.0." Also, the test site has a dedicated IP and this host uses VPS/VDS (Virtual Private/Dedicated Server) technology. All redirects used fully qualified domain names and paths, absolute addresses.
It looks to me as if any site can hijack any site with any kind of redirect, presently. Let the games begin! :) Page rank may still limit who can hijack who but I'm uncertain how to test that aspect.
webdude, are you seeing the same results I'm seeing? After you confirm my results, I'll try substituting a few pages at a different domain for these test pages.
[edited by: DaveAtIFG at 6:26 am (utc) on Sep. 26, 2004]
Thanks for the test
When you say:
"the cache now contains the new test pages"
are they listed under the "redirecting URL" or under the "redirected to URL"?
Secondly, did you add a "Robot Tag" such as:
<meta name="robots" content="follow, noindex">
See message #2 on page 1?
Good question Marcello, sorry! The 302 and the meta refresh pages are listed under the redirecting URL. I missed it in my earlier post but the URL for the 301 test page has replaced the redirecting URL. It appears a 301 is still NOT a potential hijack tool.
Both the redirecting pages and the target pages contain:
<meta name="robots" content="index, follow">
|It looks to me as if any site can hijack any site with a 302 or a meta refresh, presently. |
Did you catch that, rustybrick?
I was wondering if I jumped the gun when I asked the other sites who used a 302 (but no meta refresh) to remove my listings. It's a bit strange to go from asking for links to begging people to please, please remove them!
Both the sites that hijacked me used meta refresh, but the first site returns a 302 and the second site returns a 200.
The second site has Apache 1.3x in the server header, but the first site just says Apache. If that means anything to what you are researching, here.
My situation to date is: the first site still remains in the SERPS when you do a search on my domain name.
The second site did not reply to my request to remove me from their directory. Their directory has to do with travel accommodations. I have sent them a more stern email. I also found out that the guy who owns the site also owns a company that offers SEO and web design. He also designed the site of his hosting company, so I doubt I'm going to get anywhere by complaining to them about it.
That site now appears along with mine in the SERPS with my title and snippet, cached page and their URL. I'm just waiting for Google to drop my index page again.
Any new developments for you, Webdude?
point of clarification,
were all of these redirect cases interdomain?
Apparently, you overlooked this plumsauce.
|This testing was all done within one domain. |
I'm uncertain what you mean by "interdomain," to me that means between different domains...?
In any event, both the redirected pages and the targeted pages ALL reside within the same domain.
I'm now aware of two sites that are using 302 redirects and appear to be hijacking, one is using Apache 1.3.31, the other is 1.3.26.
I emailed GoogleGuy suggesting he review this thread and offering my test data. Based on past responses, I don't expect to learn anything specific from him. Things have changed at Google, IPO et al...
> It looks to me as if any site can hijack any site with any kind of redirect, presently. Let the games begin! :) Page rank may still limit who can hijack who but I'm uncertain how to test that aspect.
The Page Rank thing explains why I wasn't able to hijack google.com.
"Page rank may still limit who can hijack"
I my case a PR3 page with a meta-refresh hijacked my PR6 Index-Page which is still nowhere to be found in the Google-Index.
Since this happened, this hijacking page is still moving up in the SERPS, in spite that Google has deleted the Cache of this page due to my DMCA-complaint. (with as only result that the proof is gone but nothing else has changed)
For any kind of text-snipped from the content of my disappeared Index-Page and keywords, foo.com is still at the top of the results.
My Index-Page at widget.com contains 4 outgoing links to high-ranking sites about the same topic and field.
These links are now showing up as coming from foo.com/widget-com.html instead of from my page at widget.com
the result gives 4373 backlinks with foo.com/widget-com.html in the top position, no sign of my widget.com/ which is the real page that contains a link to other-site.com.
I admit to not being an expert on this but I think I've found a hijacking attempt against my sites. the one site has decent pr but it only gets about 200 visitors a day.. So it's not a big fat target..
the text below is what I sent to google's spam report.
The message to google included the real sitenames. I substituted HIJACKER for the domain name adn Widgets for the subject item here.
I would like to report a page that is showing a PR of 3 on the google toolbar. I have reason to believe that they are using a method of pagerank hijacking.
I did a search in Yahoo to see who was linking to my site
The yahoo search was Link:http://www.mysite.com -site:http://www.mysite.com
I noticed a suspicious entry of “You Search : " WIDGETS", " widgets". Search more " widgets". Find about " widgets".” which went to a Russian site
Following that link took me to [HIJACKER.ru...]
The top of that page (which is unranked at this point as far as PR ) shows all sex links. However, closer to the bottom, you see links to my site in the format of
Clicking on the link they give just takes you to another of their results pages.
I realize that the initial search here was in Yahoo and I am writing to google.
1.I believe that the [HIJACKER.ru...] site is trying to hijack pagerank through redirects
2.this hijacking has been widely reported in webmasterworld.com and #*$!.com
3.I see no logical reason that a page with results about sex toys, etc should have results about widgets mixed in.
I can provide a formatted copy of this in word if you want or you can contact me at
(i gave them my email and phone)
========== end message i sent
The search engines need to fix this real soon ..
|The search engines need to fix this real soon .. |
Webmaster@google.com has already responded to many of us and told us that it is working exactly as they want it to work and that we should shut up and leave them alone.
Yahoo responded to my note by banning my site from their index.
Have Dvorak, Cringely, Sullivan, or Bricklin written about this yet?