Forum Moderators: open
Normally the site grows at a tempo of 200 to 500 pages a month indexed by Google and others ... but since about 1-week I noticed that my site was loosing about
5,000 to 10,000 pages a week in the Google Index.
At first I simply presumed that this was the unpredictable Google flux, until yesterday, the main index-page from www.widget.com disappeared completely our of the Google index.
The index-page was always in the top-3 position for our main topics, aka keywords.
I tried all the techniques to find my index page, such as: allinurl:, site:, direct link etc ... etc, but the index page has simply vanished from the Google index
As a last resource I took a special chunk of text, which can only belong to my index-page: "company name own name town postcode" (which is a sentence of 9
words), from my index page and searched for this in Google.
My index page did not show up, but instead 2 other pages from other sites showed up as having the this information on their page.
Lets call them:
www.foo1.net and www.foo2.net
Wanting to know what my "company text" was doing on those pages I clicked on:
www.foo1.com/mykeyword/www-widget-com.html
(with mykeyword being my site's main topic)
The page could not load and the message:
"The page cannot be displayed"
was displayed in my browser window
Still wanting to know what was going on, I clicked " Cached" on the Google serps ... AND YES ... there was my index-page as fresh as it could be, updated only yesterday by Google himself (I have a daily date on the page).
Thinking that foo was using a 301 or 302 redirect, I used the "Check Headers Tool" from
webmasterworld only to get a code 200 for my index-page on this other site.
So, foo is using a Meta-redirect ... very fast I made a little robot in perl using LWP and adding a little code that would recognized any kind of redirect.
Fetched the page, but again got a code 200 with no redirects at all.
Thinking the site of foo was up again I tried again to load the page and foo's page with IE, netscape and Opera but always got:
"The page cannot be displayed"
Tried it a couple of times with the same result: LWP can fetch the page but browsers can not load any of the pages from foo's site.
Wanting to know more I typed in Google:
"site:www.foo1.com"
to get a huge load of pages listed, all constructed in the same way, such as:
www.foo1.com/some-important-keyword/www-some-good-site-com.html
Also I found some more of my own best ranking pages in this list and after checking the Google index all of those pages from my site has disappeared from the Google index.
None of all the pages found using "site:www.foo1.com" can be loaded with a browser but they can all be fetched with LWP and all of those pages are cached in their original form in the Google-Cache under the Cache-Link of foo
I have send an email to Google about this and am still waiting for a responds.
I am sort of in the same boat. My site was alos remove, but the link remains and is redirected to the offending sites home page.
As for the cloak checker, most of the sites utilizing cloaks are using ip based cloaks which a tool cannot check. You would have to spoof the IP of googlebot in order to do this. A good cloak is transparent to the user and very hard to find. This is what is making this so frustrating.
I know what you are saying, but I did find some differences using the cloaking checker I found and was wondering if any of the things I posted were relevant. I can send you a sticky of the URL of the cloacking checker site I used if you're interested.
Maia
PS-Still wondering if you had tried the googlebot@google.com email address I suggested earlier in the thread. When I attempted to use the "remove URL" tool that was also listed as the contact email address to use if you had further questions. Although, at this point, it seems they have stopped responding to anybody.
The number of samples that I have been has been *extremely* small, but there is a definite commonality. I would really like to see more samples for confirmation purposes. If anyone has additional samples I would really like to see them via sticky.
The technique will not be published here, however email originating from google.com will be answered provided they appear genuinely interested in fixing the problem or at least verifying the correctness of their handling of the particular protocol interactions.
I came in this morning and noticed that I got crawled by googlebot. Not a full crawl, but about 1/3 of the site. I thought, okay, this is good. Maybe it will straighten the mess out.
Guess what! My money 3 word phrase popped up to #1! Yippee! But wait a minute. I better check that link out. Oh No! It couldn't be! Could it? Guess what. The link is the offending site! My cache! My Backlinks!
My other 3 word phrase is now at #5. And here is the kicker. It is another site's redirect! My cache! My backlinks!
Money 2 word phrase went from #50 to #51. It is STILL the same offending site with my cache and my backlinks.
So it seems that instead of improving, it got worse.
It appears to be an artifact of the way specific dynamic page handlers are implemented on apache, on version 1.3x in particular.
Thanks to the posters who sent me their responses from Google. What a joke. I think they have a bug but they won't/can't/ fix it. Lot's of the standard double speak with nothing being done.
Oh, and by the way - You guessed it. No response from google. Not even an acknowledgement. Oh Well, maybe that is just as well.
Are you sick of me yet?
WebDude
You guessed it. No response from google. Not even an acknowledgement.
You don't want the kind of response that I got.
If they have not responded, perhaps they are still looking at it and scratching their heads.
When I reported it to them, they said I should shut up and color. They told me that they are perfectly happy with the way it is working.
I think I have a handle on the programming/interpretation error that *may* be causing this as it relates to 302 redirects but not meta refresh.
I looked at two different samples on the same Apache (the same host).
Both codes written in PHP and use redirects 302. One code doesn't have linked pages cached on google, but another one does.
Here is what wannabrowser with NetSpider shows:
1. the one that doesn't have links cached
HTTP/1.1 302 Found
Date: Thu, 23 Sep 2004 15:46:55 GMT
Server: Apache/1.3.31 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.3.8 FrontPage/5.0.2.2634a mod_ssl/2.8.18 OpenSSL/0.9.6b
X-Powered-By: PHP/4.3.8
Location: http:// www blahblahblah net
Transfer-Encoding: chunked
Content-Type: text/html
2. this one hijacked pages
HTTP/1.1 302 Found
Date: Thu, 23 Sep 2004 15:50:25 GMT
Server: Apache/1.3.31 (Unix) mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.3.8 FrontPage/5.0.2.2634a mod_ssl/2.8.19 OpenSSL/0.9.6b
X-Powered-By: PHP/4.3.8
P3P: CP="CAO DSP COR CURa ADMa DEVa OUR IND PHY ONL UNI COM NAV INT DEM PRE" policyref="www somesite com/w3c/p3p.xml"
Set-Cookie: hits=++929+; expires=Thu, 23-Sep-04 15:51:26 GMT
Set-Cookie: linksread=%5BEND%5D929%5B%2C%5D1095954626%5BEND%5D; expires=Fri, 23-Sep-05 15:50:26 GMT
Location: [www...] blahblahblah net
Transfer-Encoding: chunked
Content-Type: text/html
<meta http-equiv="refresh" content="0;url=http://www blahblahblah com">
--------------------------
Also I was trying to find the changes in Apache 1.3 from the previous versions - didn't really find anything on 302. But in one of the discussions I read they mentioned to use no-cache
<META HTTP-EQUIV="EXPIRES" CONTENT="Tue, 04 Dec 1996 21:29:02 GMT"> <META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"> <META HTTP-EQUIV="CACHE-CONTROL" CONTENT="PRIVATE">
is there logic?
here is the post - search on G for: apache 302 redirect + refresh + cache
the first link with redhat domain.
gemini
search on G for: W3C "Checkpoint 7.5"
Edit: desided to quote
7.5 Until user agents provide the ability to stop auto-redirect, do not use markup to redirect pages automatically. Instead, configure the server to perform redirects. [Priority 2]
Content developers sometimes create pages that refresh or change without the user requesting the refresh. This automatic refresh can be very disorienting to some users. Instead, in order of preference, authors should:Configure the server to use the appropriate HTTP status code (301). Using HTTP headers is preferable because it reduces Internet traffic and download times, it may be applied to non-HTML documents, and it may be used by agents who requested only a HEAD request (e.g., link checkers). Also, status codes of the 30x type provide information such as "moved permanently" or "moved temporarily" that cannot be given with META refresh.
Replace the page that would be redirected with a static page containing a normal link to the new page.
It certainly sais right there not to use metarefresh with 302... now we need to figure out the rest :)lol I don't know if we should write to Apache or actually it is SE problem.. G and Y! has the same... so it must be something Apache could answer.
gemini
They made this decision because of the last email reply from G on this issue, basically saying that both URL's (the client URL, and the affiliate URL) have the same content, so G is listing the one with the higher 'rank'.
This interests me because I am considering creating a website that would feature a large amount of copyright-free text, along with my own commentary and related data from other sources. The navigation and page design would be my own work.
The text explicitly says that everything produced by this organization is free of copyright and may be reproduced. In fact, other publishers sell the text (in paper). But the organization that produces it does display it on their website. So I am worried that Google may spot the similarities and ban my website because (presumably) it will have the lower page rank.
What exactly does Google look for? The exact same HTML code? (which would not be the case here) or large amounts of same text as it is displayed in a browser?
Of course, you may not rank as well this way but your users will be happy.
On the matter of this being an Apache problem, unless I am missing something, this sounds like nonsense. If Google indexed content pages instead of pages that redirect to content the problem simply would not exist. I can see no possible way that this can be explained as a server problem - it's a Google/Yahoo problem.
Kaled.
On the matter of this being an Apache problem, unless I am missing something, this sounds like nonsense.
i did not say that it was an apache *problem*, i said that the specific cases i know of are all from servers running apache 1.3x.
i also see, from network traces, not a header checker, that there is a subtle difference in the way that apache handles certain boundary cases. while *fully* conformant to rfc 2616, the difference still sort of stands out. so, it is not an *apache* problem, rather it is behaviour that is specifically seen in certain configurations of apache 1.3x. the possibility exists that it is being interpreted in an unexpected way by an external program under someone else's control.
as an experienced programmer, i think you will recognise that boundary cases are often where bugs lie because a programmer relied on behaviour that did not hold true at the boundary.
i will repeat again that i still need more samples. rather than guessing, i would like to be able to say that of X samples, there were Z commonalities.
The only thing left is that when you search for www.[my-site].com the result shows:
www.[other-site].com/dir/links/XXX
No title, no snippet, just the link that used to link to me.
Google can show you the following information for this URL:
Find web pages that are similar to www.[my-site].com
Find web pages that link to www.[my-site].com
Find web pages that contain the term "www.[my-site].com"
When you hover over the "pages that are similar" the link goes to "pages that are similar to the www.[other-site].com.
When you hover over the "pages that link to" it goes to the www.[other-site].com.
When you hover over the "pages that contain the term" it goes to www.[MY-SITE].com.
Now, if the page was cloaking, wouldn't it still be showing up in the SERPS for keyword searches like it did before they removed my listing from their site?
Why, then is Google still associating me with this site's link that contains no title, no snippet, no cache and no longer seems to have anything to do with me?