Forum Moderators: open
Normally the site grows at a tempo of 200 to 500 pages a month indexed by Google and others ... but since about 1-week I noticed that my site was loosing about
5,000 to 10,000 pages a week in the Google Index.
At first I simply presumed that this was the unpredictable Google flux, until yesterday, the main index-page from www.widget.com disappeared completely our of the Google index.
The index-page was always in the top-3 position for our main topics, aka keywords.
I tried all the techniques to find my index page, such as: allinurl:, site:, direct link etc ... etc, but the index page has simply vanished from the Google index
As a last resource I took a special chunk of text, which can only belong to my index-page: "company name own name town postcode" (which is a sentence of 9
words), from my index page and searched for this in Google.
My index page did not show up, but instead 2 other pages from other sites showed up as having the this information on their page.
Lets call them:
www.foo1.net and www.foo2.net
Wanting to know what my "company text" was doing on those pages I clicked on:
www.foo1.com/mykeyword/www-widget-com.html
(with mykeyword being my site's main topic)
The page could not load and the message:
"The page cannot be displayed"
was displayed in my browser window
Still wanting to know what was going on, I clicked " Cached" on the Google serps ... AND YES ... there was my index-page as fresh as it could be, updated only yesterday by Google himself (I have a daily date on the page).
Thinking that foo was using a 301 or 302 redirect, I used the "Check Headers Tool" from
webmasterworld only to get a code 200 for my index-page on this other site.
So, foo is using a Meta-redirect ... very fast I made a little robot in perl using LWP and adding a little code that would recognized any kind of redirect.
Fetched the page, but again got a code 200 with no redirects at all.
Thinking the site of foo was up again I tried again to load the page and foo's page with IE, netscape and Opera but always got:
"The page cannot be displayed"
Tried it a couple of times with the same result: LWP can fetch the page but browsers can not load any of the pages from foo's site.
Wanting to know more I typed in Google:
"site:www.foo1.com"
to get a huge load of pages listed, all constructed in the same way, such as:
www.foo1.com/some-important-keyword/www-some-good-site-com.html
Also I found some more of my own best ranking pages in this list and after checking the Google index all of those pages from my site has disappeared from the Google index.
None of all the pages found using "site:www.foo1.com" can be loaded with a browser but they can all be fetched with LWP and all of those pages are cached in their original form in the Google-Cache under the Cache-Link of foo
I have send an email to Google about this and am still waiting for a responds.
Of course, you may not rank as well this way but your users will be happy.
On the matter of this being an Apache problem, unless I am missing something, this sounds like nonsense. If Google indexed content pages instead of pages that redirect to content the problem simply would not exist. I can see no possible way that this can be explained as a server problem - it's a Google/Yahoo problem.
Kaled.
On the matter of this being an Apache problem, unless I am missing something, this sounds like nonsense.
i did not say that it was an apache *problem*, i said that the specific cases i know of are all from servers running apache 1.3x.
i also see, from network traces, not a header checker, that there is a subtle difference in the way that apache handles certain boundary cases. while *fully* conformant to rfc 2616, the difference still sort of stands out. so, it is not an *apache* problem, rather it is behaviour that is specifically seen in certain configurations of apache 1.3x. the possibility exists that it is being interpreted in an unexpected way by an external program under someone else's control.
as an experienced programmer, i think you will recognise that boundary cases are often where bugs lie because a programmer relied on behaviour that did not hold true at the boundary.
i will repeat again that i still need more samples. rather than guessing, i would like to be able to say that of X samples, there were Z commonalities.
The only thing left is that when you search for www.[my-site].com the result shows:
www.[other-site].com/dir/links/XXX
No title, no snippet, just the link that used to link to me.
Google can show you the following information for this URL:
Find web pages that are similar to www.[my-site].com
Find web pages that link to www.[my-site].com
Find web pages that contain the term "www.[my-site].com"
When you hover over the "pages that are similar" the link goes to "pages that are similar to the www.[other-site].com.
When you hover over the "pages that link to" it goes to the www.[other-site].com.
When you hover over the "pages that contain the term" it goes to www.[MY-SITE].com.
Now, if the page was cloaking, wouldn't it still be showing up in the SERPS for keyword searches like it did before they removed my listing from their site?
Why, then is Google still associating me with this site's link that contains no title, no snippet, no cache and no longer seems to have anything to do with me?
My knowledge of http is about as thin as a cigarette paper so I am perfectly happy to accept that I am wrong. My logic is that servers simply chuck out text responses to requests in headers and that if the responses of two servers are essentially the same (as they should be) then Googlebot's response will be the same whether the server is running Apache, Windows or whatever.
However, I guess you are working on the "garbage in, garbage out" principle. If the responses of some servers are outside the norm, then Googlebot may become confused. However, I do not see this as a robot problem (though it might be I suppose) I see it as an indexation problem.
So why am I prepared to stick my head above the parapet and say it is an indexation bug? Well, first reports (and some email responses from Google) suggest that the redirection trick works best if you have a PR advantage. Other than for scheduling, I see no reason for robots to have any knowledge of PR - therefore the problem must lie in indexation. If it lies in indexation, in order for the problem to be caused by server responses, those responses must be recorded in detail and processed by the indexation service. Whilst this is possible, I would say it is more likely that the robots record only simplified versions of response headers.
Kaled.
You are definitely in the right neighbourhood.
Consider if you will that search engines are not one monolith, but rather a suite of processes implemented in multiple programs.
Consider also that these programs are implemented by individual teams or programmers.
Consider finally that these programmers depend upon informal and formal knowledge of the output of the programs upstream from their own program and the expectations with respect to the outputs from their own program.
For gigo to happen, requires only one mistake anywhere in the chain. The results of the observation points available so far suggest a boundary condition trap that I as a programmer could easily fall into. The reasoning then becomes: if one programmer could fall into the trap, then why not another?
On a more general note, I have received a few more urls by sticky, but as usual, could use more. I need both the victim and donor url's. If someone is reluctant to release the victim url, then the url of the external page is fine. That way, if it is a page containing multiple outbounds, I have no way of identifying the victim site.
My site is in my profile. Just visit the links section. Any technical assistance on how to cirumvent this problem is appreciated.
After doing a complete check I found lots of things. For example a redirect page on mky site has 16500 backlinks. But luckily the site it is linking to is a behemoth with a PR9 or 10. So good enough. And my page also has not been removed because of duplicate content.
So not-my-page meta refreshes to my-page, and my-page is dropped, but not-my-page remains in Google search results, with my-page content but the link is not-my-page.
So I email Google and they say to do a 301 redirect, that they do not manually adjust search results, but I say that I cannot 301 redirect, not-my-page is well not mine.
End result? Nothing.