Forum Moderators: open
The only possible reason for the banning I can see at the moment is the way they handle "Page Not Found" requests on the server. Because they get a lot of mistyped URLs and broken link referrals to old pages, they have the server serve the nearest page in the folder requested.
As this effectively presented duplicate content to spiders following inbound links to old pages, I recommended that they drop this strategy and served a 404 page to requests for missing pages. However they were reluctant to do this so we compromised and served the nearest page content to a missing page, but with a 404 code in the header.
Unfortunately, 2 weeks after this setup was implemented, Google dropped the site. What is the wisdom here? Do I hang in there till the next update to see if Google figures out the new 404 setup? Or should I get them to stop trying to be clever immediately?
But for the most part, as you may already know, Googlebot ignores or doesn't find the "404.html"...
So maybe somehow changing the 404's made it so that googlebot couldn't see the pages...Leading me to believe that waiting for the next update wouldn't change anything if you dont change the website first.
> Why not server a 301 to the nearest page?
> Thus you will effectively "correct" misspelled URLs.
Yes killroy I would think this should be OK to do, especially since Google recommends this approach in their webmaster guidelines. However, in my experience Google handles these badly, often indexing the source URL and not the target. I also suspect that if it encounters a lot of 301s (for instance by entering the site through old broken external links), it flags the site up as spamming. This is just a hunch though...
> Well, I am pretty sure a grey PR bar doesn't mean you were penalized...Grey usually indicates the website has not been found by googlebot yet and a completely white bar indicates a penalty.
hobbnet, the site has had PR 6 for over a year, and has a dmoz listing, so this looks like a ban to me.
I think I need to look at Googlebot's behaviour in the log. Any experience on dealing with 404s and Google would be gratefully received.
I've considered serving redirects to an alternative page in place of a 404, but i've always resisted because i'm unsure of the impact on my servers as a result of all the thousands of requests per day for formmail.pl, default.ida and the many other crappy requests that aren't anything to do with a human visitor...
Any thoughts?
I used 301s on qualified addresses. To avoid serving large 404 pages to great numbers of bogus requests such as formmail and the other exploits, I simply serve up a 0 byte blank html for those.
I have also noticed that google would pick up the 301, and then come back a few days later for the target of the 301. So if you do it near the end of a deep crawl you might miss that one.
I am happy to report though that google has picked up all th enew URLs this deep crawl that where cahnged only 2 weeks ago.
Also, by serving up (internal redirect) the blank pages for the exploit requests, I save bandwidth as well as avoiding clutter in the error logs.
I regularly check hte error logs and try to keep it empty by redirecting appropriately.
sticky me if you want details.
SN