|Search for discontinued product - 50% bad pages|
I'm looking for a discontinued product. If it matters, it's a particular brand of SCSI backplane in a particular size. I was hoping to find an online store that still has some in stock.
The product has probably been discontined for a year or more.
A Google search (regular - not Froogle) returns plenty of results. However, at least 50% of the results return:
1. 404 errors
2. no website at domain
3. completely different product
4. different product (typically, a SATA backplane)
5. product no longer available
(So far, I haven't found the product available for sale. But I've only gone through the first few pages. Do I really want to go through several thousand of them?)
There's no excuse for the first 3 results. Sadly, they are the most common. #4 points out a common problem - Google apparentely can't tell the difference between body text and menus. So, a page about SATA backplane, with a site menu mentioning a bunch of SCSI products, will rank highly for a search for a SCSI product.
(I'll say it before anyone else does. It looks like the most effective place for keyword-stuffing is in your site menus.)
Usually, when I'm searching for a product, it's one that's currently available. Google works reasonably well in that case. Google works great for the easy cases...
Try searching for a discontinued product.
Welcome to Google Hell...
This works against both users and webmasters. Do you want visitors who are looking for something you don't have? Do you think you'll ever see those users again?
Anyone else notice that Google is WAY to slow to remove pages that are no longer there, or that are no longer relevant?
I don't understand why Google keeps pages that no longer exist. Are they still so concerned about beating other search engines with their count of listed pages?
|However, at least 50% of the results return <various useless results> |
Actually, 100% of the results (that I have examined) return useless results.
Google needs to clean out their refrigerator! It's full of rotten stuff.
Are they retaining data way past it's prime in order to inflate "number of pages indexed"?
What is your experience with your own websites? How long after you delete a page does Google remove the page from the SERPs?
I believe that Google brings them from its archives if there's a link pointing at a 404 link. Suppose you have /blah-blah.html and delete it.If google sees that a site still links to it, I think Google will still show that as a supplemental.
If I delete the page from my site, I want it gone.
|I don't understand why Google keeps pages that no longer exist. |
Probably the same reason that DMOZ is very slow to actually delete a website (as opposed to moving it back to "unreviewed") -- they often reappear after days, weeks, or even months, and they may have been the only source of some particular piece of information.
With Google, the motivation is even higher to retain since, with any luck, they have a cached copy you can look at.
this brings another point: as far as copyrights go. Some lawyer with to much tim in his hands, and or looking for a payday will probably sue google for keeping deleted pages. There's no issue of consent either, since you didn't give google permission to keep a certain page in perpetuity, even after you delete it.
|If I delete the page from my site, I want it gone. |
But how does any search engine know that? 404 means the server can't find the resource at that time. That's all it means, and as others have observed, 404 content often does re-appear. Not all servers are maintained with great discipline.
Better results should come from returning a 410 (Gone) -- however, a recent interview with the Google Sitemaps team said that they treat 404 and 410 indentically. Still, I have heard stories of more rapid removals from the index using a 410 response.
|If I delete the page from my site, I want it gone. |
|With Google, the motivation is even higher to retain since, with any luck, they have a cached copy you can look at. |
Well, now we get into very shady areas in regards to copyright. Let's say I design a really nifty *fill-in-the-blank* and decide to show my design on my website for 48 hours or something like that.
Three months later, I realize that my design is being copied all over the world! After investigating, I find that even though I used no index no follow, somebody else "borrowed" my whole page (complete with schematics, photos, etc) ... put it on their website and it has been there for months!
I hire a lawyer, file a DMCA report, win my lawsuit and get a nice cash settlement as well as the other guy's page taken down ... but Google still has the cached page for everyone to see because the other guy didn't use no index no follow.
If I delete a page from my site, I want it completely gone off the internet and like the original poster, if looking for something, I'd rather not get supplemental results which have little or no content or which are redirected to someone's home page.
The internet is still in its infancy. Imagine what the "supplemental" page results will be like in another 10 years! Will surfers have to wade through hundreds of irrelevant results to find one gem?
Deleted pages should not show up in the current search results. If Google is bent on saving everything ever written on the internet, then accessing non-existant pages should be made an option rather than mixed in with the real search results. Its very frustrating for searchers and website owners alike.
? I really need to read the whole post before I post. Sorry :(
[edited by: tantalus at 5:50 pm (utc) on April 4, 2006]
I'm now changing all my sites urls and links for Yahoo! and MSN SEO. I can see that both bots are crawling better now. Google I hope you can sustain your competitive advantage before your competitors overtake you.
I see your point on the first 404, however if Google doesn't find the link, say 2-3 times within a month or so, it's safe to say that the page has been deleted. If it's a mistake from the server owner so be it; people suffer from their mistakes all the time. When he /she fixes it, google can bring the page back.
I had originally thought that 410 was the sure thing to delete a page, only to learn that Google doesn't care about the site owner's wishes, and wants to provide whatever content. The problem is that this is not their content, and the copyright holder, by virtue of deleting it, does not want it to be displayed
Looks like the solution to deleting a page from your site and various SE caches is to first replace the page with something very empty, such as "page is obsolete". Leave that up for a month or two, or longer, so that it gets cached, then actually delete the page.
I see yahoo slurping on pages that I deleted years ago. I haven't bothered to check their results to see if they do anything with those long gone pages, but they keep coming back looking.
I have to say though even a month is not long enough for Google to assume that the site is realy gone. I mean it takes up to 4 months and longer sometimes just to see the result of a positive change, I would hate that the negtive changes would take hold quicker then positive ones.
I know this has happened to someone on here. I rad about it too much fo rit to not. But you have a host company, everything is great, then one day your site is off line.
You make calls, send emails, finally you come to find that your hosting company has gone belly up. You scamble to get your source code, you look for another contract. Possibly have to have data migrated over. Transfer the DNS from a company that is gone to your new one (Not an easy task). If you managed to get all this fixed in a month (I would say it is impossible), but if you did, wouldn't you just die if you found your site that has been top 20 in the SERPs for years is now totally dropped. It happens all the time. I know it sucks to get pages that return 404 but it would be worse if that site was really only gone for a couple weeks and then you would never find it again.
>> Looks like the solution to deleting a page from your site and various SE caches is to first replace the page with something very empty, such as "page is obsolete". Leave that up for a month or two, or longer, so that it gets cached, then actually delete the page. <<
Google will simply bring back an older version of the page as a Supplemental Result.
I did that with a page in November, I cleared it of information and just left a "page gone" note on the page. In January the almost empty page was deleted, and returned a 404 status.
Within days Google reverted to showing a July 2005 cache for that page.
You have to leave the "page gone" page up for months, but even that just means that the "page gone" page will be the one that shows as supplemental. You gotta 301, and continue to link to the old URL. That's the only way to kill a healthy page. There is no way to kill a supplemental of course.
The "Page Gone" note had been indexed and cached several times per week for several months at the point that the page was finally removed from the site.
Google just reverted to an older cache, and marked it Supplemental - within days.
If the issue is getting old information off the internet. (Good for your site only .. unless you can get a third party to do this) Just to reinforce -- Put up a Blank page, with a 404 or a 301 if you have related info, you can even 301 to a custom error page.
That does not work if the page is already Supplemental. Google will continue to show the old page for years.
If the page was not Supplemental, then they will simply create one using a cache from 4 to 12 months old to use in its place instead.
All these points are great, however doesn't it all seem a bit technical to basically tell a search engine how to know a page no longer exists.
Yes you can 301, 410 etc etc - but lets be honest how many website owners know what the heck that is on about.
It is Google's job (and the rest) to index the web - that's what they tell us - they therefore need to reflect the web. It is their job to handle all these types of issues, not ask every single website owner in the world to do a 301 etc for that old page so that their results will be relevant.
This is not the case of chicken and egg - websites come first, indexing search engines come second. The search engine that reflects this will win surely?
A related problem that Google doesn't handle well is stale content on the sites themselves. I encountered this as well on the search that prompted me to write this post.
The problem is especially severe for computer parts vendors product pages. A lot of these vendors run "virtual" warehouses, and drop-ship much, most, or all of their "inventory".
I've noticed vendors getting more and more lax about removing products that are no longer available. In many cases, they have no idea what is actually in stock at their supplier's warehouse, and/or the mechanism to update this information is broken.
In the case of one particular vendor, I noticed that every product from a certain manufacturer showed exactly 2 items in stock in their Texas warehouse. I thought this seemed odd, and called the company. They confirmed that they had nothing in stock, drop-ship from the manufacturer, had no idea what the manufacturer had in stock, and that the inventory information on their website was completely erroneous. They also expressed no interest in fixing the problem.
Is it Google's job to filter this?
Absolutely. It's their job to find what users are looking for. If websites return irrelevant results, they should go to the bottom of the heap. Google needs to develop specific algorithms for particular types of searchs. When doing a product search, whether or not the product is in fact available, and the reliability of the vendor's product listings should be criteria.
Regarding the probably-innocent "keyword stuffing" in product indices, a friend of mine suggested that I could have avoided irrelevant results in my search by putting "SCSI backplane" in quotes, which would have dropped the "SCSI this", "SCSI that" in the menus from being deemed relevant. (I got SATA backplanes in high positions when searching for SCSI backplanes, apparently because SCSI appeared in multiple places in the site product menus that appear on every page.)
Should a user have to worry about this and structure their queries to work around the problem? I don't think so. Google needs to work on understanding the structure of web pages, and recognizing when they see a site index. Site indices should contribute little if anything to the relevance of the page on which they appear.
Is there a way of preventing part of a page from being examined for relevance? There should be. There needs to be a meta tag that says "this part of the page is part of a site-wide index or menu, and has little or no relevance to this page's specific content."
If the code is site wide, Google can (should!) already see that it is site wide - 'cus they have looked at multiple pages from the site.
I may just be uneducated in webmasterly ways ... but I just delete the page and all links to it on my site when I delete a page. I don't worry about 404's or 410's or what have you.
It may not be the accepted way to do things, but it seems to work ok for the most part.
Just deleting a page will certainly work. The problem is that if "important" pages disappear from your site you may be penalized, or so the rumor goes...
|If I delete a page from my site, I want it completely gone off the internet |
there is no problem to easily get what you require.
simply, use robots.txt file properly and we will never cache any of your webpage.
We are very happy you call us "the internet", however we are Google only.
|The problem is that if "important" pages disappear from your site you may be penalized, or so the rumor goes... |
Yikes ... i've been doing that for years and didn't know this was something my site could be penalized for! Honest to God, this is the first time I have ever heard that particular rumour!
Is this really true?
<added> If someone clicks on a search result for a page that has been taken down (usually a product which is no longer available) ... they end up back on my homepage. Is this a really bad thing to do? If so, why is it bad?
You just said two opposite things. Do you delete pages and not do anything else, or do you 301 everything back to your main page?
Forget it ... I removed my original message as there was nothing to say of value.
But to answer your question steveb ... I do nothing. I simply delete the page. I don't know what happens internally to feed my homepage to those clicking on search results for pages which no longer exist.
You'll have to ask those here at WebmasterWorld what was done when the site was being built in order to do this. It is beyond me and my abilities as a webmaster.
I've deleted pages and didn't notice my position in GG change for other product pages within the section of the deleted page. They didn't last that long in the listings, also would return to the home page if clicked on before GG scrapped them.
You can display a copy of your home page as a custom error message, as long as the original url gets a 404 (or 410) response in the server header.
I would not suggest using a 301 to your home page, as over time this results in many different urls all having the home page's content. Usually those urls just end up as Supplemental Results, but if the "duplicate pages" number gets very high, I have seen it start to impact the rank of other pages from the domain in the SERPs. It may have something to do with poisoning any links on the dupe conetnt page -- but I'm not sure on that, nor where the threshold of safety may be.
In fact, one of the dirty tricks I've seen malicious competitors try is to aim a pile of bad links at a site with an incorrectly configured "custom 404" that doesn't really return a 404. Same thing with various canonical possiblities -- if they see a vulnerability, they sometimes aim links at those urls.
Good post Tedster!
I don't know if it's my Y! store platform that does that as a feature - but you know I've got about 11 pages supplemental.
Two of them are the home page with bizarre URL's combining my URL and an old one I used years ago like "http://www.domain.com/olddomanname/" and a few other weird ones.
I've been trying to get an answer on how they're getting there but no one has had a clue. You may have hit on it?
Now, how would I get rid of those 11 duplicate pages with weird URL's?