homepage Welcome to WebmasterWorld Guest from 23.20.77.156
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 40 message thread spans 2 pages: 40 ( [1] 2 > >     
Impressed with google's soft 404 detection
Shepherd




msg:4584880
 11:49 am on Jun 17, 2013 (gmt 0)

Been getting messages in gWMT regarding "soft" 404s on a couple of sites lately.
Looked into them and was fairly impressed. These sites (2 of them) are completely custom, no off the shelf code, built from scratch. NO structured data, NO google Highlighter.

Google is identifying pages on the site that are of little use to the visitor, for example, maybe a product out of stock, or an option selected for a product that does not exist.

To me this would not be as impressive if ours was an off the shelf CMS or ECom software, something opensource that google could train the bot on. This is completely unique, custom to us stuff, and it changes a lot.

 

lucy24




msg:4584979
 4:01 pm on Jun 17, 2013 (gmt 0)

Google is identifying pages on the site that are of little use to the visitor, for example, maybe a product out of stock, or an option selected for a product that does not exist.

Well, it's not really. It's making up URLs using known queries and putting them together in new ways to see what comes up.

It's essentially the same process as making up a garbage path and seeing what happens when they ask for it. (In my case it seems to be triggered when I've added a batch of new redirects and g### wants to be sure they're bona fide redirects and not soft 404s.)

If it's your own hand-rolled code the solution is simple. Go into your php-or-equivalent, teach it to set a flag when results lead to a real page, and otherwise send out a 404 combined with displaying error page. It doesn't have to be the same page as if the user requested a garbage URL; that part is strictly for humans. What matters is that a 404 gets sent out.

Shepherd




msg:4585004
 5:27 pm on Jun 17, 2013 (gmt 0)

Well, it's not really. It's making up URLs using known queries and putting them together in new ways to see what comes up.


Sorry Lucy24, I probably wasn't clear, they are finding "real" pages, not made up strings for URLs. Seen that before, not the same thing here.

Here's the best example I can come up with without using specifics:

We sell "examples" at example.com. We have pages for "examples" in every state: example.com/alabama, example.com/alaska... ect. We write articles about "examples" for most states and list them here: example.com/alabama/articles. So lets say we have'nt written any articles for "examples" in Alaska yet. The page is still there: exampl.com/alaska/articles but instead of listing the available articles (there are none) it says "no articles for Alaska yet". google is telling us in WMT that example.com/alaska/articles is a "soft 404". And they're right, these are pages that are not adding value, that should not be there.

I don't usually give google much credit but they are nailing this! For whatever it's worth, I'm impressed that they have been able to figure it out.

lucy24




msg:4585081
 9:08 pm on Jun 17, 2013 (gmt 0)

Heh. Possibly they noticed that assorted queries all lead to the identical page content. If so, I agree that they deserve credit for recognizing that this is fundamentally a "soft 404" rather than Duplicate Content.

Maybe they would be happier if you tweaked your page-generating code to return a 404 response in situations that lead to the "Sorry, we don't have any articles" page. To the human user it will make no difference, since there's no inherent connection between the page they see and the response the browser receives.

brotherhood of LAN




msg:4585087
 9:18 pm on Jun 17, 2013 (gmt 0)

Agrees with Lucy, better to tell them it's not a valid page rather than relying on them guessing it and inferring something from some/lots of those guesses.

Ideally there wouldn't be any links to alaska/articles until there's something worth linking to.

Shepherd




msg:4585100
 10:13 pm on Jun 17, 2013 (gmt 0)

Ideally there wouldn't be any links to alaska/articles until there's something worth linking to.

Absolutely, we do this most of the time.

Possibly they noticed that assorted queries all lead to the identical page content.

That's probably a good assessment, content not identical but definitely thin and similar.

This feels a lot like some pretty serious machine learning. To be able to correctly asses the on page content of a one off site like this is to say the least impressive. If this was a wordpress site or some other opensource CMS then no big deal, it'd be pretty easy to put together a footprint for soft 404's en mass

Shepherd




msg:4585102
 10:26 pm on Jun 17, 2013 (gmt 0)

relying on them guessing it and inferring something from some/lots of those guesses.


oh, yeah, that's the other impressive part, 212,973 pages indexed, 575 soft 404 errors and every one of them is spot on. It's almost as if every page on the site was reviewed by a human.

Shepherd




msg:4585103
 10:29 pm on Jun 17, 2013 (gmt 0)

oh, and the first soft 404 was detected/reported on 5/9/2013. Some of the pages reported are years old so this is something new they are doing.

Awarn




msg:4585121
 11:53 pm on Jun 17, 2013 (gmt 0)

They have had those soft 404s for some time now. Commonly I see these on a product that we no longer carry. It hits the same product page but itemid parameter doesn't exist because that product has been removed. So the product page is valid and still gives results but since that specific item is deleted from the database no data is being retrieved.

Shepherd




msg:4585128
 12:12 am on Jun 18, 2013 (gmt 0)

So the product page is valid and still gives results but since that specific item is deleted from the database no data is being retrieved.


So, like what I've mentioned, you have real pages that probably shouldn't be there and google is determining by the content (or lack of) alone that the page should no longer be there? Much in the same way a human would maybe?

lucy24




msg:4585143
 1:07 am on Jun 18, 2013 (gmt 0)

"All the stuff around the edges is the same on every page. In the middle where I expected to find a bunch of unique content, there are only a few words-- and one of them is 'no'."

Yah. Your average robot should be able to manage that.

Remedy in simplest form:

--before starting to build page, turn on output buffer
--once you reach the point where your code determines that there is or is not material to make a "real" page, send out your response header, either 200 or 404
--now release the contents of the buffer and, if necessary, finish building the page

The sole purpose of the buffer is to buy your server some time before it has to send out a response header, since this can only be done before the user has seen even a single byte of the page.

Shepherd




msg:4585277
 8:58 am on Jun 18, 2013 (gmt 0)

"All the stuff around the edges is the same on every page. In the middle where I expected to find a bunch of unique content, there are only a few words-- and one of them is 'no'."

Yah. Your average robot should be able to manage that.


Great concept, that's all good until you start to get down to the nuts and bolts, 700,000,000+ websites, each with their own "around the edges", some with no edges, some with multiple edges. I don't think we're talking about an "average" bot here any more when we're talking about a bot that can learn the structure, content, and intent of a 1 out of 700,000,000 website and then accurately decide, in much the same way a human would, if the page has value based solely on the contents of that page alone.

We spend a lot of time (myself included) bashing google for the things they apparently get wrong. We need to spend more time taking a look at what they get right and how they did it.

atlrus




msg:4585397
 3:48 pm on Jun 18, 2013 (gmt 0)

Here's the best example I can come up with without using specifics:

We sell "examples" at example.com. We have pages for "examples" in every state: example.com/alabama, example.com/alaska... ect. We write articles about "examples" for most states and list them here: example.com/alabama/articles. So lets say we have'nt written any articles for "examples" in Alaska yet. The page is still there: exampl.com/alaska/articles but instead of listing the available articles (there are none) it says "no articles for Alaska yet". google is telling us in WMT that example.com/alaska/articles is a "soft 404". And they're right, these are pages that are not adding value, that should not be there.

I don't usually give google much credit but they are nailing this! For whatever it's worth, I'm impressed that they have been able to figure it out.


Actually I am a bit worried about this, but not as much as you should be. After all, those "soft 404s" in your example are not really 404, just very thin content pages, however, unlike 404 - they do exist. Soft 404 has nothing to do with actual content, but with errors implementing a "real" 404.

I'm not sure when the "soft 404" crept in, but if your page has "There are no articles at this time" - it's a perfectly legit 200 OK page. Google considering those pages 404 is like claiming that a bakery doesn't sell cupcakes because they are out at this time.
Or as someone else mentioned, Google soft 404-ing products out of stock. Just because the product is out of stock it doesn't mean the page does not exist!?!

I just hate when Google doesn't follow simple protocol rules, but make up some arbitrary BS...

lucy24




msg:4585439
 6:37 pm on Jun 18, 2013 (gmt 0)

Google considering those pages 404 is like claiming that a bakery doesn't sell cupcakes because they are out at this time.

There's more to it, though.
#1 If customer calls the bakery and asks about cupcakes at a time when they happen to be out, is the customer told "Yes, we sell them" or "Yes, but we're sold out until tomorrow"?
#2 Does the bakery's display case include a currently empty section labeled "cupcakes", or do other products take up all available space?

There's a big difference between a complete product page with an "out of stock" label in the corner where you place your order, and a page with no product-specific content.

atlrus




msg:4585442
 7:24 pm on Jun 18, 2013 (gmt 0)

There's more to it, though.


no, there isn't. It doesn't matter how the bakery manages the counter space, as long as they sell cupcakes.

If a page is not returning 404 but 200, then the page exists (unless a technical error is involved). 404 is protocol specific and there are not "buts" about it, it means "not found", not "out of stock", "thin content" or whatever else Google thinks fits the case.

I don't see how anyone could applaud Google twisting basic network layer protocols just so they can tweak their own algo.

There's a big difference between a complete product page with an "out of stock" label in the corner where you place your order, and a page with no product-specific content.


No difference, as far as http is concerned. If the page exists, it doesn't matter to http if there is 1 word or 1,000. Now, there could be difference as far as Google ranking that page, but that's a totally unrelated topic.

lucy24




msg:4585493
 11:09 pm on Jun 18, 2013 (gmt 0)

Uhm... Have you entirely missed several years of discussion of the "soft 404" concept?

atlrus




msg:4585503
 11:25 pm on Jun 18, 2013 (gmt 0)

Uhm... Have you entirely missed several years of discussion of the "soft 404" concept?


I must have, where did this discussion took place? Here is what Google itself says about "soft 404", bold is mine:

Returning a code other than 404 or 410 for a non-existent page(or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic. Firstly, it tells search engines that thereís a real page at that URL. As a result, that URL may be crawled and its content indexed. Because of the time Googlebot spends on non-existent pages, your unique URLs may not be discovered as quickly or visited as frequently and your siteís crawl coverage may be impacted (also, you probably donít want your site to rank well for the search query [File not found]).

I admit I don't keep up with protocols as often as I should, but I am sure I, Google, W3 and the rest would have not missed a change in 404. If there is a special WebmasterWorld meaning behind soft 404, then no, I don't know about it.

Google knows very well what they are doing - sticking their nose again where it does not belong.

Shepherd




msg:4585507
 11:35 pm on Jun 18, 2013 (gmt 0)

I think google is changing their definition of a "soft 404" even though they have not updated their public information: [support.google.com...]

as Atlrus says:
After all, those "soft 404s" in your example are not really 404, just very thin content pages


google is reporting pages (and this is something new afaik) that probably should be 404s, they probably should not exist. Good or bad, I don't know, impressive, yes.

brotherhood of LAN




msg:4585511
 11:42 pm on Jun 18, 2013 (gmt 0)

If they are calling those thin pages "soft 404's" they probably shouldn't as clearly it can cause some confusion.

I understand why they'd want to define it as such though, basically they're saying the pages are "as good as" 404.

If there aren't links to those pages then it shouldn't be an issue at all (and I feel Google would be in error inferring that the pages existed in the first place). Otherwise, 404 should be served as normal.

lucy24




msg:4585518
 11:53 pm on Jun 18, 2013 (gmt 0)

Shepherd, does your site return a 404 under other circumstances, such as when there's a garbage request? That is, "example.com/akjkvljcufoidrtujd.html" and the like. Google itself makes requests in this form; it's probably automated but the trigger is pretty low. (That is: I've seen it myself, typically after I've made major changes.)

If yes, then we are getting into something interesting. It used to be that "soft 404" was a site-wide issue: they were only interested in sites that never returned a 404 at all, ever. Now they're getting more narrowly focused.

Do you-- or can you-- add a meta "noindex" to pages of this kind? Seems like they shouldn't react as strongly to a soft 404 if the result isn't intended to be indexed anyway.

Shepherd




msg:4585522
 12:18 am on Jun 19, 2013 (gmt 0)

does your site return a 404 under other circumstances

yes, normal 404 returned for pages that do not exist.

Do you-- or can you-- add a meta "noindex" to pages of this kind?

Do not but could however this would require knowing how thin is too thin so to know which pages should have the no-index. (in reality this is something we already know, just making a point)

phranque




msg:4585547
 1:54 am on Jun 19, 2013 (gmt 0)

"low quality" != "soft 404"

if you are intentionally linking to a url and serving a 200 OK response, it's not a "soft 404".

if your ErrorDocument specifies a fully qualified url and causes a "404 response" to generate a 302/200 status chain, that's a "soft 404".
if a "junk" url request gets a response which very helpfully looks like a 404 page yet provides a 200 response that's a "soft 404".
if a large number of "junk" or legacy url requests get redirected to a single or small number of urls such as the home page or category pages that's a "soft 404".

410 Gone (or 404 Not Found) means the url is gone (or not found), not the product being sold on that url.


going with the bakery analogy, if i show up too late to buy a toasted coconut donut today - sure, it's a low quality experience to be told "come back in the morning when we have some freshly baked" (200 OK).
it would be an even worse experience if the baker told you "we've never sold toasted coconut" (404 Not Found) or "we've stopped selling toasted coconut" (410 Gone).

lucy24




msg:4585569
 4:52 am on Jun 19, 2013 (gmt 0)

if your ErrorDocument specifies a fully qualified url and causes a "404 response" to generate a 302/200 status chain, that's a "soft 404".

If your site intentionally redirects all bad requests to the home page, that's a "soft 404". It's a widely used technique-- and one that drives some human users up the wall.

410 Gone (or 404 Not Found) means the url is gone (or not found), not the product being sold on that url.

The query is part of the URL. That's why I said earlier that there's a difference between an "out of stock" label and a page that's empty in the middle.

Now suppose the bakery has never sold toasted coconut anything-- but it thinks that it might like to some day, and doesn't want to burn its bridges or turn away customers by saying it doesn't carry them, so instead it says day after day "We don't have any right now".

Besides, why is a site search pointing to an empty page? It ought to go though the intermediate stage of a search-results page-- which is no-indexed, so it doesn't matter whether there are any articles about widgets in Gambia.

If someone who knows the site's URL structure types in
/widgets/articles/gambia/
and there's no such article, the request deserves a 404. And note again that this has absolutely nothing to do with what the human user sees. You can perfectly well make a custom 404 page that reads the query and says "I'm sorry, we don't have any articles about abc in xyz."

atlrus




msg:4585576
 5:31 am on Jun 19, 2013 (gmt 0)

Besides, why is a site search pointing to an empty page?


I think you are confusing search with http.

404 has nothing to do with search, which is why my gears are grinding reading this. 404 is part of a network layer protocol and Google should stay away and not try to muscle itself into defining things like that on its own merit.

lucy24




msg:4585594
 6:30 am on Jun 19, 2013 (gmt 0)

It depends on the search structure.

Most humans are afraid to type directly into their browser's address bar-- if they even know what it is-- so instead they go to Site Search and say "gimme some articles about 19th-century widgets". And then site search, which may not really be a search at all, converts the request into /widgets/century/19/ without first checking whether such a page exists on the site. And then the cms does some clanking and churning and spits out a page whose sole content is "we can't tell you anything about widgets in the 19th century".

All those http responses were codified long before-- that is, ahem, "long" in Internet terms-- it became commonplace to generate pages first, put in the content second. So you could have a site where as far as the server is concerned, no request ever meets anything but a perfect 200. But a search engine wouldn't be doing its job if it didn't distinguish between real pages with real content, and pages that were created because the developer forgot to code for bad requests.

phranque




msg:4585620
 8:36 am on Jun 19, 2013 (gmt 0)

The query is part of the URL.

why is a site search pointing to an empty page?

these statements are irrelevant to the discussion.
before your post nobody had mentioned site search or query strings in this thread.

So you could have a site where as far as the server is concerned, no request ever meets anything but a perfect 200.

you are describing a CMS that was written by someone who didn't read the HTTP specification.
that's not the problem being discussed here.

lucy24




msg:4585824
 6:22 pm on Jun 19, 2013 (gmt 0)

these statements are irrelevant to the discussion.
before your post nobody had mentioned site search or query strings in this thread.

I'm extrapolating from the original post. A page whose content says "Sorry, we have no articles about X" can only arise from a request for articles about X. I seriously doubt the site was expressly coded to answer requests that can only come from the googlebot, never from a human.

you are describing a CMS that was written by someone who didn't read the HTTP specification. that's not the problem being discussed here.

Yes, it is. It's EXACTLY the problem being discussed. Maybe the issue is with google's use of the term "404". But once a term has come into regular use, you can't step back and say you refuse to use it because it really means something else. Try spelling "referrer" correctly and see where it gets you.

Similarly, there's not much use in complaining about CMS writers not reading specifications when there are thousands if not millions of www sites built around the lines
RewriteCond %{REQUEST_URI} !-f
RewriteCond %{REQUEST_URI} !-d
which means nothing more nor less than "If you are even thinking about returning a 404-- don't."

g1smd




msg:4585977
 6:32 am on Jun 20, 2013 (gmt 0)

it would be an even worse experience if the baker told you "we've never sold toasted coconut" (404 Not Found)

That's not quite right. 404 means the server can't find it right now, doesn't know if it ever existed, and has no idea if it might exist in the future.

phranque




msg:4586032
 10:05 am on Jun 20, 2013 (gmt 0)

404 = "we don't know anything about toasted coconut" or maybe "we aren't selling toasted coconut at the moment and i can't or won't tell you if we ever have or ever will."

not quite as brief in the telling, but a similar visitor experience...

lucy24




msg:4586225
 9:29 pm on Jun 20, 2013 (gmt 0)

That's where the cms and auto-generated pages come in. Once upon a time, pages either existed or they did not. But now we've got the potential for two entirely different and unrelated things:

--what the human sees, as in the posted example
instead of listing the available articles (there are none) it says "no articles for Alaska yet"

--what the machine receives, i.e. the bare number 200 or 404. (Or 301 or 206 or 403 or...)

My position is that a page which says, very nicely, "I'm devastated to have to tell you that we don't currently have any articles about the particular subject you're interested in", surrounded by helpful navigation links, is effectively a very high quality 404 page. But it's not a page that you would want anyone to index or link to or include in the site's page count or do any of the other things you do with pages.

This 40 message thread spans 2 pages: 40 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved