|IIS URLs, rewrites, and 404 vs 200|
So we have had a huge problem on our site over the past few months, where Google has discarded the URLs we use throughout our internal navigation and has begun creating their own.
I say "creating their own" because the links now being indexed are truncated versions of our correct URLs. They are not linked to from anywhere within our site, and I haven't been able to find any evidence any external links point to these pages.
The breadth that this is happening across the site also suggests this being a systemic change instead of Google picking up on a popular (if incorrect) external link.
So while this page may have ranked for several years:
Google has begun indexing something like this instead:
(nid here would be is some database variable with 12345 being its value.)
We don't have canonicals in place, so that certainly isn't helping.
But what I'm looking for is help understanding how IIS treats URLs, URL rewrites, and status codes (particularly 404 vs 200).
My understanding of rewritten URLs in IIS is that it will always try to load something even if the URL doesn't exist.
So for http://www.example.com/gallery/nid-12345/complete-garbage-here = 200ok
Is that true, and if so, is there any way to change that behavior?
We are several months away from launching a rewrite of all the URLs on the site, and along with adding canonicals, I want to see if there's any way for us to make canonicals a safety net rather than a requirement to prevent Google from taking liberties with our URLs.
Ideally, we would be able to force IIS to serve 404 if a page doesn't exist instead of responding with 200ok.
(I'm also curious if anyone can explain why this isn't the default behavior.)
Hang on there.
Obviously I'm not going to address the IIS-specific aspects, but one thing is universal: A 200 response can't occur in a vacuum. It has to be accompanied by some kind of content. The "content" may happen to be a completely blank page if the site has botched its php, jsp or similar. But neither the server nor the browser knows that.
If you request a garbage URL from your own site, do you end up somewhere? If so, where? What's in the address bar? What's in the body of the page? If the page is blank, is there any source code?
|So while this page may have ranked for several years: |
It's obvious that your ASP script doesn't check that the requested 'keyword-phrase' is valid for this ID. It should do so by looking up that text in the database, and then either return 404 if it doesn't match what was requested or else redirect to the correct version of the URL.
The .aspx extension was never necessary. You could go extensionless.
While a URL might be requested as example.com/gallery/id-12345-page-name-here by someone "out there" on the web, that request will be mapped via an internal rewrite to a function like /gallery.aspx?id=12345&name=page-name-here or similar. The fact that the URL request has been fulfilled by the .aspx file means the server returns 200 OK. This happens even when there is no content in the database matching this request. It's therefore up to the ASPX script to return a 404 status code and error page in that case.
With correct design of the complete system, duplicate content and "soft 404" problems can be completely eliminated.
Thanks so much for your responses!
|If you request a garbage URL from your own site, do you end up somewhere? If so, where? What's in the address bar? What's in the body of the page? If the page is blank, is there any source code? |
Good point; I should have specified. If you were to load http://www.example.com/gallery/nid-12345/ or http://www.example.com/gallery/nid-12345/ke, it would respond with 200 and load the content for http://www.example.com/gallery/nid-12345/keyword-phrase.aspx. So even though the page doesn't "exist", it exists.
|While a URL might be requested as example.com/gallery/id-12345-page-name-here by someone "out there" on the web, that request will be mapped via an internal rewrite to a function like /gallery.aspx?id=12345&name=page-name-here or similar. The fact that the URL request has been fulfilled by the .aspx file means the server returns 200 OK. This happens even when there is no content in the database matching this request. It's therefore up to the ASPX script to return a 404 status code and error page in that case. |
With correct design of the complete system, duplicate content and "soft 404" problems can be completely eliminated.
THIS! This is what I want. Duplicate content, soft 404s, and now crazy indexation has been killing us.
Can you point me to any resources that I can pass to our developers so that they can fix this issue?
If I understand correctly, it sounds like they need to adjust the aspx script, but I'm not sure if that's the extent of it or if there would be additional things they would need to address.
We also plan on dropping the aspx file extension during the rewrites... will that effect this behavior at all?
Thanks again for the responses!
I only use classic ASP but the situation should be similar.
An IIS server can be set to use default or non-default error scripts to respond to (eg) 404. All of my sites are set (by me) to run a parsing script if a 404 occurs. The result depends on who's asking.
If a bot asks for a non-existent page it gets a proper 404 (or in rare cases a 301 redirect to a similar page: eg .asp instead of .html).
If a human asks then they are redirected to (usually) the home page.
If your developers or hoster has mis-set the 404 handler then that could be your problem.
Aside from that if a subdirectory has a redirect script as its default (eg index.asp) that may be invoked by G and return a valid page. I always set the default page of a directory to either a proper page or back to the home page (but see above).
Finally, G has a very bad habit of trying URLs that do not exist. Other engines also do this but G seems to be the worst. They even try stupid file extensions that could never occur on an IIS server and probably not on most linux servers either.
If you drop file extensions after an SE has indexed them then you will porobably lose those pages from the SE's index. Good SEs may think, "Ah! I will drop the old 404-returned page and index the new one." G is not that friendly in my experience. I would keep the old pages and (probably) provide a canonical redirection to the non-extension version. Or vice versa. You would need to sort this out in details so that links from SERPS do not reject users.
thanks for your response dstiles.
It's really a relief to hear others have experienced googlebot's belligerence.
It's baffling to me. There are all these concerns over managing your site and crawlable pages in order to avoid taxing crawler bandwidth. Meanwhile, the crawler is taking extra work upon itself to create and test URLs which don't exist anywhere else on the web. So frustrating!
But thanks for your advice. I will ask our devs to look into how we handle 404s.
And that is a REALLY GOOD point about having to track down all of the malformed URLs google has indexed in order to 301 them to the appropriate page. Heavy testing of our 301s was going to happen during the course of the rewrites regardless, but I might have overlooked testing those URLs as well.
Canonicals will be universally implemented with the rewrites, but I don't even trust that that is enough to keep G from making up their own URLs.
Hopefully rewrites + canonicals + improved 404 handling = no more made up URLs.
However you manage it and however well you 404 unavailable pages, G (and others!) WILL try made-up URLs.
I don't know exactly WHY they do this but the original suggested reason of wanting to be sure how sites handled 404's does not hold water in view of the number of duff URLs they try.
Another idea I've seen was that they are looking for new content, but the potential URL list for that for even a small site is ludicrously prohibitive.
Yet another suggestion is that they are using stupid links found on other web sites. That MAY be possible for some links but the number and obviously made-up state of the URLs suggests this is only a very small percentage of hits.
|Another idea I've seen was that they are looking for new content, but the potential URL list for that for even a small site is ludicrously prohibitive. |
... and new pages are not likely to hide behind URLs of the kind you'd get if your cat went to sleep on the keyboard.
For future reference:
I've assumed that even if the mechanics are different, rewriting to a dynamic page has the same effect in IIS as in Apache: The server itself returns a 200, because it has successfully handed off to the php/asp/jsp/whatever. It's up to the destination page to evaluate parameters and return a 404 when appropriate.
Right or wrong?
Google asks for non-existant URLs to see whether they get a "HTTP/1.1 404 Not Found" response code and an HTML page with an error message of some sort or whether they get "200 OK" and a content page template with large blank bits because there's no content in the database for this request to populate the page with. The latter is a "soft 404". Altering the script that generates the HTML page fixes this.
I'm not sure what "resources" your developer needs. This is "HTTP 101", returning the right status code and content (or error) for each request, valid or not.
> returns a 200, because it has successfully handed off to the php/asp/jsp/
Lucy - not strictly true. The page calls the processor (at least, in ASP and, I think, in PHP). If there are <% type tags anywhere in the page script then the parser is invoked; if not it will be served directly as html/text. If the page script does not exist at all then there is nothing to parse so the SERVER invokes the 404 handler (in IIS this is, by default, a short ASP script. Errors can be produced by the parser but by then it's been established that 404 is not a possible error UNLESS the script cannot find database (or other) content, when the SCRIPT deliberately generates an artificial 404.
g1smd - as I said, that does not entirely stack up. They use so many bad URLs there must be another reason. One bad URL would do, half a dozen would be overkill. Several dozen is dumb.
The SERVER returns 200 IF it can find a page corresponding to the request, either by direct URL or by an internal redirect (eg if 404 then return home page with 200) but that is a deliberate decision by the site designer or hoster. If there is no database content there should be a section in the script that says: "Invoke 404".
|It should do so by looking up that text in the database, and then either return 404 if it doesn't match what was requested or else redirect to the correct version of the URL. |
g1smd, can you speak at all to how resource intensive these lookups would be on the database?
(Sorry if that question doesn't make sense--I'm an SEO who knows very little about how databases and servers actually work.)
But I know a pain point with IT is string lookups from rewritten URLs, and if I request a string lookup for every request of any URL we have--well, it probably won't go well.
We're an ecomm site with several hundred thousand pages. After an awful lot of work from our developers, our load times are awesome. If I ask for something that will require a noticeable performance hit, it could seem like a big step backwards to them.
This URL situation is killing me though from an SEO perspective, but I hope it doesn't come down to a decision between site performance and SEO performance.
[edit for words.]
|The SERVER returns 200 IF it can find a page corresponding to the request, either by direct URL or by an internal redirect (eg if 404 then return home page with 200) but that is a deliberate decision by the site designer or hoster. If there is no database content there should be a section in the script that says: "Invoke 404". |
OK, I think we're saying the same thing, barring technicalia including differing definitions of "internal". Bottom line is that the response the server sends may or may not be the response the user receives.
|can you speak at all to how resource intensive these lookups would be on the database? |
Minimal impact. Your script extracts the e.g. product ID from the requested URL and uses it to look up the matching text in the database. It's a very simple and quick request. The string comparison of "requested text" and "database text" is quick and leads to a simple "does it exactly match?" decision. If it does match then proceed to pull the entire page content from the database build the HTML page and send it to the browser. If it does not match then either return HTTP 404 or send a HTTP 301 redirect response pointing to the correct URL using the requested product id plus the correct URL text as returned by the database. Of course if no URL text was returned from the database, because this product ID doesn't exist, send HTTP 404 and the associated HTML error message page.
In many sites the URL text isn't stored as a separate database entry. It is simply built from the page title using a set of RegEx patterns. The algorithm is something like:
Get page title from database and store original as page title for later use. Additionally change to lower case, remove leading and trailing spaces, change "&" and "&" to space, remove apostrophes, change space or spaces to single hyphen, de-dupe contiguous hyphens, etc, then store this URL text.
I tend to use URLs like
where c is category, p is product page, r is review page, etc.
If the URL is requested with either appended junk or with truncated text the site redirects to the correct URL. Optionally, it can also keep a log of all of these requests in a separate file for later analysis.
awesome, thank you so much for explaining that!
Indeed, our new URL structures will be similar to what you use. Instead of having the database values in a directory of its own and the SEO rewrite contained in that directory (e.g. /nid-1234/rewritten-text.aspx) we will be putting it all in one directory (e.g. /nid-1234-rewritten-text/)
The idea there was to force Google to a 404 if they try /nid-1234-rewri, whereas now they can do whatever they like since anything contained within the id directory will return 200 (e.g. /nid-1234/rewritten-text.aspx = 200, /nid-1234/ = 200, /nid-1234/re = 200).
Thanks again for you insights (and lucy24 and dstiles as well)! This discussion has definitely helped me understand what I need to ask of IT.
Trailing slash indicates a folder, and that fact brings with it various other baggage. More so when there are rewrites involved.
If those URLs are actually "pages" don't add a trailing slash.
Use URLs with a trailing slash for real physical folders and URLs with an extension for CSS, JS and image files, robots.txt, etc.