Forum Moderators: phranque

Message Too Old, No Replies

how to 404 nonexistant pages

         

jake66

2:38 am on Apr 25, 2006 (gmt 0)

10+ Year Member



presently, i've been spotting yahoo searching for files that don't exist.

the problem: dynamic url's.
how can you 404 something like: mysite.com/?S=A

currently, this produces a 200/ok response. and i do have a 404 file employed via htaccess:

ErrorDocument 404 /404.php

hakre

7:16 am on Apr 25, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



this is the default apache directory listing and the queryinfo part contains information about how to sort the listing.

i think you can disable this default behavior within your apache config:

Option -Indexes

give it a try. to "fix" the already existant problem try with robots.txt for that url and maybe something with modrewrite. or just put a index.html into that directory, that should do the job.

jake66

3:40 am on Apr 26, 2006 (gmt 0)

10+ Year Member



if i enter this into my htaccess and break something.. could i cause permanent damage?

i did a search on the web and didn't find much info other than how to turn the option on.... but what exactly does it do?

please forgive such a dumb question, but i'm relatively new to php and apache in general.

jake66

4:05 am on Apr 26, 2006 (gmt 0)

10+ Year Member



i bit the bullet and plopped:
Option -Indexes

into my htaccess and got an internal server error.. is there a specific method how to employ this, or must one be server admin? i am on a shared hosting account w/ mod_rewrite, etc enabled

Tastatura

4:20 am on Apr 26, 2006 (gmt 0)

10+ Year Member



Options All -Indexes

jake66

4:33 am on Apr 26, 2006 (gmt 0)

10+ Year Member



i added this to the beginning of my htaccess in /
and it's still giving a 200/ok response to www.mysite.com/?S=A

Tastatura

4:50 am on Apr 26, 2006 (gmt 0)

10+ Year Member



I just re-read your post (more carefully). I don’t think that adding “Options All –Indexes” to .htaccess will get rid of?S=A behavior (but you should not be getting 500 error any more and my first post was aimed at that.)

jake66

5:14 am on Apr 26, 2006 (gmt 0)

10+ Year Member



oh, ok... well i did not get a 500 the most recent try.

is there custom coding involved with preventing activity like www.webmasterworld.com/?testfor404 from producing a 200/ok response?

going on the fact i am able to get to the main page of webmasterworld by trying that, i would think it's safe to ignore this problem?

i have spotted yahoo looking for urls like this... but if everyone (with php) experiences the same type of response, why would yahoo even bother trying to see the type of response any page would give?


also... say portions of your site are dynamic in this scenario:

- 10 products per page
- 100 products in stock
= that translates into 10 different pages.

*these pages are dynamic: index.php?page=1 and index.php?page=2 and so on..
*you sell 50 products, you are down to 5 pages.
*google, msn and yahoo are still hitting index.php?page=10 and getting a 200/ok response even though you don't have enough products to list past page 5

...what can be done to thwart the 200/ok responses on these types of pages? is it possible to custom-code into your website backend, or are you doomed for having the blasted little '?' in the urls?

jdMorgan

1:50 pm on Apr 26, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In order to understand this problem, you need to take into account one of the 'rules' of the HTTP protocol.

A query string such as "?S=A" is not part of a URL, and does not identify a specific resource in the context of HTTP and Apache server. Rather, it is a string of data attached to a URL, to be passed to the resource *at* the given URL -- in your example, your script at "/" -- the default index page of your site.

As such, only the base URL "/" can be checked for existence or non-existence by the server. If you want to check whether the script-generated 'page' identified by "S=" exists, then the script itself must perform that checking function.

You *can* use mod_rewrite code in .htaccess to return a 410-Gone response based on the query string, but that is a high-maintenance and error-prone approach. If you elect to follow this path (not recommended), see mod_rewrite's RewriteCond directive, which can be used to check the requested %{QUERY_STRING} available as a server variable.

The problem is that from your description, it sounds like your query-string-based pages come and go frequently. Since it can often take a search engine many months (or even a couple of years) to finally recognize a 410-Gone or 404-Not Found response and stop asking for the obsolete page, your list of removed pages might grow very large, making maintenance difficult. (You would have a big problem deciding how long to leave each removed-page 410 in place; If you removed the 410 too soon and a search engine spider re-requested that page, then you'd have to add the code for that page back in, and start the clock for that removed entry all over again.)

I suggest that you modify the script to check the database to see if it can generate the requested page, and if not, return a 410-Gone response along with a custom error page that the visitor can use to find a similar product. (Explain that the product is no longer available and provide links on that 410 page pointing to your site map, home page, and product selector, for example.)

Jim