Forum Moderators: phranque

Message Too Old, No Replies

404 traps

         

proxyHunter

5:44 am on Oct 24, 2003 (gmt 0)

10+ Year Member



i have set up a site that has a directory structure like dmoz.org. Im using a 404 error php script to output code based on the url entered. eg. www.example.com/cars/ or www.example.com/bikes/ . (no static html pages exist)

it works fine from a user standpoint, but i dont think bots will actually be able spider the 404.

the reason I say this is because i use a [spider simulator] and for each subdirectory it was returning a 404, rather than the page and data based on the url.

can anyone suggest a step ive missed in setting this up?

[edited by: jdMorgan at 6:57 am (utc) on Oct. 24, 2003]
[edit reason] Delinked [/edit]

jdMorgan

6:51 am on Oct 24, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ProxyHunter,

Yes, check the headers your script is returning with the Server Header Checker [webmasterworld.com]. If you return a 404-Not Found status for valid page requests, then your pages won't be indexed.

Server response codes [w3.org] should mean what they say and say what they mean: The spiders base their behaviour on these server responses, so if the server says the page is 404-Not Found, the spider will understandably head off to find pages that return 200-OK elsewhere.

One solution: Do not use the server's 404 mechanism to transfer control to your script. Instead, use mod_rewrite [httpd.apache.org] to map your URLs directly to the script, and then have the script generate 404-Not Found only if it can't build a page to serve.

You may also be able to use mod_actions to pass all "page" requests to the script by requested MIME-type. There may be even slicker ways to do it -- you might want to try asking over in the PHP forum. Just make sure the solution you use returns correct server headers!

Note: Avoid mapping special files such as /robots.txt and /w3c/p3p.xml to the script, unless you *really* want to handle those internally. You can, of course, also exclude requests for images, css, or anything else from being passed to the script - just depends on what you want to do.

Jim