Forum Moderators: phranque
it works fine from a user standpoint, but i dont think bots will actually be able spider the 404.
the reason I say this is because i use a [spider simulator] and for each subdirectory it was returning a 404, rather than the page and data based on the url.
can anyone suggest a step ive missed in setting this up?
[edited by: jdMorgan at 6:57 am (utc) on Oct. 24, 2003]
[edit reason] Delinked [/edit]
Yes, check the headers your script is returning with the Server Header Checker [webmasterworld.com]. If you return a 404-Not Found status for valid page requests, then your pages won't be indexed.
Server response codes [w3.org] should mean what they say and say what they mean: The spiders base their behaviour on these server responses, so if the server says the page is 404-Not Found, the spider will understandably head off to find pages that return 200-OK elsewhere.
One solution: Do not use the server's 404 mechanism to transfer control to your script. Instead, use mod_rewrite [httpd.apache.org] to map your URLs directly to the script, and then have the script generate 404-Not Found only if it can't build a page to serve.
You may also be able to use mod_actions to pass all "page" requests to the script by requested MIME-type. There may be even slicker ways to do it -- you might want to try asking over in the PHP forum. Just make sure the solution you use returns correct server headers!
Note: Avoid mapping special files such as /robots.txt and /w3c/p3p.xml to the script, unless you *really* want to handle those internally. You can, of course, also exclude requests for images, css, or anything else from being passed to the script - just depends on what you want to do.
Jim