Forum Moderators: DixonJones

Message Too Old, No Replies

Custom 404 pages do not return 404 - how can I detect?

crawling problem with custom 404 pages

         

AndrewStuart

8:08 pm on Feb 12, 2005 (gmt 0)

10+ Year Member



Hi all,

I've being developing and using a home grown crawler for a number of years to populate a google style search engine for the biotech sector

<Sorry, no personal URLs.
See Terms of Service [webmasterworld.com]>

I only include pages that return Error 200 (no error, among other criteria), but I've noticed that I can crawl a site and get back 10's, even 100's of custom 404 pages which all say the same thing, and this clogs my data. I could search for the string 404 but this is not elegant. Because I'm hitting a variety of web servers I need something in the http headers and get response that actually tells me its a custome error page and not an ordinary page.

Any ideas?

[edited by: tedster at 10:41 pm (utc) on Feb. 12, 2005]

hakre

9:21 pm on Feb 13, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



actually you're indexing 200er pages containing a file not found message which should be 404. i can see no solution for this, because your client should rely on the header. i would ban the domains using such a mechanism.

arrowman

9:57 pm on Feb 18, 2005 (gmt 0)

10+ Year Member



When status codes are not properly used, text recognition is all that remains.