Forum Moderators: DixonJones
I've being developing and using a home grown crawler for a number of years to populate a google style search engine for the biotech sector
<Sorry, no personal URLs.
See Terms of Service [webmasterworld.com]>
I only include pages that return Error 200 (no error, among other criteria), but I've noticed that I can crawl a site and get back 10's, even 100's of custom 404 pages which all say the same thing, and this clogs my data. I could search for the string 404 but this is not elegant. Because I'm hitting a variety of web servers I need something in the http headers and get response that actually tells me its a custome error page and not an ordinary page.
Any ideas?
[edited by: tedster at 10:41 pm (utc) on Feb. 12, 2005]