having large amounts of 404 errors can hurt rankings. a 404 error that happens from broken links and missing pages is a bad quality signal to google.
404 errors can also happen when a user has a typo while they manually enter the url into a browser and other random mistakes. you can ignore those random mistakes.
if you see a consistent pattern of 404 errors then either build a page or redirect the requests to a pre-existing page. this will keep users happy and boost the quality signals going to google.
it is helpful to your internal link popularity score when you increase pages your site. when in doubt just build out a new page and interlink it with your old content.
I'd suggest using the "Fetch as googlebot" utility in the Labs section of Webmaster Tools to check out some of the suspect URLs. It's a definite concern that you find these URLs manually or with Xenu, but googlebot requests show a 404 in your log. It could be sign that your site was hacked.
i tried to find the broken links with xenu, but it was of no use... the references are all correct, i have checked the individual files of code also
dont know what to do? could u guide me more abt the hacked issue?
If your server has been hacked, then there are all kinds of games that will serve googlebot something different than a regular user would get. In fact, sometimes a hacker might install a script that is buggy and doesn't do what they thought it would.
You need to explore those 404 URLs individually, at least some of them, and see what happens to a browser and what happens to googlebot. Xenu is probably too broad a brush. For checking individual URLs, I'd probably use Firefox with the LiveHTTPHeaders add-on.
Are you showing 404 errors for "pages" that exist on your site?
If so - that is a different problem.
If these pages aren't pages you intended to create - you can see the linking pages for some of these in Google Webmaster Tools.
You should run all the pages that are coming back 404 (from both your logs and GWT) through xenu. You can do it as a text file instead of menu trying to find the links. You should also run a sample of those through the view as Googlebot tool as Ted mentioned.
It sounds to me like a server problem. A hack or an accidental malfunction.
hi tedster, goodroi, chris, and aristotle
i guess u guys are thinking that the problem is the pages are giving 404 to bots but 200 to users,... it is not the case
i have checked with fetch as googlebot, xenu and livehttpheaders....and all legitimate links are returning status 200
the problem is:
we cant seem to FIND the links on the site which are broken and are returning 404 (bots are accessing those urls which are not supposed to be on our site exampe: www.site.com/ultra-widget.htm... instead of www.site.com/folder/ultra-widget.htm... is there any way to find the origin...
Now I understand better - it sounds like you are seeing the same kind of thing we're discussing in this thread: Webmaster Tools - again with the anomalies [webmasterworld.com]
There are anomalies in WebmasterTools from time to time that we don't understand. If you've done due diligence and cannot see the evidence on your site, then there's little more you can do on that angle.
So keep your focus on the traffic loss itself - dig out which URLs have lost traffic and on what keywords. There should be some patterns emerging in that research. There is another thread that seems to apply to your case - October 2010 ranking drops [webmasterworld.com]
no tedster, this is not the issue with wmt, WMT is showing no 404s at all (weird wmt)
i found these issues when i used different server logs readers like WebLogExpert and DeepLogAnalyzer
we cant seem to FIND the links on the site which are broken and are returning 404 (bots are accessing those urls which are not supposed to be on our site exampe: www.site.com/ultra-widget.htm... instead of www.site.com/folder/ultra-widget.htm...
Your opening post talked about "bots". Are you talking about crawler traffic requesting URLs that are 404, or search engine referrals being sent to 404 URLs?
i am talking abt 80-90% of the stats of crawlers in different server log analyzers requesting 404s (and these are the links i cant seem to find anywhere on the site)
Just off-hand, it sounds like you're using a CMS and some URL_rewrite configuration has been scrambled. But your problem is a precise technical issue, so precise use of technical vocabulary is important for anyone to give you useful feedback.
Do you mean you can't find those URLs on your site, or that you can't find any links on your site that point to those to those URLs?
can't find any links on the site that point to those URLs
Keep in mind - as tester is right the terminology is important - that I keep reading things like "can't find the links on the site which are broken" and "requesting 404s". I am not trying to be nit picky, but just trying to make sure you understand:
1) There doesn't have to be a link to something for you to get a 404 error. There could be a mod rewrite issue like Tedster mentions - and that would probably be in some sort of .htaccess file. There isn't going to be any "links" per say in a case like this, but line(s) of code in a config file somewhere.
2) pages don't return 404 errors (in general). A request for non existent pages does cause this to occur. Same for non existent files. It is possible to have a totally working website - that tries to pull something from the client side (such as a missing CSS file) that will show up 404 every time. There can also be a call to the back end (like a database) that also will return a 404. This can sometimes have no real ill effect to the visitor (but should be fixed anyway). Relative links can screw this up as well.
You seem to be able to see (based on your comments about the folders) some sort of pattern to the URL. Are you seeing 200ok for all the files that page requests as well using live http headers (including the CSS, js,...)?
yes, i am seeing 200 ok for all the files
today, i explicitly 301 redirected few of the top 404 pages to relevant content pages, i am seeing a particular thing
in putty, googlebot, yahoobot,bingbot are all making lots of connections (upto 100) now ...compared to what 1 or 2 connections they previously used to make
was it due to the fact that they were being redirected to 404s and they used to stop crawling?
also... its been slmost 24 hours since i have made the changes,i dont still see many pages that have gotten indexed in google search's LAST 24 HOUR FILTER?
do i have to wait longer than that?
Hi epnaniac - we too are seeing googlebot return 404 errors as the bot is trying to access the rewritten URL in the log instead of the raw URL - can I ask you what type of 404 you are getting ? Our error is 404 11 0 so would be interested in what yours is - we too had mass traffic loss but more recently
I think you have some redirect problems.
Not everyone links to your site the way you want them to - some will not use the www
Your site does not correctly do the redirect for this - when I try to go to:
I should be redirected to:
instead I am redirected to:
Which then attempts to pull the file:
which returns a 404 - you will need to use a tool that allows you to see the background traffic like live http headers and not xenu for this.
I have sent you the actual urls
I noticed that you asked the same question in the IIS forum [webmasterworld.com] - and that thread seems to have a definitive answer:
|Ocean 1000: |
The error is that the url was double escaped. And by default II7 doesn't allow that to be processed. See the following KP article for more details.
HTTP Error 404.11 – URL_DOUBLE_ESCAPED [support.microsoft.com]
Hi Tedster, yeh sorry for double posting, wanted to get perspective from a crawling and technical aspect - we believe the 404 11 errors are coming from Google image searches (when the image is displayed in the top frame, the second frame tries to render the page the image sits on - when it renders a page that has a %20 in it, Google double escapes by encoding the % which then renders the page as /keyword%2520keyword/ instead of /keyword%20keyword/
We have 150,000 sitemaps pages with the %20 inserted and it doesn't seem to have problems though, so we are still baffled as to the reasons of traffic loss.
sorry, i was away for a few days,..
i dont know much abt 404 error types, all i can see on my access logs are 404 errors with http 1.1 404 309
let me know if it helped and you solved your problem ? :)
@tedster and chris
thank you guys for helping me out, you guys are life savers and your tips and pointers to guys like us is measurable in GOLD,.. you dont know what you guys are doing, your contribution to WW and us is priceless!
with httpliveheaders i was able to detect few files requests which were giving 404 statuses,...also i solved few of the redirection issues in .htaccess
...i worked on page load time, and made my site a lot faster,.....crawling of googlebot has shot up drastically, i am still to see a regain of traffic though
but the original problem remains.......
i am still seeing links like
www.site.com/cement.htm in my access log
whereas the orginal link should be
also we are seeing majority of links like
these all links should be
i have checked .htaccess for possible redirection issues, but havent found anything
these links are crawled the most times by googlebot,yahoo and bing bot.......i dont know from where they are accessing these links :(
Edited to disable autolinking and fix url display
[edited by: Robert_Charlton at 6:20 am (utc) on Jan 17, 2011]
|i am still seeing links like |
www.site.com/cement.htm in my access log
whereas the orginal link should be
Having redirects/rewrites set up in .htaccess does not mean that you will not see the original (dynamic) URLs in your access logs. If they were previously exposed to search engines, they will still keep requesting them periodically.
The question is - after your .htaccess fixes, what is the server response to such requests? Do you still get 404 or do you now get 301 301 redirect response to such requests?
If you are still getting 404 then I would check .htaccess again. If you are getting 301 redirect to the correct URL , then this is a correct response and there is nothing else to do.