Msg#: 4638958 posted 3:06 pm on Jan 21, 2014 (gmt 0)
I have a page on my website which alphabetically lists links to around 100 directories. Late last year I started observing an excessive number of requests in my logs from Googlebot, adding a slash at the end of the file extension and requesting non existing urls from off this one page such as:
To my horror, I discovered that these requests were all resolving, so I redirected example.com/a-zIndex.htm/ to example.com/a-zIndex.htm thinking that would sort things out. Now I am seeing over 69,000 urls in GWT returning 404's. When I click on the tab to show me where the page is linked from, the urls displayed there are also 404 'not found' yet the date on some of them shows that the page was first discovered only 4 days ago - this is about 3 weeks after I set up the redirect.
My alphabetical listing page has disappeared from the SERPs - yet I notice, Google is happy to include one or two of the now non existent pages in its results.
What would be the best way to handle this problem? I'm thinking of blocking Googlebots access to non existent directory /a-zIndex.htm/
Msg#: 4638958 posted 8:40 pm on Jan 21, 2014 (gmt 0)
Incorporating the link rel="canonical" tag into your pages should clean this up.
I don't imagine this will stop Googlebot wasting time hammering the site for those non existent urls which is more important to me than the ranking of that one page.
You have a relative URL in an href or src on your soft 404 page. It's causing Google to infinitely crawl 404 pages
No. the urls now resolve to a generic hard 404 page not found. The URLs listed in the 'linked from' tab in WMT return a fetch status of "not found" when Fetch as Google is used. I've set up the redirect incorrectly. All the pages should be resolving to example.com/a-zIndex.htm Instead, I am seeing something like: example.com/a-zIndex.htmExampleDirectory1/ExampleDirectory36/ExampleDirectory8/ExampleDirectory65/anotherPage.htm which returns the 404. I would expect them all to gradually disappear as the links to them become removed but I am seeing the opposite - it's as if the redirect is not working for Googlebot.
Msg#: 4638958 posted 11:11 pm on Jan 21, 2014 (gmt 0)
A few months back there was a thread in the Apache subforum started by someone who wanted to screen out every possible type of bad request, whether or not they'd ever happened. One category of "things you don't need unless you need them" is the IgnorePathInfo setting. By default, anything appended to an URL in .html will be ignored, so everything resolves. This is not a problem ... until the day someone asks for such a bogus URL. At that point, you need to set up a global redirect
to blahblah.html alone. Exact formulation will depend on whether you're on Apache or IIS. There's no need to constrain it to the googlebot; you want to redirect everyone, so checking a condition is needless work for the server.
Msg#: 4638958 posted 7:39 pm on Jan 22, 2014 (gmt 0)
I converted our website into HTML5 format (with pages ending with .html) and still see at GWT "Not found 404" for old pages that ended .htm in spite of the fact that I used redirect. So, I understand that no matter what we do G gives Not found 404 error message for pages that do not exist anymore.