Googlebot crawling non existing urls, 69,000 so far - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot crawling non existing urls, 69,000 so far

OnlyOne

3:06 pm on Jan 21, 2014 (gmt 0)

10+ Year Member

I have a page on my website which alphabetically lists links to around 100 directories. Late last year I started observing an excessive number of requests in my logs from Googlebot, adding a slash at the end of the file extension and requesting non existing urls from off this one page such as:

example.com/a-zIndex.htm/ExampleDirectory1/ExampleDirectory36/ExampleDirectory8/ExampleDirectory65/anotherPage.htm
example.com/a-zIndex.htm/ExampleDirectory16/ExampleDirectory26/ExampleDirectory81/YetAnotherPage.htm
example.com/a-zIndex.htm/ExampleDirectory18/ExampleDirectory16/ExampleDirectory84/ExampleDirectory94/ExampleDirectory4/YetAnotherPageAgain.htm
...
...
etc

To my horror, I discovered that these requests were all resolving, so I redirected example.com/a-zIndex.htm/ to example.com/a-zIndex.htm thinking that would sort things out. Now I am seeing over 69,000 urls in GWT returning 404's. When I click on the tab to show me where the page is linked from, the urls displayed there are also 404 'not found' yet the date on some of them shows that the page was first discovered only 4 days ago - this is about 3 weeks after I set up the redirect.

My alphabetical listing page has disappeared from the SERPs - yet I notice, Google is happy to include one or two of the now non existent pages in its results.

What would be the best way to handle this problem? I'm thinking of blocking Googlebots access to non existent directory /a-zIndex.htm/

bumpski

5:29 pm on Jan 21, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Incorporating the link rel="canonical" tag into your pages should clean this up.

Google's support and Matt Cutts on the topic
[support.google.com...]
[mattcutts.com...]

evanhall

6:09 pm on Jan 21, 2014 (gmt 0)

10+ Year Member

You have a relative URL in an href or src on your soft 404 page. It's causing Google to infinitely crawl 404 pages.

Make sure all of your relative URLs start with a '/' in hrefs and srcs, or add a <base> tag to <head>.

OnlyOne

8:40 pm on Jan 21, 2014 (gmt 0)

10+ Year Member

Incorporating the link rel="canonical" tag into your pages should clean this up.

I don't imagine this will stop Googlebot wasting time hammering the site for those non existent urls which is more important to me than the ranking of that one page.

You have a relative URL in an href or src on your soft 404 page. It's causing Google to infinitely crawl 404 pages

No. the urls now resolve to a generic hard 404 page not found. The URLs listed in the 'linked from' tab in WMT return a fetch status of "not found" when Fetch as Google is used. I've set up the redirect incorrectly. All the pages should be resolving to example.com/a-zIndex.htm Instead, I am seeing something like:
example.com/a-zIndex.htmExampleDirectory1/ExampleDirectory36/ExampleDirectory8/ExampleDirectory65/anotherPage.htm which returns the 404. I would expect them all to gradually disappear as the links to them become removed but I am seeing the opposite - it's as if the redirect is not working for Googlebot.

lucy24

11:11 pm on Jan 21, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

A few months back there was a thread in the Apache subforum started by someone who wanted to screen out every possible type of bad request, whether or not they'd ever happened. One category of "things you don't need unless you need them" is the IgnorePathInfo setting. By default, anything appended to an URL in .html will be ignored, so everything resolves. This is not a problem ... until the day someone asks for such a bogus URL. At that point, you need to set up a global redirect

from
(blahblah.html)more-stuff-here

to
blahblah.html
alone. Exact formulation will depend on whether you're on Apache or IIS. There's no need to constrain it to the googlebot; you want to redirect everyone, so checking a condition is needless work for the server.

Josun

7:39 pm on Jan 22, 2014 (gmt 0)

10+ Year Member

I converted our website into HTML5 format (with pages ending with .html) and still see at GWT "Not found 404" for old pages that ended .htm in spite of the fact that I used redirect. So, I understand that no matter what we do G gives Not found 404 error message for pages that do not exist anymore.

lucy24

10:54 pm on Jan 22, 2014 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Do you mean that every single page that used to be .htm is now .html? If so, then any reported 404 means the redirect isn't taking place.

I suppose it's too late to point out that there was no earthly reason to change either your visible URLs or the physical file extensions :(