homepage Welcome to WebmasterWorld Guest from 54.197.215.146
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Googlebot crawling non existing urls, 69,000 so far
OnlyOne




msg:4638960
 3:06 pm on Jan 21, 2014 (gmt 0)

I have a page on my website which alphabetically lists links to around 100 directories. Late last year I started observing an excessive number of requests in my logs from Googlebot, adding a slash at the end of the file extension and requesting non existing urls from off this one page such as:

example.com/a-zIndex.htm/ExampleDirectory1/ExampleDirectory36/ExampleDirectory8/ExampleDirectory65/anotherPage.htm
example.com/a-zIndex.htm/ExampleDirectory16/ExampleDirectory26/ExampleDirectory81/YetAnotherPage.htm
example.com/a-zIndex.htm/ExampleDirectory18/ExampleDirectory16/ExampleDirectory84/ExampleDirectory94/ExampleDirectory4/YetAnotherPageAgain.htm
...
...
etc

To my horror, I discovered that these requests were all resolving, so I redirected example.com/a-zIndex.htm/ to example.com/a-zIndex.htm thinking that would sort things out. Now I am seeing over 69,000 urls in GWT returning 404's. When I click on the tab to show me where the page is linked from, the urls displayed there are also 404 'not found' yet the date on some of them shows that the page was first discovered only 4 days ago - this is about 3 weeks after I set up the redirect.

My alphabetical listing page has disappeared from the SERPs - yet I notice, Google is happy to include one or two of the now non existent pages in its results.

What would be the best way to handle this problem? I'm thinking of blocking Googlebots access to non existent directory /a-zIndex.htm/

 

bumpski




msg:4639013
 5:29 pm on Jan 21, 2014 (gmt 0)

Incorporating the link rel="canonical" tag into your pages should clean this up.

Google's support and Matt Cutts on the topic
[support.google.com...]
[mattcutts.com...]

evanhall




msg:4639018
 6:09 pm on Jan 21, 2014 (gmt 0)

You have a relative URL in an href or src on your soft 404 page. It's causing Google to infinitely crawl 404 pages.

Make sure all of your relative URLs start with a '/' in hrefs and srcs, or add a <base> tag to <head>.

OnlyOne




msg:4639053
 8:40 pm on Jan 21, 2014 (gmt 0)

Incorporating the link rel="canonical" tag into your pages should clean this up.


I don't imagine this will stop Googlebot wasting time hammering the site for those non existent urls which is more important to me than the ranking of that one page.


You have a relative URL in an href or src on your soft 404 page. It's causing Google to infinitely crawl 404 pages


No. the urls now resolve to a generic hard 404 page not found. The URLs listed in the 'linked from' tab in WMT return a fetch status of "not found" when Fetch as Google is used. I've set up the redirect incorrectly. All the pages should be resolving to example.com/a-zIndex.htm Instead, I am seeing something like:
example.com/a-zIndex.htmExampleDirectory1/ExampleDirectory36/ExampleDirectory8/ExampleDirectory65/anotherPage.htm which returns the 404. I would expect them all to gradually disappear as the links to them become removed but I am seeing the opposite - it's as if the redirect is not working for Googlebot.

lucy24




msg:4639103
 11:11 pm on Jan 21, 2014 (gmt 0)

A few months back there was a thread in the Apache subforum started by someone who wanted to screen out every possible type of bad request, whether or not they'd ever happened. One category of "things you don't need unless you need them" is the IgnorePathInfo setting. By default, anything appended to an URL in .html will be ignored, so everything resolves. This is not a problem ... until the day someone asks for such a bogus URL. At that point, you need to set up a global redirect

from
(blahblah.html)more-stuff-here

to
blahblah.html
alone. Exact formulation will depend on whether you're on Apache or IIS. There's no need to constrain it to the googlebot; you want to redirect everyone, so checking a condition is needless work for the server.

Josun




msg:4639400
 7:39 pm on Jan 22, 2014 (gmt 0)

I converted our website into HTML5 format (with pages ending with .html) and still see at GWT "Not found 404" for old pages that ended .htm in spite of the fact that I used redirect. So, I understand that no matter what we do G gives Not found 404 error message for pages that do not exist anymore.

lucy24




msg:4639450
 10:54 pm on Jan 22, 2014 (gmt 0)

Do you mean that every single page that used to be .htm is now .html? If so, then any reported 404 means the redirect isn't taking place.

I suppose it's too late to point out that there was no earthly reason to change either your visible URLs or the physical file extensions :(

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved