Forum Moderators: phranque

Message Too Old, No Replies

URLs not followed - Webmaster Tools crawl error

Looking for the reason Googlebot can't follow my redirect links on a site

         

librarian

6:28 pm on Sep 26, 2008 (gmt 0)

10+ Year Member



A few days ago I found Google had listed six similar URLs under the Web Crawl area of our Webmaster Tools account that couldn't be followed by Googlebot. Trying the link listed got me to the correct page. The problem seems to be that the links Google tried to follow were old incomplete links that I added to the site's .htaccess file with a redirect to the correct pages. Since the link worked from the Webmaster Tools area, I wondered why it didn't work for Googlebot. Then I noticed each of the links use an 's in the link. I believe the apostrophe might be the problem. So I've been researching ways to remove it using mod_rewrite to see if that would fix the problem for Googlebot. I haven't found any way to remove it from a link. What I've found is always about using it, or not, in coding. Not what I need.

Does anyone have experiece in the "URLs followed area?" There was nothing in the help area that would fit this situation. One thought I had was to remove the redirect and let the 6 links land on the 404 page which Googlebot has not had a problem with yet that I know of.

Thanks.

jdMorgan

1:04 am on Sep 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You may need to cover multiply-encoded, encoded, and un-encoded renditions of the single-quote character, for example:

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /billy('¦\%(25)*27)s-stuff\.ht [NC]
RewriteRule ^billy http://www.example.com/billys-stuff.html [NC,R=301,L]

Here we use THE_REQUEST, which is the browser's (or robot's) HTTP request line, as seen in your raw server access logs. This saves us having to figure out whether or to what extent the URL-path 'seen' by RewriteRule has been un-encoded.

To explain the pattern:

The requested URL starts with "billy" followed by either an unencoded single quote "'" or en encoded single quote "%27" or a multiply-encoded sequence, where the "%" itself has been encoded one or more times, followed by the code for an apostrophe. So this should catch "'", %27 (singly-encoded) %2527 (doubly-encoded), or %25252C (multiply-encoded).

Hopefully, that will catch all cases, so you don't have to worry about it.

Important: replace the broken pipe "¦" characters in the RewriteCond pattern above with solid pipe characters before use; Posting on this forum modifies the pipe characters.

I put the start of the URL-path, "billy," into the RewriteRule as well. This prevents the server from wasting time processing the RewriteCond if the URL doesn't start with a path-part that indicates it might need to be corrected. Whatever your URLs might be, put the characters up to the first single quote into the RewriteRule to make it as selective as possible so your server won't waste time on unnecessary checking of the RewriteCond.

To prevent problems in the future, don't use any characters except a-z, A-Z, 0-9, hyphen, and underscore in URLs -- Just don't do it. That's how this trouble gets started, because the HTTP spec does not give Webmasters complete freedom to choose the URL character-set. See RFC 2396 for more information; SOme characters are allowed in URLs, some are allowed in query strings appended to those URLs, and some are not allowed at all (and must be encoded if they are used).

Jim

g1smd

8:57 am on Sep 27, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Full stop is also allowed in a URL, and usually appears just before the extension. I never use an underscore in a URL - too many issues.

A lot of people also use comma, colon, and semi-colon in URLs, and that can add unnecessary complication. Never ever use a space.

librarian

11:27 pm on Oct 7, 2008 (gmt 0)

10+ Year Member



Hi,
Thank you for your help. I'm sorry I didn't get back to reply sooner. Unfortunately some off line events got in the way. But now I've had an opportunity to try your suggestion in my .htaccess file. I couldn't get it to work other then direct any search to the 404 page. I think I didn't provide enough information the first time. Here are the URLs as Google sees them:

[bbbbbbb-bbb.bbb...]
[bbbbbbb-bbb.bbb...]
[bbbbbbb-bbb.bbb...]
[bbbbbbb-bbb.bbb...]
[bbbbbbb-bbb.bbb...]
[bbbbbbb-bbb.bbb...]

They redirect to [bbbbbbb-bbb.bbb...]
Below are the original redirects in my .htaccess file. They are there because another engine used to come looking for these three variations. The last one seems to be what caused the URLs Google came up with:

RewriteRule ^vv/firstname_lastname's_sssssssss_sign_-_backwoods_ggggggg$ /vv/firstname_lastname_sssssssss_signs.htm [R=301,L]
RewriteRule ^vv/firstname_lastname's_sssssssss_sign_-_getaway_&_mmmmmm$ /vv/firstname_lastname_sssssssss_signs.htm [R=301,L]
RewriteRule ^vv/firstname_lastname's_sssssssss_sign_-_great_dddddddd$ /vv/firstname_lastname_sssssssss_signs.htm [R=301,L]

The redirects in the .htaccess file worked well until Google got the variations. Is it possible to use what you had without the .htm. I tried removing \.ht but it didn't seem to go except to the 404 page.

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /billy('¦\%(25)*27)s-stuff\.ht [NC]
RewriteRule ^billy http://www.example.com/billys-stuff.html [NC,R=301,L]

I hope I haven't confused you too much with all the variations. I have no idea how Google came up with them.

Yes, I know I should have never used the underscore in the site's URLs. But they date from 1998 when there was no answer as to what was better to use, the - or the _. It has never been a problem before. Now the site is too large to make a change. I've learned search engines have a long memory. They will still look for URLs that havent' been there for years.

I did change the ¦.

Thank you again for all the help.

Rhoda

jdMorgan

12:02 am on Oct 8, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I suggest that you don't try to explicitly match all possible variations, and instead, just look for anything except the exact, correct URL. If the requested URL "looks kinda like" a correct URL, but isn't exactly right, then redirect it.

Here's an example:


# If requested URL-path is not exactly as desired
RewriteCond %{REQUEST_URI} !^/vv/firstname_lastname_sssssssss_signs\.htm$
# then redirect to correct it
RewriteRule ^vv/firstname.+lastname.+s.+sssssssss.+sign http://www.example.com/vv/firstname_lastname_sssssssss_signs.htm [R=301,L]

I don't understand what all these "ssss" and "dddd" strings are, but that should be close to what you need, and it should handle all the variants you posted.

Jim

librarian

5:23 pm on Oct 9, 2008 (gmt 0)

10+ Year Member



Hi Jim,
The sss's and ddd's just replaced words. I know you're not allowed to put exact file names in postings. But I did want to indicate the variations googlebot was coming up with for a fairly short file name.

The second suggestion worked great. I found one other combination of file requests that also worked with this solution. Another couple looked like they would work but didn't because the first word I needed to be the same in all the variations didn't exist in one of searches or did for two searches but not the third.

The timing was good because overnight googlebot came up with another variation using capitals. I have a section in the .htaccess file to remove capitals so this new variation was also taken care of. But now it's becoming a question where to place things. I have the capitals section in the middle of the .htaccess file but don't know if that is where it really should be. Are the rules in the correct order to process quickly? So far, everything works that is in the file.

Processing speed of the .htaccess file is not an issue yet but I often wonder if I've used the right order or over complicated a rule because I've used an example that worked but might not require as many steps as I've used. I have sections on file redirects, case reduction, your new suggestion, directory changes, www to non www, fix double slashes, space to underscores, remove characters after htm and finally the 404 file redirect to the error page.

Thanks again for the help. I'll be watching to see how long it takes googlebot to remove this new error it found.

Rhoda

jdMorgan

5:51 pm on Oct 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Generally, the correct order for rules is, external redirects (R=30x) first, and in order from most-specific to least-specific patterns. This means that in almost all cases, a rule that redirects a single URL will be first, followed by rules which redirect well-defined groups of URLs, and the last external redirect is generally the one that redirects from the non-www to www domain (or vice-versa) -- because it redirects *all* requested URLs if the wrong hostname is requested. Note that any preceding rules will have corrected the hostname as well as the URL, if you use a fully-canonical address as the substitution URL to be redirected-to.

Following the external redirects, place your internal rewrites -- again in order from most-specific to least-specific.

There are many cases where similarly-specific rule patterns are mutually-exclusive. In that case, it doesn't matter what order you put those rules in. I'd recommend going with "most-likely-to-be-executed" first.

Finally, a review of the whole thing, with the thought in mind, "Does this order make sense?" is the best protection against unexpected/unexplainable results.

Jim

g1smd

5:58 pm on Oct 9, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The first redirect rule should be the most specific, dealing with perhaps one specific URL, or a small group. Within that rule you should also force www at the same time, so that you don't have another later rule doing that and causing a redirection chain.

Next up, slightly less specific rules, and again, force www at the same time, and so on.

Usually, about this time you'll drop in the rule for index file canonicalisation - and again force the www at the same time for those.

The non-www to www redirect is always the last one (in my experience).

After the redirects list out any rewrites. Again, most specific stuff should be first.

[seems that jd types quicker]