Googlebot crawling // pages?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot crawling // pages?

Wondering where the double slashes are coming from

suzie250

6:34 am on Dec 31, 2005 (gmt 0)

The last few days I have noticed in my logs, some pages being crawled by google are showing as mywebsite.com//main.php instead of mywebsite.com/main.php

Does anyone know what might be causing this?
Is this a problem?

Not all pages show up that way in the log and it only happens with google.

Thanks,
Suzie

tedster

1:23 am on Jan 1, 2006 (gmt 0)

I've seen these requests in logs at times, and sometimes I've seen those double slash urls show up in the search results. Theoretically this could cause duplicate content issues, but I haven't come across a real world instance so far where the site was vanishing from search results.

At first I thought a poorly formulated rewrite rule might be generating this oddity, but it's not so on the site where I saw it in the logs. Clearly, the first thing to do is search your pages for even one instance of an accidental // -- just one and the whole site could accumulate duplicate urls for the same content.

Rainie

3:51 am on Jan 1, 2006 (gmt 0)

I've been seeing googlebot do this for months on several of my sites. It will crawl URLs, adding the extra slash. Later on, it will come back and crawl them properly (with one slash.)

I haven't figured out why and it's bugging me. When I see it, I'm left wondering if it's because of something I messed up. I've checked a few things and didn't find any problems. I couldn't find much info on this either. I've been looking...

So far, I haven't noticed URLS with two slashes appearing for those sites in the SERPS. Now wouldn't that be the kicker? Finally getting the canonicals properly indexed just to be hit by something like this?

I should mention that I haven't noticed this type of crawling on my site with canonical problems -- and it looks good on the test DC. I've seen it on some of my other sites.

The URLs (with two slashes) are delivering a 200, so really, is it considered two pages? Anyone know?

cws3di

4:37 am on Jan 1, 2006 (gmt 0)

A couple of months ago, one of my sites was crawled with // and about 10 of them showed up under the site:example.com almost immediately as Supplemental Results (i.e. duplicate content)

I VERY quickly went through the entire site looking for any error in my linking, but NADA. So, I made ALL of my internal links absolute instead of relative, in order to thwart any continued spidering and indexing of these darned // things! So far, no more have been indexed as Supplementals, but googlebot continues to spider them, because somehow these bogus urls are in googlebot's memory mode.

ie all of my internal site links are now of the form:
<a hre%="htt%://www.example.com/page1.html">

instead of <a hre%="page1.html">

I put in those % symbols above in order to prevent these from showing as real links here in the forum - please substitute the appropriate href and http
.

proboscis

12:43 am on Jan 3, 2006 (gmt 0)

Hi,

Same thing is happening to me. Another site copied a bunch of my content in September and linked back to each of my pages with a double // then in October my site started slowly choking to death in Google.

Now, if I search for an exact snippet in quotes I can see some results only show the url with the double // and sometimes I see two results, the correct one and a second one with a double // shown as a supplemental after I click "repeat the search with the omitted results included"

I have lost around 50% of my traffic, I don't know if this is part of my problem or not.

Someone on another thread suggested adding a base href tag to each page so I am trying that now.

stinkfoot

7:48 pm on Jan 3, 2006 (gmt 0)

Possibliy related :

[webmasterworld.com...]

rainborick

9:27 pm on Jan 6, 2006 (gmt 0)

I encountered this phenomenon about mid-October and set up 301 redirects to try to head them off. I also ran Xenu Link Sleuth to see if I could find the source of the problem, but never did locate anything suspicious.

Well, I checked the logs on this site again today and Googlebot was still trying to crawl with the double slashes, but at least my rankings (such as they are) are still holding on this site.

sit2510

6:31 am on Jan 7, 2006 (gmt 0)

I also encountered this problem with googlebot crawlling //, so firstly changed relative links to absolute as well as adding base ref, but these changes did not seem to help since googlebot continue to crawl // successfully. A week later I added 301 redirect for // to /, but the googlebots still attempt to get // which were redirected to / few months back.

At first I thought it was my webmasters' mistakes in putting // on relative links but we can't find any evidence on the sites. Then I thought it could be someones (our competitors) doing evil deeds by pointing // links to our sites, but again we can't find any evidence.

RIGHT NOW I purely believe that it is solely Googlebot's mistake in crawling // because no other bots like msn, yahoo slurp, and others are crawling it.

suzie250

9:14 pm on Jan 18, 2006 (gmt 0)

I thought I should come back and tell how I resolved this problem. (at least for now it's not a problem)

I was able to stop the crawling to all // by placing a disallow in the robots.txt file. I did this to stop the crawling while I tried to figure out what was causing the problem.

For weeks, I could find nothing that was causing the problem and was gonna just blow it off as a google bug.

Last week, I deleted a directory and placed a 301 redirect in my htaccess file. When I checked the redirect, it went to the correct page but only with double slashes! The redirect was correct but the problem was that I sent the old page to a n index.php file. Once I removed the index.php and just used the folder the double slashes went away.

So now my theory is that since my server is set to use index.php as the base for all folders, the redirects were doing double duty.

I removed any and all instances of index.php wherever I could in my files and especially from any redirects.

So instead of:
301 /oldfolder/oldfile.php *http://www.mysite.com/newfolder/index.php

I now use
301 /oldfolder/oldfile.php *http://www.mysite.com/newfolder/
(without the *)
I removed the disallow from the robots.txt and so far no crawling on any double slashes.

amythepoet

9:31 pm on Jan 18, 2006 (gmt 0)

How do I check to see if google has started to crawl my site?

travelin cat

9:43 pm on Jan 19, 2006 (gmt 0)

amy, you will have to check your server logs, hopefully your ISP allows this and better yet, has a stats program that will allow you to easily see what se's are crawling your site.

amythepoet

9:53 pm on Jan 19, 2006 (gmt 0)

Thanks, I believe it's a crawlin now

I'm getting a new stats program, awstats soon, so that should help

travelin cat

12:29 am on Jan 20, 2006 (gmt 0)

awstats is what we use.. it shows you how many times the spiders hit and how much bandwitdh they each use... very useful stuff.

amythepoet

12:50 am on Jan 20, 2006 (gmt 0)

Oh sounds excellent to me. I can't wait to get it.

jomaxx

9:46 pm on Jan 25, 2006 (gmt 0)

I have suddenly started experiencing this same problem -- Googlebot is spidering a bunch of URLs like this: GET //index.html

AFAIK my server is configured correctly. My working theory is that some clone site or scraper site is linking to my pages using this format.

What's worse is that it's causing Googlebot to crawl a bunch of directories that are expressly forbidden in the robots.txt! You'd think a plex full of PhD's could figure out not to put two slashes after the domain, but apparently not.

Anyway what I really need is the .htaccess code for redirecting // pages to the / equivalent. I can't seem to get it to work or find an example online. Anyone?

rainborick

10:06 pm on Jan 25, 2006 (gmt 0)

I used this:


RewriteEngine On
RewriteCond %{REQUEST_URI} ^//(.*)?$
RewriteRule ^/(.*)$ http://www.mysite.com/$1 [R=301]

jomaxx

12:28 am on Jan 26, 2006 (gmt 0)

Thanks for the snippet, but I can't get it to do anything for me.

suzie250

7:00 pm on Jan 26, 2006 (gmt 0)

What's worse is that it's causing Googlebot to crawl a bunch of directories that are expressly forbidden in the robots.txt!

That's how I noticed the // in the first place and the bot had been crawling for two weeks before it started to crawl those pages. It's amazing what you don't see in your logs!

When I changed my robots.txt back to not disallowing those //, I did not put a redirect in place for the bots to go to the correct page. Naturally, the googlebot came back to crawl and found those // again. At this point, I'm sure those are pages that the bot already had and not a link from my site. I'm going to try the snippet that rainborik suggested sometime today and see if it works for me.

What I want to point out, is that the original directory indexed had gone supplemental and after a few weeks of disallowing //, it came back out. With the recent crawling of the // again, that directory has gone supplemental again. (For duplicate content) This is definately just a google problem and my best guess is that it has something to with their attempt to correct the canonical issues.

I'm guessing that not many webmasters use a redirect from // to / as general practice. Why would they even think of that, unless they have experienced it.

Is this not one the easiest ways to sc$#w your competitor? You would only have to leave a link to .thecompetitor.com// just long enough for the googlebot to grab it and start crawling their site. Take the link down, and the webmaster has no clue what has happened. (Just so you know and I don't get slammed, this is not something that I would do.)

suzie250

8:53 pm on Jan 26, 2006 (gmt 0)

rainborick's suggestion did not work for me either, but this did (testing on localhost):

# Remove multiple slashes anywhere in URL (less efficient than next rule)
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http:example.com%1/%2 [R=301,L]

Got the above from - [webmasterworld.com...]

jomaxx

9:05 pm on Jan 26, 2006 (gmt 0)

Thanks. The redirect portion still didn't work for me because the entire server path to that page was being inserted in the new URL.

A minor change made it work on my server:

# Remove multiple slashes anywhere in URL (less efficient than next rule)
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . %1/%2 [R=301,L]