Forum Moderators: Robert Charlton & goodroi
Does anyone know what might be causing this?
Is this a problem?
Not all pages show up that way in the log and it only happens with google.
Thanks,
Suzie
At first I thought a poorly formulated rewrite rule might be generating this oddity, but it's not so on the site where I saw it in the logs. Clearly, the first thing to do is search your pages for even one instance of an accidental // -- just one and the whole site could accumulate duplicate urls for the same content.
I haven't figured out why and it's bugging me. When I see it, I'm left wondering if it's because of something I messed up. I've checked a few things and didn't find any problems. I couldn't find much info on this either. I've been looking...
So far, I haven't noticed URLS with two slashes appearing for those sites in the SERPS. Now wouldn't that be the kicker? Finally getting the canonicals properly indexed just to be hit by something like this?
I should mention that I haven't noticed this type of crawling on my site with canonical problems -- and it looks good on the test DC. I've seen it on some of my other sites.
The URLs (with two slashes) are delivering a 200, so really, is it considered two pages? Anyone know?
I VERY quickly went through the entire site looking for any error in my linking, but NADA. So, I made ALL of my internal links absolute instead of relative, in order to thwart any continued spidering and indexing of these darned // things! So far, no more have been indexed as Supplementals, but googlebot continues to spider them, because somehow these bogus urls are in googlebot's memory mode.
ie all of my internal site links are now of the form:
<a hre%="htt%://www.example.com/page1.html">
instead of <a hre%="page1.html">
I put in those % symbols above in order to prevent these from showing as real links here in the forum - please substitute the appropriate href and http
.
Same thing is happening to me. Another site copied a bunch of my content in September and linked back to each of my pages with a double // then in October my site started slowly choking to death in Google.
Now, if I search for an exact snippet in quotes I can see some results only show the url with the double // and sometimes I see two results, the correct one and a second one with a double // shown as a supplemental after I click "repeat the search with the omitted results included"
I have lost around 50% of my traffic, I don't know if this is part of my problem or not.
Someone on another thread suggested adding a base href tag to each page so I am trying that now.
[webmasterworld.com...]
Well, I checked the logs on this site again today and Googlebot was still trying to crawl with the double slashes, but at least my rankings (such as they are) are still holding on this site.
At first I thought it was my webmasters' mistakes in putting // on relative links but we can't find any evidence on the sites. Then I thought it could be someones (our competitors) doing evil deeds by pointing // links to our sites, but again we can't find any evidence.
RIGHT NOW I purely believe that it is solely Googlebot's mistake in crawling // because no other bots like msn, yahoo slurp, and others are crawling it.
I was able to stop the crawling to all // by placing a disallow in the robots.txt file. I did this to stop the crawling while I tried to figure out what was causing the problem.
For weeks, I could find nothing that was causing the problem and was gonna just blow it off as a google bug.
Last week, I deleted a directory and placed a 301 redirect in my htaccess file. When I checked the redirect, it went to the correct page but only with double slashes! The redirect was correct but the problem was that I sent the old page to a n index.php file. Once I removed the index.php and just used the folder the double slashes went away.
So now my theory is that since my server is set to use index.php as the base for all folders, the redirects were doing double duty.
I removed any and all instances of index.php wherever I could in my files and especially from any redirects.
So instead of:
301 /oldfolder/oldfile.php *http://www.mysite.com/newfolder/index.php
I now use
301 /oldfolder/oldfile.php *http://www.mysite.com/newfolder/
(without the *)
I removed the disallow from the robots.txt and so far no crawling on any double slashes.
AFAIK my server is configured correctly. My working theory is that some clone site or scraper site is linking to my pages using this format.
What's worse is that it's causing Googlebot to crawl a bunch of directories that are expressly forbidden in the robots.txt! You'd think a plex full of PhD's could figure out not to put two slashes after the domain, but apparently not.
Anyway what I really need is the .htaccess code for redirecting // pages to the / equivalent. I can't seem to get it to work or find an example online. Anyone?
What's worse is that it's causing Googlebot to crawl a bunch of directories that are expressly forbidden in the robots.txt!
That's how I noticed the // in the first place and the bot had been crawling for two weeks before it started to crawl those pages. It's amazing what you don't see in your logs!
When I changed my robots.txt back to not disallowing those //, I did not put a redirect in place for the bots to go to the correct page. Naturally, the googlebot came back to crawl and found those // again. At this point, I'm sure those are pages that the bot already had and not a link from my site. I'm going to try the snippet that rainborik suggested sometime today and see if it works for me.
What I want to point out, is that the original directory indexed had gone supplemental and after a few weeks of disallowing //, it came back out. With the recent crawling of the // again, that directory has gone supplemental again. (For duplicate content) This is definately just a google problem and my best guess is that it has something to with their attempt to correct the canonical issues.
I'm guessing that not many webmasters use a redirect from // to / as general practice. Why would they even think of that, unless they have experienced it.
Is this not one the easiest ways to sc$#w your competitor? You would only have to leave a link to .thecompetitor.com// just long enough for the googlebot to grab it and start crawling their site. Take the link down, and the webmaster has no clue what has happened. (Just so you know and I don't get slammed, this is not something that I would do.)
# Remove multiple slashes anywhere in URL (less efficient than next rule)
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http:example.com%1/%2 [R=301,L]
Got the above from - [webmasterworld.com...]
A minor change made it work on my server:
# Remove multiple slashes anywhere in URL (less efficient than next rule)
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . %1/%2 [R=301,L]