Forum Moderators: phranque

Message Too Old, No Replies

bad urls being crawled

server problem or fix in htaccess

         

proboscis

6:13 pm on Apr 23, 2011 (gmt 0)

10+ Year Member



Hi,

I noticed these types of urls are being crawled:

www.example.com/page.shtml
www.example.com/page.shtml/page1.shtml
www.example.com/page.shtml/page1.shtml/page2.shtml

The first one is correct, but they all show the same content, and it can end up with more duplicate urls than real urls.

Is this an error in the way the server is set up or...?

Thanks!

g1smd

6:49 pm on Apr 23, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Run Xenu LinkSleuth over the site and look very carefully at the report.

You'll need to fix any faulty internal linking and you'll need to also implement a fix such that all future requests for faulty URLs either return the "404 Not Found" status OR else return a 301 redirect to the right version of the URL.

Make sure that MultiViews is turned OFF.

proboscis

12:38 am on Apr 25, 2011 (gmt 0)

10+ Year Member



Thanks g1smd!

I can't run Xenu, host won't allow it, but I found screaming frog and ran that. I didn't find anything on my site causing the error, I think someone is linking to a page incorrectly.

So when I spider a url with a trailing slash after the file name ( .html/ ) every relative url or directory on that page is added after the file name as if the file were a directory, it does that for awhile and many duplicate urls are created or "found".

So I have some code in htaccess, but it's not entirely working.

Is there a better fix or should I work on making the htaccess code work correctly?

(don't know about MultiViews, I asked, they will probably tell me after the holiday)

g1smd

7:32 am on Apr 25, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, don't use relative linking!

Begin links with a slash and they then count from the root of the site.

You could also add a rule to .htaccess, something like this:

RewriteRule ^([^/]+/)*([^/.]+\.)html/ - [G]


The RewriteRule should be added after you fix your internal links. The rule is not a complete solution in an of itself.

Also, don't use MultiViews. Turn that off too.

jdMorgan

5:15 pm on Apr 25, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Or

# Externally redirect requests to remove "extra" URL-path info appended after htm, html,
# shtm, or shtml filetypes due to bad links on other sites. This should prevent those bad
# links from appearing in search results, as long as the bad links on not on my own site.
RewriteRule ^(([^/.]+/)*[^.]+\.s?html?)/ http://www.example.com/$1 [R=301,L]

Jim

proboscis

8:55 pm on Apr 25, 2011 (gmt 0)

10+ Year Member



Yes, don't use relative linking!


Oh, I see that fixes a large part of the problem!

I have some code in htaccess already but there are two problems with it that I know of, one I've seen more than 10,000 301s in my logs, not sure if that is bad or not, but maybe changing to absolute urls will reduce that number.

Two, a url with a trailing slash or double slash that does not exist redirects to my 404 page with a 301.

Options +FollowSymLinks
RewriteEngine on
# Remove multiple slashes anywhere in URL
RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]
#
# Remove trailing slash if filetype present in URL
RewriteRule ^(.+\.[^/]+)/$ http://www.example.com/$1 [R=301,L]

# Remove extra URL-path info if filetype present in URL
RewriteRule ^([^.]+\.[^/]+)/ http://www.example.com/$1 [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.shtml\ HTTP/
RewriteRule ^(([^/]+/)*)index\.shtml$ http://www.example.com/$1 [R=301,L]

RewriteCond %{REQUEST_URI} !^/404\.html$
RewriteCond %{HTTP_HOST} ^example\.com [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com\. [OR]
RewriteCond %{HTTP_HOST} ^www\.example\.com\:[0-9]+
RewriteRule ^(.*)$ http://www.example.com/$1 [R=301,L]

#

g1smd

9:15 pm on Apr 25, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A URL with a trailing slash or double slash that does not exist redirects to my 404 page with a 301.

Whenever I see the words "redirect to ... 404 page" I can see that is SEO suicide.

Why does it redirect? The 404 error should be served at the originally requested URL.

After the redirect, what HTTP status code is returned? Is it 200? If it is, you have a major problem on your hands.

Use the Live HTTP Headers extension for Firefox to investigate this in detail.

g1smd

9:21 pm on Apr 25, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{REQUEST_URI} ^(.*)//(.*)$
RewriteRule . http://www.example.com%1/%2 [R=301,L]


might work more efficiently as:

RewriteCond %{REQUEST_URI} ^/+([^/]+/)/(.*)$
RewriteRule . http://www.example.com/%1%2 [R=301,L]

proboscis

8:45 pm on Apr 26, 2011 (gmt 0)

10+ Year Member



Why does it redirect?


I think the code says to redirect any url with a trailing slash, so even a page that doesn't exist gets redirected. I'm not exactly sure.

I was using fetch as googlebot and it says pages that do not exist return a 301, they are being redirected to my custom 404 page.

But I used live headers as you suggested and that says that a non-existant page with a trailing slash first redirects to the page without the trailing slash with a 301, then the next line says that the page does not exist and returns a 404.

It's not the same as fetch as googlebot?

Now what should I do?

Thanks so much for your time and adive!

g1smd

9:47 pm on Apr 26, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's not too bad:

Request: URLa --301--> URLb.

Request: URLb = 404.

If there was more than one 301 step, or there was a 200 at any point then you would have a big problem.

proboscis

12:46 am on Apr 28, 2011 (gmt 0)

10+ Year Member



Oh, it is possible to have more than one 301 step, that is if someone linked to a url with both a double slash and a trailing slash.

But I don't know how likely that would be to happen.

Also, is fetch as googlebot reliable?

It's not showing me a 404, it says:

Request: URLa --301--> URLb
It doesn't say that it continues on to find a 404, only live headers says that.

Sorry this is taking me so long to understand!

g1smd

1:14 am on Apr 28, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yes, Google only makes one request. It does not immediately follow redirects. The redirected-to URL goes into a database and is requested later.

Live HTTP Headers follows the normal browser behaviour in immediately following a redirect instruction by making a new HTTP request.

proboscis

5:44 pm on Apr 29, 2011 (gmt 0)

10+ Year Member



Oh okay, so the 301 redirecting to the 404 page is not a big problem. Good.

But I can still create a situation where there are multiple 301 redirects to the right page, 200 or 404 being the final outcome.

But I think someone would have to purposefully create that link in order for that to happen.

Is that something you think needs to be fixed?

andrew_o

6:52 pm on May 4, 2011 (gmt 0)

10+ Year Member



Hi,

I'm new to apache and I have a problem like @proboscis and I have no clue to make a rule in .htaccess file.

Please can anybody help me with the exact lines of code?

My correct link is:

http://mysite.com/category/page/4


The incorrect link that doesn't return a 301 redirection to the correct link is:

http://mysite.com/category.html/page/4


==============================================


My correct link is:

http://mysite.com/category-blue/page/4


The incorrect link that doesn't return a 301 redirection to the correct link is:

http://mysite.com/category_blue/page/4


==============================================

My correct link is:

http://mysite.com/view/blue/page/4


The incorrect link that doesn't return a 301 redirection to the correct link is:

http://mysite.com/view-blue/page/4


My links don't have "/" at the end and also don't have ".html" extensions.

Can anybody to write the general rule for these 3 cases?

Thank you.