Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

When to use noindex, 301 or robots.txt ?

         

realmaverick

9:11 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Good evening guys, I'm working on clearing the tens of thousands of errors in webmaster tools, caused by various things. As well as weeding out "junk".

1. The forum software we use, utilises tags, that in the latest version, decided to start using friendly URL's. This means the tens of thousands of old tags, all now 404. In this case, I am going to do a 301, though I cannot get it to work. I've tried about a billion different things including:

RewriteRule ^(.*)\-tag\.html&tagtype=contentType$ http://www.example.com/content/contentType/tag/$1/ [R=301,L]
*FIXED THIS PART*

Old tag url is example.com/example-tag,html&tagtype=contentType to new tag http://www.example.com/content/contentType/tag/example/

2. We have thousands of 404's from a developers error, the 404 is correct, the pages don't exist and never will. The error is www.example.com/app=core/example... which I have Disallowed the entire directory in the robots.txt. The error in the code, which caused the links to the pages is also fixed.

3. 2 million profiles, many of which are identical. I have added noindex, follow to all profile pages as well as removed many of the links, that lead to profiles.

4. We went from using Wordpress blog to invision blog, as it integrates properly with our forum and the rest of our CMS. However the URL's don't match. They contain an ID, making a Rewrite rule impossible. My developer wrote a script, that contains 1,000 301's for the .htaccess. I'm hoping 1,000 lines of code, won't hinder performance.

Have I made the right choices?

[edited by: realmaverick at 10:07 pm (utc) on Mar 11, 2012]

g1smd

9:51 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



example.com/example-tag,html&tagtype=contentType

Are you sure?

Comma before html?

Ampersand is not a valid character in the path part of a URL.

Where is the ? for query string delimiter.

Since (.*) reads in the entire URL path to the very end, it should NEVER appear at the beginning or in the middle of a RegEx pattern.

realmaverick

10:06 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I settled on this:

RewriteRule ^(.*)-tag(.*)\.html&tagtype=widget http://www.example.com/widget/widgets/tag/$1/ [R=301,L]
which works.

The reason I couldn't get anything to work, was because I'd placed the rule in the wrong part of the htaccess (smacks head). But it did need a few refinements of the code got the result I needed. i.e the Comma rather than period etc.

Before moving the rule up the code, it wasn't doing anything at all. So I missed the obvious error.

[edited by: realmaverick at 10:30 pm (utc) on Mar 11, 2012]

g1smd

10:09 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



One (.*) pattern in the rule was bad.

Two is a disaster.



Ampersand is not a valid character in the path part of a URL.

Where is the ? for query string delimiter.

[edited by: g1smd at 10:10 pm (utc) on Mar 11, 2012]

realmaverick

10:10 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One last question, to add to the list.

5. My members periodically request a piece of their content, which would have lead to a 404. Instead we use a 410 gone page.

But this will prevent the flow of link juice, had that page acquired a number of links?

One thought is to 301 redirect to the homepage, but then the content hasn't moved to the homepage.

Another option, is to leave the page as it is, but remove the screenshots and download button, with a small message informing users that the content has been removed?

g1smd

10:12 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Never mass redirect URLs for multiple removed pages to a single page.

You'll have your site flagged for "soft 404 errors" and that's a bad thing.

realmaverick

10:13 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



g1smd, why do you say that?

The tag is a variable, and there is also one some tags, a number that comes after the word tag. I cannot see another option, other than to leave them 404'd?

I have tested 100 from the list in WMT and all of them successfully 301 redirected to the correct page.

I'm not very well versed with htaccess or modrewrite, so please let me know the danger of what I've done.

Thanks

realmaverick

10:17 pm on Mar 11, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Looks like I replied before you edited your post.

Never mass redirect URLs for multiple removed pages to a single page.

You'll have your site flagged for "soft 404 errors" and that's a bad thing.


That's not what I'm doing.

RewriteRule ^(.*)-tag(.*)\.html&tagtype=widget http://www.example.com/widget/widgets/tag/$1/ [R=301,L]
which works.

the first (.*) is to catch all tags so for example madonna-tag and the second is to catch a number, that sometimes appears after the tag i.e tag123.html

It then redirects to http://www.example.come/widget/widgets/tag/madonna/

There is only one Madonna tag, it may or may not contain a number. Some of the tags do an others don't. There aren't multiple versions of each though. The number is unique to the tag.

So ultimately, lets say there are 50,000 old tag pages, they will redirect to 50,000 different tag pages of the new format. Not 50,000 to 1.

I hope I'm making sense, I've been working on these for so long today, I feel like I'm losing my mind.

realmaverick

1:39 am on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Holy crap, just found a 404 on a blog post that has a link from a PR8 and 9 website!

lucy24

5:07 am on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



That's not what I'm doing.

You may have misunderstood what a "soft 404" is. In brief: it's when you don't return a 404, but instead redirect all requests for nonexistent URLs to some central page. Search engines hate this. (By weird coincidence, so does this user. This is not something you hear every day.) So if you're returning 404s you're doing the right thing. If you're talking about pages that used to exist but no longer do, then 410 is correct.

RewriteRule ^(.*)-tag(.*)\.html&tagtype=widget http://www.example.com/widget/widgets/tag/$1/ [R=301,L]
which works.

the first (.*) is to catch all tags so for example madonna-tag and the second is to catch a number, that sometimes appears after the tag i.e tag123.html

It then redirects to http://www.example.come/widget/widgets/tag/madonna/

There is only one Madonna tag, it may or may not contain a number. Some of the tags do an others don't. There aren't multiple versions of each though. The number is unique to the tag.

Essential facts to remember:

.* means capture as much as you possibly can, from zero to infinity. Regular Expressions are greedy by default; the RegEx variety used in mod_rewrite is no exception.

Equally important: You and I are looking at the Rule in two dimensions, seeing the whole thing from beginning to end. The server operates in one dimension; it doesn't know what's looming ahead until it gets there.

So let's do a walkthrough of your server meeting the rule

(.*)-tag(.*)\.html&tagtype=widget

.* >> Server dutifully captures the entire request.

-tag >> Server says "Oh, oops, I have to see if there's the literal string '-tag' after my capture." Backtrack, backtrack, backtrack-- until it's back to the piece before "-tag". If there was no "-tag" in the request, it spits out the whole Capture and moves on to the next Rule.

.* again >> Server dutifully captures everything from "-tag" all the way to the end of the text.

\.html >> Server says "Oh, oops", et cetera as above, and again has to backtrack until it finds an .html. Again, if there was no html, it gives up on the Rule and moves on.

& >> Server says "WTF? There's no ampersand here! Those belong in query strings, and I can't see query strings." So, after two captures and two backtracks, the server ends up having to abandon the whole thing anyway.

Setting aside the query-string issue:

Going by your synopsis, the first .* is not "capture anything and everything up to the end". It's "capture some specific word". Best expressed as ^([A-Za-z]+) meaning "capture away-- so long as you meet nothing but letters". Use + rather than * because I assume you don't have null -tags.

The second .* is, again, not "capture anything and everything up to the end". It's "there may be some numbers here". Best expressed as ([0-9]*).

And now we can talk about a RewriteCond looking at %{QUERY_STRING}. But not here and not yet, because it's only a few days since I last posted the boilerplate.

realmaverick

5:39 am on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi Lucy, the redirect was not an attempt to create 301's for pages that haven't existed, but create 301's for pages that have moved.

Also as far as I have tested, the rule redirects from old to new. It doesn't redirect to a single page.

I've followed your suggestions and hopefully now it's more efficient?

I had to change ^([A-Za-z]+) to ^([A-Za-z0-9]+) as some tags are number, for example somebody named a tag 007

This rewrite stuff makes me so anxious.

Thanks a lot for your help. Please let me know if I've made any obvious mistakes with the implementation.

I have checked a ton of tags and they all appear to be 301 redirecting perfectly to the new URL's.

realmaverick

6:02 am on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here is how it looks now

RewriteRule ^([A-Za-z0-9]+)-tag([0-9]*)\.html&tagtype=widget http://www.example.com/widget/widgets/tag/$1/ [R=301,L]


The old URLs were

http://www.example.com/example-tag.html&tagtype=widget


Not through choice, but the CMS we use. They have now been changed to

http://www.example/widget/widgets/tag/example/


As I say, the redirects are successful. But if I've made more mistakes, I would very much appreciate it, if you can point them out.

Thanks a lot and g'nite. 6am here.

lucy24

6:39 am on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Wow. If it's working, then I guess your old URLs really and truly have ampersands in them. Surprised they didn't change in transit to %26. Or make the server explode, or something.

But wait. You're only using $1 in your RewriteRule. I think you said that for any given -tag, there's only one possible number, or no number at all. Then you don't need to capture the [0-9] piece.

Do there exist addresses that start out

blahblah-tag123

but don't continue through ".html&tagtype=widget" at the end? If not, you don't even need to look at anything after "-tag". Get that server outta there two nanoseconds sooner ;)

rlange

2:54 pm on Mar 12, 2012 (gmt 0)

10+ Year Member



lucy24 wrote:
.* >> Server dutifully captures the entire request.

This is false. If you have a regex of the form "^(.*)-tag" and match it against the string "foo-tag bar-tag", the captured subpattern will contain "foo-tag bar", not "foo-tag bar-tag".

[regular-expressions.info...]

--
Ryan

realmaverick

4:02 pm on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I guess the main thing, is that the redirect is working and hopefully isn't too much of a waste on resources. I do feel somewhat uneasy though, as it's such a complex thing to mess with.

I am glad the 404's will slowly go away though.

Lucy, I'm still interested in how I'd make this a RewriteCond, if you fancy sharing your know-how some more :)

lucy24

10:08 pm on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is false. If you have a regex of the form "^(.*)-tag" and match it against the string "foo-tag bar-tag", the captured subpattern will contain "foo-tag bar", not "foo-tag bar-tag".

You've misunderstood. The Regular Expression first captures all the way to the end. It then has to backtrack, spitting out part of its original capture, until it finds the last occurrence of -tag. If the request happens to contain two occurrences of -tag, with the second specified piece [0-9]+ between them, then the Regular Expression has to backtrack again. Only if you express each piece as .*? -- or preferably constrain it even more narrowly -- will it stop as soon as it hits the first -tag, with no backtracking needed.

I'm still interested in how I'd make this a RewriteCond

The best RewriteRules have no conditions, because the format of the "pattern" creates its own condition.

Remember that mod_rewrite moves two steps forward, one back. It first looks at the rule. If and only if the rule matches the current request-- for example, if the Rule says \.html$ and the request is for widgets/foobar.html-- then it steps back and looks at the Conditions that belong to the Rule. So if your rule boils down to "redirect everything that looks like this" you don't need Conditions.

g1smd

10:13 pm on Mar 12, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is false. If you have a regex of the form "^(.*)-tag" and match it against the string "foo-tag bar-tag", the captured subpattern will contain "foo-tag bar", not "foo-tag bar-tag".

Yes. This will be the "end" result, but only after hundreds of back off and retry "trial match" attempts. This is very slow and inefficient.

Use a more specific pattern, one that can be parsed from left to right in just one pass.

realmaverick

12:03 am on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm all a bit over my head. I followed guides to create the rule.

If anybody can give me a specific, more efficient rule, that'd be great.

Thanks

lucy24

12:21 am on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Your most recent rule looks perfect. The only remaining question is whether there's too much of it :)

realmaverick

9:55 am on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Lucy. I come across a small issue. Its not picking up tags that have + or - in them. So for example it picks up blue-tag.html but not blue+widget-tag.html or blue-widget-tag.html

What's the most efficient way to pick that up?

g1smd

10:02 am on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Add the + character to the "allowed characters" grouping in the pattern.

realmaverick

10:57 am on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hi g1smd, I've added the plus with success, however I can't seem to pickup the hyphen in *blue-widget*-tag.html

rlange

1:01 pm on Mar 13, 2012 (gmt 0)

10+ Year Member



lucy24 wrote:
You've misunderstood. The Regular Expression first captures all the way to the end. It then has to backtrack, spitting out part of its original capture, until it finds the last occurrence of -tag.

Argh. I should have used a .* pattern when reading your post. My bad.

realmaverick wrote:
Hi g1smd, I've added the plus with success, however I can't seem to pickup the hyphen in *blue-widget*-tag.html

You'll also have to add the hyphen to the character class, too. Since it's a metacharacter within character classes, you'll have to add it to either the beginning ("[-a-zA-Z0-9+]"), or the end ("[a-zA-Z0-9+-]") to make sure it works as expected.

--
Ryan

g1smd

9:05 pm on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



[a-zA-Z0-9+-]+
will try to pick up "some-words-here-" with a trailing hyphen on the first pass. It's difficult to find a less greedy and less ambiguous pattern with the hyphenated parts so early in the URL path.

realmaverick

9:40 pm on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Hmm the - won't grab it for some reason, it's managing to get its mitts on the + though.

lucy24

11:55 pm on Mar 13, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Some RegEx dialects are persnickety and will only recognize the hyphen - if it comes at the very beginning of the bracketed group. Another thing to try is escaping it \-

realmaverick

12:03 am on Mar 14, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Wicked, that worked a treat. Thanks Lucy :)

realmaverick

12:05 am on Mar 14, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I bloody love the ability to remove pages in GWT from the lists, as you fix them!

realmaverick

12:26 am on Mar 14, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Any idea how I specify a particular url to 301 redirect, for index.php to / without it effecting index.php?blah

g1smd

1:03 am on Mar 14, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



RewriteCond %{QUERY_STRING} !.


The above is true if there is no query string.

However you'll likely generate an internal loop when DirectoryIndex changes the internal pointer to point to "index.php" again.

Use a solution that tests THE_REQUEST for the correct path instead. There's hundreds of prior code examples here.
This 32 message thread spans 2 pages: 32