homepage Welcome to WebmasterWorld Guest from 54.242.231.109
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

This 48 message thread spans 2 pages: 48 ( [1] 2 > >     
Spam words in a query string - do these backlinks hurt rankings?
anand84




msg:4129184
 6:55 am on May 8, 2010 (gmt 0)

< This thread was split from another location [webmasterworld.com] >

to put it in a different perspective, if your URL can be dynamically changed to another URL which does not reflect the actual keyword you programmed for your URL and still resolves as a 200 header found, you've got a problem.

Informative post Dusky. Let me understand what you have posted here. If there is an additional keyword injected in the URL, you mean it should take you to a 404 error instead of displaying a page?

I have a Wordpress blog where I tested example.com/keyword-here/?q=spam-keyword and see that while the URL in the address remains the same (with the ?q= part included), the webpage has however resolved to example.com/keyword-here/

Do you mean this is a problem?

[edited by: tedster at 2:56 pm (utc) on May 8, 2010]

 

tessmac




msg:4129195
 7:40 am on May 8, 2010 (gmt 0)

This is my first post, so hello to everyone, glad to be here.

Anand84 from what I understand example.com/keyword-here/?q=spam-keyword should be
example.com/keyword-here?q=spam-keyword ( without the / after the first keyword-here )

Again if I understand Dusky correctly he is saying that if using that url the page resolves and doesnt throw up a 404 page not found error, then potentially you have duplicate content issues.

If that is wrong please correct me.

I have had a similar issue with hundreds of links being posted that resolve to www.example.com?keyword-keyword-keyword amongst others. When doing a search on G, this page then came up( very low down )for a legitimate keyword thst we use.

The page that resolves is the home page, but I am presuming here that G will see this as duplicate comtent.

[edited by: tedster at 2:43 pm (utc) on May 8, 2010]
[edit reason] switch to example.com (just one spot) [/edit]

anand84




msg:4129210
 9:33 am on May 8, 2010 (gmt 0)

Aah...I see Dusky's point now. Thanks a lot Tessmac and welcome to WebmasterWorld :-)

I checked site:mysite.com and did notice a few phantom URLs with '?' suffixes indexed too.

lhw455




msg:4129218
 9:49 am on May 8, 2010 (gmt 0)

Dusky, the problem you described seems to fit exactly what's been happening to my blogs. Is there a fix we can use (I'm using wordpress) which will allow the legitimate keywords (i.e. ones which we have linked to on our site), but not allow URL's which we have not linked to to not resolve? I'm using a similar query i.e www.example.com/?s=keyword

lhw455




msg:4129232
 10:40 am on May 8, 2010 (gmt 0)

Actually, now that I think about it, it's a silly question, sorry! I've had an idea, what I'm going to do is store all the long-tail keywords in a file or the DB and check is the term is one of them, if not, redirect to 404.

Thanks

BillyS




msg:4129242
 12:13 pm on May 8, 2010 (gmt 0)

Again if I understand Dusky correctly he is saying that if using that url the page resolves and doesnt throw up a 404 page not found error, then potentially you have duplicate content issues.

Back in late 2005, early 2006 duplicate content of this type was a big problem. Many of us took steps back then to eliminate these problems at that time.

If you're using a CMS like Joomla or Wordpress, they've solved this issue by now either in core or through an extension / module.

Google Webmaster Central has some useful information on creating custom 404 pages.

tessmac




msg:4129265
 1:21 pm on May 8, 2010 (gmt 0)

BillyS...I am using joomla, a version of 1.0, not 1.5, and we are getting these urls indexed through spam links, which I am sure must be causing a problem as they seem to be deliberately being placed.

Do you know how we may be able to fix this.

dusky




msg:4129297
 2:10 pm on May 8, 2010 (gmt 0)

The main thing you should worry about is if the duplicate content originated on your site, such as posting a link on your own blog, forum or news as yoursite.com/different-title-1234.html when the original link pulled from your database is yoursite.com/original-title-1234.html, this will be seen as referencing one page with two or more different URLs by the webmaster as intentional content duplication which falls under the assumption "trying to manipulate or inflate SERPs and rank", this may attract a penalty or a heavy filter. Whereas if a spammer / competitor posted yoursite.com/different-title-1234.html on external site/s trying to draw attention to a duplicate content to penalize you, this practice is out of your control, but initially G* and other search engines will index that because it's a backlink leading to a page which resolves to a correctly found header 200, and that may cause temporary rank drops for pages until G* and other SEs applied the correct filter and rank measures.

Furthermore, when one page can be accessed by many URLs, its PR is split, hence weakened. Other additional problems to the ones I mentioned above, such as eternal loops by spiders stuck on the same URL with 1000s of keywords each and everyone of them is seen as a different page (if some idiot posted them all on thousands of blogs as comments links), unnecessary Gbot overhead spent on indexing bogus URLs all leading to the same page. In two words, if it's caused deliberately by you on the site, it may cause penalties or heavy filters, if outside your control, there should not be any worry, G* is known to ignore phantom URLs and only index one URL leading to the one page, still one need to make sure all wrong URLs written on the address bar or as backlinks are 301 redirected to the correct URL.

As said above, to prevent this, for Apache users (other webservers may have similar procedures) is not too difficult.

It differs from one CMS to another, but for Zikula, Joomla, PHPNuke and similar built CMSes where you have the original URL before the mod-rewrite applied as yoursite.com/index.php&name=blabla&whatever=blabla or yoursite.com/modules.php&name=blabla&filer=blabla
This should work if you placed it on the bottom of you htaccess file, but make sure it is working, you get a 500 error if it's not.

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /\?(.*)\ HTTP/ [NC]
RewriteRule ^/?$ /404\.shtml? [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Za-z]{3,9}\ /(.*)\.html\?(.*)\ HTTP/ [NC]
RewriteRule ^index\.php$ /%1\.html? [R=301,L]

See the bolded index.php, that's when your urls are as yoursite.com/index.php&name=blabla&whatever=blabla
If they are as yoursite.com/modules.php&name=blabla&whatever=blabla change it to modules.php.

The first two lines say redirect to a 404 custom error page 404.shtml (if no custom page exists, it still gives a 404 error anyway) if an idiot is asking for yoursite.com/?write-whatever-here which does not exist but resolves to a 200 header found, another problem which I did not speak about and should test first if you have that problem, the second is saying redirect with 301 to the correct page anyone adding .....html?q=something or ?anything-here to the END of the URL, note the .html, if your pages are .htm change them accordingly.

Please note, the above may break your site if not applied properly and make sure to use a link checker such as X*nu or similar after you have applied it. Also, the above should be different for different CMSes, drupal 6+ seems to be immune from most of those problems.

[edited by: tedster at 3:04 pm (utc) on May 8, 2010]

dusky




msg:4129311
 2:37 pm on May 8, 2010 (gmt 0)

tessmac
Anand84 from what I understand example.com/keyword-here/?q=spam-keyword should be
example.com/keyword-here?q=spam-keyword ( without the / after the first keyword-here )


Not quite, example.com/keyword-here/?q=spam-keyword should be example.com/keyword-here. When you have the problem is when someone pastes example.com/keyword-here?spam-here and they get the page example.com/keyword-here BUT with url in the address bar still as example.com/keyword-here?spam-here

OR your original page as example.com/product-title-1234.html and you can access it also by example.com/product-title-1234.html?spam-words, you've got a problem, basically ? question mark enables any parameters added and still resolves to a correct page. Millions of sites have this problem, add ?whatever to the end of thread pages on many large forums and you'll see! As long as the injection (the different URL) is not posted on the site itself, G* knows how to deal with it I guess, it's out of your control.

tedster




msg:4129322
 3:04 pm on May 8, 2010 (gmt 0)

Also, if you are seeing bogus backlinks such as example.com/keyword-here?q=spam-keyword a question that should come up is "Why would someone do that?"

Especially if you are using a common CMS like Wordpress, Joomla, PHPNuke etc, {but really, in anmy case) there is a chance that you've been hacked. In other words, your page may be hosting parasite links that are cloaked so only googlebot sees them. The hacker/spammer would be creating those backlinks trying to build some ranking power for your page - so that THEIR parasite links gain ranking power.

You can use the "fetch as googlebot" utility in WebmasterTools to quickly check if this is the caase.

tessmac




msg:4129335
 3:45 pm on May 8, 2010 (gmt 0)

Dusky..thanks

Not quite, example.com/keyword-here/?q=spam-keyword should be example.com/keyword-here. When you have the problem is when someone pastes example.com/keyword-here?spam-here and they get the page example.com/keyword-here BUT with url in the address bar still as example.com/keyword-here?spam-here


This is the actual problem.

The url that is posted is www.example.com?word-word-word ( where all the 3 words are the same ) The words that are posted in the url are actually words that our site is themed on, but they have been repeated 3 times.

Other eaxamples are similar with again the same words repeated 2 times or 3 times, but one is in capitals.

We have seen these links in our G WMT and account for about 150 backlinks all posted on various dubious forums. I have also seen some of our competitors listed with similar query strings on these same forums.

The url is in the address bar.

So we need to find a way to redirect this or it will cause problems?

tessmac




msg:4129337
 3:48 pm on May 8, 2010 (gmt 0)

Tedster

Also, if you are seeing bogus backlinks such as example.com/keyword-here?q=spam-keyword a question that should come up is "Why would someone do that?"

Especially if you are using a common CMS like Wordpress, Joomla, PHPNuke etc, {but really, in anmy case) there is a chance that you've been hacked. In other words, your page may be hosting parasite links that are cloaked so only googlebot sees them. The hacker/spammer would be creating those backlinks trying to build some ranking power for your page - so that THEIR parasite links gain ranking power.

You can use the "fetch as googlebot" utility in WebmasterTools to quickly check if this is the caase.


Thanks..I have just checked and this doesn't seem to be the case. But as you say there must be a reason.

Would it be to trigger some type of penalty with G?

tedster




msg:4129346
 4:10 pm on May 8, 2010 (gmt 0)

That is another possibility.

One of the focuses that some black hat practitioners take is lowering the websites that rank above theirs. Some people call this "disruptive webmastering" and it can get pretty dark. So if your site has this query string vulnerability, it may attract this kind of attack.

g1smd




msg:4129434
 8:57 pm on May 8, 2010 (gmt 0)

RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /\?(.*)\ HTTP/ [NC]
RewriteRule ^/?$ /404\.shtml? [R=301,L]

RewriteCond %{THE_REQUEST} ^[A-Za-z]{3,9}\ /(.*)\.html\?(.*)\ HTTP/ [NC]
RewriteRule ^index\.php$ /%1\.html? [R=301,L]

See the bolded index.php, that's when your urls are as example.com/index.php&name=blabla&whatever=blabla
If they are as example.com/modules.php&name=blabla&whatever=blabla change it to modules.php.

The first two lines say redirect to a 404 custom error page 404.shtml


Do not use the above code. It is quite dangerous and is absolutely the wrong way to handle this.

NEVER issue a redirect to an error page. A redirect means the request results in a 302 or 301 status being sent. For URLs that should not exist, the server must directly return a 404 status for the originally requested URL. There should be NO redirect.

If you want to send 404 status for all external URL requests with appended query strings use something like:

#Send 404 for any URL with appended query string
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^\?]+)?\?([^\ ]+)\ HTTP/ [NC]
RewriteRule .* /path-does-not-exist [L]


OR

# Send 404 for root URL with appended query string
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /\?([^\ ]+)\ HTTP/ [NC]
RewriteRule !. /path-does-not-exist [L]



The use of multiple (.*) patterns in the second rule in the "quote" above, will cause the pattern matching to attempt hundreds of "back off and retry" attempts for each URL request arriving at your server. This will run hundreds of times quicker:

# Redirect .html URL requests with appended query strings to
# www version of domain and strip the query string value.
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*([^.]+)\.html\?([^\ ]+)\ HTTP/ [NC]
RewriteRule ^(([^/]+/)*)([^.]+)\.html$ http://www.example.com/$1$2.html? [R=301,L]


All other syntax changes are also intentional.

The original rule could never work. It required the filename in the RewriteRule pattern to match index.php but needed the filename in the RewriteCond pattern to match index.html and it could never match both requirements at the same time.

The code here is still a bit dangerous as URLs published to the web should really omit the "index.php" or "index.html" filename entirely. Even now, the code probably needs a little bit more customisation.

dusky




msg:4129466
 9:55 pm on May 8, 2010 (gmt 0)

g1smd, I get you, but the 404.shtml page is a custom 404 error page and returns a 404 error not found header if done right and checked with header fetcher, however, I take your point, /path-does-not-exist in this case maybe wiser, a raw true error page is best. The index.php thing, yes I explained that and said if you have index.php as the URL constructor. What I posted above works OK with postnuke / Zikula and other similar CMSes because they have index.php or modules.php always as the start of any URL fetched from the database.

Using the wildcard approach may not be the best solution, but I had to do something, thanks for your input, I'll give your solution a try myself.

This is what I like about these forums, with constructive arguments like this, we achieve results, better results!
Anyone else with more hacker busting fixes related to this?

tedster




msg:4129476
 10:10 pm on May 8, 2010 (gmt 0)

Just a word of caution. Some people are tempted to preserve the link juice coming from such spam backlinks, and that's why they use a 301 to the same URL without the query string.

I don't think that's really a good idea - do you really WANT that kind of link juice, even if it seems to help in the short term? That's why the 404 is a much better choice. However, in a pinch, the canonical link tag on every page can also be a guard against this kind of query string duplicate URL problem.

In fact, I wonder if the canonical link might even help a bit more with Google than a 404 response. (Not tested, and don't want to test it, either.)

g1smd




msg:4129503
 10:35 pm on May 8, 2010 (gmt 0)

I get you, but the 404.shtml page is a custom 404 error page and returns a 404 error not found header if done right

Yes,
example.com/404.shtml may very well return a 404 status code, but the very important point here is that example.com/filename.html?duff-url-request does not return a 404 response. It returns a 301 redirect to a different URL. The 301 redirect is not a 404 status. The browser isn't told that the URL does not exist. Instead it is told to make a new request for a different URL. When it does so, it is then told that this new URL does not exist.

That is not the same thing and this can be a very dangerous situation, and is one that must be avoided.


The index.php thing, yes I explained that and said if you have index.php as the URL constructor. What I posted above works OK

As posted, the code cannot possibly work. It will only match if both "index.php" AND "index.html" are in the requested URL at the same time.

[edited by: g1smd at 10:42 pm (utc) on May 8, 2010]

dusky




msg:4129507
 10:41 pm on May 8, 2010 (gmt 0)

tesdter,
Just a word of caution. Some people are tempted to preserve the link juice coming from such spam backlinks, and that's why they use a 301 to the same URL without the query string.


I implemented the solution having what you said in mind, and yes many spam links are redirected to the correct page as some of them already give healthy PR, I say spam links, sometimes they were only spelling mistakes of postings of a link to an interesting article or forum post / blog etc which would give a 200 correct header even though the URL is different but the page is correct, so redirecting to the correct page instead of 404 in that case is better and I have that as part of my solution. g1smd's note about not using wildcard is a good one though.

g1smd,
The original rule could never work. It required the filename in the RewriteRule pattern to match index.php but needed the filename in the RewriteCond pattern to match index.html and it could never match both requirements at the same time.

It'll work only if someone is using the corresponding CMS and already implementing / using the short URLs method. I have sites that use the old Postnuke .764 and Zikula, and the fix (my fix) works on both. For other CMSes, a slight tweak should be undertaken, incidently, your tweak above as it is (the second one) won't work on CMSes that need the index.php or modules.php as the starting URL before any short URLs implementations, it needs a tweak I guess to work on any, wonder if you or someone else can have a go and make it cross-CMS fix. Note that /path-does-not-exist would redirect to the custom 404 error page if you nhave one anyway!

So the final tweaked (when using postnuke, Zikula type URls and short URLs) fix with your added advice about not using wildcards is:

# Send 404 for root URL with appended query string
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /\?([^\ ]+)\ HTTP/ [NC]
RewriteRule !. /path-does-not-exist [L]

# Redirect .html URL requests with appended query strings to
# the original URL stripping what's added as spam / mistakes.
RewriteCond %{THE_REQUEST} ^[A-Za-z]{3,9}\ /([^\ ]+)\.html\?([^\ ]+)\ HTTP/ [NC]
RewriteRule ^index\.php$ /%1\.html? [R=301,L]

TheMadScientist




msg:4129508
 10:43 pm on May 8, 2010 (gmt 0)

Sometimes I feel like tedster's alter-ego... LOL :)

I strip all query_strings, not in an effort to retain any inbound link weight, but rather to make sure visitors who happen to click on a link somehow get the information they were looking for and also for security purposes. If I did not do it this way, I would probably personally go with the canonical link relationship on the pages with the query string. I would probably NOT serve a 404 (or a 410), because I try my best to get visitors to what they were looking for and don't like the idea of even one visitor thinking MY site is broken or not working as it should because SOMEONE ELSE did something...

IMO it's fairly easy to create a bad experience for some visitors in the name of search engine rankings, but for me even one unnecessary bad experience for a visitor is something I try to avoid, so sometimes I throw caution to the wind and take people to the right page... It's also much easier for them to get back there again if they don't have some goofy query_string on the URL they found containing the information, so I live on the edge and don't allow query_strings on any of my current sites.

g1smd




msg:4129512
 10:47 pm on May 8, 2010 (gmt 0)

Note that
[A-Z] when used with the [NC] flag is processed twice as fast as using the [A-Za-z] pattern.

Note that /path-does-not-exist would redirect to the custom 404 error page if you have one anyway!

No. There is no redirect here. If you rewrite an incoming external URL request to fetch content from an internal server filepath that does not exist, Apache's ErrorDocument handler is directly invoked for that request. This is not a redirect.

tedster




msg:4129516
 10:52 pm on May 8, 2010 (gmt 0)

The 301 redirect is not a 404 status. The browser isn't told that the URL does not exist.

g1smd, I've seen this kind of custom 404 page handling on many websites, and it has not caused them trouble. As long as the redirect target page returns a 404 status, then this approach has worked well in my experience. It comes up more on IIS servers, rather than Apache.

Problems have come because Microsoft certification used to instruct using 302 to a 200 for custom error pages. But with a 301 or even a 302 redirect, Google does "get it" that the original URL is bogus.

Even more - and especially when the site is hosted on an ISS server - googlebot often tests for "soft 404" handling where the final page is a 200 OK status. Google both compensates for the poor webmastering and even creates a warning in Webmaster Tools.

The important thing, in my opinion, is NOT to serve the content of /keyword-title/ when /keyword-title/?q=spam-keyword is requested.

[edited by: tedster at 11:03 pm (utc) on May 8, 2010]

tedster




msg:4129522
 11:00 pm on May 8, 2010 (gmt 0)

There's an added challenge when someone's infrastructure (especially their tracking and analytics) intentionally uses query strings. Yes, I know that it's not a good idea, but WebTrends has done this for a long time and that means some enterprise servers are stuck with it until they can do a massive redevelopment project.

In those cases, the site needs to strip the intended query strings and redirect them to the correect base URL -- and also deal with the spam attack. The easiest way out is often a 301 redirect to the base URL with a 200 OK, and then just accept the spammy link juice if Google actually sends it through. Even better is stripping only the ?WT=[campaign tracking] query string to resolve with a 200 and sending the other spammy query string versions to a 404.

TheMadScientist




msg:4129526
 11:07 pm on May 8, 2010 (gmt 0)

The easiest way out is often a 301 redirect to the base URL with a 200 OK, and then just accept the spammy link juice if Google actually sends it through.

Yeah, I've actually been stripping query_strings for years and have not had any issues I've noticed from doing it. With the way G tests everything, sometimes by adding random query_strings themselves, I would guess they 'get' the canonicalization technique in place and just discount the link weight to nothing, but it's not something I've tested other than in practice and not noticing any ill effects from having done it.

dusky




msg:4129530
 11:10 pm on May 8, 2010 (gmt 0)

g1smd, yes /path-does-not-exist [L] is a direct response without a redirect, nonetheless will produce a 404 error header, whether that is custom error or the raw webserver 404 response.
As to the A-Za-z instead of A-Z, I was intending to catch both lower and higher case spam / wrong keywords or parameters, but I guess I am not an expert on the subject and the [NC] is enough to cover both and only serve/ catch one instance.

The important thing, in my opinion, is NOT to serve the content of /keyword-title/ when /keyword-title/?q=spam-keyword is requested.


Exactly, as in most cases it's a hack attack / injection attempt or someone trying to frame you with SEs. It is different from /wrong-keyword-title/ posted as a backlink (or accessed as such) instead of /keyword-title/ which one should redirect it to the correct page /keyword-title/ instead of serving a 404 error response.

TheMadScientist




msg:4129541
 11:27 pm on May 8, 2010 (gmt 0)

hack attack / injection attempt

Exactly why I don't even allow them on my sites...

Besides, how much easier does it get?
RewriteCond %{THE_REQUEST} \?
RewriteRule .? http://example.com%{REQUEST_URI}? [R=301,L]

You can simply exclude the stripping from directories where they are necessary with:
RewriteCond %{REQUEST_URI} ^/the-path/to-the-directory
RewriteCond %{THE_REQUEST} \?
RewriteRule .? http://example.com%{REQUEST_URI}? [R=301,L]

BTW: Using either of the preceding you can rewrite to URLs containing a query_string, they just cannot be requested directly by a visitor. So, IOW it doesn't mess with dynamic sites using SE Friendly URLs.

g1smd




msg:4129543
 11:34 pm on May 8, 2010 (gmt 0)

you can rewrite to URLs containing a query_string

It is impossible to rewrite to a URL.

A rewrite targets an internal filepath inside the server, not a URL. The server's file system has no concept of "URLs". It works only with filepaths.

URLs and filepaths are not at all the same thing. They are merely associated by the actions of a server.

The target of a redirect is a URL. It causes the browser to start a new HTTP transaction requesting a different URL.

TheMadScientist




msg:4129546
 11:41 pm on May 8, 2010 (gmt 0)

I think my post was understandable and non-technical enough for the general public who might be reading this thread to be able to get what I meant since if they know there is a difference they could probably write the code themselves and if not then they will know it will work without a degree or any research. Sorry if I confused you or something.

URLs and filepaths are not at all the same thing.

And, you did forget to point out a query_string is not part of the file path, but rather information passed to the location... Query_Strings and file paths are not at all the same thing either.

blend27




msg:4129580
 1:37 am on May 9, 2010 (gmt 0)

A while back a friend of mine came under the attack from some(lets say not white-hat seo company) cucarachas. Does not have to be a SPAMY keyword. Page=1 will......

Then some one in supporters forum said "KEEP YOUR SHIP TIGHT". Ever since then, this is a big can of worms.

SAME as TMS(TheMadScientist) no QS allowed on my sites, at least to the visitors - BOTS GET 404 on incoming..

jdMorgan




msg:4129720
 2:29 pm on May 9, 2010 (gmt 0)

That was likely me with the advice about keeping good pitch between your planks...

While search engines' response-handling may indeed be sophisticated and flexible, there is a "higher authority," and that is the HTTP/1.1 protocol documentation.

When working on error-handling (or even out of curiosity), just fire up a server-headers checker and test your server's responses under various conditions. Requests for resources linked within your own site should directly return a 200-OK and content, a 304-Not-Modified, or a 206-Partial Content response. Requests for bogus URLs should return 404-Not Found. Requests for obsolete URLs should return 410-Gone. Requests from malicious clients should get a 403-Forbidden. All of these should occur directly, with no intervening 301/302/303 redirect responses.

Requests for salvageable URLs due to minor link typos, type-in errors, or trailing punctuation added by mis-coded forum/blog auto-linking can be 301-redirected to a corrected URL. But maliciously-malformed URLs should be rejected with a 404.

The method you choose to handle the 'grey area' between these depends on how sophisticated your error-handling can be: If you can use a database to look up 'bad requested URLs' and make a reasonable decision on whether they can and should be salvaged based on the actual words in the query string, then do so.

Otherwise you may want to assess just how "malicious" the bad URL requests you typically receive seem to be and how many bad requests you seem to be getting, and use that to decide whether to 301-redirect them to the correct URLs or to just 404 them on the spot.

On the one hand, if your site is being bombarded with hundreds of ppp-keyword-laden-URL requests per hour due to malicious link-building, then a 404 is indicated. On the other, if you get only a dozen of those a month, but several hundred typo-induced incorrect-URL requests, you may want to just 301 them to the corrected URL. The 404 method is "safer" with respect to fending off link-exploits, but the 301 method is much 'friendlier' to typo-prone visitors and those coming from 'imperfect' links posted on other sites.

No two Web sites are the same, and really, only you can decide what's appropriate for your own. But the HTTP protocol signaling should be done directly, unambiguously, and consistently by returning correct server response codes.

Jim

jdMorgan




msg:4129733
 3:16 pm on May 9, 2010 (gmt 0)

I noted some disagreement on this code snippet above:
RewriteCond %{THE_REQUEST} ^[A-Za-z]{3,9}\ /(.*)\.html\?(.*)\ HTTP/ [NC]
RewriteRule ^index\.php$ /%1\.html? [R=301,L]

This code can only work in the case where the requested URL has been previously rewritten --in the context of handling this current HTTP request-- from /index.html to /index.php.

THE_REQUEST reflects the original client request, while the RewriteRule examines the current URL-path, which may have been previously rewritten.

So if this code works, it implies that a prior rule in the current .htaccess file is missing an [L] flag, or that a prior internal rewrite has been invoked in a higher-level .htaccess or config file. This previous internal rewrite could have been done by mod_rewrite, mod_negotiation, mod_speling, or AcceptPathInfo, with the possible subsequent involvement of mod_dir.

The fact that it should not work but does indicates an additional problem.

In addition, best practices suggest that a full URL --protocol, domain, and URL-path-- should be used in any rule intended to invoke an external redirect.

Jim

This 48 message thread spans 2 pages: 48 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved