homepage Welcome to WebmasterWorld Guest from 54.161.155.142
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
duplicate content in GWT again!
helenp




msg:4632856
 12:20 pm on Dec 22, 2013 (gmt 0)

Hi, not sure if this should go here or in apache forum.
I had a year ago many duplicate content.

I done some changes on site with javascript, and google index doubble pages with the script and the searcher.

I have these 2 kind of url in GWT as duplicated title:
/espanol/ventas/?document.body.scrollTop:document.documentElement.scrollTop
This looks the be a jquery slider I added to the sales pages., this page exists if I write it in adressbar, but shouldnīt.

/sales/?z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search
This I dont understand at all,
this should not exists as when a search is a searchpage is used.
However suppose is because I have mixed 2 forms in one that toogle between both.
The parameters z, t, d, day, month, day2, month2 & e are all stated in GWT as representative url, so they should not index those parameters anyway.

This is my .htacces:

AddType application/x-httpd-php5 .htm .html
RewriteEngine On
RewriteRule ^(guestbook_[0-9]+\.htm) http://www.example.com/reviews/$1 [R=301,L]
RewriteRule ^espanol/(libro_de_visitas_[0-9]+\.htm) http://www.example.com/espanol/opiniones/$1 [R=301,L]
RewriteRule ^svenska/(gastbok_[0-9]+\.htm) http://www.example.com/svenska/kommentarer/$1 [R=301,L]
RewriteRule ^(z3originalguestbook_[0-9]+\.htm) http://www.example.com/reviews/$1 [R=301,L]
RewriteRule ^espanol/(z3original_[0-9]+\.htm) http://www.example.com/espanol/opiniones/$1 [R=301,L]
RewriteRule ^svenska/(z3originalgastbok_[0-9]+\.htm) http://www.example.com/svenska/kommentarer/$1 [R=301,L]
# REDIRECT htm INDEX PAGES to index/
RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ http://www.example.com/$1 [R=301,L]
RewriteCond %{SERVER_PORT} ^443$
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/]+/)*index\.html?\ HTTP/
RewriteRule ^(([^/]+/)*)index\.html?$ https://www.example.com/$1 [R=301,L]
# Get rid of extra path info such as example.com/pagina1.htm/maps/ etc
RewriteRule ^((?:[^./]+/)*[^./]+\.(?:html?|php))/ http://www.example.com/$1 [R=301,L]
# Redirect non-canonical to www
RewriteCond %{SERVER_PORT} !^443$
RewriteCond %{HTTP_HOST} !^(www\.example.com\.com)?$
RewriteRule (.*) http://www.example.com/$1 [R=301,L]
RewriteCond %{SERVER_PORT} ^443$
RewriteCond %{HTTP_HOST} !^(www\.example.com\.com)?$
RewriteRule (.*) https://www.example.com/$1 [R=301,L]
AddType 'text/css; charset=UTF-8' css
<Files ~ "\.(log)$">
order allow,deny
deny from all
</Files>
<FilesMatch "\.(pl|txt|htm|html|[sf]?cgi|spl)$">
Header set Cache-Control: "max-age=7200"
<filesMatch "\.(htm|html|css|js)$">

What more to do?

[edited by: aakk9999 at 12:48 pm (utc) on Dec 22, 2013]
[edit reason] Exemplified [/edit]

 

aakk9999




msg:4632862
 1:01 pm on Dec 22, 2013 (gmt 0)

How many of such URLs are you getting?

I think you have two issues:

1) find out where/how these URLs are created and make sure you fix the issue
2) For such URLs that have been indexed, either redirect them back to canonical URL or return 404/410, which is what should be done in your .htaccess

Currently there is no rule in your .htaccess that would be dealing with these URLs

helenp




msg:4632864
 1:23 pm on Dec 22, 2013 (gmt 0)

thanks aakk,
looks at this moment the only pages that has the bug is the sales index pages in the different languages.
I had a look at my 404 errors, and I have several links like this that does deliver a 404:
/sales/properties_for_sale_marbella.htm?z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search
the path is identical.


If I past this on any page:
"?z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search"
the page exists, does not deliver any 404 page,
but if I add the same to a page that should have a parameter in the url such as the one above it gives a 404.

Doing test with page that uses parameters,
This page exist but shouldnt:
/sales/properties_for_sale_marbella.htm?id=112?z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Searc

however this does not exist:
/sales/properties_for_sale_marbella.htm?/sales/properties_for_sale_marbella.htm?id=112?z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Searc

helenp




msg:4632865
 1:25 pm on Dec 22, 2013 (gmt 0)

How many of such URLs are you getting?

I think you have two issues:

1) find out where/how these URLs are created and make sure you fix the issue
2) For such URLs that have been indexed, either redirect them back to canonical URL or return 404/410, which is what should be done in your .htaccess

Currently there is no rule in your .htaccess that would be dealing with these URLs


I get many, many if I add it to the urls.
where should I start to do those fixes?

helenp




msg:4632900
 5:54 pm on Dec 22, 2013 (gmt 0)

As far as I can see if I search inside my site,
there is only one page indexed like this:
?document.body.scrollTop:document.documentElement.scrollTop

And there are 2 pages indexed like this, but with diferent dates:
z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search

both pages are in same section, and in same section there are also several 404 errors due to that the url already have an id, the 404 errors has as date 4 and 5 of december.
I think maybe the links with parameters was a bug on my site that later was fixed, as I were working on the site, no idea,
so I suppose I will have to redirect those 2 pages, and wait.

The page indexed with this:
document.body.scrollTop:document.documentElement.scrollTop
must be from the javascript in the jquery slider, not
sure if it was some eventual error from googles side, or more pages may be indexed.

I know the dates of the 404 error, as I can see those in GWT.
I wonder is there a way to see the date the indexed pages were indexed? Cant see any way in GWT

helenp




msg:4632916
 9:43 pm on Dec 22, 2013 (gmt 0)

puf, I been trying to do a 404, but cant manage, suppose due to the parameters, so then I try a 301 to the canonicalpage, but does not work either...
what to do with the parameters?

This is the last effort:
RewriteCond %{QUERY_STRING} z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search
RewriteRule ^sales/index\.htm$ /sales/? [L,R=301]

lucy24




msg:4632933
 11:39 pm on Dec 22, 2013 (gmt 0)

<off topic>
<Files ~ "\.(log)$">
</off>
I thought Regular Expressions only worked in FilesMatch :(

RewriteCond %{QUERY_STRING} z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search
RewriteRule ^sales/index\.htm$ /sales/? [L,R=301]

That seems very tightly constrained. Is that the exact, literal text of the only query string you ever get? Even if you replaced each value with \d+ or 20\d\d-\d\d and so on, how many matches would you get?

One of the standard bits of advice is: First explain in English what you want to do. Then we hammer out a rule to make it work. So think about which exact parameters are causing trouble, and what you want to do with them.

Are you also trying to get rid of all the "document.body.etcetera" garbage? What happens if you try

RewriteCond %{THE_REQUEST} document\.body
RewriteRule {URL-that-gets-this-parameter} - [R=404,L]

?

Yes, really "R=404". Little-known Apache quirk that also works with mod_alias. You could replace the whole flag with [G] if you don't mind making the search engine think that this page used to exist but you've taken it away.

helenp




msg:4632987
 9:50 am on Dec 23, 2013 (gmt 0)

<off topic>
<Files ~ "\.(log)$">
</off>
I thought Regular Expressions only worked in FilesMatch :(

Remember some problem with that, think as I a am on apache but server uses litespeed, so sometimes there are things that does not work in my apache.



One of the standard bits of advice is: First explain in English what you want to do.

Am doing my very best, will try to do better ;)


RewriteCond %{QUERY_STRING} z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search
RewriteRule ^sales/index\.htm$ /sales/? [L,R=301]

That seems very tightly constrained. Is that the exact, literal text of the only query string you ever get? Even if you replaced each value with \d+ or 20\d\d-\d\d and so on, how many matches would you get?


No,
I have 2 querystrings indexed as far as I can see.
There can be many diferent matches, but as an emergency I just tried the easiest.
These are the 2 querystrings I have indexed:

/sales/properties_marbella_east.htm?z=any&t=anytype&d=1&day=08&month=2014-01&day2=01&month2=2013-12&e=Search

/sales/?z=any&t=anytype&d=1&day=08&month=2014-06&day2=01&month2=2013-12&e=Search

All kind of querystrings cant be blocked as there are queries that needs to be indexed such as:
?id=112 for exampel


Are you also trying to get rid of all the "document.body.etcetera" garbage? What happens if you try

RewriteCond %{THE_REQUEST} document\.body
RewriteRule {URL-that-gets-this-parameter} - [R=404,L]


That does not work.
in {URL-that-gets-this-parameter}
I have tried:
/espanol/ventas/
espanol/ventas/
/espanol/ventas/index.htm
espanol/ventas/index.htm
/espanol/ventas/?document.body.scrollTop:document.documentElement.scrollTop
espanol/ventas/?document.body.scrollTop:document.documentElement.scrollTop

I tried after the rewriterules and before the RewriteConds in my htaccess above, suppose its the right place.

And I get the page.....no error.

Thanks

lucy24




msg:4632992
 10:25 am on Dec 23, 2013 (gmt 0)

I have tried:

If /espanol/ is your first directory, no leading slash in htaccess. If it's a deeper directory, either way will work if you leave off the opening anchor.

index.htm, present or absent, depends on the actual form of your URL. But, come to think of it, I don't know whether mod_rewrite with the unusual flag R=404 kicks in before or after mod_dir has done its stuff. That's assuming index.htm is a real file in a real, physical directory. There are some situations where RewriteRules only work if you specify "index.htm" --even if this is not part of the visible request. (I have personally seen this happen. It's confusing.)

The versions with ? will definitely not work, ever, because a RewriteRule only looks at the path.

espanol/ventas/
with no anchors should cover all possible forms. (Does it make your skin crawl to spell it without the tilde? It does to me!)

Try putting it at the very beginning of all RewriteRules, right after RewriteEngine On. It isn't the ideal location for a 404, but it means no other rule has a chance to get involved.

All of this is assuming that the "document.body" blahblah really is reaching your server as a request. I assume you've seen it in logs, not just on google's say-so.

helenp




msg:4632996
 10:42 am on Dec 23, 2013 (gmt 0)

espanol/ventas/
with no anchors should cover all possible forms. (Does it make your skin crawl to spell it without the tilde? It does to me!)

Try putting it at the very beginning of all RewriteRules, right after RewriteEngine On. It isn't the ideal location for a 404, but it means no other rule has a chance to get involved.

yes, it makes me feel bad without the tilde lol.

Nop, no luck,
after RewriteEnging on either.
I tried:
RewriteEngine on
RewriteCond %{THE_REQUEST} document\.body
RewriteRule {espanol/ventas/index.htm} - [R=404,L]

RewriteEngine on
RewriteCond %{THE_REQUEST} document\.body
RewriteRule {espanol/ventas/} - [R=404,L]

and no error is displayed.

JD_Toims




msg:4633001
 11:27 am on Dec 23, 2013 (gmt 0)

RewriteEngine on
RewriteCond %{QUERY_STRING} !^(id=[0-9]*)?$
RewriteRule .? http://www.example.com%{REQUEST_URI}?%1 [R=301,L]

Should be close to what you need.

g1smd




msg:4633005
 11:49 am on Dec 23, 2013 (gmt 0)

Do not test QUERY_STRING for the match; it can lead to an infinite loop. Test THE_REQUEST instead.

You do not need to match the requested URL and query string in full. Identify one part that is common to all duff requests, and which never appears in valid requests, and just test for that.

lucy24




msg:4633085
 5:12 pm on Dec 23, 2013 (gmt 0)

Oh, ###, Helen, you're not using literal { } braces are you? I just used those as my {replacement text}. Oops! Don't use them in the actual rule.

helenp




msg:4633120
 5:31 pm on Dec 23, 2013 (gmt 0)

Oh, ###, Helen, you're not using literal { } braces are you? I just used those as my {replacement text}. Oops! Don't use them in the actual rule.

jajajaj, yes actually I did, however no matter,
it doesnīt work wihtout them either :(

Ive been testing with JD_Toims also,
but neither :(

lucy24




msg:4633271
 9:27 pm on Dec 23, 2013 (gmt 0)

RewriteCond %{QUERY_STRING} !^(id=[0-9]*)?$

Means: the query string is exactly "id=some-number" or "id=" (no value), or "" exactly nothing. Is that what your query string is supposed to be?

But wait. All your pages are really php, aren't they? I seem to remember you parse everything for php even if it's got an htm(l) extension. Might it be easier for a single php script to read the query string and issue a 301 redirect if appropriate? I assume you've already got something that issues a 404 if a parameter value is wrong.

helenp




msg:4633283
 9:57 pm on Dec 23, 2013 (gmt 0)

RewriteCond %{QUERY_STRING} !^(id=[0-9]*)?$

Means: the query string is exactly "id=some-number" or "id=" (no value), or "" exactly nothing. Is that what your query string is supposed to be?

What I understood was that redirect if not id=anynumber.

But wait. All your pages are really php, aren't they? I seem to remember you parse everything for php even if it's got an htm(l) extension. Might it be easier for a single php script to read the query string and issue a 301 redirect if appropriate? I assume you've already got something that issues a 404 if a parameter value is wrong.


Yes they are parsed, but I only have a 404 script on the pages that get the pages from the database,
like this:
else {
header("HTTP/1.0 404 Not Found");
include("../404.shtml");
exit;
}

Suppose that is poosible, to add
ever seen it, but why not.

Also I have this on all pages since the duplicate issue even though is resolved is good to serve the correct version:
if ($_SERVER['SERVER_PORT'] == 443)
{
$host = $_SERVER['HTTP_HOST'];
$request_uri = $_SERVER['REQUEST_URI'];
$good_url = "http://" . $host . $request_uri;

header( "HTTP/1.1 301 Moved Permanently" );
header( "Location: $good_url" );
exit;
}

can do similar, except the port thing.
Good idea, thanks, didnīt remember about the onpage thing.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved