homepage Welcome to WebmasterWorld Guest from 174.129.163.183
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Google and php duplicate problem
serenoo




msg:4288484
 1:29 pm on Mar 28, 2011 (gmt 0)

By webmaster google tool I found a page where Google says I have a 404. It is right, but I did not create that page. It is not on my server, but if I type it on a browser it appears. Then I found it is a problem with many websites that run php.

My page is www.example.com/keyword.php
but if you type
www.example.com/keyword.php/whateveryouwant.something
the browser display a duplicate instead of 404.

Examples:
http://www.php.net/manual/en/function.is-file.php
http://www.php.net/manual/en/function.is-file.php/webmasterworld

http://www.phpmyadmin.net/home_page/downloads.php
http://www.phpmyadmin.net/home_page/downloads.php/test

Is there a way to solve such duplicate problem?

[edited by: tedster at 4:24 pm (utc) on Mar 28, 2011]
[edit reason] make examples visible [/edit]

 

tedster




msg:4288661
 7:24 pm on Mar 28, 2011 (gmt 0)

That's what a 404 response means - the URL was "not found", and you did not create any page. You need to be clear about what the word "page" means (some kind content served with a 200 OK response) and the word "URL", a web address that can receive any kind of server response.

In the case you described, as long as your server is responding to that URL request with a 404 Not Found status, there is no duplicate problem.

TheMadScientist




msg:4288666
 7:34 pm on Mar 28, 2011 (gmt 0)

Google says I have a 404.

...

the browser display a duplicate instead of 404.

Uh, which is it? I don't understand...

Oh, or are they separate 'things'? Google said you have a 404 and then you did more investigation and found another issue?

Yes, not serving a 404 for any request that's not properly formatted to generate a unique page can be an issue, even for non-php sites ... Try the same thing on almost any site: http://www.example.com/sompage.html?some=stuff-here

Most of the time you will not generate a 404, but rather a 200 OK duplicate of the page ... Solving the issue is a bit involved, and I don't have time to post on that right now, but it can be done ... I'll try to check back in later.

serenoo




msg:4288676
 8:03 pm on Mar 28, 2011 (gmt 0)

Tedster my case is the same of php.net
How can I know if my server is responding with a 404 to that URL?

TheMadScientist: they are separate 'things'. Google said I have a 404 and then I did more investigation and found another issue.

tedster




msg:4288679
 8:11 pm on Mar 28, 2011 (gmt 0)

You use a server header checker utility. Many people use the LiveHTTPHeaders add-on for Firefox, but there are many other tools, too.

If Google is saying the URL gives them a 404 status, that is what is happening, at least for googlebot. You could even verify that within WMT by using the "Fetch as Googlebot" tool.

TheMadScientist




msg:4288681
 8:14 pm on Mar 28, 2011 (gmt 0)

Okay, it's a bit 'complicated' to solve and part of the solution depends on whether you need query_strings externally or not ... Personally, I get around the issue by only allowing query_strings internally ... To do this, you need to redirect all query_string URLs to 'friendly' URLs, then rewrite the friendly URLs back to the query_string URLs internally ... After doing this, you can strip the query_string from all external URLs, and the best place I know of to learn all the details of mod_rewrite you're going to need to use if you decide to do it is the Apache Forum [webmasterworld.com].

It's really a detailed conversation and site specific answer you're going to need ... It can be done with PHP and making sure your php pages serve a 404 for erroneous query_string, and that's a conversation for the PHP Forum [webmasterworld.com], but personally, I prefer the mod_rewrite solution, because IMO it's 'cleaner'.

ADDED: If you don't feel like searching around for a server header checker, click: control panel at the top of the screen, then server headers (2nd link from the bottom on the left side of the page) and you'll find one of my all time favorites ... it's actually the only one I use unless I need something more detailed (like checking to see if I really get a 304 from custom etag headers), which isn't very often.

TyMax




msg:4289754
 8:41 pm on Mar 30, 2011 (gmt 0)

I have had this problem with htm pages as well and I believe it is someone trying to sabotage my ranking by producing duplicate content to some of my better indexed pages.

I found this out when a page was no longer being indexed by Google and found the duplicate page which I think has at least one canonical link to it from another website not mine. The person is putting this onto the end of the url.
-
http://www.example.com/example.htm?ref=*****.com
-
http://www.example.com/example.htm
-
The 404 will not filter out the ?ref=*****.Com and Google still picks it up. I set up Parameters in Google Webmaster tools to disregard the ?ref=*****.Com but it still seems to pick it up. Google now picks up the main page as the primary and seems for now to disregard ?ref=*****.Com

You might want to try the Parameters in Google to filter out the string. You might want to read this page which explains the problem.
-
[google.com...]
-
I did not want to go through all the 301 redirects. Try putting ?ref=*****.Com in Google and see how many pages still come up. Put it on one of your .htm and .php pages and see what I mean. Seems it maybe a new tool others are using to try and sabotage other website page ranking.

[edited by: tedster at 10:31 pm (utc) on Mar 30, 2011]
[edit reason] hide the domain name [/edit]

TheMadScientist




msg:4289781
 9:04 pm on Mar 30, 2011 (gmt 0)

I'm going to go ahead and post a 'simple fix it with mod_rewrite if you don't need query strings' piece of code, because I think it will probably help future readers ... Always double check and test mod_rewrite code before installing it on a live site, because if you get it wrong you can break your site in ways you might not notice until after it's done quite a bit of damage.

RewriteEngine on
RewriteCond %{QUERY_STRING} .
RewriteRule .? http://www.example.com%{REQUEST_URI}? [R=301,L]

The above should strip ALL query strings from all URLs on a site.

Make sure you use the canonical version of your domain on the right-side of the rule to avoid 'stacked' redirects, meaning one redirect to http://example.com/the-url.htm to remove the query string, and another to http://www.example.com/the-url.htm to redirect to the www version is not a good thing.

If you don't understand something, it doesn't work, or you need more help, then you should be able to find it in the Apache Forum, linked above. If you use query strings, even only on internal URLs Remember: the above will remove ALL query strings, because it doesn't take into account any you actually need, so if you have a dynamic site, even with friendly URLs, the above MUST be modified prior to use or your site will not work right.

tedster




msg:4289805
 10:32 pm on Mar 30, 2011 (gmt 0)

This isn't something new - I remember seeing it for many years. It began as a kind of server log "spamming" so the webmaster would see who sent the traffic. This was long before canonical issues were understood very well.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved