Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Old php causing dup content problems?

mystery php I thought

         

texasville

8:48 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I wasn't sure where to post this but since it ultimately could possibly cause a problem in google with supplemental due to duplicate content issues I chose the google search forum.

I took on a client a few months back and built a new site for him. He had a current site that was a mass template design provided by a company that caters to his niche market.

I built an all original site. I did use some contact forms to shortcut the process. This site has not been linked to all that much yet but before he had a pr2. Now it is grayed out and even tho in google webmaster tools it shows it has pr assigned.

A few days ago I was going thru Yahoo and noticed a few strange urls in the site. I assumed it was left over from the other site which had been built in php. I built the new site in straight html. I clicked on one url: http://www.example.com/?spg=something.php and lo and behold it resolved to the index page of the new site. It displayed the url I had imput but it was the index page.

So, I figured I had a real dup content problem. More than one path to the same url. I scaveneged everything through out the site to see what was making it do this. It was not set to parse html as php. There was nothing.

Then I thought that maybe the statcounter code was causing it. I tried the address on another site I control:
http://www.anothersite.com/?spg=something.php and it resolved to the index page of that site.

I thought I was onto something until I tried it on webmasterworld.com and it did the same thing. It will even do it with just the? after the / . In fact it will do it on most any site but msn.com. It reverts to the default page.

So, I figured no problem then. Shouldn't be causing a problem for the site I was concerned about. Then it occured to me that since these pages originally existed, it does cause a problem. Because on my site those had existed then I do have multiple paths to the same url.

This is really bending my mind. It is not something I understand at all. Does anyone know? Do I have a problem? And what 301 redirect can I drop in to resolve this?

[edited by: tedster at 9:25 pm (utc) on Aug. 21, 2007]
[edit reason] fix formatting [/edit]

FriskUK

10:45 pm on Aug 21, 2007 (gmt 0)

10+ Year Member



I wouldn't do anything or worry about it at all.

All apache hosted sites will work as domain.com/?query=whatever

Probably why it doesn't work with msn as i'm sure they'll be using IIS.

If no page is specified, (e.g. domain.com and not domain.com/index.php) apache will always use the default index name specified in http.conf, but not show the page name in the browser url (i think iis does).

And with the query, you can put a query to any url... e.g. www.domain.com/index.html?query=whatever, it doesn't mean it will be used

I'm sure google will understand that your site is a new website and in time, those links will eventually fade out of the index... maybe those urls you saw are still linked from other websites that linked to the original website?

Either way, i wouldn't worry about it.

[edited by: FriskUK at 10:49 pm (utc) on Aug. 21, 2007]

tedster

11:26 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, if Google has all these urls in their index, but they now all resolve to the home page, then I would help Google to get things straightened out. Multiple indexed urls for the same content can cause troubles - specially if there are a large number of them.

One way to handle this would be to add a line to your robots.txt file. You can make use of the pattern matching wildcard * that Googlebot does support, even though it's not currently part of the robots.txt standard protocol.

To block googlebot from all URLs that include a question mark (?):

User-agent: googlebot
Disallow: /*?*

If you still have some active urls that use a query string, then you need to describe a more exact pattern, one that focuses on the exact kind of query string you don't want indexed, for example.

User-agent: googlebot
Disallow: /*?spg=*

You can then use Google's url removal tool to remove those duplicate urls from the active index and reduce any side effects they may be creating.

[edited by: tedster at 12:51 am (utc) on Aug. 22, 2007]

texasville

11:30 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks Tedster! I think that is the ticket. Like I said, I wouldn't be worried about it except for the fact that those old url's did once exist.

jd01

11:47 pm on Aug 21, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



All apache hosted sites will work as domain.com/?query=whatever

Unless you use a little mod_rewrite to remove the ? from query_string URLs. It's one of the cool things you can do with an Apache server you cannot (easily) do with IIS.

RewriteEngine on
RewriteCond %{THE_REQUEST} \?
RewriteRule (.*) /$1? [R=301,L]

(Works great unless you need to use query_strings in an original request --- then it gets a little more complicated.)

Justin