Another duplicate content issue?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Another duplicate content issue?

Google seems dead set on creating duplicates!

AndyA

3:50 am on Oct 8, 2006 (gmt 0)

I just spotted Googlebot in my logs asking for pages that don't exist on my server, never have existed, and my server is returning a 200 code. This is what is happening:

I have a page on my site http://example.com/page1.shtml
and another http://example.com/page8.shtml

Googlebot is asking for: http://example.com/page1.shtml/page8.shtml and the server is serving page1.shtml, with a http code 200.

So that means I now have a page1.shtml and a page1/page8.shtml with the same contents, right? And this didn't just happen once, but quite a few times, each time with different pages, i.e., page3.shtml/page14.shtml.

Where in the heck is Googlebot coming up with these crazy URLs, and any idea how to do Mod Rewrite to put a stop to it?

[edited by: tedster at 4:15 am (utc) on Oct. 8, 2006]
[edit reason] use example.com, de-link [/edit]

gehrlekrona

4:33 am on Oct 8, 2006 (gmt 0)

This has also happened to me in June and July after the big crash. There were all kinds of weird paths and I don't know where they got them from at all.
The only thing I could come up with was to hat fully qualified paths everywhere so there wasn't going to be any mistakes. Where they got the paths from? No idea but it seemed like they suddenly couldn't handle relative and fully qualified paths anymore.
Something is very wrong with Google and has been for months but they won't, of course, admit it.
In June my traffic disappeared and I, like you, discovered tons of wrong paths, and God knows what, and the site was almost gone for 2 months, lossing thousands of dollars, end then, suddenly it came back again for a couple of weeks in August, just to disappear again in the middle of september.
Like always, Google doesn't care about the little ones! We are just collateral damage in their stupid fight against spam sites or whatever they are trying to do. I say, stupid fight against spam sites, just because the root of the spam sites are Google itself since their AdSense invites all spam sites.

CainIV

4:39 am on Oct 8, 2006 (gmt 0)

Googlebot is asking for: http://example.com/page1.shtml/page8.shtml and the server is serving page1.shtml, with a http code 200.

Perhaps you have an incorrect relative url in your site that points to page8.shtml (perhaps from page1.shtml)

Then that url would be indexed as such. A base href tag would solve this if it were the issue.

[edited by: CainIV at 4:39 am (utc) on Oct. 8, 2006]

netchicken1

4:46 am on Oct 8, 2006 (gmt 0)

I think I have simialr sometimes as well. I find it asking for URLS with some of the text in an article in the url.

I was wondering if it was some sort of system to look for dups, with the text being the random check for unique content.

Its left me wondering as well. At one time I went through my hyperlinks in case I had screwed one up and included some text in the link, but nope the problem doesn't seem to be at my end :)

gehrlekrona

2:30 pm on Oct 8, 2006 (gmt 0)

Here's a path from my site they try to access:
/search/index.php/blog/livechat/phponline/garagesales/blog/index.php?catid=42&start=15
/search/index.php is the right path to a file
/blog/livechat/phponline/ this is also a path to a live chat
/garagesales/ this is another path which exists
/blog/ a path to a blog
/index.php?catid=42&start=15 finally the right page........

Now how wrong is that?

AndyA

3:46 pm on Oct 8, 2006 (gmt 0)

I've searched and searched, and I have no links to anything like this on my site, and a search for the link in quotes at Google doesn't turn up anything.

I've disallowed that directory for now, so there won't be any reason to try to index anything. I'm going to redo that section at some point anyway, so might as well let it be dormant until that time.

The pages in this section do have an include file, perhaps that is the problem. Is an Iframe preferred over an include?

Staffa

3:56 pm on Oct 8, 2006 (gmt 0)

G just created 16 errors while crawling my site.

The correct Url = page.asp?iData=4&iCat=3&menID=2

Two days ago G re-arranged the query string to page.asp?iCat=4&iData=3&menID=2 which will create duplicate content.

Today it was crawling with the wrong string and 'invented' iCat numbers which don't exist so each request returned a 500 error.
And no, those numbers never existed and the pages with the correct query string have been crawled time and time again so there is no reason for G to suddenly make up its own brand of string.

PS : G has now been crawling uninterupted from 16:40:45 till 17:02:28 while creating 100 errors

[edited by: Staffa at 4:08 pm (utc) on Oct. 8, 2006]

g1smd

6:17 pm on Oct 8, 2006 (gmt 0)

Try running Xenu LinkSleuth over the site and looking at what it finds as it finds it, as well as looking at the generated report at the very end...