Forum Moderators: open
I've discovered that in the Google index there is a second listing for my site:
[%20bluewidgets.com...]
Of course if clicked it leads to a The page cannot be displayed. I don't undertand how it got there, my main page is .shtml and I don't have another. This might help to explain why Google hasn't indexed pages that I've had up since February. Has this happened to anyone else?
Thanks
It's easy enough to link to someone that way by accident using DTP-style HTML authoring software, but ideally the search engine will notice that the domain is wrong (in HTTP terms) and ignore it.
When I discovered this, I put in rewrite rules to do a 301 redirect from the junk URLs to the correct ones.
What is surprising is that they got put in the index at all, since they would have resulted in a 404 error prior to my fixup -- it looks like the Google crawler found a string starting with http:// and grabbed it, even though this was a fragment of another URL that would have worked. Worse, I think they also got a session id that is tacked onto the end of the URL so we have multiple instances of the same page showing up. Doh!
In our case, I believe these links all came from the "pseudo-directories" which were harvesting queries for popular keywords from Yahoo and other sources of paid listings. Because we add tracking codes for these, we can see where the URL originally came from.
Look at this search [google.com] that lists malformed links to PDF files for example (final trailing "/" is an error).
Also, it looks like it is treating my home page as duplicate content for:
[mysite.com...]
www.mysite.com
www.mysite.com/index.html
www.mysite.com/?trackingurl
www.mysite.com/%1Fdynamicstringfromsomesearchengine
A 301 redirect from domain.com to www.domain.com will take care of one problem.
For links to the index page just use www.domain.com/ without mentioning the exact filename. That may take care of another problem (but you also need to get all other sites that link to you to change to that format).
I don't see how I can write redirects for all of the tracking urls and dynamic strings that are being generated from other search engines, I pointed out a few obvious ones, a few I can 301, but there are several more I don't know what to do with. By the time G indexes some dynamic url it's too late for me, and it seems like I am getting a duplicate content penalty. The dynamic strings are not from our website, we are completely static, they look like results from another search engine or directory.
my3cents -- you say:
I have the same problem with incorrect urls being indexed, when they result in a 404, google is showing title and description of my default 404 page.
Also, it looks like it is treating my home page as duplicate content for:[mysite.com...]
www.mysite.com
www.mysite.com/index.html
www.mysite.com/?trackingurl
www.mysite.com/%1Fdynamicstringfromsomesearchengine
Is your server returning a 404 error for the 404 page (many sites send back error pages with a 200 status)?
If your page is correctly returning a 404 error, then this would be bizarre situation, since it implies not only that google is finding bogus URLs (which is natural), but after it retrieves them it fails to notice the 404 (Not Found) response and goes ahead to index the page.
Make sure your bad pages are returning the 404 response! And if they are, please let us know, as this could explain a lot of things :-)
ErrorDocument 404 [mysite.com...]
if you goto [mysite.com...]
you get the errorpage.shtml page
the incorrect urls are being indexed with the title and description snipet from the errorpage.shtml
Also, I am calling my admin to see if there is anything he has done that could be causing a problem or if he has any suggestions.