How could this have happened?

Forum Moderators: open

Message Too Old, No Replies

How could this have happened?

Hexidecimal %20 in my url

Snookered

12:52 am on Jun 23, 2004 (gmt 0)

Hi,

I've discovered that in the Google index there is a second listing for my site:

[%20bluewidgets.com...]

Of course if clicked it leads to a The page cannot be displayed. I don't undertand how it got there, my main page is .shtml and I don't have another. This might help to explain why Google hasn't indexed pages that I've had up since February. Has this happened to anyone else?

Thanks

ciml

5:00 pm on Jun 23, 2004 (gmt 0)

I haven't seen that for a long time. This used to happen in Google if you had wildcard DNS and hosting that served the content regardless of the domain used to access it. Then, when someone links to that address (with the %20 in the domain) Google would be served the content and merge it with the valid URL for the page.

It's easy enough to link to someone that way by accident using DTP-style HTML authoring software, but ideally the search engine will notice that the domain is wrong (in HTTP terms) and ignore it.

john_k

5:01 pm on Jun 23, 2004 (gmt 0)

I believe that %20 is the encoding for a space. So look for someone linking to your site with an erroneous space before the www.

sublime1

7:15 pm on Jun 23, 2004 (gmt 0)

We have recently seen lots of incorrectly encoded URLs for our sites showing up in the index (that is, in the "site:mycompany.com" results). I they are alarming, and also seem to be persistent.

When I discovered this, I put in rewrite rules to do a 301 redirect from the junk URLs to the correct ones.

What is surprising is that they got put in the index at all, since they would have resulted in a 404 error prior to my fixup -- it looks like the Google crawler found a string starting with http:// and grabbed it, even though this was a fragment of another URL that would have worked. Worse, I think they also got a session id that is tacked onto the end of the URL so we have multiple instances of the same page showing up. Doh!

In our case, I believe these links all came from the "pseudo-directories" which were harvesting queries for popular keywords from Yahoo and other sources of paid listings. Because we add tracking codes for these, we can see where the URL originally came from.

trimmer80

12:30 am on Jun 24, 2004 (gmt 0)

since they would have resulted in a 404 error prior to my fixup

so you dont have wildcard dsn enabled? This usually only occurs if you do. Infact a 301 redirect would benefit you as you can still obtain PR from sites that do this type of misspelling.

g1smd

9:10 pm on Jun 24, 2004 (gmt 0)

Google indexes every link that it finds on every page that it crawls. It might not show those pages in the SERPs, but it does record that the link exists (note: not the place that the links points to). There are a great many such typos indexed already.

Look at this search [google.com] that lists malformed links to PDF files for example (final trailing "/" is an error).

my3cents

5:52 am on Jun 25, 2004 (gmt 0)

I have the same problem with incorrect urls being indexed, when they result in a 404, google is showing title and description of my default 404 page.

Also, it looks like it is treating my home page as duplicate content for:

[mysite.com...]
www.mysite.com
www.mysite.com/index.html
www.mysite.com/?trackingurl
www.mysite.com/%1Fdynamicstringfromsomesearchengine

g1smd

3:36 pm on Jun 25, 2004 (gmt 0)

Yep. Those are all different pages as far as Google is concerned.

A 301 redirect from domain.com to www.domain.com will take care of one problem.

For links to the index page just use www.domain.com/ without mentioning the exact filename. That may take care of another problem (but you also need to get all other sites that link to you to change to that format).

my3cents

3:59 pm on Jun 25, 2004 (gmt 0)

g1smd,

I don't see how I can write redirects for all of the tracking urls and dynamic strings that are being generated from other search engines, I pointed out a few obvious ones, a few I can 301, but there are several more I don't know what to do with. By the time G indexes some dynamic url it's too late for me, and it seems like I am getting a duplicate content penalty. The dynamic strings are not from our website, we are completely static, they look like results from another search engine or directory.

sublime1

4:01 pm on Jun 25, 2004 (gmt 0)

g1smd -- thanks for the clarification on what shows up in the "site:" query. From that I feel better.

my3cents -- you say:

I have the same problem with incorrect urls being indexed, when they result in a 404, google is showing title and description of my default 404 page.
Also, it looks like it is treating my home page as duplicate content for:
[mysite.com...]
www.mysite.com
www.mysite.com/index.html
www.mysite.com/?trackingurl
www.mysite.com/%1Fdynamicstringfromsomesearchengine

Is your server returning a 404 error for the 404 page (many sites send back error pages with a 200 status)?

If your page is correctly returning a 404 error, then this would be bizarre situation, since it implies not only that google is finding bogus URLs (which is natural), but after it retrieves them it fails to notice the 404 (Not Found) response and goes ahead to index the page.

Make sure your bad pages are returning the 404 response! And if they are, please let us know, as this could explain a lot of things :-)

my3cents

4:04 pm on Jun 25, 2004 (gmt 0)

I am going to check into this further

my3cents

4:14 pm on Jun 25, 2004 (gmt 0)

ok, I have a .htaccess file that says:

ErrorDocument 404 [mysite.com...]

if you goto [mysite.com...]

you get the errorpage.shtml page

the incorrect urls are being indexed with the title and description snipet from the errorpage.shtml

Also, I am calling my admin to see if there is anything he has done that could be causing a problem or if he has any suggestions.