Google indexing: is there anything they wont index?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google indexing: is there anything they wont index?

soapystar

6:46 am on Jul 10, 2006 (gmt 0)

Having a site that is indexed 3 times, once correctly, once with %20 added to the www prefix, and once with double slashes //. How does google continue to index complete sites incorrectly in this way leaving them crucified with supplmentals. I used to think everything this obviuos would be sorted in the long run until i saw a matt cutts post on his blog. In response to a question about why so many sites are indxed with %20 he said he didnt know and would ask around. I never saw an answer to that and was more than surprised that such issues seem to be unknown to the guys at the plex.

I wonder if anyone can beat my triple indexed site? Four times anyone?

Quadrille

8:34 am on Jul 15, 2006 (gmt 0)

Probably best to fix your site, before all three disappear.

The problem is almost certainly yours, not Google's.

soapystar

8:47 am on Jul 15, 2006 (gmt 0)

thanks for totally ignoring the point and telling me to fix what isnt broken. You can add the double slash to any website url inner directory and produce that result. The %20 is because urls are incorrectly indexed when a gap appears in the http// www.
While theres a work around by removing wild card dns its a bug in google that indexes this way, no other engine does it.

mr_lumpy

9:12 am on Jul 15, 2006 (gmt 0)

I have many pages with commas, ampersands and spaces that have been indexed by Google (and only Google) with their hexadecimal equivalents (%20 etc). I simply have stopped creating new pages with "high" characters, as a solution.

I can't understand why this hasn't been fixed yet.

proboscis

9:27 am on Jul 15, 2006 (gmt 0)

Google has more fake urls for my site than real ones...

Some urls are with the double slash but most are real urls with random directories and pages appended.

Would love to know where they come from...

mr_lumpy

9:52 am on Jul 15, 2006 (gmt 0)

I should add, too, that think links from Google with the hex equivalents all return 404 errors. Totally insane, really.

g1smd

10:10 am on Jul 15, 2006 (gmt 0)

If a URL thrown at your site results in your site returning a 200 response, then the content that is returned will be indexed.

It is up top you to set the status to 404 for any sort of URL that should not return content.

mr_lumpy

10:19 am on Jul 15, 2006 (gmt 0)

Hi g1smd,

I am not sure what you mean. The correct URLs, with ampersands, etc. return 200 OK. The URLs Google is sending return 404 Not found. I can only assume Google found a page it wanted to index (with an ampersand), incorrectly changed the URL on its servers to hex, and then provided that incorrect link for the SERPs. User clicks the Google link, gets a 404 from my server (as it should). This has been going on for over a year. I simply don't use ampersands, spaces etc. anymore.

soapystar

10:42 am on Jul 15, 2006 (gmt 0)

even googles own site will return a 200 if you add slashes to the directory url.....

LunaC

2:53 pm on Jul 15, 2006 (gmt 0)

Google started doing similar things to me, for example,

example.com/folder/file.html/
The server however responds with 200 ok, the page that shows up looks terrible (css is the only file I use a relative link on, so that url breaks that)

Also for the first time in the almost 9 years this site has been alive, G. is now also starting to index without the www, adding /index.html etc.

A few days ago I started adding 301's to try to correct the errors. Gbot did follow one of the 301s, no idea if it will actually help correct anything or not. (I had tried fixing this before on another site, it never did recover but it's worth a shot.)

G1smd, how would I return a 404 for something like file.html/? Would that be a better way to deal with that than by using a 301?

g1smd

3:11 pm on Jul 15, 2006 (gmt 0)

You could set up a 301 redirect for it, or send a 404 instead.

It is a couple of line of code to bung in the .htaccess file.

ClintFC

5:53 pm on Jul 15, 2006 (gmt 0)

If you really want to stop Google indexing a certain page, you need to follow these 4 easy steps:

1. Put a "Disallow:/page" entry in robots.txt

2. Put a <meta name="robots" content="noindex,nofollow"> tag on the page

3. Put a "nofollow" tag on each link to the page

4. Shutdown your server whenever you think the Googlebot might be about to crawl the page.

Hope this helps!

PS: If you also want some of your pages to rank, you'll need to start several blogs that link "organically" to your site via a bunch of junky pseudo-articles. This is Web 0.2 Google style!

g1smd

7:17 pm on Jul 15, 2006 (gmt 0)

A robots.txt "disallow" stops spidering of a page, but Google will still show the page as a URL-only entry in the SERPs if they ever see a link to that page from anywhere else. The rel="nofollow" attribute does not guarantee exclusion either.

The on-page "disallow" meta tag completely removes a page from the index.

buckworks

7:58 pm on Jul 15, 2006 (gmt 0)

Something that annoys me is that Google will index something like http://example.com/%20%20 if another site forms the link with extra space after the end of the proper URL.

More than once I've had a dud URL like that knock the real URL out of the SERPs. It would be nice if they'd ignore extraneous following spaces instead.

Google Sitemaps shows these as errors, but ironically I have to go to Yahoo's link: command to track them down!

proboscis

1:55 am on Jul 16, 2006 (gmt 0)

How do you set up that 404 in .htaccess?

mr_lumpy

3:06 am on Jul 17, 2006 (gmt 0)

I'd appreciate a .htaccess 404 example. My attempts have so far not been working.

soapystar

7:45 am on Jul 17, 2006 (gmt 0)

go see jdmorgan in the apache forum.