Welcome to WebmasterWorld Guest from 54.145.13.215

Message Too Old, No Replies

Google indexing: is there anything they wont index?

     
6:46 am on Jul 10, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 19, 2002
posts:1945
votes: 0


Having a site that is indexed 3 times, once correctly, once with %20 added to the www prefix, and once with double slashes //. How does google continue to index complete sites incorrectly in this way leaving them crucified with supplmentals. I used to think everything this obviuos would be sorted in the long run until i saw a matt cutts post on his blog. In response to a question about why so many sites are indxed with %20 he said he didnt know and would ask around. I never saw an answer to that and was more than surprised that such issues seem to be unknown to the guys at the plex.

I wonder if anyone can beat my triple indexed site? Four times anyone?

8:34 am on July 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member quadrille is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 22, 2002
posts:3455
votes: 0


Probably best to fix your site, before all three disappear.

The problem is almost certainly yours, not Google's.

8:47 am on July 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 19, 2002
posts:1945
votes: 0


thanks for totally ignoring the point and telling me to fix what isnt broken. You can add the double slash to any website url inner directory and produce that result. The %20 is because urls are incorrectly indexed when a gap appears in the http// www.
While theres a work around by removing wild card dns its a bug in google that indexes this way, no other engine does it.
9:12 am on July 15, 2006 (gmt 0)

New User

5+ Year Member

joined:July 7, 2006
posts:24
votes: 0


I have many pages with commas, ampersands and spaces that have been indexed by Google (and only Google) with their hexadecimal equivalents (%20 etc). I simply have stopped creating new pages with "high" characters, as a solution.

I can't understand why this hasn't been fixed yet.

9:27 am on July 15, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:May 21, 2004
posts:449
votes: 0


Google has more fake urls for my site than real ones...

Some urls are with the double slash but most are real urls with random directories and pages appended.

Would love to know where they come from...

9:52 am on July 15, 2006 (gmt 0)

New User

5+ Year Member

joined:July 7, 2006
posts:24
votes: 0


I should add, too, that think links from Google with the hex equivalents all return 404 errors. Totally insane, really.
10:10 am on July 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


If a URL thrown at your site results in your site returning a 200 response, then the content that is returned will be indexed.

It is up top you to set the status to 404 for any sort of URL that should not return content.

10:19 am on July 15, 2006 (gmt 0)

New User

5+ Year Member

joined:July 7, 2006
posts:24
votes: 0


Hi g1smd,

I am not sure what you mean. The correct URLs, with ampersands, etc. return 200 OK. The URLs Google is sending return 404 Not found. I can only assume Google found a page it wanted to index (with an ampersand), incorrectly changed the URL on its servers to hex, and then provided that incorrect link for the SERPs. User clicks the Google link, gets a 404 from my server (as it should). This has been going on for over a year. I simply don't use ampersands, spaces etc. anymore.

10:42 am on July 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 19, 2002
posts:1945
votes: 0


even googles own site will return a 200 if you add slashes to the directory url.....
2:53 pm on July 15, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:June 9, 2005
posts:354
votes: 0


Google started doing similar things to me, for example,

example.com/folder/file.html/
The server however responds with 200 ok, the page that shows up looks terrible (css is the only file I use a relative link on, so that url breaks that)

Also for the first time in the almost 9 years this site has been alive, G. is now also starting to index without the www, adding /index.html etc.

A few days ago I started adding 301's to try to correct the errors. Gbot did follow one of the 301s, no idea if it will actually help correct anything or not. (I had tried fixing this before on another site, it never did recover but it's worth a shot.)

G1smd, how would I return a 404 for something like file.html/? Would that be a better way to deal with that than by using a 301?

3:11 pm on July 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


You could set up a 301 redirect for it, or send a 404 instead.

It is a couple of line of code to bung in the .htaccess file.

5:53 pm on July 15, 2006 (gmt 0)

Junior Member

5+ Year Member

joined:Mar 23, 2006
posts:88
votes: 0


If you really want to stop Google indexing a certain page, you need to follow these 4 easy steps:

1. Put a "Disallow:/page" entry in robots.txt

2. Put a <meta name="robots" content="noindex,nofollow"> tag on the page

3. Put a "nofollow" tag on each link to the page

4. Shutdown your server whenever you think the Googlebot might be about to crawl the page.

Hope this helps!

PS: If you also want some of your pages to rank, you'll need to start several blogs that link "organically" to your site via a bunch of junky pseudo-articles. This is Web 0.2 Google style!

7:17 pm on July 15, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:July 3, 2002
posts:18903
votes: 0


A robots.txt "disallow" stops spidering of a page, but Google will still show the page as a URL-only entry in the SERPs if they ever see a link to that page from anywhere else. The rel="nofollow" attribute does not guarantee exclusion either.

The on-page "disallow" meta tag completely removes a page from the index.

7:58 pm on July 15, 2006 (gmt 0)

Moderator

WebmasterWorld Administrator buckworks is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Dec 9, 2001
posts:5611
votes: 22


Something that annoys me is that Google will index something like http://example.com/%20%20 if another site forms the link with extra space after the end of the proper URL.

More than once I've had a dud URL like that knock the real URL out of the SERPs. It would be nice if they'd ignore extraneous following spaces instead.

Google Sitemaps shows these as errors, but ironically I have to go to Yahoo's link: command to track them down!

1:55 am on July 16, 2006 (gmt 0)

Preferred Member

10+ Year Member

joined:May 21, 2004
posts:449
votes: 0



How do you set up that 404 in .htaccess?
3:06 am on July 17, 2006 (gmt 0)

New User

5+ Year Member

joined:July 7, 2006
posts:24
votes: 0


I'd appreciate a .htaccess 404 example. My attempts have so far not been working.
7:45 am on July 17, 2006 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 19, 2002
posts:1945
votes: 0


go see jdmorgan in the apache forum.