Forum Moderators: phranque

Message Too Old, No Replies

Googlebot

Why has Googlebot indexed Disallowed pages

         

MegaTraders

12:21 pm on Sep 5, 2007 (gmt 0)

10+ Year Member



Help! According to Google Webmaster Tools, over 150 of the 256 pages on my site got crawled at the end of last week. I have a robots.txt and a sitemap.xml in the root. However, when I search for site:example.com in Google I see 9 disallowed pages indexed and only 2 others out of 256.

By the way Googlebot has been crawling my site since May.

Why has Googlebot indexed my disallowed pages?

Why has Google not indexed my other pages even though it appears to have crawled most of them?

Heeeeeeeeeeeeeeeeeelp!

[edited by: tedster at 8:23 pm (utc) on Sep. 5, 2007]
[edit reason] switch url to example.com [/edit]

Philosopher

12:34 pm on Sep 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



First, let me say Welcome to WebmasterWorld. :)

Second, you'll want to edit your original post and remove your URL as posting of actual URLs is against the TOS of the site.

How long have you had your robots.txt file up? If it was put up recently, it's possible that the pages were spidered prior to the robots.txt file being added to your site.

Finally, spidered and indexed are obviously two different things. Sometimes it takes a while from when a page is spidered until when it can be found in the index. Give it a bit of time.

Also, it can happen, especially for a new site, that G will crawl it but not add it to the index at first. If this ends up being the case, additional links will generally fix it.

orionsweb

4:16 pm on Sep 5, 2007 (gmt 0)

10+ Year Member



Another possible reason or two...

Unfortunately not all SE's (especially second and third tier) will follow a robot.txt file's instructions.

When they don't they index those pages, sometimes Google (and other SE's) will pick up on the links from third party engines, directories, or individual sites and index them that way.

One way to help get your un-indexed pages listed would be to create a google sitemap, (and while you're at it do a yahoo sitemap too).

jimbeetle

5:58 pm on Sep 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Also go over to Google's Webmaster Tools and run your file through the robots.txt validator.

jdMorgan

6:15 pm on Sep 5, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Some relevant informtion:

robots.txt does not say not to index a page, it says not to fetch the page. Pages found through links may be indexed using the link-text of those links as the result title. robots.txt was intended for server bandwidth conservation, and pre-dates the major search engines.

The on-page <meta name="robots"> tag can be used to keep pages out of most search engines' indexes. However, in order to read this tag, the robot must be allowed to fetch the page, so a page should not be Disallowed in robot.txt if this tag is to be used on it.

While often presented and discussed as being equivalent alternatives, robots.txt and the on-page <meta name="robots"> tag are very different things for different purposes.

Jim