Googlebot

Forum Moderators: phranque

Message Too Old, No Replies

Googlebot

Why has Googlebot indexed Disallowed pages

MegaTraders

12:21 pm on Sep 5, 2007 (gmt 0)

Help! According to Google Webmaster Tools, over 150 of the 256 pages on my site got crawled at the end of last week. I have a robots.txt and a sitemap.xml in the root. However, when I search for site:example.com in Google I see 9 disallowed pages indexed and only 2 others out of 256.

By the way Googlebot has been crawling my site since May.

Why has Googlebot indexed my disallowed pages?

Why has Google not indexed my other pages even though it appears to have crawled most of them?

Heeeeeeeeeeeeeeeeeelp!

[edited by: tedster at 8:23 pm (utc) on Sep. 5, 2007]
[edit reason] switch url to example.com [/edit]

Philosopher

12:34 pm on Sep 5, 2007 (gmt 0)

First, let me say Welcome to WebmasterWorld. :)

Second, you'll want to edit your original post and remove your URL as posting of actual URLs is against the TOS of the site.

How long have you had your robots.txt file up? If it was put up recently, it's possible that the pages were spidered prior to the robots.txt file being added to your site.

Finally, spidered and indexed are obviously two different things. Sometimes it takes a while from when a page is spidered until when it can be found in the index. Give it a bit of time.

Also, it can happen, especially for a new site, that G will crawl it but not add it to the index at first. If this ends up being the case, additional links will generally fix it.

orionsweb

4:16 pm on Sep 5, 2007 (gmt 0)

Another possible reason or two...

Unfortunately not all SE's (especially second and third tier) will follow a robot.txt file's instructions.

When they don't they index those pages, sometimes Google (and other SE's) will pick up on the links from third party engines, directories, or individual sites and index them that way.

One way to help get your un-indexed pages listed would be to create a google sitemap, (and while you're at it do a yahoo sitemap too).

jimbeetle

5:58 pm on Sep 5, 2007 (gmt 0)

Also go over to Google's Webmaster Tools and run your file through the robots.txt validator.

jdMorgan

6:15 pm on Sep 5, 2007 (gmt 0)

Some relevant informtion:

robots.txt does not say not to index a page, it says not to fetch the page. Pages found through links may be indexed using the link-text of those links as the result title. robots.txt was intended for server bandwidth conservation, and pre-dates the major search engines.

The on-page <meta name="robots"> tag can be used to keep pages out of most search engines' indexes. However, in order to read this tag, the robot must be allowed to fetch the page, so a page should not be Disallowed in robot.txt if this tag is to be used on it.

While often presented and discussed as being equivalent alternatives, robots.txt and the on-page <meta name="robots"> tag are very different things for different purposes.

Jim

Googlebot

Why has Googlebot indexed Disallowed pages

MegaTraders

Philosopher

orionsweb

jimbeetle

jdMorgan

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week