Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Googlebot ignoring robots.txt

         

akreider

7:08 am on Mar 31, 2005 (gmt 0)

10+ Year Member



Googlebot is currently ignoring or misinterpreting my robots.txt file.

My robots file includes the following two lines:

User-agent: *
Disallow: apage.php

Googlebot has been visiting apage.php?id=1 and so on (mindlessly indexing tens of thousands of identical pages).

Do I need to use
Disallow: apage.php* instead?

Or is there a problem with Googlebot?

g1smd

10:16 pm on Mar 31, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Use the Google urlcontroller to remove the links. You need to register first.

It takes about 24 hours for them to disappear after filling in the form.

bmsd33

5:10 am on Apr 1, 2005 (gmt 0)

10+ Year Member



I recently removed all "disallow" instructions (which I recently added) from my robots.txt because I noticed googlebot was doing the exact opposite: it was only visiting the pages I disallowed!

Reid

10:16 am on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



are you guys using adsense?
[webmasterworld.com...]

also php?id=1 etc aren't these seperate files from the .php file you disallowed?
maybe you should disallow the directory which contains these files?

'googlebot only crawling what you disallowed'
you got me there.

joeduck

10:57 pm on Apr 5, 2005 (gmt 0)

10+ Year Member



akreider:

We are coping with similar problem. For over a year robots.txt has been disallowing our /cf/ directory which we use to run counting script for our advertiser links. This did not prevent the indexing (in Feb I think) of over 50k links out to external sites. These now appear in site:oursite.com/cf/ as "pages" even though these resolve to those sites and not ours. About six weeks ago we changed robots.txt to *allow* that directory but the bogus link pages remain.

After my inquiry Google help wrote that we should disallow the /cf/ directory.

Cannot remove these links with Google's removal tool because the links exist and are correct - they are just wrongly indexed as "pages". Therefore I get the message "page still exists".

g1smd

11:07 pm on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Put the Disallow instruction back into your robots.txt file. Use their removal tool with the option "remove pages using a robots.txt file", and the pages will be delisted within 24 hours.

joeduck

11:26 pm on Apr 5, 2005 (gmt 0)

10+ Year Member



Thx g1 - we'll try it that way. I should have read the Google removal instructions more thoroughly.

BigDave

11:47 pm on Apr 5, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Are they being crawled or just indexed?

Putting it in robots.txt will not keep it from getting indexed, it will just keep them from being able to crawl the page to know what is on that page.

Dave_T

1:42 am on Apr 6, 2005 (gmt 0)

10+ Year Member



I had similar experience. I believe that your robots.txt has to be case sensitive.

akreider

4:37 am on Apr 6, 2005 (gmt 0)

10+ Year Member



I needed to have a / before the file name. So
/apage.php works.

I did the urlcontroller thing and it worked.

g1smd

11:13 am on Apr 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I would recommend that everyone uses the site:yourdomain.com command to see what Google has indexed, then use the urlcontroller to remove everything that you do not want to be listed.

You should also set up a 301 redirect from non-www to www to avoid duplicate content. Additionally, make sure that all links that point to folders, or point to an index page inside a folder, do not include the actual filename. Make sure that the URL ends with a trailing / every time.

.

Be aware that Google does treat:

domain.com/folder
domain.com/folder/
domain.com/folder/index.html
www.domain.com/folder
www.domain.com/folder/
www.domain.com/folder/index.html

as six different pages.

You want the one shown in bold to be the one that they actually list (because your server should redirect a request for folder to folder/ automatically anyway). Never include the actual index file filename. This will allow you to change your technology in the future without having to change any of the links at all.

joeduck

8:09 am on Apr 7, 2005 (gmt 0)

10+ Year Member



Thx G1smd. We disallowed our script directories in robots.txt and then ran URLcontroller at Google and all the offensive links showing as pages were removed within hours. Not clear yet if this has had any affect on our traffic problem from Google.