Forum Moderators: Robert Charlton & goodroi
My robots file includes the following two lines:
User-agent: *
Disallow: apage.php
Googlebot has been visiting apage.php?id=1 and so on (mindlessly indexing tens of thousands of identical pages).
Do I need to use
Disallow: apage.php* instead?
Or is there a problem with Googlebot?
also php?id=1 etc aren't these seperate files from the .php file you disallowed?
maybe you should disallow the directory which contains these files?
'googlebot only crawling what you disallowed'
you got me there.
We are coping with similar problem. For over a year robots.txt has been disallowing our /cf/ directory which we use to run counting script for our advertiser links. This did not prevent the indexing (in Feb I think) of over 50k links out to external sites. These now appear in site:oursite.com/cf/ as "pages" even though these resolve to those sites and not ours. About six weeks ago we changed robots.txt to *allow* that directory but the bogus link pages remain.
After my inquiry Google help wrote that we should disallow the /cf/ directory.
Cannot remove these links with Google's removal tool because the links exist and are correct - they are just wrongly indexed as "pages". Therefore I get the message "page still exists".
You should also set up a 301 redirect from non-www to www to avoid duplicate content. Additionally, make sure that all links that point to folders, or point to an index page inside a folder, do not include the actual filename. Make sure that the URL ends with a trailing / every time.
.
Be aware that Google does treat:
domain.com/folder
domain.com/folder/
domain.com/folder/index.html
www.domain.com/folder
www.domain.com/folder/
www.domain.com/folder/index.html
as six different pages.
You want the one shown in bold to be the one that they actually list (because your server should redirect a request for folder to folder/ automatically anyway). Never include the actual index file filename. This will allow you to change your technology in the future without having to change any of the links at all.