Forum Moderators: goodroi
For example if I disallow /anything/top.php
Does it also disallow /anything/top.php?parameter=anything
?
Because so far search engines keep on indexing parts of what I don't want them to!
I have multiple pages with query parameters and want NOT them indexed.
How can I do?
Thanks
Thing is, the info you've provided is a bit sketchy. So here's a sketchy answer (sorry)...
1.) If you want to keep SEs out of a directory containing "multiple pages with query parameters", disallow the entire directory. So your example...
disallow /anything/top.php
...should be written this way:
User-agent: *
Disallow: /anything
(Note: If your directory shares any word common to files located in other parts of your site, add a trailing slash or else you'll disallow the other files: /anything will also block /anythinggoes.html. See Search Engine World's excellent Robots.txt Tutorial [searchengineworld.com].)
2.) Some search engines require that you specify their UA by name. So you need to include more than one set of restrictions:
User-agent: *
Disallow: /anything
User-agent: ExampleBot
Disallow: /anything
3.) Make sure your robots.txt document is up to snuff using Search Engine World's Robots.txt Validator [searchengineworld.com].
I am going to double check my stuff again but I have a hard time to undertand why?
In addition it created a bunch of duplicates
My robots.txt is valid.
>>pfui: I do not want to disalow the whole directory but simply a file and its parameters, period.
And it seems not to work.
followgreg, I can only suggest that you post part (not all) of your robots.txt file here, please. That way we'll be able to see precisely how you've written your instructions.
Please do NOT edit what you're using in your robots.txt other than to remove any info that will identify your site (per WW's TOS).
To make it easier to troubleshoot, please post the exact lines where you name the major crawlers and be sure to include some or all of the sample directories you want each one to ignore. (The directories should all be right under the lines where the SEs are named.) Thanks!
Here is the robots.txt
User-agent: *
Disallow: /cache/
Disallow: /editor/
Disallow: /media/
Disallow: /component/
Disallow: /components/com_any/go.php
Disallow: /forums_admin/
Disallow: /index2.php
All go.php and index2.php are indexed with their parameters why? :(
Is there an .htaccess solution? I just don't want Google to crawl these links but it keeps on doing it!
Damn!
My robots.txt is in the root /robots.txt and on Google these [b]are listed as URL's only.[/ur]But that means that Google READS what I don't want it to read [...]
Check your log files to see if Google is actually fetching these disallowed files. If not, then it's simply collecting links -and maybe link text- to create those URL-only listings.
Solution: Allow those pages in robots.txt, then add HTML robots NOINDEX tags to the pages.
Jim
However there is a fuzy boundry.
robots.txt tells a spiders/cralwers not to GET a certain URL, however the standard says nothing about then not including this in their indexes.
The meta tags can tell spiders/crawlers not to index, follow links, or have a cached copy of a page. However to read these meta tags the spider/crawler has to be allowed to GET the page in the first place.
If you have the meta tags in the page but also disallow it in robots.txt then the spider/crawler will never even read the meta tags as you've told them not to requests the page. Hence the situation can arise that you have a noindex tag in your page, but the URL will appear in their index.
Another situation that is common is that a page gets indexed, and then afterwards it is disallowed in robots.txt
This means that the search engines index contains a page that is now disallowed. There is no standard to say that they should then remove this page from their index.