Robots.txt help

Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt help

Ok, maybe I did not understand how it works

followgreg

1:06 am on Jan 24, 2006 (gmt 0)

Not sure i understood what's wrong with my robots.txt, please help:

For example if I disallow /anything/top.php
Does it also disallow /anything/top.php?parameter=anything

Because so far search engines keep on indexing parts of what I don't want them to!

I have multiple pages with query parameters and want NOT them indexed.

How can I do?

Thanks

followgreg

7:17 am on Jan 27, 2006 (gmt 0)

apparently I'm alone on this one :)

Dijkgraaf

9:33 pm on Jan 27, 2006 (gmt 0)

Your understanding is right, it should disallow any URI starting with the string given in the disallow.
Have you run your robots.txt through a validator?

Pfui

9:45 pm on Jan 27, 2006 (gmt 0)

Different search engines do things differently -- some don't touch files with? in the path, others didn't but now do, etc. And, of course, others totally ignore robots.txt.

Thing is, the info you've provided is a bit sketchy. So here's a sketchy answer (sorry)...

1.) If you want to keep SEs out of a directory containing "multiple pages with query parameters", disallow the entire directory. So your example...

disallow /anything/top.php

...should be written this way:

User-agent: *
Disallow: /anything

(Note: If your directory shares any word common to files located in other parts of your site, add a trailing slash or else you'll disallow the other files: /anything will also block /anythinggoes.html. See Search Engine World's excellent Robots.txt Tutorial [searchengineworld.com].)

2.) Some search engines require that you specify their UA by name. So you need to include more than one set of restrictions:

User-agent: *
Disallow: /anything

User-agent: ExampleBot
Disallow: /anything

3.) Make sure your robots.txt document is up to snuff using Search Engine World's Robots.txt Validator [searchengineworld.com].

followgreg

11:31 pm on Jan 30, 2006 (gmt 0)

You know guys it drives me totally nuts...Google has indexed ALL pages that I disallowed :(

I am going to double check my stuff again but I have a hard time to undertand why?
In addition it created a bunch of duplicates

My robots.txt is valid.

>>pfui: I do not want to disalow the whole directory but simply a file and its parameters, period.
And it seems not to work.

Dijkgraaf

12:22 am on Jan 31, 2006 (gmt 0)

Well I'm currently setting up a series of pages that will test various behaviours of robots/crawlers in regard to parameters, redirects and disallows.
I'll post back in here sometime to let people know if I find anything interesting.

Pfui

1:24 am on Jan 31, 2006 (gmt 0)

Curiouser and curiouser.

followgreg, I can only suggest that you post part (not all) of your robots.txt file here, please. That way we'll be able to see precisely how you've written your instructions.

Please do NOT edit what you're using in your robots.txt other than to remove any info that will identify your site (per WW's TOS).

To make it easier to troubleshoot, please post the exact lines where you name the major crawlers and be sure to include some or all of the sample directories you want each one to ignore. (The directories should all be right under the lines where the SEs are named.) Thanks!

gregorym

1:08 am on Feb 9, 2006 (gmt 0)

I can not believe that, Google keeps on indexing pages that I forbid :(

Here is the robots.txt

User-agent: *
Disallow: /cache/
Disallow: /editor/
Disallow: /media/
Disallow: /component/
Disallow: /components/com_any/go.php
Disallow: /forums_admin/
Disallow: /index2.php

All go.php and index2.php are indexed with their parameters why? :(

Is there an .htaccess solution? I just don't want Google to crawl these links but it keeps on doing it!

Damn!

jdMorgan

1:12 am on Feb 9, 2006 (gmt 0)

Let me ask what may seem to be an odd question...

How are these pages listed in Google's search results?

Are they listed as pages with full titles and descriptions or a snippet, or are they listed as URL-only, perhaps with link-text as the title?

Jim

Dijkgraaf

8:44 am on Feb 9, 2006 (gmt 0)

Is your robots.txt file in the root folder of your website?
e.g. if you were to go to http://www.example.com/robots.txt where example.com is your domain, do you see the contents of your robots.txt

Note: I can NOT be a URL like
http://www.example.com/~myid/robots.txt

followgreg

11:47 pm on Feb 9, 2006 (gmt 0)

My robots.txt is in the root /robots.txt and on Google these are listed as URL's only.

But that means that Google READS what I don't want it to read, I have no clue how to keep these pages private.

Are they any mod_rewrite solution as stated above?

Dijkgraaf

11:55 pm on Feb 9, 2006 (gmt 0)

If they are URL's only, then google has NOT read those pages. What is has read is pages that have links to those pages, and hence has recorded those URL's

Those URL's will only come up in a site: or similar search and not in a standard search.

jdMorgan

12:03 am on Feb 10, 2006 (gmt 0)

My robots.txt is in the root /robots.txt and on Google these [b]are listed as URL's only.[/ur]
But that means that Google READS what I don't want it to read [...]

No, it doesn't. Google will list a page as URL-only if it finds a link anywhere pointing to that page. The link could be on one of your allowed pages, or even on another site. That's why I asked the question.

Check your log files to see if Google is actually fetching these disallowed files. If not, then it's simply collecting links -and maybe link text- to create those URL-only listings.

Solution: Allow those pages in robots.txt, then add HTML robots NOINDEX tags to the pages.

Jim

marketingmagic

8:28 pm on Feb 10, 2006 (gmt 0)

I've seen similar problems in both Yahoo and MSN. They list pages, directories, and even development subdomains - which are not linked from anywhere public.

Bottom line? I gues none of the engines really listen to robots.txt, or meta No index tags despite telling us they do.

Dijkgraaf

9:20 pm on Feb 10, 2006 (gmt 0)

marketingmagic
No. they do pay attention to robots.txt and meta tags (or at least the well behaved ones).

However there is a fuzy boundry.

robots.txt tells a spiders/cralwers not to GET a certain URL, however the standard says nothing about then not including this in their indexes.

The meta tags can tell spiders/crawlers not to index, follow links, or have a cached copy of a page. However to read these meta tags the spider/crawler has to be allowed to GET the page in the first place.

If you have the meta tags in the page but also disallow it in robots.txt then the spider/crawler will never even read the meta tags as you've told them not to requests the page. Hence the situation can arise that you have a noindex tag in your page, but the URL will appear in their index.

Another situation that is common is that a page gets indexed, and then afterwards it is disallowed in robots.txt
This means that the search engines index contains a page that is now disallowed. There is no standard to say that they should then remove this page from their index.

Robots.txt help

Ok, maybe I did not understand how it works

followgreg

followgreg

Dijkgraaf

Pfui

followgreg

Dijkgraaf

Pfui

gregorym

jdMorgan

Dijkgraaf

followgreg

Dijkgraaf

jdMorgan

marketingmagic

Dijkgraaf

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week