Forum Moderators: goodroi

Message Too Old, No Replies

Robots.txt help

Ok, maybe I did not understand how it works

         

followgreg

1:06 am on Jan 24, 2006 (gmt 0)

10+ Year Member




Not sure i understood what's wrong with my robots.txt, please help:

For example if I disallow /anything/top.php
Does it also disallow /anything/top.php?parameter=anything

?

Because so far search engines keep on indexing parts of what I don't want them to!

I have multiple pages with query parameters and want NOT them indexed.

How can I do?

Thanks

followgreg

7:17 am on Jan 27, 2006 (gmt 0)

10+ Year Member



apparently I'm alone on this one :)

Dijkgraaf

9:33 pm on Jan 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Your understanding is right, it should disallow any URI starting with the string given in the disallow.
Have you run your robots.txt through a validator?

Pfui

9:45 pm on Jan 27, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Different search engines do things differently -- some don't touch files with? in the path, others didn't but now do, etc. And, of course, others totally ignore robots.txt.

Thing is, the info you've provided is a bit sketchy. So here's a sketchy answer (sorry)...

1.) If you want to keep SEs out of a directory containing "multiple pages with query parameters", disallow the entire directory. So your example...

disallow /anything/top.php

...should be written this way:

User-agent: *
Disallow: /anything

(Note: If your directory shares any word common to files located in other parts of your site, add a trailing slash or else you'll disallow the other files: /anything will also block /anythinggoes.html. See Search Engine World's excellent Robots.txt Tutorial [searchengineworld.com].)

2.) Some search engines require that you specify their UA by name. So you need to include more than one set of restrictions:

User-agent: *
Disallow: /anything

User-agent: ExampleBot
Disallow: /anything

3.) Make sure your robots.txt document is up to snuff using Search Engine World's Robots.txt Validator [searchengineworld.com].

followgreg

11:31 pm on Jan 30, 2006 (gmt 0)

10+ Year Member



You know guys it drives me totally nuts...Google has indexed ALL pages that I disallowed :(

I am going to double check my stuff again but I have a hard time to undertand why?
In addition it created a bunch of duplicates

My robots.txt is valid.

>>pfui: I do not want to disalow the whole directory but simply a file and its parameters, period.
And it seems not to work.

Dijkgraaf

12:22 am on Jan 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well I'm currently setting up a series of pages that will test various behaviours of robots/crawlers in regard to parameters, redirects and disallows.
I'll post back in here sometime to let people know if I find anything interesting.

Pfui

1:24 am on Jan 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Curiouser and curiouser.

followgreg, I can only suggest that you post part (not all) of your robots.txt file here, please. That way we'll be able to see precisely how you've written your instructions.

Please do NOT edit what you're using in your robots.txt other than to remove any info that will identify your site (per WW's TOS).

To make it easier to troubleshoot, please post the exact lines where you name the major crawlers and be sure to include some or all of the sample directories you want each one to ignore. (The directories should all be right under the lines where the SEs are named.) Thanks!

gregorym

1:08 am on Feb 9, 2006 (gmt 0)

10+ Year Member




I can not believe that, Google keeps on indexing pages that I forbid :(

Here is the robots.txt

User-agent: *
Disallow: /cache/
Disallow: /editor/
Disallow: /media/
Disallow: /component/
Disallow: /components/com_any/go.php
Disallow: /forums_admin/
Disallow: /index2.php

All go.php and index2.php are indexed with their parameters why? :(

Is there an .htaccess solution? I just don't want Google to crawl these links but it keeps on doing it!

Damn!

jdMorgan

1:12 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Let me ask what may seem to be an odd question...

How are these pages listed in Google's search results?

Are they listed as pages with full titles and descriptions or a snippet, or are they listed as URL-only, perhaps with link-text as the title?

Jim

Dijkgraaf

8:44 am on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is your robots.txt file in the root folder of your website?
e.g. if you were to go to http://www.example.com/robots.txt where example.com is your domain, do you see the contents of your robots.txt

Note: I can NOT be a URL like
http://www.example.com/~myid/robots.txt

followgreg

11:47 pm on Feb 9, 2006 (gmt 0)

10+ Year Member




My robots.txt is in the root /robots.txt and on Google these are listed as URL's only.

But that means that Google READS what I don't want it to read, I have no clue how to keep these pages private.

Are they any mod_rewrite solution as stated above?

Dijkgraaf

11:55 pm on Feb 9, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If they are URL's only, then google has NOT read those pages. What is has read is pages that have links to those pages, and hence has recorded those URL's

Those URL's will only come up in a site: or similar search and not in a standard search.

jdMorgan

12:03 am on Feb 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



My robots.txt is in the root /robots.txt and on Google these [b]are listed as URL's only.[/ur]

But that means that Google READS what I don't want it to read [...]


No, it doesn't. Google will list a page as URL-only if it finds a link anywhere pointing to that page. The link could be on one of your allowed pages, or even on another site. That's why I asked the question.

Check your log files to see if Google is actually fetching these disallowed files. If not, then it's simply collecting links -and maybe link text- to create those URL-only listings.

Solution: Allow those pages in robots.txt, then add HTML robots NOINDEX tags to the pages.

Jim

marketingmagic

8:28 pm on Feb 10, 2006 (gmt 0)

10+ Year Member



I've seen similar problems in both Yahoo and MSN. They list pages, directories, and even development subdomains - which are not linked from anywhere public.

Bottom line? I gues none of the engines really listen to robots.txt, or meta No index tags despite telling us they do.

Dijkgraaf

9:20 pm on Feb 10, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



marketingmagic
No. they do pay attention to robots.txt and meta tags (or at least the well behaved ones).

However there is a fuzy boundry.

robots.txt tells a spiders/cralwers not to GET a certain URL, however the standard says nothing about then not including this in their indexes.

The meta tags can tell spiders/crawlers not to index, follow links, or have a cached copy of a page. However to read these meta tags the spider/crawler has to be allowed to GET the page in the first place.

If you have the meta tags in the page but also disallow it in robots.txt then the spider/crawler will never even read the meta tags as you've told them not to requests the page. Hence the situation can arise that you have a noindex tag in your page, but the URL will appear in their index.

Another situation that is common is that a page gets indexed, and then afterwards it is disallowed in robots.txt
This means that the search engines index contains a page that is now disallowed. There is no standard to say that they should then remove this page from their index.