robots.txt with? in URL

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt with? in URL

jozomannen

10:52 am on Jul 2, 2005 (gmt 0)

I use this in my robots.txt:

User-agent: *

Disallow: /page.php

But pages like page.php?id=123 are still got indexed, what should I do?

kservik

4:12 am on Jul 3, 2005 (gmt 0)

I have problems with the same issue. Tries using "*", but I understand this is not supported by every search engine.

jozomannen

9:45 am on Jul 3, 2005 (gmt 0)

Will it help if I change it to

Disallow: /page.php?

kservik

11:55 am on Jul 3, 2005 (gmt 0)

Try this:

Disallow: /page.php?*

This will work with Googlebot supposedly.

Kim

jozomannen

6:09 pm on Jul 3, 2005 (gmt 0)

Thank you, I'll try that out. How long do you think it will take until I will see results in deindexed pages?

topr8

7:55 pm on Jul 3, 2005 (gmt 0)

>>Disallow: /page.php?*

this is invalid mark up, i don't know if it works for google as said above but it may make the line completely invalid for other spiders.

kservik

9:07 pm on Jul 3, 2005 (gmt 0)

Yes, that is probably correct. I dont know if it works for Yahoo/MSN, but it is supposed to work for Google.

Dijkgraaf

10:57 pm on Jul 3, 2005 (gmt 0)

Disallow: /page.php should be disallowing page.php?id=123 for all bots.
1) Which search bot is not obeying it?
2) Have you validated your robots.txt file? See link at top of forum.
3) Is your page.php in your root folder?
If not you need to Dissallow: /dir/page.php
4) Is your robots.txt file in your root folder?
5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's

jozomannen

8:13 am on Jul 4, 2005 (gmt 0)

5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's

I guess that's the problem, the robots.txt validates, and it's google I have problems with. But it's not crawling the sites, just showing the URL in a site: -search.

Maybe I remove them from the site: -search by using this meta-tag: <META NAME="robots" CONTENT="noindex, nofollow, noarchive" />
?

Dijkgraaf

2:28 am on Jul 5, 2005 (gmt 0)

Yes, you could put those Meta tags in, but then you would have to remove those files from your robots.txt, otherwise Googlebot will never fetch that page and read those Meta tags.
It is all a matter of what is the lesser evil.
Having those URL's show up if you do a search for all URL's on a site, or having a search bot fetching the page (and any issues this may cause).

jozomannen

9:50 am on Jul 5, 2005 (gmt 0)

I'll try that with the meta tags, I't just a testsite anyways.

Reid

4:23 am on Jul 6, 2005 (gmt 0)

user-agent: *
disallow: /page.php?*

is invalid. never use * in the disallow feild when it is in the user-agent feild. Most bots will see it as invalid

googlebot DOES allow it though

user-agent: googlebot
disallow: /page.php?*
is perfectly valid but pointless since disallow: /page.php? would have the EXACT same effect. a more valid use of the wildcard (for googlebot) would be something like disallow: /*? which would disallow any URL containing "?"

The original robots.txt should work for all bots but not if they are following an inbound link to the URL in question. In that case the listing would be URL-only and never go beyond that.
If the IBL is removed the URL-only listing should also disappear within a few crawls.

kservik

5:44 am on Jul 6, 2005 (gmt 0)

So what you are saying is that:

/page.php?

Will keep msnbot and inktomi out too?

Reid

7:05 am on Jul 6, 2005 (gmt 0)

/page.php?

will disallow any URL beginning with /page.php?

it will not disallow /page.php but it will disallow /page.php?id=2 /page.php?id=3 ect.

robots.txt is based on prefix-matching meaning that any url that matches up with the prefix /page.php? will be disallowed.

if you disallow /page.php
then it will disallow /page.php And /page.php?id=2 /page.php?id=3 ect because they all contain the prefix /page.php

googlebot can use the wildcard
/*.php will disallow all .php files in every directory because you are essentially saying
disallow: /(any text string).php
all files with a .php extension would be matched but files with a different extension such as .html would not match.

MSN and Inktomi do not allow the wildcard * in the disallow field but they do obey the user-agent: * (so does googlebot)

Reid

7:31 am on Jul 6, 2005 (gmt 0)

bots will obey the specific user-agent to them or the wildcard user-agent, whichever comes first

example:

user-agent: *
disallow: /page.php?

user-agent: googlebot
disallow: /otherpage.php

googlebot will obey the user-agent: * one because it comes first.
when you have multiple lines in robots.txt naming multiple user-agents then user-agent: * should be the last entry to pick up all bots that did not match a user-agent: string
remember only googlbot can handle * in the disallow feild.
In most cases a simple robots.txt with user-agent: * will give one command to all bots but in some cases it is useful to give different bots different directives.

user-agent: somebot
disallow: /

user-agent: someotherbot
disallow: /

user-agent: googlebot
disallow: /cgi-bin
disallow: /*?

user-agent: *
disallow: /cgi-bin

i don't have a problem with any other search engines indexing URL's containing "?" except googlebot so I gave it a special directive
I wan't em all out of my cgi-bin and those first 2 bots are wasting my bandwidth but they do obey robots.txt
(just a crude example)

kservik

11:17 am on Jul 6, 2005 (gmt 0)

Thanks, Reid. I learned a lot from this thread!

:-)
Kim