Welcome to WebmasterWorld Guest from 54.197.171.28

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt with? in URL

   
10:52 am on Jul 2, 2005 (gmt 0)

10+ Year Member



I use this in my robots.txt:

User-agent: *

Disallow: /page.php

But pages like page.php?id=123 are still got indexed, what should I do?

4:12 am on Jul 3, 2005 (gmt 0)

10+ Year Member



I have problems with the same issue. Tries using "*", but I understand this is not supported by every search engine.
9:45 am on Jul 3, 2005 (gmt 0)

10+ Year Member



Will it help if I change it to

Disallow: /page.php?

?

11:55 am on Jul 3, 2005 (gmt 0)

10+ Year Member



Try this:

Disallow: /page.php?*

This will work with Googlebot supposedly.

Kim

6:09 pm on Jul 3, 2005 (gmt 0)

10+ Year Member



Thank you, I'll try that out. How long do you think it will take until I will see results in deindexed pages?
7:55 pm on Jul 3, 2005 (gmt 0)

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



>>Disallow: /page.php?*

this is invalid mark up, i don't know if it works for google as said above but it may make the line completely invalid for other spiders.

9:07 pm on Jul 3, 2005 (gmt 0)

10+ Year Member



Yes, that is probably correct. I dont know if it works for Yahoo/MSN, but it is supposed to work for Google.
10:57 pm on Jul 3, 2005 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Disallow: /page.php should be disallowing page.php?id=123 for all bots.
1) Which search bot is not obeying it?
2) Have you validated your robots.txt file? See link at top of forum.
3) Is your page.php in your root folder?
If not you need to Dissallow: /dir/page.php
4) Is your robots.txt file in your root folder?
5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's
8:13 am on Jul 4, 2005 (gmt 0)

10+ Year Member



5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's

I guess that's the problem, the robots.txt validates, and it's google I have problems with. But it's not crawling the sites, just showing the URL in a site: -search.

Maybe I remove them from the site: -search by using this meta-tag: <META NAME="robots" CONTENT="noindex, nofollow, noarchive" />
?

2:28 am on Jul 5, 2005 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Yes, you could put those Meta tags in, but then you would have to remove those files from your robots.txt, otherwise Googlebot will never fetch that page and read those Meta tags.
It is all a matter of what is the lesser evil.
Having those URL's show up if you do a search for all URL's on a site, or having a search bot fetching the page (and any issues this may cause).
9:50 am on Jul 5, 2005 (gmt 0)

10+ Year Member



I'll try that with the meta tags, I't just a testsite anyways.
4:23 am on Jul 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



user-agent: *
disallow: /page.php?*

is invalid. never use * in the disallow feild when it is in the user-agent feild. Most bots will see it as invalid

googlebot DOES allow it though

user-agent: googlebot
disallow: /page.php?*
is perfectly valid but pointless since disallow: /page.php? would have the EXACT same effect. a more valid use of the wildcard (for googlebot) would be something like disallow: /*? which would disallow any URL containing "?"

The original robots.txt should work for all bots but not if they are following an inbound link to the URL in question. In that case the listing would be URL-only and never go beyond that.
If the IBL is removed the URL-only listing should also disappear within a few crawls.

5:44 am on Jul 6, 2005 (gmt 0)

10+ Year Member



So what you are saying is that:

/page.php?

Will keep msnbot and inktomi out too?

7:05 am on Jul 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



/page.php?

will disallow any URL beginning with /page.php?

it will not disallow /page.php but it will disallow /page.php?id=2 /page.php?id=3 ect.

robots.txt is based on prefix-matching meaning that any url that matches up with the prefix /page.php? will be disallowed.

if you disallow /page.php
then it will disallow /page.php And /page.php?id=2 /page.php?id=3 ect because they all contain the prefix /page.php

googlebot can use the wildcard
/*.php will disallow all .php files in every directory because you are essentially saying
disallow: /(any text string).php
all files with a .php extension would be matched but files with a different extension such as .html would not match.

MSN and Inktomi do not allow the wildcard * in the disallow field but they do obey the user-agent: * (so does googlebot)

7:31 am on Jul 6, 2005 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



bots will obey the specific user-agent to them or the wildcard user-agent, whichever comes first

example:

user-agent: *
disallow: /page.php?

user-agent: googlebot
disallow: /otherpage.php

googlebot will obey the user-agent: * one because it comes first.
when you have multiple lines in robots.txt naming multiple user-agents then user-agent: * should be the last entry to pick up all bots that did not match a user-agent: string
remember only googlbot can handle * in the disallow feild.
In most cases a simple robots.txt with user-agent: * will give one command to all bots but in some cases it is useful to give different bots different directives.

user-agent: somebot
disallow: /

user-agent: someotherbot
disallow: /

user-agent: googlebot
disallow: /cgi-bin
disallow: /*?

user-agent: *
disallow: /cgi-bin

i don't have a problem with any other search engines indexing URL's containing "?" except googlebot so I gave it a special directive
I wan't em all out of my cgi-bin and those first 2 bots are wasting my bandwidth but they do obey robots.txt
(just a crude example)

11:17 am on Jul 6, 2005 (gmt 0)

10+ Year Member



Thanks, Reid. I learned a lot from this thread!

:-)
Kim

 

Featured Threads

Hot Threads This Week

Hot Threads This Month