Welcome to WebmasterWorld Guest from 54.205.106.138

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt with? in URL

     
10:52 am on Jul 2, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 8, 2004
posts:114
votes: 0


I use this in my robots.txt:

User-agent: *

Disallow: /page.php

But pages like page.php?id=123 are still got indexed, what should I do?

4:12 am on July 3, 2005 (gmt 0)

New User

10+ Year Member

joined:Sept 22, 2004
posts:16
votes: 0


I have problems with the same issue. Tries using "*", but I understand this is not supported by every search engine.
9:45 am on July 3, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 8, 2004
posts:114
votes: 0


Will it help if I change it to

Disallow: /page.php?

?

11:55 am on July 3, 2005 (gmt 0)

New User

10+ Year Member

joined:Sept 22, 2004
posts:16
votes: 0


Try this:

Disallow: /page.php?*

This will work with Googlebot supposedly.

Kim

6:09 pm on July 3, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 8, 2004
posts:114
votes: 0


Thank you, I'll try that out. How long do you think it will take until I will see results in deindexed pages?
7:55 pm on July 3, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member topr8 is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 19, 2002
posts:3254
votes: 19


>>Disallow: /page.php?*

this is invalid mark up, i don't know if it works for google as said above but it may make the line completely invalid for other spiders.

9:07 pm on July 3, 2005 (gmt 0)

New User

10+ Year Member

joined:Sept 22, 2004
posts:16
votes: 0


Yes, that is probably correct. I dont know if it works for Yahoo/MSN, but it is supposed to work for Google.
10:57 pm on July 3, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 31, 2005
posts:1108
votes: 0


Disallow: /page.php should be disallowing page.php?id=123 for all bots.
1) Which search bot is not obeying it?
2) Have you validated your robots.txt file? See link at top of forum.
3) Is your page.php in your root folder?
If not you need to Dissallow: /dir/page.php
4) Is your robots.txt file in your root folder?
5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's
8:13 am on July 4, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 8, 2004
posts:114
votes: 0


5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's

I guess that's the problem, the robots.txt validates, and it's google I have problems with. But it's not crawling the sites, just showing the URL in a site: -search.

Maybe I remove them from the site: -search by using this meta-tag: <META NAME="robots" CONTENT="noindex, nofollow, noarchive" />
?

2:28 am on July 5, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:May 31, 2005
posts:1108
votes: 0


Yes, you could put those Meta tags in, but then you would have to remove those files from your robots.txt, otherwise Googlebot will never fetch that page and read those Meta tags.
It is all a matter of what is the lesser evil.
Having those URL's show up if you do a search for all URL's on a site, or having a search bot fetching the page (and any issues this may cause).
9:50 am on July 5, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Sept 8, 2004
posts:114
votes: 0


I'll try that with the meta tags, I't just a testsite anyways.
4:23 am on July 6, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 16, 2004
posts:693
votes: 0


user-agent: *
disallow: /page.php?*

is invalid. never use * in the disallow feild when it is in the user-agent feild. Most bots will see it as invalid

googlebot DOES allow it though

user-agent: googlebot
disallow: /page.php?*
is perfectly valid but pointless since disallow: /page.php? would have the EXACT same effect. a more valid use of the wildcard (for googlebot) would be something like disallow: /*? which would disallow any URL containing "?"

The original robots.txt should work for all bots but not if they are following an inbound link to the URL in question. In that case the listing would be URL-only and never go beyond that.
If the IBL is removed the URL-only listing should also disappear within a few crawls.

5:44 am on July 6, 2005 (gmt 0)

New User

10+ Year Member

joined:Sept 22, 2004
posts:16
votes: 0


So what you are saying is that:

/page.php?

Will keep msnbot and inktomi out too?

7:05 am on July 6, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 16, 2004
posts:693
votes: 0


/page.php?

will disallow any URL beginning with /page.php?

it will not disallow /page.php but it will disallow /page.php?id=2 /page.php?id=3 ect.

robots.txt is based on prefix-matching meaning that any url that matches up with the prefix /page.php? will be disallowed.

if you disallow /page.php
then it will disallow /page.php And /page.php?id=2 /page.php?id=3 ect because they all contain the prefix /page.php

googlebot can use the wildcard
/*.php will disallow all .php files in every directory because you are essentially saying
disallow: /(any text string).php
all files with a .php extension would be matched but files with a different extension such as .html would not match.

MSN and Inktomi do not allow the wildcard * in the disallow field but they do obey the user-agent: * (so does googlebot)

7:31 am on July 6, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 16, 2004
posts:693
votes: 0


bots will obey the specific user-agent to them or the wildcard user-agent, whichever comes first

example:

user-agent: *
disallow: /page.php?

user-agent: googlebot
disallow: /otherpage.php

googlebot will obey the user-agent: * one because it comes first.
when you have multiple lines in robots.txt naming multiple user-agents then user-agent: * should be the last entry to pick up all bots that did not match a user-agent: string
remember only googlbot can handle * in the disallow feild.
In most cases a simple robots.txt with user-agent: * will give one command to all bots but in some cases it is useful to give different bots different directives.

user-agent: somebot
disallow: /

user-agent: someotherbot
disallow: /

user-agent: googlebot
disallow: /cgi-bin
disallow: /*?

user-agent: *
disallow: /cgi-bin

i don't have a problem with any other search engines indexing URL's containing "?" except googlebot so I gave it a special directive
I wan't em all out of my cgi-bin and those first 2 bots are wasting my bandwidth but they do obey robots.txt
(just a crude example)

11:17 am on July 6, 2005 (gmt 0)

New User

10+ Year Member

joined:Sept 22, 2004
posts:16
votes: 0


Thanks, Reid. I learned a lot from this thread!

:-)
Kim

 

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members