homepage Welcome to WebmasterWorld Guest from 54.146.190.193
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
robots.txt with? in URL
jozomannen

10+ Year Member



 
Msg#: 673 posted 10:52 am on Jul 2, 2005 (gmt 0)

I use this in my robots.txt:

User-agent: *

Disallow: /page.php

But pages like page.php?id=123 are still got indexed, what should I do?

 

kservik

10+ Year Member



 
Msg#: 673 posted 4:12 am on Jul 3, 2005 (gmt 0)

I have problems with the same issue. Tries using "*", but I understand this is not supported by every search engine.

jozomannen

10+ Year Member



 
Msg#: 673 posted 9:45 am on Jul 3, 2005 (gmt 0)

Will it help if I change it to

Disallow: /page.php?

?

kservik

10+ Year Member



 
Msg#: 673 posted 11:55 am on Jul 3, 2005 (gmt 0)

Try this:

Disallow: /page.php?*

This will work with Googlebot supposedly.

Kim

jozomannen

10+ Year Member



 
Msg#: 673 posted 6:09 pm on Jul 3, 2005 (gmt 0)

Thank you, I'll try that out. How long do you think it will take until I will see results in deindexed pages?

topr8

WebmasterWorld Senior Member topr8 us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 673 posted 7:55 pm on Jul 3, 2005 (gmt 0)

>>Disallow: /page.php?*

this is invalid mark up, i don't know if it works for google as said above but it may make the line completely invalid for other spiders.

kservik

10+ Year Member



 
Msg#: 673 posted 9:07 pm on Jul 3, 2005 (gmt 0)

Yes, that is probably correct. I dont know if it works for Yahoo/MSN, but it is supposed to work for Google.

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 673 posted 10:57 pm on Jul 3, 2005 (gmt 0)

Disallow: /page.php should be disallowing page.php?id=123 for all bots.
1) Which search bot is not obeying it?
2) Have you validated your robots.txt file? See link at top of forum.
3) Is your page.php in your root folder?
If not you need to Dissallow: /dir/page.php
4) Is your robots.txt file in your root folder?
5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's

jozomannen

10+ Year Member



 
Msg#: 673 posted 8:13 am on Jul 4, 2005 (gmt 0)

5) Are the bots actually indexing the contents of those pages? ie. are you seeing a title and sippet? Or are you just seeing the URL listed?
Even if you dissallow something in robots.txt, the URL can still get listed, as all robots.txt does is tell the search engine bot not to fetch it. However that URL will only come up if you do a search for particular URL's

I guess that's the problem, the robots.txt validates, and it's google I have problems with. But it's not crawling the sites, just showing the URL in a site: -search.

Maybe I remove them from the site: -search by using this meta-tag: <META NAME="robots" CONTENT="noindex, nofollow, noarchive" />
?

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 673 posted 2:28 am on Jul 5, 2005 (gmt 0)

Yes, you could put those Meta tags in, but then you would have to remove those files from your robots.txt, otherwise Googlebot will never fetch that page and read those Meta tags.
It is all a matter of what is the lesser evil.
Having those URL's show up if you do a search for all URL's on a site, or having a search bot fetching the page (and any issues this may cause).

jozomannen

10+ Year Member



 
Msg#: 673 posted 9:50 am on Jul 5, 2005 (gmt 0)

I'll try that with the meta tags, I't just a testsite anyways.

Reid

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 673 posted 4:23 am on Jul 6, 2005 (gmt 0)

user-agent: *
disallow: /page.php?*

is invalid. never use * in the disallow feild when it is in the user-agent feild. Most bots will see it as invalid

googlebot DOES allow it though

user-agent: googlebot
disallow: /page.php?*
is perfectly valid but pointless since disallow: /page.php? would have the EXACT same effect. a more valid use of the wildcard (for googlebot) would be something like disallow: /*? which would disallow any URL containing "?"

The original robots.txt should work for all bots but not if they are following an inbound link to the URL in question. In that case the listing would be URL-only and never go beyond that.
If the IBL is removed the URL-only listing should also disappear within a few crawls.

kservik

10+ Year Member



 
Msg#: 673 posted 5:44 am on Jul 6, 2005 (gmt 0)

So what you are saying is that:

/page.php?

Will keep msnbot and inktomi out too?

Reid

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 673 posted 7:05 am on Jul 6, 2005 (gmt 0)

/page.php?

will disallow any URL beginning with /page.php?

it will not disallow /page.php but it will disallow /page.php?id=2 /page.php?id=3 ect.

robots.txt is based on prefix-matching meaning that any url that matches up with the prefix /page.php? will be disallowed.

if you disallow /page.php
then it will disallow /page.php And /page.php?id=2 /page.php?id=3 ect because they all contain the prefix /page.php

googlebot can use the wildcard
/*.php will disallow all .php files in every directory because you are essentially saying
disallow: /(any text string).php
all files with a .php extension would be matched but files with a different extension such as .html would not match.

MSN and Inktomi do not allow the wildcard * in the disallow field but they do obey the user-agent: * (so does googlebot)

Reid

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 673 posted 7:31 am on Jul 6, 2005 (gmt 0)

bots will obey the specific user-agent to them or the wildcard user-agent, whichever comes first

example:

user-agent: *
disallow: /page.php?

user-agent: googlebot
disallow: /otherpage.php

googlebot will obey the user-agent: * one because it comes first.
when you have multiple lines in robots.txt naming multiple user-agents then user-agent: * should be the last entry to pick up all bots that did not match a user-agent: string
remember only googlbot can handle * in the disallow feild.
In most cases a simple robots.txt with user-agent: * will give one command to all bots but in some cases it is useful to give different bots different directives.

user-agent: somebot
disallow: /

user-agent: someotherbot
disallow: /

user-agent: googlebot
disallow: /cgi-bin
disallow: /*?

user-agent: *
disallow: /cgi-bin

i don't have a problem with any other search engines indexing URL's containing "?" except googlebot so I gave it a special directive
I wan't em all out of my cgi-bin and those first 2 bots are wasting my bandwidth but they do obey robots.txt
(just a crude example)

kservik

10+ Year Member



 
Msg#: 673 posted 11:17 am on Jul 6, 2005 (gmt 0)

Thanks, Reid. I learned a lot from this thread!

:-)
Kim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved