homepage Welcome to WebmasterWorld Guest from 204.236.254.124
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
wildcard
stevelibby




msg:1528814
 8:10 am on May 18, 2006 (gmt 0)

can you use a wildcard in a robots.txt.
My web site is written in asp.when results are returned they are in batches of 25. To select the 2nd batch the href is:
page.asp?Page=2
However moving through the results are fine but coming back page.asp & page.asp?page=1 maybe deemed as duplicatecontent by search engines?
So can i block results above 25?

 

Dijkgraaf




msg:1528815
 8:44 pm on May 18, 2006 (gmt 0)

No, wildcards aren't part of the standerd and not supported by most bots.
However in your example you could disallow
disallow: /page.asp?page
or even
disallow: /page.asp?
which would allow /page.asp but not allow /page.asp?page=1 etc. because they disallow rule will disallow any URL begining what you specify.

Reid




msg:1528816
 6:43 am on May 30, 2006 (gmt 0)

wildcards are valid in User-agent and for googlebot they are valid in the Disallow

User-agent:googlebot
Disallow:/page.asp?*

User-agent:*
Disallow:

This will allow all for all user agents except googlebot which will disallow all page.asp with a?
Make sure you use the googlebot directive first else googlebot will follow the User-agent:* directive.

Pfui




msg:1528817
 8:58 pm on May 30, 2006 (gmt 0)

1.) For any newcomers following along, be sure you include a space after any colon:

User-agent: Badbot
Disallow: /file.html
Disallow: /directory

2.) A blank Disallow is the same as saying Allow. Adding a forward slash will turn away all crawlers inclined to heed the most basic robots.txt:

User-agent: *
Disallow: /

3.) Nowadays bots' preferences can be (head-bangingly) unique and specific, so it's always a good idea to go to the source:

[search.msn.com...]
See also: EXAMPLES [search.msn.com] (including wildcards)

[robotstxt.org...]

[google.com...]
See also: EXAMPLES [google.com] (including wildcards)

Pfui




msg:1528818
 9:03 pm on May 30, 2006 (gmt 0)

4.) Googlebot's info doesn't have to come first. From the EXAMPLES page, above:

When creating your robots.txt file, please keep the following in mind: When deciding which pages to crawl on a particular host, Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*".

(Odd. I coudn't edit my just-made post to add that bit.)

Reid




msg:1528819
 11:40 pm on May 30, 2006 (gmt 0)

I've seen other posts on WW of people having problems with googlebot using the wrong directive, turns out they had Uaser-agent: * before User-agent: googlebot it's always safe to assume that any bot will follow the first valid directive it finds so User-Agent: * should always be at the end of the robots.txt file (meaning all bots that have not been named directly) Just good code practice anyway.

Reid




msg:1528820
 12:05 am on May 31, 2006 (gmt 0)

I just took a closer look at the examples here from MSN and Google.
You have to be careful
Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*".

This statement , if read carefully does not say what will happen when BOTH directives are used.
Bots do not double back and check the robots.txt file twice, it will likely follow the first valid directive, either User-Agent: * or User-agent: googlebot (whichever comes first)
Also the MSN example does NOT say that psbot will recognize any * and for MSNbot it ONLY shows the file extension wildcard (note that $ must be present) it does not say that MSNbot allows the use of wildcards in any other way than with file extensions.
Example Disallow: *.gif$ ($ must be used)

Pfui




msg:1528821
 12:53 am on May 31, 2006 (gmt 0)

FWIW, my robots.txt files (multiple sites) begin with the generic block --

User-agent: *
Disallow: /

-- followed by all other entries in alphabetical order, some very detailed (msnbot, Googlebot, Slurp), some simple. All specifically identified major SE bots find 'their' instructions and follow them 99% of the time.

Now if only they'd all follow the same conventions! (See: AdsBot-Google's robots.txt specs [webmasterworld.com].)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved