Welcome to WebmasterWorld Guest from 107.20.34.173

Forum Moderators: goodroi

Message Too Old, No Replies

wildcard

     

stevelibby

8:10 am on May 18, 2006 (gmt 0)

10+ Year Member



can you use a wildcard in a robots.txt.
My web site is written in asp.when results are returned they are in batches of 25. To select the 2nd batch the href is:
page.asp?Page=2
However moving through the results are fine but coming back page.asp & page.asp?page=1 maybe deemed as duplicatecontent by search engines?
So can i block results above 25?

Dijkgraaf

8:44 pm on May 18, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No, wildcards aren't part of the standerd and not supported by most bots.
However in your example you could disallow
disallow: /page.asp?page
or even
disallow: /page.asp?
which would allow /page.asp but not allow /page.asp?page=1 etc. because they disallow rule will disallow any URL begining what you specify.

Reid

6:43 am on May 30, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



wildcards are valid in User-agent and for googlebot they are valid in the Disallow

User-agent:googlebot
Disallow:/page.asp?*

User-agent:*
Disallow:

This will allow all for all user agents except googlebot which will disallow all page.asp with a?
Make sure you use the googlebot directive first else googlebot will follow the User-agent:* directive.

Pfui

8:58 pm on May 30, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



1.) For any newcomers following along, be sure you include a space after any colon:

User-agent: Badbot
Disallow: /file.html
Disallow: /directory

2.) A blank Disallow is the same as saying Allow. Adding a forward slash will turn away all crawlers inclined to heed the most basic robots.txt:

User-agent: *
Disallow: /

3.) Nowadays bots' preferences can be (head-bangingly) unique and specific, so it's always a good idea to go to the source:

[search.msn.com...]
See also: EXAMPLES [search.msn.com] (including wildcards)

[robotstxt.org...]

[google.com...]
See also: EXAMPLES [google.com] (including wildcards)

Pfui

9:03 pm on May 30, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



4.) Googlebot's info doesn't have to come first. From the EXAMPLES page, above:

When creating your robots.txt file, please keep the following in mind: When deciding which pages to crawl on a particular host, Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*".

(Odd. I coudn't edit my just-made post to add that bit.)

Reid

11:40 pm on May 30, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I've seen other posts on WW of people having problems with googlebot using the wrong directive, turns out they had Uaser-agent: * before User-agent: googlebot it's always safe to assume that any bot will follow the first valid directive it finds so User-Agent: * should always be at the end of the robots.txt file (meaning all bots that have not been named directly) Just good code practice anyway.

Reid

12:05 am on May 31, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I just took a closer look at the examples here from MSN and Google.
You have to be careful
Googlebot will obey the first record in the robots.txt file with a User-agent starting with "Googlebot." If no such entry exists, it will obey the first entry with a User-agent of "*".

This statement , if read carefully does not say what will happen when BOTH directives are used.
Bots do not double back and check the robots.txt file twice, it will likely follow the first valid directive, either User-Agent: * or User-agent: googlebot (whichever comes first)
Also the MSN example does NOT say that psbot will recognize any * and for MSNbot it ONLY shows the file extension wildcard (note that $ must be present) it does not say that MSNbot allows the use of wildcards in any other way than with file extensions.
Example Disallow: *.gif$ ($ must be used)

Pfui

12:53 am on May 31, 2006 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



FWIW, my robots.txt files (multiple sites) begin with the generic block --

User-agent: *
Disallow: /

-- followed by all other entries in alphabetical order, some very detailed (msnbot, Googlebot, Slurp), some simple. All specifically identified major SE bots find 'their' instructions and follow them 99% of the time.

Now if only they'd all follow the same conventions! (See: AdsBot-Google's robots.txt specs [webmasterworld.com].)

 

Featured Threads

Hot Threads This Week

Hot Threads This Month