homepage Welcome to WebmasterWorld Guest from 54.227.41.242
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt and regular expressions
Can they use them?
jqwan



 
Msg#: 4461252 posted 3:23 pm on Jun 4, 2012 (gmt 0)

I have read conflicting stuff on this all over the interenet and even here in the forums. I am simply trying to get rid of a series of search pages with dynamic paramters and want to use robots.txt to do this. Can it except any regex? On SEOmoz (http://www.seomoz.org/learn-seo/robotstxt) it says they will

Pattern Matching

Google and Bing both honor two regular expressions that can be used to identify pages or sub-folders that a SEO wants excluded. These two characters are the asterisk (*) and the dollar sign ($).

* - which is a wildcard that represents any sequence of characters
$ - which matches the end of the URL

but most other sites I see say they will not.


Any ideas?

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4461252 posted 4:03 pm on Jun 4, 2012 (gmt 0)

The * used in this way is not a Regular Expression, so be careful how you talk about it.

If you want to include directives that are intended for specific search engines, you can use any syntax that they say they recognize. But if you want an all-purpose "Which part of 'disallow' didn't you understand?" then stick to the minimalist form.

I don't know about Bing, but google ignores "crawl-delay" even though I'm sure it understands it perfectly well.

jqwan



 
Msg#: 4461252 posted 4:57 pm on Jun 4, 2012 (gmt 0)

Yea I was talking about like a page that was page.aspx then the page had page.aspx/color=black and so on for more refinements. So I was going to add

Disallow: /page.aspx* does this sound correct?

Thanks!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4461252 posted 5:27 pm on Jun 4, 2012 (gmt 0)

Never use * at the end of the pattern. The pattern "matches from the left" and is a "prefix match".

Use * only near the beginning or in the middle of the pattern.

Disallow: /this disallows anything beginning example.com/this so the * is not needed.

Disallow: /*that disallows URL requests like example.com/<something-or-anything>that as a prefix.

The
$ ending is needed only when you need an exact match.
jqwan



 
Msg#: 4461252 posted 5:41 pm on Jun 4, 2012 (gmt 0)

Thanks for the update g1smd. However with the example I put above I would need to match whatever is after the page.aspx and there isnt a real end that I can put on it because it is dynamic. So the page could be

page.aspx?color=blue
page.aspx?size=120
page.aspx?size=120&types=internal
Etc

Is there a way to match what I am talking about above with the *? thanks!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4461252 posted 5:47 pm on Jun 4, 2012 (gmt 0)

Do you need to match some query strings and not others?
Is this for one particular aspx page or for all aspx pages?

If you need to block all query strings for one particular
.aspx page then the prefix match for disallowing example.com/page.aspx?<anything> is
Disallow: /page.aspx?
It's a prefix match. You dont need a
* here.

If you want to block any
.aspx page with any query string, e.g. block example.com/<anything>.aspx?<anything> then use:
Disallow: /*.apsx?
The
* is needed only in place of the page name.

Never use
* at the end of the pattern.
Use
* only near the beginning or in the middle of the pattern.

If you wanted to block requests for exactly
example.com/page.aspx without query strings but allow the same page with query strings you would use
Disallow: /page.apsx$
or
Disallow: /*.apsx$
jqwan



 
Msg#: 4461252 posted 6:37 pm on Jun 4, 2012 (gmt 0)

Thanks g1smd I think I understand now. Since it is all query strings that go with page.aspx then I will use Disallow: /page.aspx? and it will match all of the additional query stings added to it. Correct?

Thanks again I really appreciate your help!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4461252 posted 6:42 pm on Jun 4, 2012 (gmt 0)

The pattern is a prefix match (matches from the left) so the rule
Disallow: /page.aspx?
matches any request that BEGINS
example.com/page.aspx? with anything or nothing after the question mark.
jqwan



 
Msg#: 4461252 posted 6:52 pm on Jun 4, 2012 (gmt 0)

Thanks g1smd you have really helped a lot!

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4461252 posted 7:48 pm on Jun 4, 2012 (gmt 0)

The devil is in the details.

It's especially important to define "exactly" what you want to do in plain English before you even begin to think about any code.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved