Welcome to WebmasterWorld Guest from 54.163.84.199

Forum Moderators: goodroi

Robots.txt and regular expressions

Can they use them?

   
3:23 pm on Jun 4, 2012 (gmt 0)



I have read conflicting stuff on this all over the interenet and even here in the forums. I am simply trying to get rid of a series of search pages with dynamic paramters and want to use robots.txt to do this. Can it except any regex? On SEOmoz (http://www.seomoz.org/learn-seo/robotstxt) it says they will

Pattern Matching

Google and Bing both honor two regular expressions that can be used to identify pages or sub-folders that a SEO wants excluded. These two characters are the asterisk (*) and the dollar sign ($).

* - which is a wildcard that represents any sequence of characters
$ - which matches the end of the URL

but most other sites I see say they will not.


Any ideas?
4:03 pm on Jun 4, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



The * used in this way is not a Regular Expression, so be careful how you talk about it.

If you want to include directives that are intended for specific search engines, you can use any syntax that they say they recognize. But if you want an all-purpose "Which part of 'disallow' didn't you understand?" then stick to the minimalist form.

I don't know about Bing, but google ignores "crawl-delay" even though I'm sure it understands it perfectly well.
4:57 pm on Jun 4, 2012 (gmt 0)



Yea I was talking about like a page that was page.aspx then the page had page.aspx/color=black and so on for more refinements. So I was going to add

Disallow: /page.aspx* does this sound correct?

Thanks!
5:27 pm on Jun 4, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Never use * at the end of the pattern. The pattern "matches from the left" and is a "prefix match".

Use * only near the beginning or in the middle of the pattern.

Disallow: /this
disallows anything beginning
example.com/this
so the
*
is not needed.

Disallow: /*that
disallows URL requests like
example.com/<something-or-anything>that
as a prefix.

The
$
ending is needed only when you need an exact match.
5:41 pm on Jun 4, 2012 (gmt 0)



Thanks for the update g1smd. However with the example I put above I would need to match whatever is after the page.aspx and there isnt a real end that I can put on it because it is dynamic. So the page could be

page.aspx?color=blue
page.aspx?size=120
page.aspx?size=120&types=internal
Etc

Is there a way to match what I am talking about above with the *? thanks!
5:47 pm on Jun 4, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Do you need to match some query strings and not others?
Is this for one particular aspx page or for all aspx pages?

If you need to block all query strings for one particular
.aspx
page then the prefix match for disallowing
example.com/page.aspx?<anything>
is
Disallow: /page.aspx?

It's a prefix match. You dont need a
*
here.

If you want to block any
.aspx
page with any query string, e.g. block
example.com/<anything>.aspx?<anything>
then use:
Disallow: /*.apsx?

The
*
is needed only in place of the page name.

Never use
*
at the end of the pattern.
Use
*
only near the beginning or in the middle of the pattern.

If you wanted to block requests for exactly
example.com/page.aspx
without query strings but allow the same page with query strings you would use
Disallow: /page.apsx$

or
Disallow: /*.apsx$
6:37 pm on Jun 4, 2012 (gmt 0)



Thanks g1smd I think I understand now. Since it is all query strings that go with page.aspx then I will use Disallow: /page.aspx? and it will match all of the additional query stings added to it. Correct?

Thanks again I really appreciate your help!
6:42 pm on Jun 4, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The pattern is a prefix match (matches from the left) so the rule
Disallow: /page.aspx?

matches any request that BEGINS
example.com/page.aspx?
with anything or nothing after the question mark.
6:52 pm on Jun 4, 2012 (gmt 0)



Thanks g1smd you have really helped a lot!
7:48 pm on Jun 4, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



The devil is in the details.

It's especially important to define "exactly" what you want to do in plain English before you even begin to think about any code.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month