homepage Welcome to WebmasterWorld Guest from 54.204.64.152
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Robots.txt and regular expressions
Can they use them?
jqwan




msg:4461254
 3:23 pm on Jun 4, 2012 (gmt 0)

I have read conflicting stuff on this all over the interenet and even here in the forums. I am simply trying to get rid of a series of search pages with dynamic paramters and want to use robots.txt to do this. Can it except any regex? On SEOmoz (http://www.seomoz.org/learn-seo/robotstxt) it says they will

Pattern Matching

Google and Bing both honor two regular expressions that can be used to identify pages or sub-folders that a SEO wants excluded. These two characters are the asterisk (*) and the dollar sign ($).

* - which is a wildcard that represents any sequence of characters
$ - which matches the end of the URL

but most other sites I see say they will not.


Any ideas?

 

lucy24




msg:4461265
 4:03 pm on Jun 4, 2012 (gmt 0)

The * used in this way is not a Regular Expression, so be careful how you talk about it.

If you want to include directives that are intended for specific search engines, you can use any syntax that they say they recognize. But if you want an all-purpose "Which part of 'disallow' didn't you understand?" then stick to the minimalist form.

I don't know about Bing, but google ignores "crawl-delay" even though I'm sure it understands it perfectly well.

jqwan




msg:4461291
 4:57 pm on Jun 4, 2012 (gmt 0)

Yea I was talking about like a page that was page.aspx then the page had page.aspx/color=black and so on for more refinements. So I was going to add

Disallow: /page.aspx* does this sound correct?

Thanks!

g1smd




msg:4461303
 5:27 pm on Jun 4, 2012 (gmt 0)

Never use * at the end of the pattern. The pattern "matches from the left" and is a "prefix match".

Use * only near the beginning or in the middle of the pattern.

Disallow: /this disallows anything beginning example.com/this so the * is not needed.

Disallow: /*that disallows URL requests like example.com/<something-or-anything>that as a prefix.

The
$ ending is needed only when you need an exact match.
jqwan




msg:4461306
 5:41 pm on Jun 4, 2012 (gmt 0)

Thanks for the update g1smd. However with the example I put above I would need to match whatever is after the page.aspx and there isnt a real end that I can put on it because it is dynamic. So the page could be

page.aspx?color=blue
page.aspx?size=120
page.aspx?size=120&types=internal
Etc

Is there a way to match what I am talking about above with the *? thanks!

g1smd




msg:4461310
 5:47 pm on Jun 4, 2012 (gmt 0)

Do you need to match some query strings and not others?
Is this for one particular aspx page or for all aspx pages?

If you need to block all query strings for one particular
.aspx page then the prefix match for disallowing example.com/page.aspx?<anything> is
Disallow: /page.aspx?
It's a prefix match. You dont need a
* here.

If you want to block any
.aspx page with any query string, e.g. block example.com/<anything>.aspx?<anything> then use:
Disallow: /*.apsx?
The
* is needed only in place of the page name.

Never use
* at the end of the pattern.
Use
* only near the beginning or in the middle of the pattern.

If you wanted to block requests for exactly
example.com/page.aspx without query strings but allow the same page with query strings you would use
Disallow: /page.apsx$
or
Disallow: /*.apsx$
jqwan




msg:4461328
 6:37 pm on Jun 4, 2012 (gmt 0)

Thanks g1smd I think I understand now. Since it is all query strings that go with page.aspx then I will use Disallow: /page.aspx? and it will match all of the additional query stings added to it. Correct?

Thanks again I really appreciate your help!

g1smd




msg:4461330
 6:42 pm on Jun 4, 2012 (gmt 0)

The pattern is a prefix match (matches from the left) so the rule
Disallow: /page.aspx?
matches any request that BEGINS
example.com/page.aspx? with anything or nothing after the question mark.
jqwan




msg:4461332
 6:52 pm on Jun 4, 2012 (gmt 0)

Thanks g1smd you have really helped a lot!

g1smd




msg:4461346
 7:48 pm on Jun 4, 2012 (gmt 0)

The devil is in the details.

It's especially important to define "exactly" what you want to do in plain English before you even begin to think about any code.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved