Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Tip: Watch your robots.txt - small change can have big impact

         

doritoz

4:31 pm on Jul 2, 2008 (gmt 0)

10+ Year Member



We recently cleaned up our site by moving our scripts into a directory simple labeled "w". To prevent Google (or the others) from indexing this directory, we added it to the robots.txt:
User-agent: *
Disallow: /w

About a week later, we noticed one of our pages (named widgets-blue.php) wasn't cached by Google. It was still showing up in the results, but no cache was available.

The next day, in Google Webmaster Tools there was an error in our XML sitemap. The error stated that widgets-blue.php couldn't be crawled because is was restricted by the robots.txt.

After a little panicking, we discovered all of our pages that begin with a "w" were dropped from the cache. The only suspect, robots.txt, had become the main suspect. We added a closing "/" to the file, and a day later, the errors are gone.

I mention this story because, in my decade+ of doing this, I've always thought that pages listed in the robots.txt had to be exact/absolute; wildcard entries were always marked with an asterisk (*). This is wrong. The asterisk is only allowed for User-agent.

The moral is be watchful and careful. This is a simple technology I thought I had mastered years ago. I don't know if I confused it with another language or I learned it wrong to begin with.

It doesn't help that robots.txt is so rarely altered. Sometime next year I'll have forgotten this episode and will have to re-research whether I should add a closing slash.

g1smd

8:16 pm on Jul 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The matching is from the left, so it matches anything that begins with /pattern.

The wildcard like /*pattern or /pattern1*pattern2, is to replace characters on the left so that the match is for any beginning (or a restricted beginning) but which must be followed by pattern.

The * is never needed on the right.

You can use $ to signify "must end with", though.

.

/123 matches /123 and /123/ and /1234 and /123/456

/123/ matches /123/ and /123/456

/*abc matches /123abc and /123/abc and /123abc456 and /123/abc/456

/123*xyz matches /123qwertyxyz and /123/qwerty/xyz/789

/123$ matches ONLY /123

/*abc$ matches /123abc and /123/abc but NOT /123/abc/x etc.

.

I have been caught out before.

This thread [webmasterworld.com...] contains some good advice direct from several Google staffers.

[edited by: g1smd at 8:30 pm (utc) on July 2, 2008]

tedster

8:20 pm on Jul 2, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for that story, doritoz. I know I've been called by site owners in a panic only to find that they just did something odd with their robots.txt. Some hackers like to mess with it, too. Robots.txt file is now part of my standard health checkup for any website I work with.

I think the GWT robots.txt tool is an excellent offering. Not only does it validate the syntax, it helps you understand if your rules are actually doing what you intended them to do.

doritoz

8:57 pm on Jul 2, 2008 (gmt 0)

10+ Year Member



Excellent summary and thread, g1smd. Thanks!

I agree tedster. Before this week I considered the robots.txt tool remedial, but it's proved to be invaluable.