homepage Welcome to WebmasterWorld Guest from 54.196.195.207
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Visit PubCon.com
Home / Forums Index / Code, Content, and Presentation / WordPress
Forum Library, Charter, Moderators: lorax & rogerd

WordPress Forum

    
Opinions About Content Of and Need For A Wordpress Robots.txt File
Just how necessary or useful is a robots.txt file for a WP installation?
Webwork




msg:4243722
 6:02 pm on Dec 17, 2010 (gmt 0)

Question 1: Here's the robots.txt file I'm currently using. To be perfectly honest I cannot explain the reason(s) behind every exclusion like a pro . . :(

Anyone see anything wrong with this list?

Robots.txt

User-agent: *
Disallow: /cgi-bin/
Disallow: /cgibin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
Disallow: /wp-login.php
Disallow: /wp-register.php
Disallow: /images/
Disallow: /go/
Disallow: /privacy-policy/
Disallow: /comment-policy/
Disallow: /terms-of-service/
Disallow: /faq/
Disallow: /contact-form/
Disallow: /iframes/
Disallow: /*?*


User-agent: psbot
Disallow: /
User-agent: Xenu
Disallow: /
User-agent: ia_archiver
Disallow: /
User-agent: Baiduspider
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: Googlebot-Image
Disallow: /

Question 2: Is a robots.txt file, for a Wordpress site, not really all that important for indexing or SEO purposes? I see so many versions of robots.txt, even amongst "the pros". (I also wonder if I'm actually seeing what the bots are seeing or if the pros cloak their robots.txt file.)

Is it more a matter of "Don't be a dumbass by excluding bots from sections/content that OUGHT to be indexed"?

 

ergophobe




msg:4244607
 9:22 pm on Dec 20, 2010 (gmt 0)

Hmmm... no takers?

So a few comments

1. Why disallow Googlebot-Image?
In the old days, lots of my visits came from image search. I think the new format, though, is cutting down on that a lot.

2. Disallow: /*?*

Why not let them in and then send them a 301 for those addresses?

3. Disallow: /faq/

Any particular reason you don't want your FAQ indexed?

bwnbwn




msg:4244623
 9:47 pm on Dec 20, 2010 (gmt 0)

Disallow: /images/
Disallow: /privacy-policy/
Disallow: /comment-policy/
Disallow: /terms-of-service/

Why disallow these? To me useful content on all these pages.

ergophobe




msg:4244661
 11:58 pm on Dec 20, 2010 (gmt 0)

privacy policy and TOS are often just boilerplate and have nothing to do with the purpose of the site.

I'm assuming Webwork is doing this to avoid diluting the site purpose and keep things more tightly themed?

images - debated that one myself. What is that directory? Are they content images or theming images? Are they all part of a page somewhere?

Webwork




msg:4244682
 12:54 am on Dec 21, 2010 (gmt 0)

I was having a "no one loves me" sortof moment there for awhile :( :( :( . . . :P

I'm seeing evidence of leeching from images ~ hotlinking, scraping. I guess there may be some sites where having images inventoried provides +>-. Just not certain for a few of mine. They're the type of images that I can imagine any number of sites would be happy to rip off and use. If they were a bit more specialized (not side appeal) then I might see more on the +++ side. (My kids were waaaay too cute . . before becoming teenagers . . :P)

FAQ, Privacy, Etc. -> Not really certain what they would add to a search engine's index. Pretty much the same on every site. Maybe I'm just being nice to the SEs? Not wasting bandwidth, too?

2. Disallow: /*?* -> Again, just a "nothing there for you to index from my POV" situation.

I'm not big on sculpting pagerank, link juice or whatever so none of the settings are for such purposes. Heck, you first have to have a bit of PR before you can even begin to worry about it . . . unless your thinking is to "gather all you can when starting out" . . . Hmmmm . . .

bwnbwn




msg:4244735
 4:31 am on Dec 21, 2010 (gmt 0)

web I have a FAQ page that does very well in the serps. This is a really good page for landing the longtail stuff as there is so much content on this page you can go after. I would at least allow this page to be indexed.

Webwork




msg:4244847
 2:02 pm on Dec 21, 2010 (gmt 0)

Interesting result bwnbwn. Food for thought.

Perhaps some FAQ pages are more "informative"?

As a whole, FAQ pages appear designed as an aid to "~doing business with website users" -> more business procedure than content.

bwnbwn




msg:4244863
 2:47 pm on Dec 21, 2010 (gmt 0)

I have over 100 questions linked to an answer page by footnote url seems to work very well for this site.
Or if this is a product page and u have the time and manpower an faq for each product works wonders on providing a monster amount of content for the product especially if it is a tech. product.

ergophobe




msg:4244984
 7:29 pm on Dec 21, 2010 (gmt 0)

>>I'm seeing evidence of leeching from images ~ hotlinking, scraping

Robots.txt in now way protects you from this.

>>Pretty much the same on every site

Then it's boilerplate? Probably safe to exclude. I have a "site" that is mostly just a FAQ which is a rewritten New York Times article into FAQ form. It earns about $10/month in Adsense and has been sitting there with almost no change for a few years.

So it depends on what your FAQ is.

>>Disallow: /*?* -> Again, just a "nothing there for you to index from my POV" situation.

Sure, but if somehow, some user somewhere has you linked that way, and Google comes a crawlin, why not let it in, and then 301 the request when it arrives to your canonical page? Actually, with WP, that should happen anyway.

Fire up HTTPLiveHeaders and request a page in the example.com/?p=123 form and see what happens. If it doesn't send a 301, fix it. If it does, let Google crawl it and give it a 301 or 404 as need be to force an index update.

I don't see the upside of stopping the Googlebot unless your 301s are faulty.

bwnbwn




msg:4244999
 8:09 pm on Dec 21, 2010 (gmt 0)

ergophobe I was finally able to fix this on our cms site example.com/?p=123 due to the complex rule, but I have it done. I agree doing it in robots.txt won't stop the link from throwing a 200, but then again I have had the pleasure of diving into the wordpress mess. I am finding it very difficult to get this or for that matter much of anything done outside using their custom stuff.

I have a strange feeling Webwork is having this same issue so he is using the robots.txt file as a possible fix. I have busted the site a couple times already so this maybe my olny option with wordpress.

Saying that would webwork's robots.txt work so even though it did throw a 200 the link would not be indexed, or cause duplicate content issues?

Webwork




msg:4245020
 9:15 pm on Dec 21, 2010 (gmt 0)

Robots.txt in now way protects you from this


My thoughts are that some % of leeches use Google image search to find images "to reuse". If the images aren't "in the index" then that makes hunting for images a bit more of a challenge. It's apparently a trade off of good (search images, visit site) versus bad (use image search to rip off images). I suspect it's an issue everyone with decent original images faces. It's not clear how the "solutions cluster": what % block image indexing, what % allow, what % allow and attempt to "hassle" image theft with watermarks, etc.. Probably those with better data can better evaluate the risks vs. rewards.

For now, until I work out how to "otherwise protect" original "quality images", including filing for copyright on some, I'm taking the low tech approach by simply making them harder to discover. May well take a traffic hit.

Decisions. Decisions. Bleh. Argh. Ugh. ;)

ergophobe




msg:4245095
 6:32 am on Dec 22, 2010 (gmt 0)

The thing is, won't Google image search index those images based on the fact that they appear on the page, without ever crawling the image directory? Hmm... maybe not, since I guess it wouldn't be able to see the image and wouldn't follow the src link.

On the other hand, if you watermark your images with your URL, maybe you'd like to get them hotlinked. It's worked wonders for the Church of the Great Spaghetti Monster (it's an explicit strategy of theirs to encourage hotlinking).

Webwork




msg:4245174
 2:03 pm on Dec 22, 2010 (gmt 0)

Ya, I really need to pick my answer/strategy . . without all my usual "too much research and analysis". ;)

Argh! "Not thinking" is HARD work for me. ;-/ :P & :(

Robert Charlton




msg:4272849
 9:32 am on Feb 26, 2011 (gmt 0)

There was a time last July-Aug when there was a lot of discussion in Google SEO News about PR sculpting, and I made several long posts on these threads just to try to cover all bases. Not sure how well I did, but moments in these discussions might provide food for thought....

Robots.txt vs. meta robots noindex?
http://www.webmasterworld.com/google/4187554.htm [webmasterworld.com]

PR Sculpting Doesn't Work and Internal NoFollow Can Harm Your Site
http://www.webmasterworld.com/google/4161726.htm
[webmasterworld.com...]

Essentially, robots.txt would be sending your PR (and associated inbound link benefits) down a black hole... throwing it away. I'd perhaps use it to block a directory of search results, an https subdomain, or a directory of known dupe content on a site so large you need to conserve crawl budget, where you're not draining off link juice from too many pages.

Otherwise, if you must block a page, I think meta robots noindex,follow is the way to go.

I feel that theming is something that should be determined by your nav structure.

I'm more and more trusting that Google's reasonable surfer model keeps utility pages like privacy policy and terms of service, etc, from receiving too much link juice and from exerting a large amount of influence on the site.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / WordPress
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved