|Opinions About Content Of and Need For A Wordpress Robots.txt File|
Just how necessary or useful is a robots.txt file for a WP installation?
Question 1: Here's the robots.txt file I'm currently using. To be perfectly honest I cannot explain the reason(s) behind every exclusion like a pro . . :(
Anyone see anything wrong with this list?
Question 2: Is a robots.txt file, for a Wordpress site, not really all that important for indexing or SEO purposes? I see so many versions of robots.txt, even amongst "the pros". (I also wonder if I'm actually seeing what the bots are seeing or if the pros cloak their robots.txt file.)
Is it more a matter of "Don't be a dumbass by excluding bots from sections/content that OUGHT to be indexed"?
Hmmm... no takers?
So a few comments
1. Why disallow Googlebot-Image?
In the old days, lots of my visits came from image search. I think the new format, though, is cutting down on that a lot.
2. Disallow: /*?*
Why not let them in and then send them a 301 for those addresses?
3. Disallow: /faq/
Any particular reason you don't want your FAQ indexed?
Why disallow these? To me useful content on all these pages.
I'm assuming Webwork is doing this to avoid diluting the site purpose and keep things more tightly themed?
images - debated that one myself. What is that directory? Are they content images or theming images? Are they all part of a page somewhere?
I was having a "no one loves me" sortof moment there for awhile :( :( :( . . . :P
I'm seeing evidence of leeching from images ~ hotlinking, scraping. I guess there may be some sites where having images inventoried provides +>-. Just not certain for a few of mine. They're the type of images that I can imagine any number of sites would be happy to rip off and use. If they were a bit more specialized (not side appeal) then I might see more on the +++ side. (My kids were waaaay too cute . . before becoming teenagers . . :P)
FAQ, Privacy, Etc. -> Not really certain what they would add to a search engine's index. Pretty much the same on every site. Maybe I'm just being nice to the SEs? Not wasting bandwidth, too?
2. Disallow: /*?* -> Again, just a "nothing there for you to index from my POV" situation.
I'm not big on sculpting pagerank, link juice or whatever so none of the settings are for such purposes. Heck, you first have to have a bit of PR before you can even begin to worry about it . . . unless your thinking is to "gather all you can when starting out" . . . Hmmmm . . .
web I have a FAQ page that does very well in the serps. This is a really good page for landing the longtail stuff as there is so much content on this page you can go after. I would at least allow this page to be indexed.
Interesting result bwnbwn. Food for thought.
Perhaps some FAQ pages are more "informative"?
As a whole, FAQ pages appear designed as an aid to "~doing business with website users" -> more business procedure than content.
I have over 100 questions linked to an answer page by footnote url seems to work very well for this site.
Or if this is a product page and u have the time and manpower an faq for each product works wonders on providing a monster amount of content for the product especially if it is a tech. product.
>>I'm seeing evidence of leeching from images ~ hotlinking, scraping
Robots.txt in now way protects you from this.
>>Pretty much the same on every site
Then it's boilerplate? Probably safe to exclude. I have a "site" that is mostly just a FAQ which is a rewritten New York Times article into FAQ form. It earns about $10/month in Adsense and has been sitting there with almost no change for a few years.
So it depends on what your FAQ is.
>>Disallow: /*?* -> Again, just a "nothing there for you to index from my POV" situation.
Sure, but if somehow, some user somewhere has you linked that way, and Google comes a crawlin, why not let it in, and then 301 the request when it arrives to your canonical page? Actually, with WP, that should happen anyway.
Fire up HTTPLiveHeaders and request a page in the example.com/?p=123 form and see what happens. If it doesn't send a 301, fix it. If it does, let Google crawl it and give it a 301 or 404 as need be to force an index update.
I don't see the upside of stopping the Googlebot unless your 301s are faulty.
ergophobe I was finally able to fix this on our cms site example.com/?p=123 due to the complex rule, but I have it done. I agree doing it in robots.txt won't stop the link from throwing a 200, but then again I have had the pleasure of diving into the wordpress mess. I am finding it very difficult to get this or for that matter much of anything done outside using their custom stuff.
I have a strange feeling Webwork is having this same issue so he is using the robots.txt file as a possible fix. I have busted the site a couple times already so this maybe my olny option with wordpress.
Saying that would webwork's robots.txt work so even though it did throw a 200 the link would not be indexed, or cause duplicate content issues?
|Robots.txt in now way protects you from this |
My thoughts are that some % of leeches use Google image search to find images "to reuse". If the images aren't "in the index" then that makes hunting for images a bit more of a challenge. It's apparently a trade off of good (search images, visit site) versus bad (use image search to rip off images). I suspect it's an issue everyone with decent original images faces. It's not clear how the "solutions cluster": what % block image indexing, what % allow, what % allow and attempt to "hassle" image theft with watermarks, etc.. Probably those with better data can better evaluate the risks vs. rewards.
For now, until I work out how to "otherwise protect" original "quality images", including filing for copyright on some, I'm taking the low tech approach by simply making them harder to discover. May well take a traffic hit.
Decisions. Decisions. Bleh. Argh. Ugh. ;)
The thing is, won't Google image search index those images based on the fact that they appear on the page, without ever crawling the image directory? Hmm... maybe not, since I guess it wouldn't be able to see the image and wouldn't follow the src link.
On the other hand, if you watermark your images with your URL, maybe you'd like to get them hotlinked. It's worked wonders for the Church of the Great Spaghetti Monster (it's an explicit strategy of theirs to encourage hotlinking).
Ya, I really need to pick my answer/strategy . . without all my usual "too much research and analysis". ;)
Argh! "Not thinking" is HARD work for me. ;-/ :P & :(
There was a time last July-Aug when there was a lot of discussion in Google SEO News about PR sculpting, and I made several long posts on these threads just to try to cover all bases. Not sure how well I did, but moments in these discussions might provide food for thought....
Robots.txt vs. meta robots noindex?
PR Sculpting Doesn't Work and Internal NoFollow Can Harm Your Site
Essentially, robots.txt would be sending your PR (and associated inbound link benefits) down a black hole... throwing it away. I'd perhaps use it to block a directory of search results, an https subdomain, or a directory of known dupe content on a site so large you need to conserve crawl budget, where you're not draining off link juice from too many pages.
Otherwise, if you must block a page, I think meta robots noindex,follow is the way to go.
I feel that theming is something that should be determined by your nav structure.