disallow: /*?*
Disallow: /wp-*
Disallow: *?replytocom
although googlebot and others may allow exceptions, the robots exclusion protocol specifically does not support globbing or wildcarding, so don't expect those *s to work everywhere.
disallow: /*?*
although googlebot and others may allow exceptions, the robots exclusion protocol describes the directive as an upper-cased "Disallow:" and uses that exclusively in examples, so don't expect the lower-cased "disallow:" to work everywhere.
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/
disallow: /*?*
Disallow: /wp-*
Disallow: *?replytocom
assuming the globbing/wildcarding support by googlebot and others, the first three lines quoted above are redundant given the fifth line quoted above and the last line quoted above is redundant given the fourth line quoted above.
User-agent: *
Disallow: /wp-content/
User-agent: Googlebot-Image
Allow: /wp-content/uploads/
"Allow:" is an extension to the robots exclusion protocol supported by google, so you're ok in this specific case.
however, the crawler is going to find the most specific user agent and respect the rules in that group.
therefore if you want any directives from the general rule (User-agent: *) to apply to a more specific rule (e.g. User-agent: Googlebot-Image) you will have to repeat all those rules within the more specific group.
for example if you want Googlebot-Image crawling /wp-content/uploads/ but nothing else in /wp-content/, you will need something more like this:
User-agent: Googlebot-Image
Disallow: /wp-content/
Allow: /wp-content/uploads/
User-agent: Mediapartners-Google*
Allow: /
User-agent: Adsbot-Google
Allow: /
User-agent: Googlebot-Mobile
Allow: /
continuing with the behavior described above about respecting the most specific group, if you really want these 3 user agent strings to exclude nothing (i.e. crawl everything), you might as well follow general protocol and do this:
User-agent: Mediapartners-Google*
Disallow:
User-agent: Adsbot-Google
Disallow:
User-agent: Googlebot-Mobile
Disallow:
here's another thing to consider.
if your purpose it to preserve crawl budget and/or reduce bandwidth usage, then excluding crawlers from all those directories and wilcarded paths is the way to go.
however this won't keep those urls out of the index.
if a url is discovered and you have excluded it from being crawled, the url will appear in google SERPS with the following description in the snippet:
A description for this result is not available because of this site's robots.txt – learn more.
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 [support.google.com]:
To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag or x-robots-tag.
finally, for security reasons you should be sure that your server is properly configured for CGI, since you don't want anyone looking directly at scripts in your /cgi-bin/ directory.
robots.txt is for excluding well-behaved bots from crawling resources but it doesn't do anything about authenticating visitors or blocking requests.
you should also keep that in mind for other sensitive areas such as /wp-admin/.
http://example.com/robots.txt is a "honeypot" for malicious probes and may expose your vulnerabilities.