Here's the another issue.... Even though a person may have a valid robots.txt file which does not block Google or any other desired S/E.... in the case of Google, they refuse to provide a list of IP numbers that their googlebot robots operate upon.... their feeble comment implies that they have too many IP numbers or that they change. That is no excuse, such a list could be the "currently used IP numbers" and that list could change with whenever they change the IP number that any particular googlebot robot is working upon.
In the case of my own site and a few others that I manage, "because bad bots don't always obey the robots.txt file" it was decided to use the better control to prevent bad bots, email harvesters, code thiefs, or other unwanted.... the better control via the .htaccess file (that's dot htaccess). With it, unwanted "referrer" strings are used, unwanted User Agents (bot names etc, including some bad plugins, website copiers), and individual IP numbers are used as well as ranges of IP numbers or entire CIDR groups.
As a result using the dot htaccess control file all of the bad bots are normally kept away from our sites, gives us a more accurate count of actual visitors too. By blocking entire ranges or CIDR groups we keep out most "unwanteds".... that includes some Known Servers or Hosts that "knowingly" give hosting or dialup service to "known spammers" or the like, and it includes countries "known for" being major sources of spam (eg many countries in Asia.... our content is not meant for that audience as we do not do business with Asia).
The BAD PART is that where we have blocked individual IP numbers, ranges or groups, or entire CIDR's, without knowing what IP numbers that "googlebot" operates from (or any other critical indexing bot they may use) it could be that we might have blocked one of their indexing bots.
Google and others do specifiy what their bot name(s) are, eg, googlebot and google-image (think that's correct), but do they have any critical "unnamed bot" that they use for indexing. THEY DO MENTION about having a double-check system to discover cloaking such that googlebot might visit a site, look at a page, then their "other" bot by no name or another name will look at the same page to see if content is different, thus discovering "cloaking" for different bots.
WHAT HAPPENS if one has accidentally blocked one of google's unnamed bots (which might seem to be a harvester or an unwanted or might be in an unwanted group)?
Google or other bots with "unnamed" bots that do double checking would not necessarily have to reveal the exact IP number of their cloak-checking-bots which would kill their means of cloak-checking, but if they said that they have bots working in a CIDR or range that contained dialups or other services then at least we would know Not to block that range.
I had also heard that if a person uses the "noarchive" meta tag in individual pages that a page or site could completely be dropped, likewise I heard that if one blocks some of the image harvesting bots that a page or site could be dropped. Don't know if either of these are true or not.
As for any bot looking for images, I don't feel that any images should be ever be harvested, often later they are infringed upon by the creation of thumbnails, which is an unauthorized duplicate or miniaturization, created without specific authorization and without specification of copyright and no control against anyone Copying and Using that thumbnail version. The original owner could not Sue a user of that thumbnail, but technically the original owner could Sue the thumbnail maker for infringement. THUS, no image should be grabbed and be archived.... ideally it would be better for having a submit page just for images where such authorization could be given for cached copies And/Or for thumbnails creation, but only for a particular S/E.
If anyone has any more specific info about IP numbers that are actually used for the indexing bots would be appreciated. If the indexing bot is ONLY named "googlebot" and does not involve another critical unnamed bot and we would also know the IP number or range to allow then proper indexing of sites/pages could be done, otherwise it could be possible that google or other good bots could be blocked (unintentionally) just to keep out some other bad ones.