I don't wish any of my site images listed on search engines. Therefore I have a blanket block on my /images/ folder in my robots.txt as below, plus the additional Googebot-Image/1.0 block in place. However the Googlebot-Image bot still insists on spidering a few images of my site, despite receiving a 403 error.
Robots: User-agent: * Disallow: /i
User-agent: Googlebot-Image/1.0 Disallow: /
Result: 403 GET 220.127.116.11 Googlebot-Image/1.0 /images/abc.gif 403 GET 18.104.22.168 Googlebot-Image/1.0 /robots.txt
I'm guessing the Googlebot-Image bot is receiving the 403 on the robots.txt so is therefor unable to ascertain what it can and cannot spider, so continues to do so.
I cannot figure out why it is receiving a 403 though.
Can anyone shed any light on what may be happening please? Many thanks in advance
Ok Google has now revisited my robots file and it 'seems' as though a previous edit of my robots may have accidentally permitted Goolebot-Images to crawl the unwanted directories. The onslaught has now ceased at last.
I should think so, unless SetEnvIf has rules all its own. (I use the special form BrowserMatch but it's the same thing.) You can put in an override saying
<Files robots.txt> Order Allow,Deny Allow from all </Files>
so they have no excuse.
Interesting to know that the Imagebot pays its own separate visits to robots.txt. I thought the robots.txtbot did the work for everyone.
There is a special rule for the regular googlebot which may also apply to the imagebot. The moment the googlebot is mentioned by name in your robots.txt, it looks only at those lines that use its name. That means that if you have any google-specific rules but you also want it to follow the general rules, you need to say everything twice.