Forum Moderators: Robert Charlton & goodroi
Seemed harmless enough at the time.
Then about a year ago I added a thumbnail screen shot generator to my site to generate thumbnail images of all the thousands of sites that I link to, adding a new look to the site, more visual.
My site security is tight, right?
Nobody was going to be crawling and scraping my images right?
Oops, I forgot I allowed those pesky search engines to index my images.
Anyway, I knew my bandwidth was going to increase drastically with all these images being downloaded and saw that progression happen as expected.
Didn't think any more about it and ignored it since everything was running smooth.
Suddenly a couple of months ago I noticed the bandwidth was spiking even higher but the number of pages being served hadn't increased to account for that bandwidth spike.
Could someone have gotten past my spider traps and started downloading all my images?
Ran a quick security check, nope, nobody was downloading them in that manner EXCEPT...
... Google, Yahoo and Live.
So where is all this bandwidth leeching coming from?
Only one way to find out, I dropped in the following anti-leech code in my .htaccess file.
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?example.com/.*$ [NC]
RewriteCond %{REQUEST_URI} ^/images-path/.*$ [NC]
RewriteRule \.(gif¦jpg)$ - [NC,F]
Wait a couple of hours...
Then I ran this grep on my access_log to see who was leeching:
grep ".jpg" access_log ¦ grep " 403 "
The results of all the image leeching, thousands of images, spewed down my screen.
How?
How was this happening?
After looking at the results, they were primarily being culled from Google Images, with a few using other search engines.
What I found was a bunch of mash-up sites dynamically using Google to locate images and then hotlink those images on the fly.
OK, feeling of stunned naivety sets in, how could I have let this happen.
Let's rectify the situation quickly my adding "/images-path/" to my robots.txt file and then individually blocking the image bots from crawling to make sure there is no doubt they got the mesage loud and clear.
Every couple days check the various search engines to see if my images were removed and Live was very prompt, soon followed by Yahoo, but they persisted to stay in Google.
The Google image scrapers continued to generate a high volume of 403 forbiddens in my access_log file, day after day, week after week.
I checked Google Images every few days, my stuff is still there.
How can this be?
Their image bot has been individually blocked for a MONTH and the path to the images has also been explicity blocked for a MONTH yet the count of images indexed keeps increasing!
Finally, after a month, enough was enough.
I used Google Webmaster Tools to request the removal of everything from that image subdirectory and a couple of days later it was removed, FINALLY.
Some 403s are still showing up in my log files from sites that haven't figured out I pulled the plug on their abuse but we're talking 50-100 a day now, not the daily thousands that were leeching before.
Do you have a large number of images?
Are you allowing your images to be indexed?
This could happen to you...
Probably already is!
I've used a number of approaches for image hotlinkers - including serving them a different image with the same name, and that image includes an advertisement. In some cases, the hot-linkers are sending my site real traffic two years later. It's type-in traffic, yes, but the url I chose for the image and the location of the hot-linking site are enough clues for me to see that this approach is helping.
I did just what they said to do, cut & paste straight off their own site:
User-agent: Googlebot-Image
Disallow: /
It did nothing.
The other search engines did what I told them from robots.txt alone.
Only Google failed.
Googlebot-Image has been included in my robots.txt for 8-9 years after a mass traffic infusion the result of people who were without interest in my sites and merely interested in the image (or at least the images name).
A couple of years ago, their bot (even though contained in robots.txt) began crawling every image on one of my sites.
I contacted Google and requested human intervention and the crawling stopped (at least for that request).
About a month later, the same extensive crawling began again.
I sent a second request for humnan intervention, however in reflection, decided NOT to wait for a possible 3rd instance of non-compliance.
They've been denied access ever since.
Don
The other search engines did what I told them from robots.txt alone. Only Google failed.
Did Google fail to stop spidering? That's all that robots.txt asks - don't spider. What they do with the index is a different story - and in my experience once Google has indexed a url, then it takes a removal request for that url to go away from the search results.
Google Image Search has had a bunch of technical tangles that have beens discussed here at various times. Many of them seem to relate to Safe Search filtering, which is of course quite intensive in the area of images compared to text content. The reports we've seen seem to say that a hotlink from an adult content domain (even to a general audience image) can get a site filtered out.
Now that's the flip side of the issue you brought up - those webmasters were trying to get INTO the image search results, not get excluded. But it is a sign of the kinds of complications in the Google Image index.
What they do with the index is a different story
[google.com...]
To remove all the images on your site from our index, place the following robots.txt file in your server root:User-agent: Googlebot-Image
Disallow: /
That's pretty clear, it says very plainly that blocking them in robots.txt will "remove all the images on your site from our index"
It's plainly obvious it doesn't.
Their image bot has been individually blocked for a MONTH and the path to the images has also been explicitly blocked for a MONTH yet the count of images indexed keeps increasing!
Google Image Search doesn't appear to be updated as frequently. I'm going to guess about four times a year there are major updates back there, kind of like PR Updates.
The URI Removal Tool is an excellent tool to have available in instances like this.
You seem to have more problems with bots than anyone else I know. They are attracted to you like flys on you know what. :)
What I found was a bunch of mash-up sites dynamically using Google to locate images and then hotlink those images on the fly.
This has been going on for years. I had troubles with multi-million page one that used Google's old SOAP Search API. Google just does not care. If you have any #1 ranked images, then you probably have been hotlinked by one or more of these sites.
blocked for a MONTH yet the count of images indexed keeps increasing!
That's one of the worst reports I've run into yet. Having the urls not dropped is one thing, but having them increase is pretty far out of line, especially if that represents continued spidering.
will "remove all the images on your site from our index"
Perhaps they should add the word "eventually". That would line up better with actual webmaster experiences we've been hearing about. Image Search is sometimes painfully slow to drop urls, and video search is even slower. Removal requests definitely speed things up.
it says very plainly that blocking them in robots.txt will "remove all the images on your site from our index"It's plainly obvious it doesn't.
So it's not just me!
Over a period of 8 months I have tried and failed to stop Google taking images from one site.
Starting with the simplest line of code in robots.txt to block all user-agents from the image folder, we then wrote instructions just for Google, adding specific folders & sub folders, adding 'wildcard' blocking for all jpegs and gifs. Webmaster Tools has said all along that the robot.txt file is correctly configured. Nothing worked.
Most recently we blocked Google's image bot in htaccess. Now that means that the page result in image search (shown in a frame under the image) is a 403, but Google are STILL caching the feckin' image.