Welcome to WebmasterWorld Guest from 54.221.131.67

Forum Moderators: goodroi

Excluding image urls from being indexed by robots.

     
2:24 pm on Aug 26, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


So I did a google search for my website using site:example.com and in the google search index are my thousands of image files which have their own url alias (via drupal) as an example being:

www.example.com/examplejpg

The above is the exact url being indexed by google search bot (without the period)..

Should I just disallow searchbots by placing the following in my robots.txt :

Disallow: /*jpg

Is this a good practice or a good implementation? Please advise Thanks!
2:33 pm on Aug 26, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


I was thinking this robots.txt directive:

User-agent: Googlebot
Disallow: /*jpg$

Would this help or is it accurate? Thanks!
2:35 pm on Aug 26, 2017 (gmt 0)

Moderator from GB 

WebmasterWorld Administrator mack is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:June 15, 2001
posts:7719
votes: 43


I can understand why some may wish to exclude image search, but in some cases, it can lead to real traffic "View image - Visit page".

I sometimes use image search for general searches because I will click on an image that is closest to my result without having to read result snippets. This doesn't work for all searches but does very well within certain niches.

One approach I have seen used is hosting all content you do not want to be indexed on a sub-domain (images.example.net) and using a disallow all condition for that sub-domain.

Mack.
2:38 pm on Aug 26, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


Thanks Mack, can you please check my following robots.txt directive:

User-agent: Googlebot
Disallow: /*jpg$

Is it worth doing it btw?
2:51 pm on Aug 26, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3565
votes: 197


Those are good points, mack though I know a good number of legitimate reasons not to want your images to show up in an image search. If you really don't want your images in search results, yo can send a noindex header with the image.

You do not need to disallow robots to prevent indexing. It is a topic that comes up here so often you can use the search here for "X-robots" and find hundreds of "how-to" threads. Disallowing robots can cause problems with Google's determinations of mobile friendly checking as will other blocked resources. That won't prevent your images from showing up in search results from other sites that have copied your images.
2:56 pm on Aug 26, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


So not2easy I did a google search and found this directive that can be added to .htaccess:

<IfModule mod_headers.c>
<FilesMatch "images/*\.jpg$">
Header append X-Robots-Tag "noindex"
</FilesMatch>
</IfModule>

Would this be a good method? Thanks!
3:02 pm on Aug 26, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


Actually I don't want file match I just want the directive to match url like "www.example.com/namejpg"

This is the url example I wanna block Thanks!
4:55 pm on Aug 26, 2017 (gmt 0)

Administrator from US 

WebmasterWorld Administrator not2easy is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Dec 27, 2006
posts:3565
votes: 197


There is a very nice search right here on WebmasterWorld, it might return more results than you wanted, but at least you can see who is recommending it, If you use results you find on Google, good luck. Hint - you may not need <FilesMatch>
6:06 pm on Aug 26, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14435
votes: 576


For image files, Disallow and noindex have the same end result. That is, I've never seen an image SERP containing blank boxes that say "the site's robots.txt would not allow us to show this picture". If you have huge numbers of image files, a robots.txt Disallow is probably a better approach, as it means the files won't get crawled in the first place. Less load on your server, and maybe a reallocation of crawl budget.
2:34 am on Aug 27, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


So I added the following to my robots.txt:

User-agent: *
Disallow: /*jpg$

Do I have to specify googlebot or the above would work in general? Thanks!
3:59 am on Aug 27, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:Apr 9, 2011
posts:14435
votes: 576


If you specify googlebot, the rule will only be heeded by googlebot. If you say * it will apply to everyone.

In robots.txt, each user-agent can only be named once. After the first match, they stop looking. Surely you've already got a block for "User-Agent: *" (the generic Disallow directives for directories you don't want anyone crawling). Any new Disallows go there.

Most robots will not understand the /*jpg$ syntax (with RegEx-style closing anchor) anyway.
5:46 am on Aug 27, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


Well Lucy24 Bing ang Google understand the syntax /*jpg$
12:43 am on Sept 5, 2017 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 19, 2004
posts:719
votes: 8


Yep you guys are right.. googlebot (google search console) complained about blocked resources for thousands of pages in my site when I added following to robots.txt:

Disallow: /*jpg$

So I deleted the directive. Thanks guys!