Welcome to WebmasterWorld Guest from 23.22.140.143

Forum Moderators: goodroi

Message Too Old, No Replies

which part of "disallow" did you not understand?

     
8:48 pm on Jul 23, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12708
votes: 244


Back on 5 July I added this piece to robots.txt:
User-Agent: *
Disallow: /perez/images

I've found by experience that the pictures in this particular directory are very popular with hotlinkers-- the kind who never bother to check back, so they never see the garish "NO HOTLINKS" graphic-- so why make it easy for them.

Concurrently I removed the directory from google's index, and added this piece to .htaccess to allow time for their outsourcing of robots.txt:
RewriteCond %{HTTP_USER_AGENT} Googlebot-Image
RewriteRule perez/images/ - [F]


I figured, leave it there for a week or so and then I can delete it when the robots.txt-bot gets caught up. (For ordinary html pages it takes five days or so.)

Wrong. Just today my logs were jam-packed with 403's from the imagebot trying to collect images that it was told way back on the fifth-- I make that eighteen days ago-- not to touch.

What gives? Does the imagebot not count as a robot, so closed doors make no difference?
5:15 am on July 24, 2011 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10542
votes: 8


do you have any other rules in robots.txt?
if so i would add a specific rule for the image bot:

User-Agent: Googlebot-Image
Disallow: /perez/images

have you checked your server access logs to see if Googlebot-Image has accessed robots.txt since july 5?
8:24 am on July 24, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12708
votes: 244


have you checked your server access logs to see if Googlebot-Image has accessed robots.txt since july 5?

Do you know, until you asked I hadn't realized that the imagebot makes its own visits to robots.txt. I went back and did a rough count. (Thank you, text editor.) The imagebot has picked up robots.txt about ten times since the restriction was added. The particularly interesting part was that on the day before its flurry of 403's it read robots.txt three separate times-- it never does this-- followed by one more time earlier on the day of the 403 binge.

i would add a specific rule for the image bot:
User-Agent: Googlebot-Image
Disallow: /perez/images

And you know what happens then. You have to go back over all your other rules and repeat them: "googlebot, this means you too". Hm. Well, that's another use for Regular Expressions.
11:00 am on July 24, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Feb 12, 2006
posts:2493
votes: 22


i had a problem like this once, and found out that sticking disallow on a directory in robots.txt does not actually mean what we think it means.

search engines are not obliged to remove that directory from their index just because you have disallowed it. they usually do, of course, which is probably why we think it works that way, but all it really does is stop them from spidering it anymore. anything that is already in their index remains untouched.

what you need to do is remove the disallow in robots.txt and put a noindex header on the images instead. you can get PHP to send a header like that. once they have been spidered and removed from their index, then you can put disallow back on robots.txt

that might explain why you are getting 403s for the images. the imagebot is still trying to pick-up images that it already has in its index.
9:56 pm on July 24, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12708
votes: 244


search engines are not obliged to remove that directory from their index just because you have disallowed it.

But if you've explicitly removed the directory using the search engine's own tools, it should certainly be gone.

Using "test robots.txt" from GWT, with Googlebot-Image added from the "user-agent" popup list (domain name changed for quoting purposes):

Test results
Url Googlebot Googlebot-Image
http://www.example.com/perez/images
Blocked by line 17: Disallow: /perez/images
Blocked by line 17: Disallow: /perez/images


That doesn't really seem to leave a lot of wiggle room does it.
10:30 pm on July 24, 2011 (gmt 0)

Administrator

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Aug 10, 2004
posts:10542
votes: 8


have you checked the IP address of the "Googlebot-Image" bot?
perhaps a non-G bot is spoofing the user agent.
1:07 am on July 25, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12708
votes: 244


:o
That is one serious spoofer. They managed to forge the 66.249.68.134 IP-- not just once but more than a hundred times over a 13-hour period. Must have been Image Collection Day, because once posted, my images hardly ever change. Sometimes there's a redirect (provided solely for search engines' benefit, since humans get whatever is currently linked), but you can get that from the HEAD can't you?
1:32 am on July 25, 2011 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Nov 5, 2005
posts: 2038
votes: 1


66.249.68.134 is Google.

(Forged IP hits to web-level pages are rare as hen's teeth.)
2:46 am on July 25, 2011 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

joined:Apr 9, 2011
posts:12708
votes: 244


66.249.68.134 is Google.

Whoops! Forgot the <fe>markup</fe>.