homepage Welcome to WebmasterWorld Guest from 54.205.144.54
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
which part of "disallow" did you not understand?
lucy24




msg:4343132
 8:48 pm on Jul 23, 2011 (gmt 0)

Back on 5 July I added this piece to robots.txt:
User-Agent: *
Disallow: /perez/images

I've found by experience that the pictures in this particular directory are very popular with hotlinkers-- the kind who never bother to check back, so they never see the garish "NO HOTLINKS" graphic-- so why make it easy for them.

Concurrently I removed the directory from google's index, and added this piece to .htaccess to allow time for their outsourcing of robots.txt:
RewriteCond %{HTTP_USER_AGENT} Googlebot-Image
RewriteRule perez/images/ - [F]


I figured, leave it there for a week or so and then I can delete it when the robots.txt-bot gets caught up. (For ordinary html pages it takes five days or so.)

Wrong. Just today my logs were jam-packed with 403's from the imagebot trying to collect images that it was told way back on the fifth-- I make that eighteen days ago-- not to touch.

What gives? Does the imagebot not count as a robot, so closed doors make no difference?

 

phranque




msg:4343194
 5:15 am on Jul 24, 2011 (gmt 0)

do you have any other rules in robots.txt?
if so i would add a specific rule for the image bot:

User-Agent: Googlebot-Image
Disallow: /perez/images

have you checked your server access logs to see if Googlebot-Image has accessed robots.txt since july 5?

lucy24




msg:4343209
 8:24 am on Jul 24, 2011 (gmt 0)

have you checked your server access logs to see if Googlebot-Image has accessed robots.txt since july 5?

Do you know, until you asked I hadn't realized that the imagebot makes its own visits to robots.txt. I went back and did a rough count. (Thank you, text editor.) The imagebot has picked up robots.txt about ten times since the restriction was added. The particularly interesting part was that on the day before its flurry of 403's it read robots.txt three separate times-- it never does this-- followed by one more time earlier on the day of the 403 binge.

i would add a specific rule for the image bot:
User-Agent: Googlebot-Image
Disallow: /perez/images

And you know what happens then. You have to go back over all your other rules and repeat them: "googlebot, this means you too". Hm. Well, that's another use for Regular Expressions.

londrum




msg:4343232
 11:00 am on Jul 24, 2011 (gmt 0)

i had a problem like this once, and found out that sticking disallow on a directory in robots.txt does not actually mean what we think it means.

search engines are not obliged to remove that directory from their index just because you have disallowed it. they usually do, of course, which is probably why we think it works that way, but all it really does is stop them from spidering it anymore. anything that is already in their index remains untouched.

what you need to do is remove the disallow in robots.txt and put a noindex header on the images instead. you can get PHP to send a header like that. once they have been spidered and removed from their index, then you can put disallow back on robots.txt

that might explain why you are getting 403s for the images. the imagebot is still trying to pick-up images that it already has in its index.

lucy24




msg:4343348
 9:56 pm on Jul 24, 2011 (gmt 0)

search engines are not obliged to remove that directory from their index just because you have disallowed it.

But if you've explicitly removed the directory using the search engine's own tools, it should certainly be gone.

Using "test robots.txt" from GWT, with Googlebot-Image added from the "user-agent" popup list (domain name changed for quoting purposes):

Test results
Url Googlebot Googlebot-Image
http://www.example.com/perez/images
Blocked by line 17: Disallow: /perez/images
Blocked by line 17: Disallow: /perez/images


That doesn't really seem to leave a lot of wiggle room does it.

phranque




msg:4343361
 10:30 pm on Jul 24, 2011 (gmt 0)

have you checked the IP address of the "Googlebot-Image" bot?
perhaps a non-G bot is spoofing the user agent.

lucy24




msg:4343379
 1:07 am on Jul 25, 2011 (gmt 0)

:o
That is one serious spoofer. They managed to forge the 66.249.68.134 IP-- not just once but more than a hundred times over a 13-hour period. Must have been Image Collection Day, because once posted, my images hardly ever change. Sometimes there's a redirect (provided solely for search engines' benefit, since humans get whatever is currently linked), but you can get that from the HEAD can't you?

Pfui




msg:4343387
 1:32 am on Jul 25, 2011 (gmt 0)

66.249.68.134 is Google.

(Forged IP hits to web-level pages are rare as hen's teeth.)

lucy24




msg:4343403
 2:46 am on Jul 25, 2011 (gmt 0)

66.249.68.134 is Google.

Whoops! Forgot the <fe>markup</fe>.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved