Forum Moderators: open
OK, why not. They're welcome.
But using an IP-address with no reverse DNS PTR record, and an empty USER_AGENT, too, is that nice?
Seen some crappy scrapers doing this, before, -- but ASK, now, too?
Oh, where has all the netiquette gone?
Long time passing ... Long time ago ...
Kind regards,
R.
[edited by: Romeo at 8:18 pm (utc) on Aug. 8, 2007]
That's the good news. The bad news is that Ask is fetching robots.txt from that IP range with a blank user-agent. :(
65.214.39.180 - - [12/Aug/2007:11:01:33 -0500] "GET /robots.txt HTTP/1.1" 200 3422 "-" "-"
Jim
[edited by: encyclo at 5:20 pm (utc) on Aug. 13, 2007]
[edit reason] combined two threads [/edit]
I generally block all unrecognized UAs because I don't have time to mess with site scrapers, and because my sites can 'afford it' ranking-wise, even if the misnamed or un-named spider is legit. But when recognized, legitimate SE companies screw up like this, it surprises me (yeah, I know, but I'd like to avoid the usual Ask-bashing and focus on the technicalities)... :)
I consider it progress, though, that since their Webmaster help page has been encouraging us to do rDNS checking for months, they're finally getting their rDNS to resolve properly...
Jim
But a blank UA with a blank referrer gets tossed, regardless of who the IP address resolves back to -- I have better things to do than spend time adding ad-hoc access-control exceptions for lazy search engine coders. The only one in recent memory was when Googlebot-Mobile came around with a literally-quoted UA string; Since I was at the time trying to get a new mobile site indexed, I did make an exception for that one for several days until they fixed it -- Possibly due to the bug report I filed.
Don and I have been discussing the access-control code that he's planning to roll out the next time his site gets scraped, and I've actually seen the code -- Beautifully compact and efficient:
SetEnvIf Request_URI "(403[^.]*\.html¦robots\.txt)$" allow_it
<Files *>
Order Deny,Allow
Deny from all
Allow from env=allowit
Allow from localhost
</Files>
;) Jim
the next time his site gets scraped
Hey Jim,
Unable to recall the last time a scraper/harvester hit my entire sites :)
Course, it's a bad thing to brag about as the next pest may be just around the corner ;)
Either I have all the "bad guys" IP ranges and UA's restricted or else the "bad guys" have determined there's not anything on my sites worth having ;)
Don
206.80.1.253 - - [15/Aug/2007:07:10:16 -0500] "GET /robots.txt HTTP/1.1" 200 4550 "-" "-"
206.80.1.253 - - [15/Aug/2007:07:10:16 -0500] "GET /MyFolder/MySubFolder.htm HTTP/1.1" 403 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"