ASK with no reverse DNS and no UA

Forum Moderators: open

Message Too Old, No Replies

ASK with no reverse DNS and no UA

Romeo

8:17 pm on Aug 8, 2007 (gmt 0)

An IP-address out of the 65.214.39.x range [= AskJeeves, Inc. UU-65-214-36 (NET-65-214-36-0-1)] comes to look at my /robots.txt once every 4 or 5 days.

OK, why not. They're welcome.

But using an IP-address with no reverse DNS PTR record, and an empty USER_AGENT, too, is that nice?

Seen some crappy scrapers doing this, before, -- but ASK, now, too?

Oh, where has all the netiquette gone?
Long time passing ... Long time ago ...

Kind regards,
R.

[edited by: Romeo at 8:18 pm (utc) on Aug. 8, 2007]

jdMorgan

5:12 pm on Aug 12, 2007 (gmt 0)

As a follow-on to this previous overwhelmingly-popular thread [webmasterworld.com], it now appears that addresses in the 65.214.36.0/22 range now properly resolve to Ask.

That's the good news. The bad news is that Ask is fetching robots.txt from that IP range with a blank user-agent. :(

65.214.39.180 - - [12/Aug/2007:11:01:33 -0500] "GET /robots.txt HTTP/1.1" 200 3422 "-" "-"

So, one step forward, two steps back.

Jim

[edited by: encyclo at 5:20 pm (utc) on Aug. 13, 2007]
[edit reason] combined two threads [/edit]

incrediBILL

6:54 pm on Aug 13, 2007 (gmt 0)

You guys forget things like ASK has a bunch of engineers testing code and stuff which might hit your site and NOT identify itself as the spider since it isn't, it's just a test perhaps.

jdMorgan

10:27 pm on Aug 13, 2007 (gmt 0)

Didn't forget that at all. If it's a test spider, or a test-fetcher, or whatever, then I expect it to say so in a valid UA string.

I generally block all unrecognized UAs because I don't have time to mess with site scrapers, and because my sites can 'afford it' ranking-wise, even if the misnamed or un-named spider is legit. But when recognized, legitimate SE companies screw up like this, it surprises me (yeah, I know, but I'd like to avoid the usual Ask-bashing and focus on the technicalities)... :)

I consider it progress, though, that since their Webmaster help page has been encouraging us to do rDNS checking for months, they're finally getting their rDNS to resolve properly...

Jim

wilderness

2:43 am on Aug 14, 2007 (gmt 0)

Didn't forget that at all. If it's a test spider, or a test-fetcher, or whatever, then I expect it to say so in a valid UA string.

How narrow-minded Jim ;)

Obviously you don't trust in the free un-monitored
internet ;)

Don

incrediBILL

6:18 am on Aug 14, 2007 (gmt 0)

I think Jim doesn't understand why QA exists...

wilderness

1:24 pm on Aug 14, 2007 (gmt 0)

Bill,
My comments were hardly criticism, rtaher just having a little fun with Jim.

I'm more narrow than he!

Don

jdMorgan

1:55 pm on Aug 14, 2007 (gmt 0)

They can QA-test my pages or use my sites to QA their algo and filter tweaks all they like, as long as the UA is valid. They'll have no problem getting a valid snapshot if they use (for example) any popular browser UA and if it behaves believably as a browser.

But a blank UA with a blank referrer gets tossed, regardless of who the IP address resolves back to -- I have better things to do than spend time adding ad-hoc access-control exceptions for lazy search engine coders. The only one in recent memory was when Googlebot-Mobile came around with a literally-quoted UA string; Since I was at the time trying to get a new mobile site indexed, I did make an exception for that one for several days until they fixed it -- Possibly due to the bug report I filed.

Don and I have been discussing the access-control code that he's planning to roll out the next time his site gets scraped, and I've actually seen the code -- Beautifully compact and efficient:


SetEnvIf Request_URI "(403[^.]*\.html�robots\.txt)$" allow_it
<Files *>
Order Deny,Allow
Deny from all
Allow from env=allowit
Allow from localhost
</Files>

;)

Jim

wilderness

2:48 pm on Aug 14, 2007 (gmt 0)

the next time his site gets scraped

Hey Jim,
Unable to recall the last time a scraper/harvester hit my entire sites :)
Course, it's a bad thing to brag about as the next pest may be just around the corner ;)

Either I have all the "bad guys" IP ranges and UA's restricted or else the "bad guys" have determined there's not anything on my sites worth having ;)

Don

wilderness

1:58 pm on Aug 15, 2007 (gmt 0)

Not spoofing, however. . .
This an image folder which Jeeves is persistent about.
The images aren't stored in that folder anyway!

206.80.1.253 - - [15/Aug/2007:07:10:16 -0500] "GET /robots.txt HTTP/1.1" 200 4550 "-" "-"
206.80.1.253 - - [15/Aug/2007:07:10:16 -0500] "GET /MyFolder/MySubFolder.htm HTTP/1.1" 403 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a1) Gecko/20070308 Minefield/3.0a1"

incrediBILL

10:07 pm on Aug 15, 2007 (gmt 0)

My comments were hardly criticism, rtaher just having a little fun with Jim.

You guys really need to come to my sarcasm workshop.

I was actually poking fun more at Ask (implying they had no QA) than Jim, but if I have to explain it the humor is lost...

jdMorgan

1:37 am on Aug 21, 2007 (gmt 0)

I'll trade you one of your sarcasm workshops for one of my appropriate-use-of-smiley-faces workshop... ;)

I mean really, if there's one forum where I'd expect you to be most serious based on your projects and other posts, this would be it.

Jim

wilderness

3:29 am on Aug 21, 2007 (gmt 0)

I mean really, if there's one forum where I'd expect you to be most serious based on your projects

Really Jim?
I visited his blog when he first arrived her to see what all "hoopla" was about and read comments that reminded me of a street girl talking to her manager ;)
Certianly not my cup of tea!