Forum Moderators: open
OrgName: HotJobs.com, Ltd.
OrgID: HOTJOB-6
Address: 701 First Ave
City: Sunnyvale
StateProv: CA
PostalCode: 94089
Country: US
CIDR: 216.109.112.0/20
This was interesting because one email account was spammed with 30 odd messages yesterday, all with the same "From:" Catrina.jones, and "Subject:" Jobs in Australia and New Zealand.
The other interesting thing about this, is that I found the IP in the UA.
The person doing the requesting was here in AU but they accessed the information via cache on the remote server. Here is an obfuscated line from our logs:
***.bigpond.net.au - - [04/Aug/2006:10:32:51 +1000] "GET /pic.gif HTTP/1.1" 200 64 "http://216.109.124.98/search/cache?p=green+widgets&prssweb=Search&ei=UTF-8
&fr=fp&x=wrt&meta=vc%3DcountryAU&u=www.example.com/&w=green+widgets&d=VDdOuGP9NNUV&icp=1&.intl=au" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; CursorZone Grip Toolbar 2.08.548; .NET CLR 1.1.4322)"
Regrettably, as our site is already in their cache, all I can do now is prevent requests for graphics and stylesheet, but maybe this warning will help someone else.
The only thing I'm not sure about is if they do their crawling from the server holding the cache.
Anyone have any feedback or ideas about this?
[edited by: volatilegx at 10:43 pm (utc) on Aug. 19, 2006]
[edit reason]
[1][edit reason] <br>[1][edit reason] obfuscated hostname [/edit] [/edit][/1] [/edit][/1]
hotjobs is a Yahoo! property, I believe.
Maybe Hotjobs are using Yahoo! cache then. If that is true it stinks and makes me more even more likely to ban Slurp from our sites. They bring so little traffic they won't be missed.
wilderness wrote:
I have their entire range denied.
Thanks - I denied the CIDR as soon as I found what was happening:
deny from 216.109.112.0/20
Cache requests don't ask for robots.txt so I feel there is no point in padding out .htaccess with the much longer code for a mod_rewrite rule.
BTW I checked back in the logs and found another couple of instances in July & August but none before that. One was from an IP in UK, the other from US.
With that said, our screen shot crazed friends at Snap use Firefox Linux to render pages with Javascript and everything, and I've seen a few others do the same.
Point is that anything you see in Google Analytics is most likely real humans than anything you would run across in your log file that didn't appear in Google Analytics, so you may be bouncing people.
Block with caution.
Unless they are using some super bots, you're probably seeing real humans
I'm fully well aware that it is humans, and I stated as such in my original post. I have never seen a bot request images and stylesheets via a remote site's cache.
... so you may be bouncing people.Block with caution.
I'm not bouncing anybody. I am only blocking requests from the Hotjobs IP range. Which real people will only be using if they insist on viewing our site through Hotjob's (Yahoo!'s?) cache. Plus, they will still be able to view our page in cache, and only miss out on the icing on the cake - images and css, which they can always get if they visit directly.
That said though, this particular site of ours is completely unrelated to employment in any shape or form - so there is no reason for someone to be searching for terms related to this niche business in Hotjobs. The fact that three people have in the space of 2 months but never previously, is rather suspicious.
Also, this business, besides being "very, very" niche, is a hire business and 99.9% of clients live/work in the same Australian city. The likelihood of a genuine potential client going to Hotjobs (in California) to search is minuscule.
[edited by: Mokita at 12:57 am (utc) on Aug. 19, 2006]
I have never seen a bot request images and stylesheets via a remote site's cache
I have seen it, so it's not uncommon.
Found a couple of yo-yo's that look as human as you or I surfing with the exception of the speed and duration of the pages they were accessing as people just can't read that fast.
Luckily it's very uncommon or I'd be coming unglued about now...
BTW, I'm not positive but I'm pretty sure Yahoo has reallocated IP's from various properties for other purposes.
[edited by: incrediBILL at 1:48 am (utc) on Aug. 19, 2006]
but I'm pretty sure Yahoo has reallocated IP's from various properties for other purposes.
It still doesn't phase me. If people insist on viewing our site from cache rather than visiting openly and normally, they will just have to deal with not getting images and css. Our site was not down, or the successfull requests for the supporting files would not have been logged.
but I'm pretty sure Yahoo has reallocated IP's from various properties for other purposes.
Sorry everyone, I'm tired and seem to be getting myself thoroughly confused with two seperate issues.
1. I have blocked "User Agents" containing the Hotjobs IP range using mod_rewrite.
RewriteCond %{HTTP_USER_AGENT} ^http://216\.109\. [OR]
This will block requests for supporting files when made via cache from that IP range. It was this I was thinking about when I posted my reply to incrediBILL above.
2. I have also blocked the Hotjobs IP range thus: deny from 216.109.112.0/20
thinking that if Hotjobs are crawling in their own right rather than using Yahoo's cache this would stop them. But yes, incrediBILL is corrrect, this might have an unforseen repercussion if Yahoo is using the IP range for other purposes. However, as Yahoo only send a tiny number of visitors our way, I doubt that the impact would be significant. YMMV.