Forum Moderators: open

Message Too Old, No Replies

Yahoo Java Crawler and Mod Security

yahoo mod security and

         

frontpage

2:05 am on Jan 30, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Lately, we are getting hundreds of crawls by Yahoo using a Java crawler. However, Mod Security is 406'ing all these requests.

We do not want to leave our server open to Java scrapers but we do want Yahoo to crawl our site.

Anyone have a solution to this issue.

Example log:

66.228.167.32 - - [28/Jan/2008:16:20:18 -0500] "GET /foo.html HTTP/1.0" 406 460 "-" "Java/1.5.0_11"

IP 66.228.167.32 resolves to fsdev1000.yst.corp.yahoo.com

[edited by: volatilegx at 9:35 pm (utc) on Feb. 12, 2008]

Brett_Tabke

7:02 pm on Jan 31, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



I am 95% sure that spidering is not going into the index. It has been my experience that anything coming from .corp is for other internal purposes.

wilderness

10:55 pm on Jan 31, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Most everybody has the Java UA denied (seem to recall that Jim has an exception for one SE) in every format.
See:
[webmasterworld.com...]
[webmasterworld.com...]
[webmasterworld.com...]

#keep-out; or whatever term you use
SetEnvIf Java keep-out

OR

RewriteCond %{HTTP_USER_AGENT} Java

66.228.167.aa is a Overture (formerely Goto.com) range.
I've had 66.228.166. denied for an eternity.
Goto has always been a pest.

Don

frontpage

1:31 pm on Feb 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



How we handled it by whitelisting the IP but keeping the Mod_Security ban on Java.

What is strange is all these requests by Yahoo for the same page over and over. It has no ads, no links, and very uninteresting. I consider this to be 'cloaking' by a search engine or obfuscation.

wilderness

2:00 pm on Feb 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Its no different than Jeeves (or others) grabbing pages with a Linux browser (and in my instance, when Linux is denied) grabbing the same page instantly with a standard browser footprint.

It seems an accepted practice for SE's to cloak, however webmasters are penalized for similar viewable practices.

Al the major SE's have so many tools avaialble to users and grabbing from some different IP ranges that it's almost a joke. In fact it would be if webmasters didn't have to deal with their farces.

As an aside; yesterday I saw a bunch of blank refers and UA from the 131.107. (which I haven't seen in some time).
This IP range gets denied at my sites regardless of what UA they use (and has for quite a while; Very OLD MSN thread here).

131.107.0.*** - - [31/Jan/2008:17:49:06 -0600] "GET /Myfolder/Mypage.html HTTP/1.1" 403 - "-" "-"

[edited by: volatilegx at 9:37 pm (utc) on Feb. 12, 2008]

Brett_Tabke

2:04 pm on Feb 1, 2008 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



It has been my experience that when you get hit by that bot, that means someone at .corp put the page or site on the internal watch list. The bot is from the press department. I would guess someone finds your site interesting and they are monitoring it for kw's of interest.

wilderness

2:45 pm on Feb 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Many thanks Brett, however I simply wish they'd go away ;)

My widgets derive visitors from the oddest of sources.

Recall one from "NYSE" that had a nasty habit of going through hundreds of pages (very slowly and time consuming) looking for materils, rather than learning properly how to utilize the search options and quotes for proper or multiple names ;)

Get regular visitors from the .MIL sites as well. Early on with my websites, though it was some goverment conspiracy ;)

All these and many more oddities would seem pecuilar, however the refer searches are on topic and validate support for their interest.

Unfortuantely the 131.107. has never provided valid refers:
2003:
[webmasterworld.com...]
[webmasterworld.com...]

Additionally there were a couple of more MSN threads in Forum 11 around the same time, which I failed to bookmark.

Don

wilderness

2:54 pm on Feb 1, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



they are monitoring it for kw's of interest

That may well be spot-on.

Have about a hundred articles online from talented widget writer who died in 1947.
His articles offer an amazing depth, which periodicals seem absent of in today's subscribers attention spans.

I recall a series of articles (lasted weekly for nearly three months; not online) by the writer that were very interesting. However there was very little subject matter (at least after the initial artilce of rants) of my widgets.
Rather, the articles topics took a very sharp turm in the direction of John Hunt Morgan the Civil War Guerilla.

wilderness

9:50 pm on Feb 11, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



There's simply not enough explicit words in the dictionary!

67.195.44.108 - - [11/Feb/2008:12:05:33 -0600] "GET /robots.txt HTTP/1.0" 200 4549 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]

Followed with a sub-driectory file read from a different Class D.

wilderness

3:12 pm on Feb 15, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The following sent through their contact page.
It's been my experience that response is rather slow.
-------------------
A Yahoo bot has been crawling pages (the same two pages; four times daily) for the past four days (see below for one example).

My inquiry is because the internet provider range is outside the normal ranges used by Yahoo, although I have no doubt as to the auntenticity of Yahoo.
My question is why the additional and MEW (at least to me and other webmasters) IP range?
Is there some source or specific tool in use for Yahoo users which would result in Yahoo spidering from this IP range?

67.195.44.108 - - [11/Feb/2008:02:00:06 -0600] "GET /robots.txt HTTP/1.0" 200 4549 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]