homepage Welcome to WebmasterWorld Guest from 54.205.105.23
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
yahoo? Is that you?
lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640877 posted 12:09 am on Jan 29, 2014 (gmt 0)

Can anyone shed light on this set of referers? Not the visitor him/herself, just the information in the referer section.

aa.bb.cc.dd - - [28/Jan/2014:08:14:15 -0800] "GET /ebooks/paston/images/titlepageV.png HTTP/1.1" 200 2397 "http://72.30.186.176/search/srpcache?ei=UTF-8&p= {more-stuff-here} &u=http://cc.bingj.com/cache.aspx?q= {more-stuff-here}" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.76 Safari/537.36"
{same} - - [28/Jan/2014:08:14:15 -0800] "GET /ebooks/paston/pastonstyles.css HTTP/1.1" 200 10746 "http://72.30.186.176/search/srpcache?ei=UTF-8&p= {more-stuff-here} &u=http://cc.bingj.com/cache.aspx?q= {more-stuff-here}" "{same}"
{same} - - [28/Jan/2014:08:27:05 -0800] "GET /ebooks/paston/images/titlepageIV.png HTTP/1.1" 200 2397 "http://72.30.186.176/search/srpcache?ei=UTF-8&p= {more-stuff-here} &u=http://cc.bingj.com/cache.aspx?q= {more-stuff-here}" "{same}"

72.30 is, of course, Yahoo! slurp. Currently blocked.


The bits I left out: aa.bb.cc.dd is a community college in Pennsylvania. I've met them before.

The three requests represent all supporting files belonging to two ebooks. The 72.30. IP has never even requested, let alone received, any material from this subdirectory.

The parts I've rendered as "more-stuff-here" are details of the query, which is only funny if you know the context. Each is a verbatim quote from about dot com; each includes a date -- which is incompatible with the two specific volumes involved. Sounds like someone trying to cheat on an assignment ... but that's neither here nor there.

 

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4640877 posted 8:01 pm on Jan 30, 2014 (gmt 0)

I think 72.30/16 is actually NOT slurp. I have slurp bots enabled but the whole of 72.30/16 blocked, and the specific IP you give is labelled in DNS as search not crawl. In general, anything with a bare IP gets stomped on anyway.

Given the DNS reference I would suspect the hits to result from a yahoo search, except I've never seen an IP used in that way by yahoo. The specific IP has port 80 ONLY open, again suggesting a web site, and using the IP in a browser does in fact bring up a yahoo search page.

The reference to bing also suggests search, since yahoo uses bing as its source.

Philosopher

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4640877 posted 8:12 pm on Jan 30, 2014 (gmt 0)

That's simply a request for a cached version of something that came up in the Yahoo search results.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640877 posted 9:55 pm on Jan 30, 2014 (gmt 0)

yahoo uses bing as its source

Ah, that does shed light. Couldn't work out how someone could get a yahoo cached page that yahoo had never asked for. (I mean, literally never: this particular page came into existence after I began keeping raw logs.)

I think 72.30/16 is actually NOT slurp.

Well, it claims to be, and that's good enough for me:

72.30.198.62 - - [16/Sep/2013:06:10:32 -0700] "GET /ebooks/christmas/MouseChristmas.html HTTP/1.1" 403 1600 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp) NOT Firefox/3.5"

I do not pretend to understand TextWrangler's alphabet. That's the first thing that came up in multi-file search, though all months are represented. Once in a blue moon they ask for robots.txt, but ordinarily it's sets of three: html file, errorstyles-- because of the 403-- and then piwik.js, also blocked. Fully humanoid, much like the plainclothes bingbot. I don't know if they would act on javascript if permitted to read it. (Bing does.) Never been curious enough to test.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4640877 posted 7:46 pm on Jan 31, 2014 (gmt 0)

Claiming it's slurp when it's not is called "lying". :)

That UA mentions firefox: genuine slurp, in my experience, does not. In any case, firefox/3.5 has been obsolete for years. I'm not saying it's not genuine yahoo but it's not, as far as I know, genuine slurp. DNS does not even give it credit for being search. Is it possible it may be a proxy?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640877 posted 9:26 pm on Jan 31, 2014 (gmt 0)

In any case, firefox/3.5 has been obsolete for years.

Yes, if they were human, a page request would get them only a redirect to the Old Browsers page. (Because of my audience, I bend over backward to recognize old browsers. But robots are simply beyond everything plausible. Mozilla/3.0? Really?)

But, as noted elsewhere, they do say "NOT Firefox/3.5". That can't be considered "lying" :)

Free lookup says (cut-and-paste):

United States Novi Inktomi Corporation
United States AS26101 YAHOO-3 - Yahoo! (registered Jul 05, 2002)
h337.hlfs.bf1.yahoo.com

So if not yahoo, then definitely something yahoid. They don't have app-engine thingies like google do they?

bhukkel



 
Msg#: 4640877 posted 9:47 pm on Jan 31, 2014 (gmt 0)

They don't have app-engine thingies like google do they?


Yahoo - as26101 has about 400k domains hosted on their network, so they offer some kind of hosted services. I dont know if it is an app-engine...but just php hosting is enough to crawl.

iomfan



 
Msg#: 4640877 posted 11:40 pm on Jan 31, 2014 (gmt 0)

Tangentially speaking:
I think 72.30/16 is actually NOT slurp.

Slurp doesn't show up any more, and I only accept the spiders "msnbot" and "bingbot" if they show up in the UA and come from a host with a domain name ending in ".search.msn.com" :)

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4640877 posted 8:58 pm on Feb 1, 2014 (gmt 0)

Lucy - since slurp is known to be not firefox why would they try to assure us it's not? Same as lying, in my book: trying to pretend it's firefox in case there is a throughway for a simple "firefox" UA. :(

Inktomi was a prmising SE several years ago but was taken over by yahoo. Inktomi became a sort of paid-for listing service for them.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640877 posted 11:30 pm on Feb 1, 2014 (gmt 0)

Heh, I see your point. The RewriteRule or mod_security rule looks for "Firefox/3\.something", nets the robot, and the robot then says piously "But I clearly and distinctly said I'm NOT..."

I went to look up what I've currently got for Yahoo! Slurp. One range is China, so set that aside. Others:

67.195 This one's got a note: "67.195.115 ... the only Slurp that asks for robots.txt"

68.180.128.0/17 (blocked) includes YahooCacheSystem

72.30 the one that triggered this thread

74.6 (blocked)

98.136.0.0/14 (blocked) includes YahooCacheSystem.

I don't know when I made the notation about robots.txt so I re-checked. 67.195.115 is definitely the most common. But there's also a scattering from 72.30.161.222 (always that exact IP) and 68.180.224.abc.

I kinda think that at some time in the past YahooCacheSystem did something to offend me, since they seem to be blocked by IP wherever they occur. In fact-- I'd forgotten this details-- I seem to have a universal block on "Yahoo" anywhere in the UA string.

Hm. I'm winding down the plainclothes bingbot experiment (on the grounds that I ended up learning absolutely nothing except that it executes javascript) so maybe I could let Slurp run wild for a few months instead.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640877 posted 7:52 am on Feb 2, 2014 (gmt 0)

A lot has changed with Y! IMO IPs have been shuffled around.

The *only* Slurp that has been showing up at my place lately is 68.180.128.0/17. No YahooCacheSystem AFAIK. In fact, they've been very busy on the several sites I watch. Gives me the impression that something is afoot.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640877 posted 6:39 am on Feb 3, 2014 (gmt 0)

98.139.204.44 - - [02/Feb/2014:22:30:06 -0800] "HEAD / HTTP/1.1" 200 278 "http://example.com/directory" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"

p9w12.geo.bf1.hostingprod.com
Yahoo/Inktomi
98.136.0.0 - 98.139.255.255
98.136.0.0/14

So what is HostingProd.com?

BTW - I've recently verified that a legit Slurp comes from 98.136.0.0/14 so that's why it got in.

Concerning YahooCacheSystem - there was a detailed discussion about this a couple years ago in which a couple WW members said how this service sells their collected data to various bidders. I forget the entire story, but it didn't look like something I wanted done with my site data, so I have blocked by UA ever since.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640877 posted 7:53 am on Feb 3, 2014 (gmt 0)

I re-checked YahooCacheSystem. (And also looked closer at TextWrangler prefs. You can tell it not to check zipped files, but it includes them by default. Shrug.) Looks like they haven't been around since November of 2012. So whatever they used to do, they're not doing it any more.

98.139.241.abc
209.131.38.abc

The second one is, if my notes can be trusted, Mobile Yahoo. I'd forgotten how little they did. All they ever asked for was the front page and favicon. And as soon as I blocked them they stopped asking for the favicon.

:: wandering off to swap current handling of plainclothes bingbot and Slurp ::

Angonasec

10+ Year Member



 
Msg#: 4640877 posted 3:25 pm on Feb 3, 2014 (gmt 0)

When feeds were (briefly) useful to site owners Yahoo's feed bot was worth tolerating.

It used 98.139.134.96 "YahooCacheSystem" and is still very active.

If Yahoo get serious about mobile search again we may have to consider allowing them in.

iomfan



 
Msg#: 4640877 posted 10:49 pm on Feb 5, 2014 (gmt 0)

Lucy24 had written:
72.30.198.62 - - [16/Sep/2013:06:10:32 -0700] "GET /ebooks/christmas/MouseChristmas.html HTTP/1.1" 403 1600 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; [help.yahoo.com...] NOT Firefox/3.5"

(72.30.198.62 = h337.hlfs.bf1.yahoo.com)

I had written:
Slurp doesn't show up any more


Yes it does (this is the first time I've seen Yahoo on any of our sites since it stopped):
b100104.yse.yahoo.net - - [05/Feb/2014:20:26:20 +0000] "GET /robots.txt" 200 2551 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; [help.yahoo.com...]


Confirmed that b100104.yse.yahoo.net = 68.180.224.228 and the other way around...

Slurp can from now on also have what Google and MSN get... ;)

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4640877 posted 12:35 am on Feb 6, 2014 (gmt 0)

I removed the universal UA block on Yahoo, but they don't seem to have noticed yet :)

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640877 posted 1:20 am on Feb 6, 2014 (gmt 0)



To reiterate... anyone have any info on HostingProd.com? (see above msg:4642013)

iomfan



 
Msg#: 4640877 posted 4:44 am on Feb 6, 2014 (gmt 0)

HostingProd.com? It's a domain hosted within a Yahoo IP block

More info here: [whois.domaintools.com...]

About the entry in your logfile: my guess would be that someone tried out a spider... ;)

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640877 posted 7:01 am on Feb 6, 2014 (gmt 0)



HostingProd.com? It's a domain hosted within a Yahoo IP block

More info here: [whois.domaintools.com...]
spider... ;)

Yes thanks, I was aware of all that before I posted. I asked for info from anyone who knows more about it.

iomfan



 
Msg#: 4640877 posted 7:50 am on Feb 6, 2014 (gmt 0)

My apologies - since you hadn't gotten a reply after 3 days I felt a bit helpful but was too cryptic: what I wanted to say is, "what Domaintools has is about all there is" - you will likely have noticed yourself that that there is no web presence online under that domain name but that there exist plenty of sites with URLs in the format
hostingprod.com/@other_domain.com. ;) Unsavory in any way? Or just eccentric? ...
keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4640877 posted 9:46 am on Feb 6, 2014 (gmt 0)

Well I should have listed all I knew first. Anyway, looking at my header log for that period, there wasn't anything malformed unless I missed it. Ping shows open ports like a server would. Guess I'll filter that range, maybe allowing only the Slurp UA through (like that'll do much good.)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved