Welcome to WebmasterWorld Guest from 54.145.71.115

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

Blekko/Scoutjet

     
8:34 am on Jul 30, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6547
votes: 115




While the Blekkobot and ScoutJet are following the deny in robots.txt and requesting nothing further, something from Blekko *is* crawling and caching pages.

I put up the denies for these two bots of theirs when I read their public statement announcing they would *not* support the nocache tag and that they would continue to post cached version of all web pages in their SERP.

After a month or so I just checked and they are displaying fresh cached versions of my web pages, so something else is getting them. Anyone know what it is and what range it's coming from?

I just added this block:

199.87.248.0 - 199.87.255.255
199.87.248.0/21
4:48 pm on July 30, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


This is a cached page of http://www.example.com/index.html from Blekko's web crawl.

Error: No content
end of quote

Course I deny just about everything.
6:27 pm on July 30, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6547
votes: 115



The question was... what range?
6:41 pm on July 30, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


199.87.252.51 - - [01/May/2012:08:26:26 +0100] "GET /robots.txt HTTP/1.1" 200 2642 "-" "Mozilla/5.0 (compatible; Blekkobot; ScoutJet; +http://blekko.com/about/blekkobot)"
199.187.122.98 - - [01/May/2012:08:30:44 +0100] "GET /robots.txt HTTP/1.1" 200 2642 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727)"
8:12 pm on July 30, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


I have ranges:

38.99.96.0 - 38.99.99.255
64.13.159.0 - 64.13.159.255
199.87.248.0 - 199.87.255.255

If you go to the link in the bot's UA blekko give you their crawling ranges. One of the better engines in that respect.

199.187.122.nn does not appear to be assoviated with blekko? Could be a coincidence or someone scraping blekko for links?
8:22 pm on July 30, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


199.87.252.51 - - [01/May/2012:08:26:26 +0100]
199.187.122.98 - - [01/May/2012:08:30:44 +0100]

Note the times?
This is approximately 3AM EST, and a very slow time for my websites, thus the log entries were consecutive.

Most North American widget folks are sleeping at that time.

The mid and western Euro's that I allow access to are just beginning their days (these widget folks tend be more active in the late-afternoon and evenings) with an approximate 5-6 hour difference.

Too much of a coincidence for me to dis-associate the two, however everybody knows I'm paranoid ;)
8:29 pm on July 30, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6547
votes: 115


Thanks dstiles. Didn't have the 38.99.96.0 - 38.99.99.255

I had the Silicon Valley Colo as a larger block:

64.13.128.0 - 64.13.191.255
64.13.128.0/18

And wilderness I'll keep an eye on 199.187.122.* Thanks
8:09 pm on July 31, 2012 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3121
votes: 3


wilderness - my point was that the 199.187.122.nn hit was probably not blekko itself but either someone using blekko as a search source (as is common with google and other SEs) or someone running an automated scrape, which could as easily come at that time as at any other.

Given your sassertion re: access activity times, I would opt for the former: using blekko as a scraper search source.

Either way, I don't see blekko itself being the culprit, although I could be wrong. I'd be interested to see any other evidence of association between blekko and databasebydesign.
8:16 pm on July 31, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


I'd be interested to see any other evidence of association between blekko and databasebydesign.


Unfortunately and in most instances, and after denying a range I no longer continue accumulating references.
4:21 am on Aug 4, 2012 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6547
votes: 115


Since denying Blekkobot/ScoutJet it now shows up every single day requesting robots.txt 20 to 30 times.

The good news is, while my pages are still listed there, all the cached copies are now gone.