Forum Moderators: open

Message Too Old, No Replies

Scout

         

wilderness

5:30 pm on Mar 16, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Has anybody allowed the spidering?
The IP range is a known colo, which I've had denied for some time.

The bot operator contacted me this AM using another IP range and via the submission form of one my websites, which I repsonded to.

In addition, the bots, creator did some crawling of my sites during January with a fixed IP.

Personally, I'm not willing to make colo exceptions.
In addition the SE really doesn't offer an active and functioning use for the general public (at least not yet).

Don

incrediBILL

9:13 pm on Mar 16, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Don, I allow nothing, it's called whitelisting ;)

If you're talking about Sphere Scout then never will it darken my door until I find a use for Sphere.

wilderness

11:11 pm on Mar 16, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Bill,
I chose not explain whitelisting to the gent, after all he may be the next google ;)
He was the something or another behind DMOZ.

Here's the UA.

"Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"

Don

incrediBILL

12:43 am on Mar 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah, I've seen scoutjet as well but I'm not willing to let them crawl until I see something of value start to form just because a few so-called well funded startups have been attempting to crawl for years with nothing coming from it.

Heck, some of them panic and start doing niche search engines for customers which aren't even remotely related to my topic which is why I'm quite the "wait and see" guy when it comes to giving someone access to a 100Ks of pages.

On the bright side, Seachme is in private beta now and the flash demo on their home page is hot, you should take a peek.

wilderness

1:37 am on Mar 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah, I've seen scoutjet as well but I'm not willing to let them crawl until I see something of value start to form just because a few so-called well funded startups have been attempting to crawl for years with nothing coming from it.

Bill,
Exactly what I offered in my private reply, as well as the log lines from their unidentidfied crawl from the other IP of the DMOZ Guy. ;)

BTW, the bot returned today inspite of their inquiry and my reply.

On the bright side, Seachme is in private beta now and the flash demo on their home page is hot, you should take a peek.

You ever see the movie Rain Main, where the challenged brother is being returned to care and the gent asks:

"where's your favorite K-mart clothes?"
"Tell 'em Ray"
"K-mart sucks!"
end of quote

Flash and Java sucks ;)

Don

edited by wilderness.

incrediBILL

9:08 am on Mar 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Getting off topic here but say what you want Don, but Seachme is cool, the visual way it works is very natural.

... and there's nothing wrong with Flash and javascript, stop being such a Luddite ;)

skrenta

4:10 pm on Mar 17, 2008 (gmt 0)

10+ Year Member



ScoutJet is me, it is a good robot. It has a 45-second min delay between fetches per-ipaddr. Of course you are free not to let it in, it obeys robots.txt of course. :)

incrediBILL, totally agree on the poor value from niche search engines. Not our intent. Full scale real web search is so much more interesting.

incrediBILL

5:13 pm on Mar 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Skrenta, welcome to WebmasterWorld!

If you implemented the "crawl-delay:" parameter you could even take pages faster on some sites as I have mine set to 2 seconds, assuming I'd let you crawl at this time. ;)

skrenta

6:08 pm on Mar 17, 2008 (gmt 0)

10+ Year Member



Thanks incrediBILL. :)

I hadn't thought of crawl-delay as a mechanism to crawl even faster... hmmm, would have to min the values across all hosts at that ip... But the thing is, if there are like 100 pages to get from a host, and goal of coverage in something like 7 days, it just doesn't need to hurry...it can fetch a page every hour and a half and still get the job done. Even 10,000 pages can stil get there at a page a minute. Not that we're crawling that much yet. Only lazy bot writers hit every second when they don't have to.

wilderness's comment made me realize that we're still fetching robots once/day from sites that have banned... I will fix that to have a longer negative cache. In fact, for now I will make it a permanent negative cache, so that it never comes back. :)

incrediBILL

10:01 pm on Mar 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Skrenta, the reason I suggested the crawl-delay is some of us have sites with 100K+ pages and I'm willing to let it crawl faster as long as it doesn't overload the server otherwise you would never get my site crawled or updated as hundreds of pages update daily.

Even at "crawl-delay: 2" I've had a combination of Google, Yahoo and MSN cause an overload twice last year for unknown reasons because all 3 crawl my site all day every day. Something converged at the wrong time, never did figure it out.

[edited by: incrediBILL at 10:03 pm (utc) on Mar. 17, 2008]

Hobbs

10:25 am on Apr 30, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Mozilla/5.0 (compatible; ScoutJet; +http://www.scoutjet.com/)"

Seeing it trying at 2 pages per minute
But I have their host "Silicon Valley Colocation" range 64.13.128.0/18 blocked, agree with wilderness about not making colo exceptions. So no pages for you ScoutJet.

jdMorgan

1:56 am on Jun 16, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In fact, for now I will make it a permanent negative cache, so that it never comes back.

Hopefully, that was a jest -- or a repartee to those who don't let you crawl. The Web changes too fast for anything to be considered "permanent" -- Why just the other day, AltaVista was King, and there was news of some little search start-up called "Goggle" or something...

Some Webmasters only allow the "big three" or perhaps the "big four" -- based strictly on referral ROI. Others allow a selected whitelist to crawl, while still others allow everybody and his brother.

But many here want to see some indication of a return for the bandwidth and log space consumed by crawlers -- a referral or two would be nice. Lacking that, many take a "wait and see" approach. For that reason, a permanent "no crawl" flag would be counterproductive.

Jim

caribguy

12:50 pm on Aug 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



So, I guess that scoutJet and Sphere Scout from 64.40.118.zzz are two different animals?

wilderness

3:10 pm on Aug 17, 2008 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I doubt the two are related.

Even the colo's are different.