Forum Moderators: open
The bot operator contacted me this AM using another IP range and via the submission form of one my websites, which I repsonded to.
In addition, the bots, creator did some crawling of my sites during January with a fixed IP.
Personally, I'm not willing to make colo exceptions.
In addition the SE really doesn't offer an active and functioning use for the general public (at least not yet).
Don
Heck, some of them panic and start doing niche search engines for customers which aren't even remotely related to my topic which is why I'm quite the "wait and see" guy when it comes to giving someone access to a 100Ks of pages.
On the bright side, Seachme is in private beta now and the flash demo on their home page is hot, you should take a peek.
Ah, I've seen scoutjet as well but I'm not willing to let them crawl until I see something of value start to form just because a few so-called well funded startups have been attempting to crawl for years with nothing coming from it.
Bill,
Exactly what I offered in my private reply, as well as the log lines from their unidentidfied crawl from the other IP of the DMOZ Guy. ;)
BTW, the bot returned today inspite of their inquiry and my reply.
On the bright side, Seachme is in private beta now and the flash demo on their home page is hot, you should take a peek.
You ever see the movie Rain Main, where the challenged brother is being returned to care and the gent asks:
"where's your favorite K-mart clothes?"
"Tell 'em Ray"
"K-mart sucks!"
end of quote
Flash and Java sucks ;)
Don
edited by wilderness.
incrediBILL, totally agree on the poor value from niche search engines. Not our intent. Full scale real web search is so much more interesting.
I hadn't thought of crawl-delay as a mechanism to crawl even faster... hmmm, would have to min the values across all hosts at that ip... But the thing is, if there are like 100 pages to get from a host, and goal of coverage in something like 7 days, it just doesn't need to hurry...it can fetch a page every hour and a half and still get the job done. Even 10,000 pages can stil get there at a page a minute. Not that we're crawling that much yet. Only lazy bot writers hit every second when they don't have to.
wilderness's comment made me realize that we're still fetching robots once/day from sites that have banned... I will fix that to have a longer negative cache. In fact, for now I will make it a permanent negative cache, so that it never comes back. :)
Even at "crawl-delay: 2" I've had a combination of Google, Yahoo and MSN cause an overload twice last year for unknown reasons because all 3 crawl my site all day every day. Something converged at the wrong time, never did figure it out.
[edited by: incrediBILL at 10:03 pm (utc) on Mar. 17, 2008]
In fact, for now I will make it a permanent negative cache, so that it never comes back.
Hopefully, that was a jest -- or a repartee to those who don't let you crawl. The Web changes too fast for anything to be considered "permanent" -- Why just the other day, AltaVista was King, and there was news of some little search start-up called "Goggle" or something...
Some Webmasters only allow the "big three" or perhaps the "big four" -- based strictly on referral ROI. Others allow a selected whitelist to crawl, while still others allow everybody and his brother.
But many here want to see some indication of a return for the bandwidth and log space consumed by crawlers -- a referral or two would be nice. Lacking that, many take a "wait and see" approach. For that reason, a permanent "no crawl" flag would be counterproductive.
Jim