@rb2k: Thanks for your reply.
1.) If you're referring to my mailbox being full, ty. Was unaware; will trim. Will also send a private Sticky Mail to you and your reply to same should get through.
Oh, also, on July 8, same AWS/IP (consistent for your crawls now?):
ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler (detected bad behaviour? please tell us at it@acquia.com)
robots.txt? NO
2.) If your crawler retrieves robots.txt whereupon it meets with the standard Disallow, why would it proceed to fetch anything else?
3.) Similarly, if your bot is new to a site, wouldn't its first (other than robots.txt) request be a GET?
4.) It would be nice if your bot and its UA string followed standard, 'good bot' conduct and met the criteria mentioned about reading/heeding robots.txt, ditto:
- code your bot with a standard UA string, including a proper info URL; and
- run your bot from your domain thus assuring reverse-IP confirmation.
Doable?
5.) Last but not least... I'm confused about your bot's crawling all over.
Apparently your search product is a paid-for-monthly plug-in for Drupal sites. So I guess I don't see how a Drupal site wanting "[their] visitors find content on [their] site faster" benefits from your company crawling
my sites. Neither do I see how
I benefit.
Or is your plug-in's access to your company-crawled indices akin to people 'plugging-in' Google search on their sites? Or does the company have expansion/competition plans? Or--?
[edited by: Pfui at 4:36 pm (utc) on Jul 19, 2010]