Forum Moderators: open

Message Too Old, No Replies

acquia

         

Pfui

7:31 pm on May 11, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Again from AWS, last seen as "Acquia Crawler" in Aug., 2009 here [webmasterworld.com] (AWS Bad Bots thread)

Now:

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler

robots.txt? NO

From the company's site: "Acquia Search helps your visitors find content on your site faster, so they stay on your site longer."

So their uninivted, rude (re robots.txt) bot is crawling my site why? Buh-bye.

tangor

5:02 am on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hit me today, too; however, read robots.txt and did a HEAD on / (3 times quickly) then went away.

Pfui

3:56 am on Jul 10, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yet another name change makes a better bot? Apparently not...

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler (detected bad behaviour? please tell us at it@acquia.com)

robots.txt? NO

So hey, here's my feedback about this bot's/company's bad conduct. Please:

- code your bot to read robots.txt; and
- code your bot to heed robots.txt; and
- code your bot with a standard UA string, including a proper info URL; and
- run your bot from your domain thus assuring reverse-IP confirmation.

TIA!

rb2k

8:08 am on Jul 19, 2010 (gmt 0)

10+ Year Member



Hi,
My name is Marc and I'm developing on the (new, post 2009) crawler that is mentioned here:
[webmasterworld.com...]

I'm sorry that it didn't seem to have fetched your robots.txt file first.
It actually SHOULD fetch the file and parse it using [github.com...] (as mentioned by tangor).

In my current tests, a typical crawler session starts like this:

184.73.19.148 - - [19/Jul/2010:09:56:43 +0200] "GET /robots.txt HTTP/1.1" 200 73 "-" "acquia-crawler (detected bad behaviour? please tell us at it@acquia.com)"
184.73.19.148 - - [19/Jul/2010:09:56:43 +0200] "HEAD / HTTP/1.1" 302 0 "-" "acquia-crawler (detected bad behaviour? please tell us at it@acquia.com)"

Do you happen to know which URL the crawler was actually looking at?
This would make debugging the error a bit easier.

Thank you,
Marc


p.s. your mailbox is full

Pfui

3:54 pm on Jul 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@rb2k: Thanks for your reply.

1.) If you're referring to my mailbox being full, ty. Was unaware; will trim. Will also send a private Sticky Mail to you and your reply to same should get through.

Oh, also, on July 8, same AWS/IP (consistent for your crawls now?):

ec2-184-73-19-148.compute-1.amazonaws.com
acquia-crawler (detected bad behaviour? please tell us at it@acquia.com)
robots.txt? NO

2.) If your crawler retrieves robots.txt whereupon it meets with the standard Disallow, why would it proceed to fetch anything else?

3.) Similarly, if your bot is new to a site, wouldn't its first (other than robots.txt) request be a GET?

4.) It would be nice if your bot and its UA string followed standard, 'good bot' conduct and met the criteria mentioned about reading/heeding robots.txt, ditto:

- code your bot with a standard UA string, including a proper info URL; and
- run your bot from your domain thus assuring reverse-IP confirmation.

Doable?

5.) Last but not least... I'm confused about your bot's crawling all over.

Apparently your search product is a paid-for-monthly plug-in for Drupal sites. So I guess I don't see how a Drupal site wanting "[their] visitors find content on [their] site faster" benefits from your company crawling my sites. Neither do I see how I benefit.

Or is your plug-in's access to your company-crawled indices akin to people 'plugging-in' Google search on their sites? Or does the company have expansion/competition plans? Or--?

[edited by: Pfui at 4:36 pm (utc) on Jul 19, 2010]

rb2k

4:31 pm on Jul 19, 2010 (gmt 0)

10+ Year Member



hey, just a quick answer before the end of my workday :)

2. My guess would be that the robots.txt parser had a bug (this was one I fixed e.g. [github.com...] ) and that's why it kept on going. That's why it would be nice to know which domains the crawler actually hit. Helps a lot with the debugging.

3. Nah, I do a HEAD request first to check for redirects to other domains/urls

4. I wasn't aware that there was a standard for User Agents, will look into it, thanks for the feedback! I'll also see what I can do about the Info URL.

5. The bot is actually doing a "state of the web" kind of thing. It doesn't have anything to do with the usual Acquia offerings.
It analyzes which CMS a site is running (by looking at the frontpage) and generates pretty graphs about CMS distribution over continents etc. As soon as the data is statistically relevant, it will be made available to the general public (since Acquia is pretty much an open-source company). This also means that unless your site is in the Alexa Top x-thousand, you shouldn't be visited by the crawler more than once :)

Thanks for the Feedback!
Marc

keyplyr

8:18 pm on Jul 19, 2010 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This also means that unless your site is in the Alexa Top x-thousand, you shouldn't be visited by the crawler more than once

Great - since I block Alexa then I won't need to block Acquia.