bixolabs bot broken

Forum Moderators: open

Message Too Old, No Replies

bixolabs bot broken

1433 loads of the same URL over 21 hours

rowan194

11:07 am on Aug 14, 2010 (gmt 0)

Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)

Fetches from Amazon AWS... over, and over, and over, and over, ... it's fetched the same URL an average of more than once per minute today.

kkrugler

4:30 pm on Aug 14, 2010 (gmt 0)

Hi Rowan,

Just saw your note - which URL is it hitting? I'll monitor this thread for any reply. Or you can email me (ken@bixolabs.com) or call (530-210-6378), whichever you prefer.

And I'll get in touch with the ops guy running that crawl to kill it.

Thanks for reporting the problem, we'll get it resolved ASAP.

-- Ken

wilderness

6:22 pm on Aug 14, 2010 (gmt 0)

crawler, spider and a dozen or so other long-abused terms are words in user-agents that most every webmaster should have black-listed, IMO.

dstiles

8:04 pm on Aug 14, 2010 (gmt 0)

It's banned here:

a) it's a data miner;

b) it runs from Amazon Cloud.

'Nuff said.

rowan194

4:16 am on Aug 15, 2010 (gmt 0)

Ken, I've emailed you.

FYI, emailing crawler@bixolabs.com doesn't work - I get a warning saying that connection is refused by lb.wordpress.com.

edit: brain fart, obviously emailing ken@<samedomain> isn't going to work either...

edit 2: checked my logs and there is no record of the bixolabs crawler ever fetching robots.txt. The repeated fetching of the same URL ended abruptly around 5 hours ago; at that point it was requesting the [same] URL every few seconds.

kkrugler

5:04 am on Aug 15, 2010 (gmt 0)

Hi Rowan,

1. Sorry about the problem with email - it's an issue with some email systems doing an MX lookup to resolve the destination domain. Please try ken@mail.bixolabs.com. And I'll update the user agent string to specify the full domain, to avoid this in the future.

2. I'm fairly certain robots.txt would have been fetched, but it would have been very early in the (15 hour) crawl cycle. What's the domain, and the URL being repeatedly fetched? Does it get decorated by different query parameters?

Thanks,

-- Ken