Welcome to WebmasterWorld Guest from 23.22.46.195

Forum Moderators: Ocean10000 & incrediBILL

bixolabs bot broken

1433 loads of the same URL over 21 hours

   
11:07 am on Aug 14, 2010 (gmt 0)



Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)

Fetches from Amazon AWS... over, and over, and over, and over, ... it's fetched the same URL an average of more than once per minute today.
4:30 pm on Aug 14, 2010 (gmt 0)

5+ Year Member



Hi Rowan,

Just saw your note - which URL is it hitting? I'll monitor this thread for any reply. Or you can email me (ken@bixolabs.com) or call (530-210-6378), whichever you prefer.

And I'll get in touch with the ops guy running that crawl to kill it.

Thanks for reporting the problem, we'll get it resolved ASAP.

-- Ken
6:22 pm on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



crawler, spider and a dozen or so other long-abused terms are words in user-agents that most every webmaster should have black-listed, IMO.
8:04 pm on Aug 14, 2010 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



It's banned here:

a) it's a data miner;

b) it runs from Amazon Cloud.

'Nuff said.
4:16 am on Aug 15, 2010 (gmt 0)



Ken, I've emailed you.

FYI, emailing crawler@bixolabs.com doesn't work - I get a warning saying that connection is refused by lb.wordpress.com.

edit: brain fart, obviously emailing ken@<samedomain> isn't going to work either...

edit 2: checked my logs and there is no record of the bixolabs crawler ever fetching robots.txt. The repeated fetching of the same URL ended abruptly around 5 hours ago; at that point it was requesting the [same] URL every few seconds.
5:04 am on Aug 15, 2010 (gmt 0)

5+ Year Member



Hi Rowan,

1. Sorry about the problem with email - it's an issue with some email systems doing an MX lookup to resolve the destination domain. Please try ken@mail.bixolabs.com. And I'll update the user agent string to specify the full domain, to avoid this in the future.

2. I'm fairly certain robots.txt would have been fetched, but it would have been very early in the (15 hour) crawl cycle. What's the domain, and the URL being repeatedly fetched? Does it get decorated by different query parameters?

Thanks,

-- Ken
 

Featured Threads

My Threads

Hot Threads This Week

Hot Threads This Month