Welcome to WebmasterWorld Guest from 54.196.153.46

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

bixolabs bot broken

1433 loads of the same URL over 21 hours

     
11:07 am on Aug 14, 2010 (gmt 0)

New User

5+ Year Member

joined:June 30, 2010
posts:36
votes: 0


Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)

Fetches from Amazon AWS... over, and over, and over, and over, ... it's fetched the same URL an average of more than once per minute today.
4:30 pm on Aug 14, 2010 (gmt 0)

New User

5+ Year Member

joined:Dec 9, 2009
posts: 4
votes: 0


Hi Rowan,

Just saw your note - which URL is it hitting? I'll monitor this thread for any reply. Or you can email me (ken@bixolabs.com) or call (530-210-6378), whichever you prefer.

And I'll get in touch with the ops guy running that crawl to kill it.

Thanks for reporting the problem, we'll get it resolved ASAP.

-- Ken
6:22 pm on Aug 14, 2010 (gmt 0)

Senior Member

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2001
posts:5459
votes: 3


crawler, spider and a dozen or so other long-abused terms are words in user-agents that most every webmaster should have black-listed, IMO.
8:04 pm on Aug 14, 2010 (gmt 0)

Senior Member from GB 

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month

joined:May 14, 2008
posts:3125
votes: 4


It's banned here:

a) it's a data miner;

b) it runs from Amazon Cloud.

'Nuff said.
4:16 am on Aug 15, 2010 (gmt 0)

New User

5+ Year Member

joined:June 30, 2010
posts:36
votes: 0


Ken, I've emailed you.

FYI, emailing crawler@bixolabs.com doesn't work - I get a warning saying that connection is refused by lb.wordpress.com.

edit: brain fart, obviously emailing ken@<samedomain> isn't going to work either...

edit 2: checked my logs and there is no record of the bixolabs crawler ever fetching robots.txt. The repeated fetching of the same URL ended abruptly around 5 hours ago; at that point it was requesting the [same] URL every few seconds.
5:04 am on Aug 15, 2010 (gmt 0)

New User

5+ Year Member

joined:Dec 9, 2009
posts: 4
votes: 0


Hi Rowan,

1. Sorry about the problem with email - it's an issue with some email systems doing an MX lookup to resolve the destination domain. Please try ken@mail.bixolabs.com. And I'll update the user agent string to specify the full domain, to avoid this in the future.

2. I'm fairly certain robots.txt would have been fetched, but it would have been very early in the (15 hour) crawl cycle. What's the domain, and the URL being repeatedly fetched? Does it get decorated by different query parameters?

Thanks,

-- Ken