homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

bixolabs bot broken
1433 loads of the same URL over 21 hours

 11:07 am on Aug 14, 2010 (gmt 0)

Mozilla/5.0 (compatible; bixolabs/1.0; +http://bixolabs.com/crawler/general; crawler@bixolabs.com)

Fetches from Amazon AWS... over, and over, and over, and over, ... it's fetched the same URL an average of more than once per minute today.



 4:30 pm on Aug 14, 2010 (gmt 0)

Hi Rowan,

Just saw your note - which URL is it hitting? I'll monitor this thread for any reply. Or you can email me (ken@bixolabs.com) or call (530-210-6378), whichever you prefer.

And I'll get in touch with the ops guy running that crawl to kill it.

Thanks for reporting the problem, we'll get it resolved ASAP.

-- Ken


 6:22 pm on Aug 14, 2010 (gmt 0)

crawler, spider and a dozen or so other long-abused terms are words in user-agents that most every webmaster should have black-listed, IMO.


 8:04 pm on Aug 14, 2010 (gmt 0)

It's banned here:

a) it's a data miner;

b) it runs from Amazon Cloud.

'Nuff said.


 4:16 am on Aug 15, 2010 (gmt 0)

Ken, I've emailed you.

FYI, emailing crawler@bixolabs.com doesn't work - I get a warning saying that connection is refused by lb.wordpress.com.

edit: brain fart, obviously emailing ken@<samedomain> isn't going to work either...

edit 2: checked my logs and there is no record of the bixolabs crawler ever fetching robots.txt. The repeated fetching of the same URL ended abruptly around 5 hours ago; at that point it was requesting the [same] URL every few seconds.


 5:04 am on Aug 15, 2010 (gmt 0)

Hi Rowan,

1. Sorry about the problem with email - it's an issue with some email systems doing an MX lookup to resolve the destination domain. Please try ken@mail.bixolabs.com. And I'll update the user agent string to specify the full domain, to avoid this in the future.

2. I'm fairly certain robots.txt would have been fetched, but it would have been very early in the (15 hour) crawl cycle. What's the domain, and the URL being repeatedly fetched? Does it get decorated by different query parameters?


-- Ken

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved