Jaxified Crawler

Forum Moderators: open

Message Too Old, No Replies

Jaxified Crawler

keyplyr

6:25 pm on Oct 10, 2010 (gmt 0)

UA: Jaxified Crawler 1.0a (+http://www.jaxified.com/)
rDNS: input1.jaxified.com.
IP: 68.233.225.**
CIDR: 68.233.224.0/19
robots txt: no

Twitter parasite

rowan194

12:02 am on Oct 12, 2010 (gmt 0)

Not sure if it's the same thing, but I'm seeing the UA of "Jaxified RSS Check Bot (+http://www.jaxified.com/)" also from input1.jaxified.com

I have noticed it fetches objects rapidly: several requests in the same second (up to 13!), and some of them duplicated (it will fetch exactly the same URI more than once in the same second)

It's also fishing: it will load a URL which has no trailing '/', then the next request (in the same second, of course) is for the same URL but with a '/' added.

Website doesn't really say what they do. I've blocked it.

caribguy

12:20 am on Oct 12, 2010 (gmt 0)

From their FB page:

We're working on building a new way to find and engage in conversations about your site.

That's all just fine and dandy, but talk is cheap: I'm more interested in generating conversions on my site. - blocked.

terryjsmith

8:06 pm on Oct 17, 2010 (gmt 0)

Hi guys,

My name is Terry and I'm the founder/creator of the bot (and the company). What we're doing is tracking conversations around links on Twitter; so we follow RSS feeds for blogs and then try to find any conversations about those posts by following links on Twitter, Facebook, and other blogs.

If you guys could provide me with some access logs (here or e-mail in profile), I'd love to see which URLs it was hitting. This is my first serious bot (written in C/C++; I tried a bunch of more standard tools from Apache and the like but am not proficient enough in Java to make the changes I needed) so it's still a little rough around the edges.

I apologize for hitting your servers so hard. As for the fishing, I've been working to normalize the URLs it looks for regarding the trailing slash so it's likely there was some overlap in fetching pages.

What is the normal rate at which a crawler accesses your site? I originally had it set to 5 requests/second max, but I usually only have 5 or so URLs per blog to check so it seemed useless; I can add it back in with a little more direction.

Sorry about any confusion; I haven't had time to properly document the crawler with such heavy development. Anyways, any access logs I can compare to my own logs would be hugely appreciated.

Thanks!

T

tangor

8:12 pm on Oct 17, 2010 (gmt 0)

Welcome to Webmasterworld, terryjsmith!

Bots should ALWAYS respect robots.txt. Many (myself included) whitelist (allow only some bots and disallow all others). Any bot that fails that simple request is generally frowned upon... and nuked from accessing the site altogether.

terryjsmith

8:13 pm on Oct 17, 2010 (gmt 0)

I should note that the RSS Check Bot, RSS Bot and Crawler are all different bots with different purposes (they should never hit the same pages or overlap). I thought it would be best to differentiate for debugging purposes, but I could consolidate them if that's a more standard practice.

RSS Check Bot - Checks your site for an RSS feed we can follow.
RSS Bot - Periodically checks your RSS feed for new posts.
Crawler - Crawls the new posts found in the feed.

I defer to your guidance.

Thanks again,

T

terryjsmith

8:16 pm on Oct 17, 2010 (gmt 0)

Thanks tangor,

I have been working very hard to get robots.txt working properly, caching the results (and it's existence), making sure I don't hit that file repeatedly, etc. Since it's all written in custom C/C++, text parsing is a bitch; but I assure you that it is coming soon (I have not found a decent robots.txt parsing library).

terryjsmith

12:54 am on Oct 18, 2010 (gmt 0)

I have consolidated the bot user-agents (all now "Jaxified Bot") and added robots.txt checking to all except one bot which follows links from Twitter using HEAD requests and never downloads any page content; I am using libcurl to get pages and it simply follows any 301 and 302 redirects itself.

You can now add User-agent: Jaxified to your robots.txt file to disallow our bot specifically, though it will also follow rules for the wild card user-agent as well.

I have taken the RSS Bot and Crawler daemons offline until I can gather more information on the duplicate requests. I appreciate any follow ups.

Thanks again for your help and feedback.

tangor

2:11 am on Oct 18, 2010 (gmt 0)

Most appreciated is swift response! Best luck on the project. I will be watching to see if there's benefit... and adding to the white list. :)