homepage Welcome to WebmasterWorld Guest from 54.161.191.154
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Twitter for Testing Bot Blocking
incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4475928 posted 11:03 pm on Jul 15, 2012 (gmt 0)

Here's a simple Twitter tip for bot blockers.

Thanks to the swarm of link chasing bots that use Twitter's API you can easily test your bot blocker instantly with a single tweet. Just tweet a full URL to the site where you want to test your bot blocking and within seconds bots will start knocking on your door and continue to trickle in for a while. Many seem to cache the results so trying to get them to come to your site repeatedly in s short period of time requires directing them to different pages per tweet.

Hope this little trick helps some people when they're testing some new blocking filters because I've found it to be an invaluable tool to be able summon bots on demand.

Not only that, they've exposed some new hosts I didn't have blocked, a bonus! :)

 

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4475928 posted 10:31 pm on Aug 2, 2012 (gmt 0)

Not that anyone seemed to be interested in the best tip for testing your bot blocker on demand, but Twitter also claims that Twitterbot honors robots.txt!

[dev.twitter.com...]
URL crawling

Twitter's crawler will respect robots.txt when scanning URLs. If a page with card markup is blocked, no card will be shown. If an image URL is blocked, no thumbnail or photo will be shown.

Twitter uses the User-Agent of Twitterbot/1.0, which can be used to create an exception in your robots.txt file. For example, here is a robots.txt which disallows crawling for all robots except Twitter's fetcher:

User-agent: Twitterbot
Disallow:

User-agent: *
Disallow: /


Side note: I also use Twitter Bootstrap [twitter.github.com] which I highly recommend for building responsive design sites in validated HTML 5. It was easy to learn and I deployed it same day.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4475928 posted 12:08 am on Aug 3, 2012 (gmt 0)


Never needed to do any testing. Just identified the parasites and blocked as needed. Human traffic from Twitter varies from double-digit to occasional triple-digit daily uniques. Twitter and Facebook have grown into very nice traffic sources yielding an increasing ROI.

rowan194



 
Msg#: 4475928 posted 9:32 pm on Aug 28, 2012 (gmt 0)

Cheers for the tip - just tweeted a unique URL on a very quiet domain and noted there were a few IPs that hit the server immediately, all within the space of 5 seconds. (agents: TweetmemeBot, UnwindFetchor, Twitterbot, Butterfly, "JS-Kit URL Resolver")

I presume these bots must subscribe to the realtime twitter stream, with the follow-on stragglers periodically querying the public API to find new URLs to munch on.

I've never used Twitter before, was just interested to see the bot activity.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4475928 posted 6:04 am on Aug 29, 2012 (gmt 0)

Also, the more followers you have, the more retweets you'll get. This generates a wider reach to all their followers, etc. etc. which in turn generates more parasite bot hits.

With my 21k+ followers, when I post a link I'll immediately see over 2 dozen non-human UAs, and another dozen or so withing the next 20 minutes, all blocked.

Every once in a while I see a new one.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved