homepage Welcome to WebmasterWorld Guest from 54.204.94.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Are there ANY legitimate visits from AWS?
1script




msg:4550487
 5:34 pm on Mar 2, 2013 (gmt 0)

I know there's a great thread somewhere here (though I can't find it in a hurry) with all AWS IP ranges listed just so people can ax accesses from AWS at firewall level.

I wanted to start doing just that but noticed access from some crawler called "kraken-crawler/0.2.0; +http://www.kontera.com/kraken-crawler.html" ( comes from 50.16.167.51 and 107.22.0.216) I don't want to say that they are legitimate - I don't know that - but I had dealt with Kontera before and, although I no longer do, it got me thinking that there may possibly be some good traffic coming from AWS. This may be helpful in deciding whether to kill all of the AWS ranges or do it more selectively.

I've requested information from Kontera on that particular crawler and will definitely update here when they respond, but I also wanted to see if other people know of any examples of good traffic from AWS? Please post the IPs and user agents if you know any good ones. Thanks!

 

incrediBILL




msg:4550600
 7:32 am on Mar 3, 2013 (gmt 0)

The AWS ranges are also used by Kindle Fire if I'm not mistaken and I don't know what happens to their page loads if that stupid Kindle Fire acceleration servers can't connect to your site, whether it defaults to normal browser operation or just bombs the page load.

Other than that, there are many legit services running from that range but I whitelist them and block anything else by default.

motorhaven




msg:4550700
 6:23 pm on Mar 3, 2013 (gmt 0)

If you have Adsense on your site you want to allow Kontera and Proximic, especially Proximic. They both are Adsense network affiliates who do follow-up crawls after Google's Mediapartner's crawl. It's stupid having a double crawl, but I've done some testing and found drops in $ by not allowing them through.

1script




msg:4550782
 1:49 am on Mar 4, 2013 (gmt 0)

@motorhaven: thank you for the tip! I did not know about it and I have to admit: I was blocking (403) proximic for the longest time, at least for the last two years, since they were such a prolific crawler. They gobble up MORE URLs than Googlebot, which in itself is not a small portion of the server load. But if there's a good use for them, and sounds like there is, I will have them un-blocked and see what happens with the AdSense which is being used on some of the sites.

keyplyr




msg:4550787
 2:24 am on Mar 4, 2013 (gmt 0)

There's absolutely no way I'm allowing proximic access to my server.

wilderness




msg:4550789
 2:37 am on Mar 4, 2013 (gmt 0)

If their UA is the same as this [webmasterworld.com], they wouldn't get in my sites without even addressing the IP.

The UA contains three denied faetures.

1script




msg:4550791
 2:41 am on Mar 4, 2013 (gmt 0)

There's absolutely no way I'm allowing proximic access to my server.
Looks like they claim their crawls affect rating of your pages for CPM advertisers, presumably on networks that they then sell info about your site to. If you are not running any third-party ads like AdSense, then that's not really a concern. But if you do, perhaps it warrants additional investigation.

I have to admit though - Proximic was the very first bad UA I ever disallowed server-wide. I also just found out I had them in the Regex for bad UAs not just once but twice - I must have really-really wanted them gone 2 years ago (and of course they just kept pounding all these years). But if there's some use, I'll let them be: at this point Chinese/Russian fake referrer bots, Facebook after-crawls from 40 different IPs within the same second, Twitter URL crawlers, Web application vulnerability probe bots and other types of various illegitimate and borderline activities have deposed Proximic from *the* most prolific questionable bot title, so it sounds like I have other, more clearly devious bots to worry about.

wilderness




msg:4550792
 2:58 am on Mar 4, 2013 (gmt 0)

Seven magic words [webmasterworld.com]

motorhaven




msg:4550794
 3:09 am on Mar 4, 2013 (gmt 0)

If I weren't displaying Adsense ads I'd certainly block Proximic. They don't support crawl delay, serve no useful propose other than targeting for advertsers and can be resource intensive during peak times, and follow up user page fetches a minute or so after the Adsense bot so each HTML page results in 3 hits.

When I first started seeing them and noticing the pattern of their crawler fetching a page shortly after a legit user did, I knew it had to be connected to Adsense. I managed to get in touch with someone there on the phone.

Based on the discussion I had not only do they crawl on behalf of direct clients, but also for other ad networks who work with Adsense. On a test site I have which has extremely steady Adsense income I saw daily drops between 10-15%, and this site normally sees 2-5% differences. The drop occurred a few days after blocking the bot.

Of course this is going to vary by site, some sites are going to get different proportions of Google's Adword ads verses Google's affiliated ad networks.

[edited by: motorhaven at 3:12 am (utc) on Mar 4, 2013]

keyplyr




msg:4550795
 3:10 am on Mar 4, 2013 (gmt 0)



Besides a dozen other reasons, I object categorically to any company that scrapes my content then sells it.

I do publish Adsence on my sites. Despite Proximic hyperbole, they ain't getting my property, period.

motorhaven




msg:4550893
 12:08 pm on Mar 4, 2013 (gmt 0)

Besides a dozen other reasons, I object categorically to any company that scrapes my content then sells it.


Google ... Adwords ... Adsense

dougwilson




msg:4552275
 8:52 pm on Mar 7, 2013 (gmt 0)

Thinking about the "magic words" mentioned. I went a grabbed these examples from my robots.txt report;

176.9.139.112 Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/) /robots.txt

5.9.17.74 Mozilla/5.0 (compatible; SearchmetricsBot; http://www.searchmetrics.com/en/searchmetrics-bot/) /robots.txt


These bots belong to SEO sites. My question is, what's the consequence, server work wise, of blocking them or not blocking them.

Really I'm trying to find the logic in blocking a particular bot or ip (just because). Is it just bandwidth?

These particular examples just crawl as far as I know. I don't need them, or like them. One is housed at Hetzger.

Would these get blocked by you guys? Why, why not?

wilderness




msg:4552313
 9:52 pm on Mar 7, 2013 (gmt 0)

One is housed at Hetzger


Oh my, another bad word.

These bots belong to SEO sites. My question is, what's the consequence, server work wise, of blocking them or not blocking them.

Really I'm trying to find the logic in blocking a particular bot or ip (just because). Is it just bandwidth?


In years past there have been many explanations of these blocking practices in this forum.

Many times, those posing the logic were trolls promoting a bot. Are you a troll ;)

What it comes down to is choice, and each webmaster must (?) decide what is beneficial or detrimental to their own websites.

I deny most visitors beyond the borders of Canada and US, and my denials are far a broader topic than your focusing upon.

Many of the bots, crawlers, harvesters, or what ever else you choose to call them, simply offer no benefit to allowing the door open.
Their almost like flies or ants. If you have one, you have more.

What are you choices when all these harvesters begin plagiarizing your content (text or images)?
Is it cheaper to hire an attorney and enter in to the long process of litigation, or simply slam the door in their face because your able to foresee their intent?

dougwilson




msg:4558898
 3:29 pm on Mar 27, 2013 (gmt 0)

Only wondering about server load when it comes to blocking nonsense (no benefit) visits - (process) 403 vs 404 or 200.

There are no "my" images or products on my site, just my words, some vids and audio - All has been scraped and harvested...

wilderness




msg:4558972
 6:30 pm on Mar 27, 2013 (gmt 0)

Not as longs as you deny by UA, IP or refer, rather than domain name lookups.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved