incrediBILL

msg:4550600 | 7:32 am on Mar 3, 2013 (gmt 0) |
The AWS ranges are also used by Kindle Fire if I'm not mistaken and I don't know what happens to their page loads if that stupid Kindle Fire acceleration servers can't connect to your site, whether it defaults to normal browser operation or just bombs the page load. Other than that, there are many legit services running from that range but I whitelist them and block anything else by default.
|
motorhaven

msg:4550700 | 6:23 pm on Mar 3, 2013 (gmt 0) |
If you have Adsense on your site you want to allow Kontera and Proximic, especially Proximic. They both are Adsense network affiliates who do follow-up crawls after Google's Mediapartner's crawl. It's stupid having a double crawl, but I've done some testing and found drops in $ by not allowing them through.
|
1script

msg:4550782 | 1:49 am on Mar 4, 2013 (gmt 0) |
@motorhaven: thank you for the tip! I did not know about it and I have to admit: I was blocking (403) proximic for the longest time, at least for the last two years, since they were such a prolific crawler. They gobble up MORE URLs than Googlebot, which in itself is not a small portion of the server load. But if there's a good use for them, and sounds like there is, I will have them un-blocked and see what happens with the AdSense which is being used on some of the sites.
|
keyplyr

msg:4550787 | 2:24 am on Mar 4, 2013 (gmt 0) |
There's absolutely no way I'm allowing proximic access to my server.
|
wilderness

msg:4550789 | 2:37 am on Mar 4, 2013 (gmt 0) |
If their UA is the same as this [webmasterworld.com], they wouldn't get in my sites without even addressing the IP. The UA contains three denied faetures.
|
1script

msg:4550791 | 2:41 am on Mar 4, 2013 (gmt 0) |
| There's absolutely no way I'm allowing proximic access to my server. |
| Looks like they claim their crawls affect rating of your pages for CPM advertisers, presumably on networks that they then sell info about your site to. If you are not running any third-party ads like AdSense, then that's not really a concern. But if you do, perhaps it warrants additional investigation. I have to admit though - Proximic was the very first bad UA I ever disallowed server-wide. I also just found out I had them in the Regex for bad UAs not just once but twice - I must have really-really wanted them gone 2 years ago (and of course they just kept pounding all these years). But if there's some use, I'll let them be: at this point Chinese/Russian fake referrer bots, Facebook after-crawls from 40 different IPs within the same second, Twitter URL crawlers, Web application vulnerability probe bots and other types of various illegitimate and borderline activities have deposed Proximic from *the* most prolific questionable bot title, so it sounds like I have other, more clearly devious bots to worry about.
|
wilderness

msg:4550792 | 2:58 am on Mar 4, 2013 (gmt 0) |
Seven magic words [webmasterworld.com]
|
motorhaven

msg:4550794 | 3:09 am on Mar 4, 2013 (gmt 0) |
If I weren't displaying Adsense ads I'd certainly block Proximic. They don't support crawl delay, serve no useful propose other than targeting for advertsers and can be resource intensive during peak times, and follow up user page fetches a minute or so after the Adsense bot so each HTML page results in 3 hits. When I first started seeing them and noticing the pattern of their crawler fetching a page shortly after a legit user did, I knew it had to be connected to Adsense. I managed to get in touch with someone there on the phone. Based on the discussion I had not only do they crawl on behalf of direct clients, but also for other ad networks who work with Adsense. On a test site I have which has extremely steady Adsense income I saw daily drops between 10-15%, and this site normally sees 2-5% differences. The drop occurred a few days after blocking the bot. Of course this is going to vary by site, some sites are going to get different proportions of Google's Adword ads verses Google's affiliated ad networks. [edited by: motorhaven at 3:12 am (utc) on Mar 4, 2013]
|
keyplyr

msg:4550795 | 3:10 am on Mar 4, 2013 (gmt 0) |
Besides a dozen other reasons, I object categorically to any company that scrapes my content then sells it. I do publish Adsence on my sites. Despite Proximic hyperbole, they ain't getting my property, period.
|
motorhaven

msg:4550893 | 12:08 pm on Mar 4, 2013 (gmt 0) |
| Besides a dozen other reasons, I object categorically to any company that scrapes my content then sells it. |
| Google ... Adwords ... Adsense
|
dougwilson

msg:4552275 | 8:52 pm on Mar 7, 2013 (gmt 0) |
Thinking about the "magic words" mentioned. I went a grabbed these examples from my robots.txt report;
176.9.139.112 Mozilla/5.0 (compatible; SISTRIX Crawler; http://crawler.sistrix.net/) /robots.txt
5.9.17.74 Mozilla/5.0 (compatible; SearchmetricsBot; http://www.searchmetrics.com/en/searchmetrics-bot/) /robots.txt These bots belong to SEO sites. My question is, what's the consequence, server work wise, of blocking them or not blocking them. Really I'm trying to find the logic in blocking a particular bot or ip (just because). Is it just bandwidth? These particular examples just crawl as far as I know. I don't need them, or like them. One is housed at Hetzger. Would these get blocked by you guys? Why, why not?
|
wilderness

msg:4552313 | 9:52 pm on Mar 7, 2013 (gmt 0) |
Oh my, another bad word. These bots belong to SEO sites. My question is, what's the consequence, server work wise, of blocking them or not blocking them. Really I'm trying to find the logic in blocking a particular bot or ip (just because). Is it just bandwidth? |
| In years past there have been many explanations of these blocking practices in this forum. Many times, those posing the logic were trolls promoting a bot. Are you a troll ;) What it comes down to is choice, and each webmaster must (?) decide what is beneficial or detrimental to their own websites. I deny most visitors beyond the borders of Canada and US, and my denials are far a broader topic than your focusing upon. Many of the bots, crawlers, harvesters, or what ever else you choose to call them, simply offer no benefit to allowing the door open. Their almost like flies or ants. If you have one, you have more. What are you choices when all these harvesters begin plagiarizing your content (text or images)? Is it cheaper to hire an attorney and enter in to the long process of litigation, or simply slam the door in their face because your able to foresee their intent?
|
dougwilson

msg:4558898 | 3:29 pm on Mar 27, 2013 (gmt 0) |
Only wondering about server load when it comes to blocking nonsense (no benefit) visits - (process) 403 vs 404 or 200. There are no "my" images or products on my site, just my words, some vids and audio - All has been scraped and harvested...
|
wilderness

msg:4558972 | 6:30 pm on Mar 27, 2013 (gmt 0) |
Not as longs as you deny by UA, IP or refer, rather than domain name lookups.
|
|