homepage Welcome to WebmasterWorld Guest from 54.234.225.23
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 31 message thread spans 2 pages: 31 ( [1] 2 > >     
Fake BingBot
1000s of hits from a Slicehost IP
Gaia




msg:4556010
 9:10 am on Mar 18, 2013 (gmt 0)

I got 1000s of hits from an agent identifying itself as BingBot, but coming from a SliceHost IP. It is also ignoring robots.txt.

User-agent: *
Disallow: /wp-


(please excuse the lack of punctuation)

Status200
Request/wp-login.php
Hostmysite.com
Referer-
RemoteIP50.57.148.171
Time2013-03-17T13:29:50+0000
UserAgentMozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Query?redirect_to=http%3A%2F%mysite.com%2Fwp-admin%2F&reauth=1
MethodGET

Status302
Request/wp-admin/index.php
Hostmysite.com
Referer-
RemoteIP50.57.148.171
Time2013-03-17T13:29:44+0000
UserAgentMozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Query
MethodGET


It followed thru this loop over and over, and at an agressive rate. I spotted it thanks to NewRelic/Loggly and its handy Chrome extension.

The IP belongs to youngshand.com, which is a marketing agency, so I wonder if they are not running any "tests". Has anyone seen this fake bot before?

[edited by: incrediBILL at 3:14 am (utc) on Mar 19, 2013]
[edit reason] unlinked URL [/edit]

 

wilderness




msg:4556309
 4:05 am on Mar 19, 2013 (gmt 0)

FWIW, it doesn't matter what UA they represent themselves as (especially when Bing and NOT coming from a legitimate MS IP offering valid DNS).

In the this instance, the IP goes all the up to the backbone:
Rackspace Hosting RACKS-8-NET-4 50.56.0.0 - 50.57.255.255

There are multiple mentions of this server farm in the archives and there should even be a comprehensive listing of all their IPs.

lucy24




msg:4556334
 6:35 am on Mar 19, 2013 (gmt 0)

especially when Bing and NOT coming from a legitimate MS IP offering valid DNS

Or, in bing's case, vice versa: bing/MSN IP but not bing/msnbot UA ;)

Depending on your site setup, you may also find in convenient to block external requests for files in .php. Just make sure you keep any redirect:rewrite pairs.

Gaia




msg:4556404
 12:41 pm on Mar 19, 2013 (gmt 0)

Everything is being blocked at firewall level (via IP). Thanks

Gaia




msg:4560486
 6:11 am on Apr 2, 2013 (gmt 0)

More Fake BingBots.

212.90.148.61
182.50.154.9
108.178.69.60
212.90.148.61

I'm inclined to believe that, while masquerading as bingbot, there is some malware infecting servers and using them to try to penetrate other systems.

keyplyr




msg:4560492
 6:35 am on Apr 2, 2013 (gmt 0)



Gaia, all these spoofed Biongbots are coming from hosting companies that should be blocked by default:

Goneo
212.90.148.0 - 212.90.148.255
212.90.148.0/24

Godaddy, Singapore
182.50.128.0 - 182.50.159.255
182.50.128.0/19

Singlehop
108.178.0.0 - 108.178.63.255
108.178.0.0/18

And as Wilderness says, the IP in your OP belongs to:

RackSpace
50.56.0.0 - 50.57.255.255
50.56.0.0/15

Another Rackspace range is:

64.49.192.0 - 64.49.255.255
64.49.192.0/18

Gaia




msg:4560671
 5:11 pm on Apr 2, 2013 (gmt 0)

Should be blocked by default? What does that mean? It is good practice to add them to the DENY table?

I generally avoid blocking an entire range because of one rogue IP. Singlehop, for example, is not a bad neighborhood. And Rackspace definitely isn't.

I already have another measure in place. modsecurity does a reverse DNS lookup (using a local DNS server) to verify if a client with a bot as UA is actually coming from a valid IP for that bot. Now I can stop manually blocking those guys...

g1smd




msg:4560678
 5:48 pm on Apr 2, 2013 (gmt 0)

You should allow searchengine UAs only from the real IP addresses for those UAs.

There's a whole bunch of IPs you should block irrespective of UA presented.

Gaia




msg:4560683
 6:18 pm on Apr 2, 2013 (gmt 0)

> You should allow searchengine UAs only from the real IP addresses for those UAs.

that is what modsec is for, since you cannot keep track of the real IPs because it is not a static list.

> There's a whole bunch of IPs you should block irrespective of UA presented.
This varies from server to server. If there is a list of IPs that should be blocked on every server please post a link to it.

keyplyr




msg:4560696
 6:31 pm on Apr 2, 2013 (gmt 0)



Should be blocked by default? What does that mean? It is good practice to add them to the DENY table?

I generally avoid blocking an entire range because of one rogue IP. Singlehop, for example, is not a bad neighborhood. And Rackspace definitely isn't.

Many webmasters here block hosting companies, colos, data centers, cloud servers. etc by default because normal human traffic does not come from these ranges. What does hit your site from these ranges are scrapers, data miners, and bot nets.

There are several threads in this forum, all listing IP ranges assigned to these agents.

Gaia




msg:4560718
 7:09 pm on Apr 2, 2013 (gmt 0)

What if you block a data center then an important search engine starts coming from an IP in that data center? By the time you find out what is going on it could be too late.

I wonder if folks at webhostingtalk also recommend using those lists. I surely wont blanket block anything but certain countries.

dstiles




msg:4560736
 7:51 pm on Apr 2, 2013 (gmt 0)

I agree with keyplr: block ALL hosting/server farms EXCEPT those IPs you KNOW carry legitimate AND USEFUL bots. If you keep tabs on things you will not be surprised by a new "important search engine" - which in any case will take years to become prominent and useful. There is almost no "real" traffic from server farms: what would a server do with one of your web pages other than scrape or otherwise abuse its contents?

I probably go further than most here by blocking MS, G and several other IP ranges EXCEPT for their bots and even then, only "real" bots: for example, I reject image bots.

I have a relatively small web server - couple of dozen small-scale web sites. Already this month, not yet two days old, I have about 5500 unwanted hits from pretend-SE bots, hackers, scrapers, virus-implanters... Killing server farms is a good way to reduce the damage: none of those got more than a 403.

By the way: it isn't only servers that send out fake bing/google bots. Several of those I'm currently seeing (and rejecting) are from compromised broadband IPs. There are millions of those - far more than compromised servers. And hits from servers are very often deliberate attacks or scrapes anyway.

wilderness




msg:4560754
 8:26 pm on Apr 2, 2013 (gmt 0)

This varies from server to server. If there is a list of IPs that should be blocked on every server please post a link to it.


It appears that unless a copy and paste list is provided to you, than the practices in accumulating such a a list makes their existence invalid.

What if you block a data center then an important search engine starts coming from an IP in that data center?


The last major search engine to appear on the WWW was MSN (aka Bing) in 2003, thus your "theory" that another will appear is as likely as hell freezing over. Allowing anybody and everybody under the guise that they may be the new second coming is a lame practice.

[edited by: Ocean10000 at 2:02 pm (utc) on Apr 3, 2013]
[edit reason] Removed flame [/edit]

jlnaman




msg:4560787
 10:43 pm on Apr 2, 2013 (gmt 0)

Some people advise trying to stop bad robots by testing User Agent strings. I have noticed how some scumbots rapidly mutate their User Agent Strings. For example, IP addr 198.27.74.10 attacks 3 times in a row, about 8 seconds apart, every 12 hours. No user agent strings are ever the same. Here are 10 recent examples (over a day and a half):
Mozilla/5.0 (Windows NT 6.2; rv:8.0) Gecko/20050108
Mozilla/5.0 (Linux i686; rv:11.0) Gecko/20000505 Firefox/11.0
Mozilla/5.0 (68K) AppleWebKit/587.0 (KHTML, live Gecko)
Mozilla/5.0 (Linux x86_64; rv:5.0) Gecko/20090507 Firefox/5.0
Mozilla/5.0 (Linux x86_64; rv:9.0) Gecko/20000927 Firefox/9.0
Mozilla/5.0 (Linux x86_64; rv:12.0) Gecko/20000621 Firefox/12.0
Mozilla/5.0 (compatible; MSIE 4.0; 68K; Win64;
Mozilla/5.0 (compatible; MSIE 10.0; Windows NT
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/587.0 (KHTML,
Mozilla/5.0 (68K; rv:11.0) Gecko/20020906 Firefox/11.0
#=========
Others show more diversity:
Same IP address, all within 90 seconds:
66.249.73.112 = DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;+http://www.google.com/bot.html)
66.249.73.112 = SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0
66.249.73.112 = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Same IP address, six minutes apart:
50.30.34.47 = SEOstats 2.1.0 https://github.com/eyecatchup/SEOstats
50.30.34.47 = wscheck.com/1.0.0 (+http://wscheck.com/)
50.30.34.47 = bot.wsowner.com/1.0.0 (+http://wsowner.com/)
#=========
All data from my Apache Logs, in the last 24 days. 158 unique User agents.

Gaia




msg:4560796
 11:23 pm on Apr 2, 2013 (gmt 0)

"important search engine" doesn't mean a new engine. it refers to the possibility that one of the majors could setup shop in a datacenter that has been blocked in the sweep blocks mentioned here.

I already have the fake searchbot problem solved (via modsecurity). What is left to deal with are the other threats, and I have enough measures in place to detect when a rogue dedicated IP is persistent as opposed to one that is doing a drive by script attack. I rather manually block those IPs (or C class at the most) than block entire ranges just because they belong to a hosting company.

[edited by: Ocean10000 at 3:07 pm (utc) on Apr 3, 2013]
[edit reason] Removed Flame [/edit]

wilderness




msg:4560802
 11:57 pm on Apr 2, 2013 (gmt 0)

Some people advise trying to stop bad robots by testing User Agent strings.


Certainly hope your referring to another forum?
I'm not aware of any participant here that focuses solely upon UA, rather, most everybody uses multiple methods and/or conditions that are available in order to determine their own priorities.

Others white-list (denying all), and then allowing acceptable IP's, UA's and/or various combinations of both.


#=========
Others show more diversity:
Same IP address, all within 90 seconds:
66.249.73.112 = DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;+http://www.google.com/bot.html)
66.249.73.112 = SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0
66.249.73.112 = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)


The above belongs in one of the numerous FAKE GOOGLE threads.


Same IP address, six minutes apart:
50.30.34.47 = SEOstats 2.1.0 https://github.com/eyecatchup/SEOstats
50.30.34.47 = wscheck.com/1.0.0 (+http://wscheck.com/)
50.30.34.47 = bot.wsowner.com/1.0.0 (+http://wsowner.com/)
#=========


The above belongs in one of the server farm threads, although it appears that's what this thread has be-reinvented as.

blend27




msg:4560811
 12:44 am on Apr 3, 2013 (gmt 0)

Most, if I can, in this neck of the woods of WebMasterWorld shoot(block invalid requests) first, then investigate later, me included...

We come to "Search Engine Spider and User Agent Identification" on daily bases, sometimes more often for access control freaks, like me again.

One thing I have to state that it works. Been here since I got one of my domain addresses 'SNIPPED' describing(spilling my issue), and later was flying under the cloak of "Scraper Slayer"... Man, It is almost addictive...

I surely wont blanket block anything but certain countries.

In my book, lately, I'd go against SF(server farm and such) list first coupled with request headers(except CN, UA, KR, BR, NL ...) before I'd block the user.

Blend27

[edited by: Ocean10000 at 3:20 pm (utc) on Apr 3, 2013]
[edit reason] Removed quote [/edit]

wilderness




msg:4560818
 1:09 am on Apr 3, 2013 (gmt 0)

back on topic. . .

for the benefit of the noobs and the copy and paste aficionados:

(Please note; in order to limit the restriction of bing/msn bots from these IP's ONLY, you'll need to reverse these procedures)

RewriteCond %{REMOTE_ADDR} ^131\.253\.(3[0-9]|4[0-7])\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} ^70\.37\. [OR]
RewriteCond %{REMOTE_ADDR} ^157\.[45][0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.[67][0-9]\.
RewriteCond %{HTTP_USER_AGENT} !(bingbot|msnbot)
RewriteRule !^robots\.txt$ - [F]

There are numerous postings of these lines (or variations), all one needs to do is search the archives.

[edited by: wilderness at 1:20 am (utc) on Apr 3, 2013]

keyplyr




msg:4560819
 1:15 am on Apr 3, 2013 (gmt 0)

YMMV

RewriteCond %{HTTP_USER_AGENT} (Bingbot|Bing\ Mobile\ |msnbot|MSRBOT) [NC]
RewriteCond %{REMOTE_ADDR} !^65\.5[2-5]\.
RewriteCond %{REMOTE_ADDR} !^70\.37\.
RewriteCond %{REMOTE_ADDR} !^131\.253\.[2-4][0-9]\.
RewriteCond %{REMOTE_ADDR} !^131\.107\.
RewriteCond %{REMOTE_ADDR} !^157\.[45][0-9]\.
RewriteCond %{REMOTE_ADDR} !^199\.30\.[1-3][0-9]\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.
RewriteCond %{REMOTE_ADDR} !^207\.[67][0-9]\.
RewriteRule !^(forbidden\.html|robots\.txt)$ - [F]

wilderness




msg:4560820
 1:24 am on Apr 3, 2013 (gmt 0)

Hey keyplr,
". . .minds think alike" ;)

Given the extent of my own IP denials, I use these lines in a different capacity than most, limiting the requests from bing/msn IP's and only allowing their SE Bots.

The FAKES are generally from IP's already denied.

jlnaman




msg:4560822
 1:58 am on Apr 3, 2013 (gmt 0)

1) wilderness is god-like and lucy24 is a genuine goddess.
2) My only point was to
    demonstrate
to people who block UAs that scumbots mutate the UAs within seconds of receiving a 403. Actual (current) examples, not reporting a new threat.
3) I actually use a multi-method strategy, as learned from this godly forum.

keyplyr




msg:4560829
 3:08 am on Apr 3, 2013 (gmt 0)

...scumbots mutate the UAs within seconds of receiving a 403.

Some do, not all... not even most. But that also seems to be a popular setting in some out-of-the-box scraper software. You can easily write a script to catch it if it's a consistent occurrence at your site(s).

Personally, I manually look through my logs each day (if not more) and visually catch these things, since these agents usually get caught doing other non-acceptable behaviors, then I give their IP address a look-see and if needed, block it for future.

jlnaman




msg:4560846
 5:19 am on Apr 3, 2013 (gmt 0)

Me too. The mutaters were already IP blocked, but I thought the evidence of the changing UAs would help convince some people to think outside of the UA "box".
Let's end this thread and wait for something important to come along.

jojy




msg:4560929
 12:09 pm on Apr 3, 2013 (gmt 0)

What about Yahoo and Google bot? Can any one suggest me htaccess rule for them too?

jojy




msg:4560939
 12:38 pm on Apr 3, 2013 (gmt 0)

Gaia, all these spoofed Biongbots are coming from hosting companies that should be blocked by default


99.9% scrapers are coming from shared hosting sites such as hostgator, bluehost, etc They use fake user agents such as Yahoo, Bing, Google, Baidu, etc. Scrapers are top on my contents and my site is no where on SERP!

My dedicated firewall is full with individual ips. I talked to my hosting company and asked them to block scrapers by checking their ip host but they said its not possible via firewall.

I wonder if there is any real solution available for this problem. I just don't want scrapers to hit my server.

wilderness




msg:4560962
 1:12 pm on Apr 3, 2013 (gmt 0)

What about Yahoo and Google bot?


Just search the archives for fake google, there are more of those threads than there are of msn/bing.

wilderness




msg:4560965
 1:15 pm on Apr 3, 2013 (gmt 0)

My dedicated firewall is full with individual ips.


I'm assuming your referring to the precise Class D?
This is a bad practice and leads to infinity.

When denying IP ranges you need to cover a wider range.
In the case of a server farm the entire range.
In the case of a private IP, the regional range, possibly with an additional condition of other criteria.

jojy




msg:4561006
 2:39 pm on Apr 3, 2013 (gmt 0)

@Wilderness

How would you block following ips? I am not sure about how to decide ip ranges.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)31.170.160.101srv37.000webhost.com

Googlebot/2.1( [googlebot.com...]

Googlebot/2.1( [googlebot.com...]

wilderness




msg:4561016
 3:05 pm on Apr 3, 2013 (gmt 0)

fake+google [google.com]


This 2010 reply from Jim [webmasterworld.com] seems the most valid.

There are simpler versions that don't do reverse dns, however you'll need to go through the results of the first link.

jlnaman




msg:4561078
 5:08 pm on Apr 3, 2013 (gmt 0)

I wonder if there is any real solution available for this problem. I just don't want scrapers to hit my server.


We all want scumbots to go away. But there is a price, a real cost to the problem. Reverse lookups, subscription services, etc. A "real" solution will cost time and money. Captchas helped for awhile, then scumbots adapted. Consider a small amount of "free" content for scrappers and "registering" for more extensive content. Look around, we all have the problem.

This 31 message thread spans 2 pages: 31 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved