homepage Welcome to WebmasterWorld Guest from 54.197.65.82
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 31 message thread spans 2 pages: 31 ( [1] 2 > >     
Fake BingBot
1000s of hits from a Slicehost IP
Gaia

5+ Year Member



 
Msg#: 4556008 posted 9:10 am on Mar 18, 2013 (gmt 0)

I got 1000s of hits from an agent identifying itself as BingBot, but coming from a SliceHost IP. It is also ignoring robots.txt.

User-agent: *
Disallow: /wp-


(please excuse the lack of punctuation)

Status200
Request/wp-login.php
Hostmysite.com
Referer-
RemoteIP50.57.148.171
Time2013-03-17T13:29:50+0000
UserAgentMozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Query?redirect_to=http%3A%2F%mysite.com%2Fwp-admin%2F&reauth=1
MethodGET

Status302
Request/wp-admin/index.php
Hostmysite.com
Referer-
RemoteIP50.57.148.171
Time2013-03-17T13:29:44+0000
UserAgentMozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Query
MethodGET


It followed thru this loop over and over, and at an agressive rate. I spotted it thanks to NewRelic/Loggly and its handy Chrome extension.

The IP belongs to youngshand.com, which is a marketing agency, so I wonder if they are not running any "tests". Has anyone seen this fake bot before?

[edited by: incrediBILL at 3:14 am (utc) on Mar 19, 2013]
[edit reason] unlinked URL [/edit]

 

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 4:05 am on Mar 19, 2013 (gmt 0)

FWIW, it doesn't matter what UA they represent themselves as (especially when Bing and NOT coming from a legitimate MS IP offering valid DNS).

In the this instance, the IP goes all the up to the backbone:
Rackspace Hosting RACKS-8-NET-4 50.56.0.0 - 50.57.255.255

There are multiple mentions of this server farm in the archives and there should even be a comprehensive listing of all their IPs.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4556008 posted 6:35 am on Mar 19, 2013 (gmt 0)

especially when Bing and NOT coming from a legitimate MS IP offering valid DNS

Or, in bing's case, vice versa: bing/MSN IP but not bing/msnbot UA ;)

Depending on your site setup, you may also find in convenient to block external requests for files in .php. Just make sure you keep any redirect:rewrite pairs.

Gaia

5+ Year Member



 
Msg#: 4556008 posted 12:41 pm on Mar 19, 2013 (gmt 0)

Everything is being blocked at firewall level (via IP). Thanks

Gaia

5+ Year Member



 
Msg#: 4556008 posted 6:11 am on Apr 2, 2013 (gmt 0)

More Fake BingBots.

212.90.148.61
182.50.154.9
108.178.69.60
212.90.148.61

I'm inclined to believe that, while masquerading as bingbot, there is some malware infecting servers and using them to try to penetrate other systems.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 6:35 am on Apr 2, 2013 (gmt 0)



Gaia, all these spoofed Biongbots are coming from hosting companies that should be blocked by default:

Goneo
212.90.148.0 - 212.90.148.255
212.90.148.0/24

Godaddy, Singapore
182.50.128.0 - 182.50.159.255
182.50.128.0/19

Singlehop
108.178.0.0 - 108.178.63.255
108.178.0.0/18

And as Wilderness says, the IP in your OP belongs to:

RackSpace
50.56.0.0 - 50.57.255.255
50.56.0.0/15

Another Rackspace range is:

64.49.192.0 - 64.49.255.255
64.49.192.0/18

Gaia

5+ Year Member



 
Msg#: 4556008 posted 5:11 pm on Apr 2, 2013 (gmt 0)

Should be blocked by default? What does that mean? It is good practice to add them to the DENY table?

I generally avoid blocking an entire range because of one rogue IP. Singlehop, for example, is not a bad neighborhood. And Rackspace definitely isn't.

I already have another measure in place. modsecurity does a reverse DNS lookup (using a local DNS server) to verify if a client with a bot as UA is actually coming from a valid IP for that bot. Now I can stop manually blocking those guys...

g1smd

WebmasterWorld Senior Member g1smd us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4556008 posted 5:48 pm on Apr 2, 2013 (gmt 0)

You should allow searchengine UAs only from the real IP addresses for those UAs.

There's a whole bunch of IPs you should block irrespective of UA presented.

Gaia

5+ Year Member



 
Msg#: 4556008 posted 6:18 pm on Apr 2, 2013 (gmt 0)

> You should allow searchengine UAs only from the real IP addresses for those UAs.

that is what modsec is for, since you cannot keep track of the real IPs because it is not a static list.

> There's a whole bunch of IPs you should block irrespective of UA presented.
This varies from server to server. If there is a list of IPs that should be blocked on every server please post a link to it.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 6:31 pm on Apr 2, 2013 (gmt 0)



Should be blocked by default? What does that mean? It is good practice to add them to the DENY table?

I generally avoid blocking an entire range because of one rogue IP. Singlehop, for example, is not a bad neighborhood. And Rackspace definitely isn't.

Many webmasters here block hosting companies, colos, data centers, cloud servers. etc by default because normal human traffic does not come from these ranges. What does hit your site from these ranges are scrapers, data miners, and bot nets.

There are several threads in this forum, all listing IP ranges assigned to these agents.

Gaia

5+ Year Member



 
Msg#: 4556008 posted 7:09 pm on Apr 2, 2013 (gmt 0)

What if you block a data center then an important search engine starts coming from an IP in that data center? By the time you find out what is going on it could be too late.

I wonder if folks at webhostingtalk also recommend using those lists. I surely wont blanket block anything but certain countries.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4556008 posted 7:51 pm on Apr 2, 2013 (gmt 0)

I agree with keyplr: block ALL hosting/server farms EXCEPT those IPs you KNOW carry legitimate AND USEFUL bots. If you keep tabs on things you will not be surprised by a new "important search engine" - which in any case will take years to become prominent and useful. There is almost no "real" traffic from server farms: what would a server do with one of your web pages other than scrape or otherwise abuse its contents?

I probably go further than most here by blocking MS, G and several other IP ranges EXCEPT for their bots and even then, only "real" bots: for example, I reject image bots.

I have a relatively small web server - couple of dozen small-scale web sites. Already this month, not yet two days old, I have about 5500 unwanted hits from pretend-SE bots, hackers, scrapers, virus-implanters... Killing server farms is a good way to reduce the damage: none of those got more than a 403.

By the way: it isn't only servers that send out fake bing/google bots. Several of those I'm currently seeing (and rejecting) are from compromised broadband IPs. There are millions of those - far more than compromised servers. And hits from servers are very often deliberate attacks or scrapes anyway.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 8:26 pm on Apr 2, 2013 (gmt 0)

This varies from server to server. If there is a list of IPs that should be blocked on every server please post a link to it.


It appears that unless a copy and paste list is provided to you, than the practices in accumulating such a a list makes their existence invalid.

What if you block a data center then an important search engine starts coming from an IP in that data center?


The last major search engine to appear on the WWW was MSN (aka Bing) in 2003, thus your "theory" that another will appear is as likely as hell freezing over. Allowing anybody and everybody under the guise that they may be the new second coming is a lame practice.

[edited by: Ocean10000 at 2:02 pm (utc) on Apr 3, 2013]
[edit reason] Removed flame [/edit]

jlnaman



 
Msg#: 4556008 posted 10:43 pm on Apr 2, 2013 (gmt 0)

Some people advise trying to stop bad robots by testing User Agent strings. I have noticed how some scumbots rapidly mutate their User Agent Strings. For example, IP addr 198.27.74.10 attacks 3 times in a row, about 8 seconds apart, every 12 hours. No user agent strings are ever the same. Here are 10 recent examples (over a day and a half):
Mozilla/5.0 (Windows NT 6.2; rv:8.0) Gecko/20050108
Mozilla/5.0 (Linux i686; rv:11.0) Gecko/20000505 Firefox/11.0
Mozilla/5.0 (68K) AppleWebKit/587.0 (KHTML, live Gecko)
Mozilla/5.0 (Linux x86_64; rv:5.0) Gecko/20090507 Firefox/5.0
Mozilla/5.0 (Linux x86_64; rv:9.0) Gecko/20000927 Firefox/9.0
Mozilla/5.0 (Linux x86_64; rv:12.0) Gecko/20000621 Firefox/12.0
Mozilla/5.0 (compatible; MSIE 4.0; 68K; Win64;
Mozilla/5.0 (compatible; MSIE 10.0; Windows NT
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/587.0 (KHTML,
Mozilla/5.0 (68K; rv:11.0) Gecko/20020906 Firefox/11.0
#=========
Others show more diversity:
Same IP address, all within 90 seconds:
66.249.73.112 = DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;+http://www.google.com/bot.html)
66.249.73.112 = SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0
66.249.73.112 = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Same IP address, six minutes apart:
50.30.34.47 = SEOstats 2.1.0 https://github.com/eyecatchup/SEOstats
50.30.34.47 = wscheck.com/1.0.0 (+http://wscheck.com/)
50.30.34.47 = bot.wsowner.com/1.0.0 (+http://wsowner.com/)
#=========
All data from my Apache Logs, in the last 24 days. 158 unique User agents.

Gaia

5+ Year Member



 
Msg#: 4556008 posted 11:23 pm on Apr 2, 2013 (gmt 0)

"important search engine" doesn't mean a new engine. it refers to the possibility that one of the majors could setup shop in a datacenter that has been blocked in the sweep blocks mentioned here.

I already have the fake searchbot problem solved (via modsecurity). What is left to deal with are the other threats, and I have enough measures in place to detect when a rogue dedicated IP is persistent as opposed to one that is doing a drive by script attack. I rather manually block those IPs (or C class at the most) than block entire ranges just because they belong to a hosting company.

[edited by: Ocean10000 at 3:07 pm (utc) on Apr 3, 2013]
[edit reason] Removed Flame [/edit]

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 11:57 pm on Apr 2, 2013 (gmt 0)

Some people advise trying to stop bad robots by testing User Agent strings.


Certainly hope your referring to another forum?
I'm not aware of any participant here that focuses solely upon UA, rather, most everybody uses multiple methods and/or conditions that are available in order to determine their own priorities.

Others white-list (denying all), and then allowing acceptable IP's, UA's and/or various combinations of both.


#=========
Others show more diversity:
Same IP address, all within 90 seconds:
66.249.73.112 = DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1;+http://www.google.com/bot.html)
66.249.73.112 = SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0
66.249.73.112 = Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)


The above belongs in one of the numerous FAKE GOOGLE threads.


Same IP address, six minutes apart:
50.30.34.47 = SEOstats 2.1.0 https://github.com/eyecatchup/SEOstats
50.30.34.47 = wscheck.com/1.0.0 (+http://wscheck.com/)
50.30.34.47 = bot.wsowner.com/1.0.0 (+http://wsowner.com/)
#=========


The above belongs in one of the server farm threads, although it appears that's what this thread has be-reinvented as.

blend27

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4556008 posted 12:44 am on Apr 3, 2013 (gmt 0)

Most, if I can, in this neck of the woods of WebMasterWorld shoot(block invalid requests) first, then investigate later, me included...

We come to "Search Engine Spider and User Agent Identification" on daily bases, sometimes more often for access control freaks, like me again.

One thing I have to state that it works. Been here since I got one of my domain addresses 'SNIPPED' describing(spilling my issue), and later was flying under the cloak of "Scraper Slayer"... Man, It is almost addictive...

I surely wont blanket block anything but certain countries.

In my book, lately, I'd go against SF(server farm and such) list first coupled with request headers(except CN, UA, KR, BR, NL ...) before I'd block the user.

Blend27

[edited by: Ocean10000 at 3:20 pm (utc) on Apr 3, 2013]
[edit reason] Removed quote [/edit]

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 1:09 am on Apr 3, 2013 (gmt 0)

back on topic. . .

for the benefit of the noobs and the copy and paste aficionados:

(Please note; in order to limit the restriction of bing/msn bots from these IP's ONLY, you'll need to reverse these procedures)

RewriteCond %{REMOTE_ADDR} ^131\.253\.(3[0-9]|4[0-7])\. [OR]
RewriteCond %{REMOTE_ADDR} ^65\.5[2-5]\. [OR]
RewriteCond %{REMOTE_ADDR} ^70\.37\. [OR]
RewriteCond %{REMOTE_ADDR} ^157\.[45][0-9]\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.46\. [OR]
RewriteCond %{REMOTE_ADDR} ^207\.[67][0-9]\.
RewriteCond %{HTTP_USER_AGENT} !(bingbot|msnbot)
RewriteRule !^robots\.txt$ - [F]

There are numerous postings of these lines (or variations), all one needs to do is search the archives.

[edited by: wilderness at 1:20 am (utc) on Apr 3, 2013]

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 1:15 am on Apr 3, 2013 (gmt 0)

YMMV

RewriteCond %{HTTP_USER_AGENT} (Bingbot|Bing\ Mobile\ |msnbot|MSRBOT) [NC]
RewriteCond %{REMOTE_ADDR} !^65\.5[2-5]\.
RewriteCond %{REMOTE_ADDR} !^70\.37\.
RewriteCond %{REMOTE_ADDR} !^131\.253\.[2-4][0-9]\.
RewriteCond %{REMOTE_ADDR} !^131\.107\.
RewriteCond %{REMOTE_ADDR} !^157\.[45][0-9]\.
RewriteCond %{REMOTE_ADDR} !^199\.30\.[1-3][0-9]\.
RewriteCond %{REMOTE_ADDR} !^207\.46\.
RewriteCond %{REMOTE_ADDR} !^207\.[67][0-9]\.
RewriteRule !^(forbidden\.html|robots\.txt)$ - [F]

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 1:24 am on Apr 3, 2013 (gmt 0)

Hey keyplr,
". . .minds think alike" ;)

Given the extent of my own IP denials, I use these lines in a different capacity than most, limiting the requests from bing/msn IP's and only allowing their SE Bots.

The FAKES are generally from IP's already denied.

jlnaman



 
Msg#: 4556008 posted 1:58 am on Apr 3, 2013 (gmt 0)

1) wilderness is god-like and lucy24 is a genuine goddess.
2) My only point was to
    demonstrate
to people who block UAs that scumbots mutate the UAs within seconds of receiving a 403. Actual (current) examples, not reporting a new threat.
3) I actually use a multi-method strategy, as learned from this godly forum.

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 3:08 am on Apr 3, 2013 (gmt 0)

...scumbots mutate the UAs within seconds of receiving a 403.

Some do, not all... not even most. But that also seems to be a popular setting in some out-of-the-box scraper software. You can easily write a script to catch it if it's a consistent occurrence at your site(s).

Personally, I manually look through my logs each day (if not more) and visually catch these things, since these agents usually get caught doing other non-acceptable behaviors, then I give their IP address a look-see and if needed, block it for future.

jlnaman



 
Msg#: 4556008 posted 5:19 am on Apr 3, 2013 (gmt 0)

Me too. The mutaters were already IP blocked, but I thought the evidence of the changing UAs would help convince some people to think outside of the UA "box".
Let's end this thread and wait for something important to come along.

jojy

5+ Year Member



 
Msg#: 4556008 posted 12:09 pm on Apr 3, 2013 (gmt 0)

What about Yahoo and Google bot? Can any one suggest me htaccess rule for them too?

jojy

5+ Year Member



 
Msg#: 4556008 posted 12:38 pm on Apr 3, 2013 (gmt 0)

Gaia, all these spoofed Biongbots are coming from hosting companies that should be blocked by default


99.9% scrapers are coming from shared hosting sites such as hostgator, bluehost, etc They use fake user agents such as Yahoo, Bing, Google, Baidu, etc. Scrapers are top on my contents and my site is no where on SERP!

My dedicated firewall is full with individual ips. I talked to my hosting company and asked them to block scrapers by checking their ip host but they said its not possible via firewall.

I wonder if there is any real solution available for this problem. I just don't want scrapers to hit my server.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 1:12 pm on Apr 3, 2013 (gmt 0)

What about Yahoo and Google bot?


Just search the archives for fake google, there are more of those threads than there are of msn/bing.

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 1:15 pm on Apr 3, 2013 (gmt 0)

My dedicated firewall is full with individual ips.


I'm assuming your referring to the precise Class D?
This is a bad practice and leads to infinity.

When denying IP ranges you need to cover a wider range.
In the case of a server farm the entire range.
In the case of a private IP, the regional range, possibly with an additional condition of other criteria.

jojy

5+ Year Member



 
Msg#: 4556008 posted 2:39 pm on Apr 3, 2013 (gmt 0)

@Wilderness

How would you block following ips? I am not sure about how to decide ip ranges.

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)31.170.160.101srv37.000webhost.com

Googlebot/2.1( [googlebot.com...]

Googlebot/2.1( [googlebot.com...]

wilderness

WebmasterWorld Senior Member wilderness us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4556008 posted 3:05 pm on Apr 3, 2013 (gmt 0)

fake+google [google.com]


This 2010 reply from Jim [webmasterworld.com] seems the most valid.

There are simpler versions that don't do reverse dns, however you'll need to go through the results of the first link.

jlnaman



 
Msg#: 4556008 posted 5:08 pm on Apr 3, 2013 (gmt 0)

I wonder if there is any real solution available for this problem. I just don't want scrapers to hit my server.


We all want scumbots to go away. But there is a price, a real cost to the problem. Reverse lookups, subscription services, etc. A "real" solution will cost time and money. Captchas helped for awhile, then scumbots adapted. Consider a small amount of "free" content for scrappers and "registering" for more extensive content. Look around, we all have the problem.

This 31 message thread spans 2 pages: 31 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved