homepage Welcome to WebmasterWorld Guest from 54.227.231.229
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 34 message thread spans 2 pages: 34 ( [1] 2 > >     
Comcast Business Communications
keyplyr




msg:4388017
 12:00 pm on Nov 17, 2011 (gmt 0)

Getting ready to close the door. Anything of value coming from this range?

173.160.0.0 - 173.167.255.255
173.160.0.0/13

 

wilderness




msg:4388145
 4:20 pm on Nov 17, 2011 (gmt 0)

Here's some more:

66.208.192.0 - 66.208.255.255
70.88.0.0 - 70.91.255.255
74.92.0.0 - 74.95.255.255
75.144.0.0 - 75.151.255.255
173.8.0.0 - 173.15.255.255

dstiles




msg:4388254
 8:41 pm on Nov 17, 2011 (gmt 0)

I've logged 750 baddies from comcast over the past couple of years.

Most are one or two hits then gone, which may be just a badly-setup browser or over-secured proxy with no headers but one got to 102 over a fortnight before giving up. Most are only blocked for a few days and then released again but it certainly seems to be a hotbed of hackers.

Unfortunately I can't block the service. One of my clients gets traffic from USA and I'm not entirely sure which ISPs without a major investigation, but I suspect some are from comcast.

keyplyr




msg:4388285
 9:45 pm on Nov 17, 2011 (gmt 0)


Unfortunately I can't block the service. One of my clients gets traffic from USA and I'm not entirely sure which ISPs without a major investigation, but I suspect some are from comcast.

Comcast is the largest high-speed internet cable provider in the USA with the most individual users - however [as described in OP] this is the "business" IP range; typically for company web sites and services - no different than any other server farm IMO.

wilderness




msg:4388342
 12:23 am on Nov 18, 2011 (gmt 0)

no different than any other server farm IMO.


I agree and do have some older sub-net searches on some of these ranges (before ARIN pulled the plug).

incrediBILL




msg:4388347
 12:41 am on Nov 18, 2011 (gmt 0)

this is the "business" IP range; typically for company web sites and services - no different than any other server farm IMO.



I would not block wholesale comcast ranges unless you have better intel than I do, I would only block abusive IPs only.

If you block all of the business range you'll possibly also be blocking comcast ISP services to office spaces, which could be a huge number of potential customers.

The reason I say this is because comcastbusiness.net is listed in the Broadband/DSL reports: [dslreports.com...]

[edited by: incrediBILL at 3:47 am (utc) on Nov 18, 2011]

keyplyr




msg:4388385
 2:27 am on Nov 18, 2011 (gmt 0)

Thanks incrediBILL

wilderness




msg:4388419
 5:14 am on Nov 18, 2011 (gmt 0)

I would not block wholesale comcast


That's why "they" offers so many different flavors of candy and other things ;)

dstiles




msg:4388706
 10:36 pm on Nov 18, 2011 (gmt 0)

keyplr - how can you (easily) tell the difference betweeb server farm and static business dsl in this context? I'm fairly sure I've seen people run web sites from their own business statics.

The major problem I see is not so much deliberate bot-ing as compromised computers/nets being used as botnets.

keyplyr




msg:4388716
 11:29 pm on Nov 18, 2011 (gmt 0)

dstiles - well that's just it, the behavior: crawling, scraping, hacking, etc.

No human browsing on ADSL asks for admin files or shell access from a web site they're visiting.

incrediBILL




msg:4388902
 5:59 pm on Nov 19, 2011 (gmt 0)

FWIW, a lot of startup sites that actually get office space would probably tend to have a comcast business account. I've noticed a fair amount of crawlers initially coming from comcast that eventually transition to server farms.

Not sure if you can get comcast business in a residential unit, but a scraper/aggregator might do that to get the increased speed and bandwidth if it's available to home users.

dstiles




msg:4388935
 8:51 pm on Nov 19, 2011 (gmt 0)

I'm still wary. Keyplr's point is good but I wonder how much is botnet on compromised machines, how much some techie playing etc. I've seen both human and bot-like on a single static office IP so, given possible customer traffic, I have to take a magnanimous line and block only if activity is obviously bot (based on the usual criteria, of course).

Pfui




msg:4388999
 2:42 am on Nov 20, 2011 (gmt 0)

I don't block whole-hog yet, via IP or Hostname, because comcastbusiness.net's such a mixed bag:

--- Real kiddies

107-0-86-218-ip-static.hfc.comcastbusiness.net
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; Foxborough Public Schools - Report unlawful access by calling (508) 543-1660; .NET CLR 1.1.4322)
REF: [courses.govhs.org...] [...etc]

--- Too many peas in too many pods

74-94-153-141-newengland.hfc.comcastbusiness.net
PostPost/1.0 (+http://postpost.com/crawlers)
robots.txt? Yes

173-9-17-1-newengland.hfc.comcastbusiness.net
PostPost/1.0 (+http://postpost.com/crawlers)
robots.txt? Yes

ec2-174-129-171-219.compute-1.amazonaws.com
PostPost/1.0 (+http://postpost.com/crawlers)
robots.txt? Yes

--- Probably another PostPost pea

74-94-153-29-newengland.hfc.comcastbusiness.net
Apache-HttpClient/4.1 (java 1.5)
robots.txt? NO

--- Frequent followers

70-91-204-161-busname-sfba.hfc.comcastbusiness.net
70-91-204-162-busname-sfba.hfc.comcastbusiness.net
Mozilla/5.0 (compatible; TweetedTimes Bot/1.0; +http://tweetedtimes.com)
robots.txt? NO

--- Zero traffic-referrer

173-164-221-73-sfba.hfc.comcastbusiness.net
Mozilla/5.0 (Thriceler-0.1 http://www.kuill.com/robots/spider.html)
robots.txt? Yes

incrediBILL




msg:4389021
 7:43 am on Nov 20, 2011 (gmt 0)

<slightly off topic>
pop quiz: who knows what the HFC means in the comcast reverse DNS?

no cheating by looking it up in Google...
</slightly off topic>

wilderness




msg:4389027
 8:54 am on Nov 20, 2011 (gmt 0)

Hybrid Fiber Coax (just a network or relay designation).

sfba is the relay station San Francsico Bay Area

lucy24




msg:4389031
 10:34 am on Nov 20, 2011 (gmt 0)

Getting ready to close the door. Anything of value coming from this range?

173.160.0.0 - 173.167.255.255

Weird. I hardly see this range at all, but the ones I do see-- just checked available logs-- are definitely human. Perfectly normal behavior.

Staffa




msg:4389035
 11:28 am on Nov 20, 2011 (gmt 0)

Just in today from Comcast Business Communications :

173.13.143.78 Mozilla/5.0 (compatible; YioopBot~~+http://www.yioop.com/bot.php)

(~~ = two white spaces)
took robots.txt, tried default page and ran out of luck ;o)

lucy24




msg:4389046
 1:00 pm on Nov 20, 2011 (gmt 0)

Huh. With me they call themselves

Mozilla/5.0 (compatible; YioopBot +http://www.yioop.com/bot.php)

from the identical IP as your visitors.

Honestly now, I can't go around banning robots just because they have a silly name ;) even if-- or especially if-- that's the only reason I remember them at all. Two visits, two months apart, each consisting of robots.txt plus a single directory-index page, doesn't quite bring them up to I Don't Like Your Face level.

On the other hand...

If the IP Address was also 173.11.90.73 to 78, then you have come to the right place to find out about who was probably crawling your site. If it was a different IP address then someone else is hijacking my crawler's name.


:: counting on fingers ::

Hm, we may have a loophole here :)

Pfui




msg:4390092
 1:53 am on Nov 23, 2011 (gmt 0)

Okay, now these kinds of guys can go jump:

74-94-156-210-newengland.hfc.comcastbusiness.net
Wget/1.12 (linux-gnu)

13:22:52 /dir/filename.html
13:40:00 /dir/filename.html

robots.txt? NO

Hmm. Am beginning to take an increasingly dim view of:

.newengland.hfc.comcastbusiness.net

74.94.15n may need some special handling...

Staffa




msg:4390209
 10:26 am on Nov 23, 2011 (gmt 0)

I have Wget blocked for years.
I pronounce it as 'we get' as in "we get .... nothing" :o)

Pfui




msg:4390257
 1:33 pm on Nov 23, 2011 (gmt 0)

Agreed re Wget. And anyone running it knows more than the average anybody and that makes me wary of that particular Comcast-addressed 'business.'

keyplyr




msg:4390449
 8:40 pm on Nov 23, 2011 (gmt 0)

Regrettably, the simple fix isn't going to work it seems as comcastbusiness.net just has too many different clients. I'll just be blocking on a case-by-case scenario.

dstiles




msg:4394948
 11:22 pm on Dec 6, 2011 (gmt 0)

Just arriving at 800 comcast IPs blocked for (in general) one bad hit each over a period of about 18 months. That's on a server hosting about 50 small-ish sites.

I have just added a per-site ban capability for comcast, although it is easy to extend to other ISPs, again on a per-site basis.

cpollett




msg:4409646
 5:17 pm on Jan 22, 2012 (gmt 0)

I was doing a quick search on my robot's name to make sure it wasn't misbehaving and noticed this discussion. I noticed in the above that
the observed ip was 173.11.143.78 but the block I said I used on my bot page was 173.11.90.73-78. This was a typo on my part 173.11.143.78 is my robot. I'd be happy to answer any questions about my bot. The source code is GPLv3 and downloadable from my other site seekquarry.com. As the bot page indicates it does respect robots.txt and in particular crawl-delay. Also, if you don't like me to crawl a site just drop me an e-mail at chris@pollett.org and I can block my crawler from hitting it. I tend to run my crawler about once a month right now on three mac minis and it does an open web crawl of about 20-30million pages. I will probably increase this to 100million shortly. The crawl order is by an estimate of page importance (based on links to that page it has seen so far and their quality ) so it will tend to grab the index page and it might be a while before grabbing any interior pages.

Samizdata




msg:4409655
 5:39 pm on Jan 22, 2012 (gmt 0)

I'd be happy to answer any questions about my bot

Why is it creating a 100,000,000 page index?

Who will benefit from the data?

Why should a webmaster allow it any access?

it will tend to grab the index page

Will it do that even if disallowed in the robots.txt file?

If so, why?

...

Pfui




msg:4409665
 6:20 pm on Jan 22, 2012 (gmt 0)

Thanks for chiming in, cpollett. In addition to Samizdata's crucial Qs, here's one more: Its official name is --?

Mozilla/5.0 (compatible; YioopBot~~+http://www.yioop.com/bot.php)
Mozilla/5.0 (compatible; YioopBot +http://www.yioop.com/bot.php)

cpollett




msg:4409667
 6:29 pm on Jan 22, 2012 (gmt 0)

My goal is to make it possible for anyone to create a large scale crawl of the web if they want. That's why the code is open-sourced. I am trying to make it easier to use then things like Nutch -- although at this point Nutch is still more capable. I chose PHP as the language mainly because it seemed to be the easiest to deploy anywhere and I have been trying to minimize the number of dependencies beyond a default install of PHP.

100,000,000 pages is my current estimate of what my three machines given my current software are capable of in about a month. At this point I am not close to maxing out my bandwidth from comcast. Ideally, I want to make the best index I can.

You can search the data now at
[yioop.com...]
It is a little slow as the results are served from a single mac mini -- the two other mac mini's are only used for downloading pages at this point. I am in the process of adding mirroring and the ability for distributing queries amongst machines.

It is completely up to a webmaster if they want to allow it access just as it up to a webmaster if they want the googlebot to access their site. Google probably sucks way more bandwidth. On the other hand, given that right now I get about 100-200 queries per day on yioop.com, Google probably gets you more traffic.

As my bot page explains if you disallow my robot in the robots.txt file it won't crawl your site beyond grabbing the robot page.

cpollett




msg:4409670
 6:48 pm on Jan 22, 2012 (gmt 0)

Right now it identifies itself as
Mozilla/5.0 (compatible; YioopBot~~+http://www.yioop.com/bot.php)
I hadn't really thought about that, is one space better?

Pfui




msg:4409689
 9:32 pm on Jan 22, 2012 (gmt 0)

1.) I'd vote for single spaces because the format's more in keeping with majors' current conventions --

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

-- and also because two spaces do not compute in html as two spaces.

2.) At some point did you opt to swap out the 'standard' semi-colon with a second space? (Just curious.)

3.) Do you really have sequential tildes in your UA? Or are those space placeholders, or--?

-----
As my bot page explains...

Alas, this forum's replete with threads proving that, for whatever reasons, most bots don't jibe with their own info pages (...including yours, vis-a-vis that tpyo'd IP:)

...if you disallow my robot in the robots.txt file it won't crawl your site beyond grabbing the robot page.

Excellent. Thank you!

Last but not least -- please tell me that can't be overridden by your users? (holds breath)

cpollett




msg:4409702
 10:52 pm on Jan 22, 2012 (gmt 0)

Okay. Here is the line out of my code that now specifies this:

define('USER_AGENT',
'Mozilla/5.0 (compatible; '.USER_AGENT_SHORT.'; +'.NAME_SERVER.'bot.php)');
So it has a second semi-colon now and uses a single space. (It had used two spaces before).
The next time I push stuff to my server and do a crawl it should use this. It's not crawling right now. I think when I wrote that line before I just hadn't noticed the semi-colon. On yioop.com, the only user who can stop and start crawls as well as configure what USER_AGENT_SHORT and NAME_SERVER are is me. The code is open-source and available on seekquarry.com, so if someone downloads it and runs it on their own machines they could modify things however they want. But that would be from a different ip range. Thanks for the suggestions and pointing out the typo in my ip address.

This 34 message thread spans 2 pages: 34 ( [1] 2 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved