80legs

Forum Moderators: open

Message Too Old, No Replies

80legs

GaryK

5:48 pm on Jul 19, 2009 (gmt 0)

Mozilla/5.0 (compatible; 008/0.83; http://www.80legs.com/spider.html;) Gecko/2008032620
76.105.253.nn
c-76-105-253-nn.hsd1.or.comcast.net
-----
OrgName: Comcast Cable Communications, Inc.
NetRange: 76.96.0.0 - 76.127.255.255
-----

ROBOTS.TXT? Yes

Seems to be a distributed bot that charges clients to crawl for them. Claims to respect robots.txt and that seems to be true.

dstiles

2:31 am on Feb 21, 2010 (gmt 0)

How do you mean, "IM client"?

As far as I can tell it's just a load of computers contributed to the grid by punters for some reason - probably monetary but perhaps idealism, curiosity or cupidity: the site doesn't actually make that clear.

What I do see is that many people can pay to run multiple IDENTICAL scrapes of MY content using MY bandwidth. The purpose to which this is put can, it seems, be anything.

Hence, the people who "allow" the bot to run on their computers get blocked by me. I'm aware they are mostly run on dynamic public ISP blocks, as I said, so after a while they may, if they don't re-offend, get unblocked so others can use the IP.

incrediBILL

9:36 am on Feb 21, 2010 (gmt 0)

How do you mean, "IM client"?

Because 80legs uses the Digsby network of IM clients:
[webmasterworld.com...]

It was a WebmasterWorld front page story I ran back in September.

Here's Digsby's page about the bot net option:
[wiki.digsby.com...]

dstiles

10:24 pm on Feb 21, 2010 (gmt 0)

Ah, thanks. Yes, I remember it now. :)

montclairguy

12:06 am on May 5, 2010 (gmt 0)

I'm glad I found these threads. This thing hammers my sites all day and all night.

@shiondev

For Pete's sake, man -- can't your programmers learn the meaning of 403 after a few hits to a site and stop banging it?

Further, what your data is used for is really unclear. How, in any way, does allowing your (very aggressive) spider access to my sites, generate more customers and revenue for me?

Specifically, what publicly available search engines are powered by any of your data? It they aren't Google, Bing, Yahoo or comparison shopping engines, I'm skeptical of any value whatsoever.

dstiles

12:39 am on May 5, 2010 (gmt 0)

I will say one thing in their favour. I added them to robots.txt and haven't seen them since.

Ditto mj12, and by that I mean even those majestic said must be "unlicensed" or fakes. :)

Pfui

4:58 am on May 5, 2010 (gmt 0)

Approx. 80 days later:

- shiondev continues to ignore posts here.

- shiondev continues to ignore sticky mail with log data showing his "do not crawl" list is still bupkis because...

- 80legs bots continue to hit wastefully and annoyingly often.

80Jenn

10:33 pm on May 7, 2010 (gmt 0)

Hi Pfui,

I work with shiondev on 80legs--I'm sorry that you haven't gotten a reply here before now. I just found this thread, and I'm sure that he's just not aware of your efforts to contact him.

I can investigate why we are crawling your site(s) if you send me your domains and/or IPs.

Our bot always honors robots.txt disallows, so if you add this to your robots file(s), we will not crawl you anymore:

User-agent: 008
Disallow: /

Since our crawler is distributed, it may take up to 3 hours for this change to take affect. After you add the disallow for our bot, we may check the robots file occasionally--at most every 3 hours, but it's highly unlikely that rate would be sustained for long periods of time.

We've recently had an issue that I want to bring up as an FYI--in one particular case, someone was blocking our crawler IP addresses. The affect of this was that when we tried to read the robots.txt file, it returned an empty file, which we interpreted as being allowed to hit their domain. And this, in return, caused us to crawl their domain, even though they didn't want us to.

So, really the most effective way to block us is via robots. If the bot isn't behaving properly after that, you can contact us on our website to let us know.

Jenn
80legs

tangor

11:49 pm on May 7, 2010 (gmt 0)

80Jenn, I hope the following in robots.txt is also respected:

User-agent: *
Disallow: /

80Jenn

2:18 am on May 8, 2010 (gmt 0)

tangor--yes, we do also respect user-agent: * robots disallows.

jdMorgan

2:31 am on May 8, 2010 (gmt 0)

Welcome to WebmasterWorld, 80Jenn!

A more robust example for robot testing would be

User-agent: Googlebot
Disallow: /images/
Allow: /images/adwords-bkgrnd
Disallow: /cgi-bin

User-agent: Slurp
User-agent: msnbot
User-agent: Teoma
Disallow: /cgi-bin
Disallow: /images/
Crawl-delay: 15

Sitemap: http://www.example.com/sitemap.xml

User-agent: *
Disallow: /

tested with the wild-card disallow at either end --top or bottom-- or even in the middle.

Some robots get wonky if they encounter a directive that they don't understand, such as the semi-proprietary "Allow", "Crawl-delay" and "Sitemap" directives above.

Some primitive robots don't process "User-agent: *" records unless they're the only record in the robots.txt file.

Many robots cannot handle multiple-user-agent policy records. That's a pity, since it's part of the original Standard for Robot Exclusion.

---

I'd also like to expand on 80Jenn's warning and remind everyone to be sure that your access control code does not deny any request for either your robots.txt file or your custom 403 error document. Denying access to either of these resources often causes trouble.

If you don't like "exposing" your robots.txt file to all requestors, then selectively serve different robots.txt files based on user-agent or IP address, etc., serving a simple "deny-from-all" file like the one tangor posted above to unknown or "suspicious" requestors.

Jim

montclairguy

2:48 pm on May 8, 2010 (gmt 0)

As 80Jenn replied without addressing what the 80legs index is actually used for, I cannot see any benefit of allowing this spider on my e-commerce sites. Now's your chance, Jen, to tell us how your index is beneficial to any e-commerce website. What publicly available search engines are powered by any of your data?

If I'm Joe Consumer, how does anything 80legs does get me to buy that widget I've always wanted from xyz_store.com?

80Jenn

4:11 pm on May 8, 2010 (gmt 0)

Before I launch into this, I want to point out that if we are crawling your site too fast, you can tell us how fast you'd like us to be crawling and we will modify that internally. And I also want to reiterate that we always honor robots disallows, for our bot and for all bots (*).

80legs does not actually have a content index, and we strongly discourage our customers from indexing the full content that they crawl, both in writing (all over our sites), verbally, and through our pricing models. What 80legs is designed for is to do computation on web-scale data. So, taking an e-commerce site as an example, these would be some primary customer use-cases:

Search engine companies: They are one of the major users of 80legs. The search engines that use us are looking for very specific kinds of content. For example: PDF files, mobile phones available in the UK, sporting events, camping equipment, job listings. So, if a search engine was using 80legs to crawl an e-commerce site, they would check the content-type of each document, and possibly also search for keywords in each doc. The data that they'd get back from this 80legs crawl is the list of URLs crawled with the relevant keywords/content-type for each one. This use-case has a direct benefit for the e-commerce site in the form of a link back to the site.

People checking on content: For example, Ad companies use us to ensure that their ads are placed properly on their network of websites. If an e-commerce site uses advertising as a revenue stream, this benefits them indirectly by benefiting Ad agencies (keeping rates high). Another example of this 80legs use-case is people searching for IP violations (pirated images/songs). The data returned from this type of crawl would simply be a list of URLs, with a yes/no marker for compliance.

People who want to analyze lots of data: One of our customers, Extractiv, does very broad crawls across the web to gather information about how people feel about products, events, etc. For example, they might search for mentions of the word "iPhone" (perhaps in customer reviews on an e-commerce site), and then they process the data surrounding that word to determine if it is being mentioned favorably or unfavorably. This kind of analysis benefits everyone because it leads to Apple making better iPhones :) And it's also just really cool to see the results of some of these crawls, especially to us geeks :)

We go to a lot of trouble to ensure that our customers are using 80legs for legitamite purposes, and we know for a fact that a lot of them are doing really cool things using our platform. But, ultimately, if you want to disallow us, we'll respect your decision and not crawl you.

tangor

5:13 pm on May 8, 2010 (gmt 0)

Sounds like everyone gets benefit except the website which has been crawled. I'll continue my disallow, and thank you for honoring it.

80Jenn

5:40 pm on May 8, 2010 (gmt 0)

Just wanted to clarify:

In my last post, the site being crawled is the e-commerce site. This site benefits because the search engines that use 80legs link back to it. And it is also indirectly benefited by Ad agencies increased profitability. And the e-commerce site owner gets the new and improved iPhone, so that's a good benefit, too :)

montclairguy

7:34 am on May 9, 2010 (gmt 0)

"The data that they'd get back from this 80legs crawl is the list of URLs crawled with the relevant keywords/content-type for each one. This use-case has a direct benefit for the e-commerce site in the form of a link back to the site."

How can you possibly know that? Sorry, but I don't buy the search engine explanation. Any search engine anybody uses has built their end user accessible index (and likely continues to do so) without 80legs involvement, especially since you discourage content indexing.

"...80legs is designed for is to do computation on web-scale data."

That pretty much says it all which has nothing to do with attracting business to my websites.

"People checking on content:"

Well that's the nail in the coffin, and an automatic ban from my sites, putting you among the ranks of Cyveillance, Netenforcers, and the rest who earn their living off (often incorrectly) harassing legitimate business owners with automated take down notices and ridiculous threats. As a legitimate, well established, authorized reseller of so and so's products, we've been incorrectly harassed by those snakes over the years, wasting valuable time and resources in defending our legitimate and authorized use of content, and I have no interest in furthering that type of research, whatsoever.

I have disallowed your crawler in robots.txt, and it seems to be obeying that instruction. In case it get's unruly, I've banned it in other ways as well.

Thank you for your response.

aeronautic

9:21 pm on May 11, 2010 (gmt 0)

Since our crawler is distributed, it may take up to 3 hours for this change to take affect. After you add the disallow for our bot, we may check the robots file occasionally--at most every 3 hours, but it's highly unlikely that rate would be sustained for long periods of time.

Untrue in my experience. Over 24 hours have passed since I banned this parasite via robots.txt and they keep coming at the very same rate, not slowing at all, despite what she and their website claims.

I've seen it hit the robots file time after time too, so I know it has it. What would their botnet users think if their ISPs got complaint after complaint for this abusive conduct?

Their claims of benefits for webmasters fail on the merits. They and their botnet make money using our (well at least my) resources, without my permission and against my express wishes. This is highly unethical.

Since I originally posted the message above these two hits came in:

Host: 206.193.198.90

/robots.txt
Http Code: 200 Date: May 11 14:25:58 Http Version: HTTP/1.1 Size in Bytes: 2592
Referer: -
Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)

/favicon.ico
Http Code: 200 Date: May 11 14:25:59 Http Version: HTTP/1.1 Size in Bytes: 3638
Referer: -
Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 ( .NET CLR 3.5.30729)

Searching for info on IP 206.193.198.90 [google.com] I see results associated with 80legs but note the lack of reference to it in the log entries - no user agent.

The IP belongs to:

OrgName: BigCity Networks, Inc.
OrgID: BIGCIT-1
Address: 405 Main Street, Suite 700
City: Houston
StateProv: TX
PostalCode: 77002
Country: US

The crawler hits keep coming too.

80Jenn

10:05 pm on May 11, 2010 (gmt 0)

aeronautic-

We will reply privately to the email that you just sent us, but I also wanted to explain what happened here:

Your robots.txt file lists several User-agents per disallow statement. We don't currently support that format (we will add it to our parser), but we have internally blocked your site from being crawled. For now, if you want to block us in your robots.txt, you can add these lines:

User-agent: 008
Disallow: /

The IP address you put in your post is mine (we are located in Houston)--I went to check your robots.txt file in Firefox after receiving your email. We always use the 008 user-agent when we crawl, as well as when we check robots files.

aeronautic

10:36 pm on May 11, 2010 (gmt 0)

Jenn,

Thank you for your reply and actions.

However, you seem to have started a business that fails to support a standard honored by the others on that very same disallow statement. A consolidated list is considered acceptable. Clearly you are operating a service that fails to meet acceptable standards of conduct. Or did.

The very existence if this nearly year-long and 3 page thread speaks to the concerns raised by your firm's conduct.

I consider your means of distributed crawling to be a highly unethical botnet. I find your exploitation of my efforts to be parasitic, not symbiotic. The effort I've been forced to make to stop you speaks to the roadblocks placed by your firm's poor ethical choices.

Do you pay my general business, hosting and bandwidth costs? No you don't.

Do you send me valuable traffic? No, you don't. Don't tell me you might maybe. I've been publishing since 97. As others have said in this thread, for me at least, unless you are Google, Bing or Yahoo, you are not sending me income in exchange for any access to my content or use of my bandwidth. Your FAQ and other statements are grossly misleading to the gullible.

Again, thank you for taking the belated steps you have taken but I strongly encourage your firm to revise the ethics and choices that lead to our encounter.

jdMorgan

1:20 am on May 12, 2010 (gmt 0)

Your robots.txt file lists several User-agents per disallow statement. We don't currently support that format (we will add it to our parser)

Many robots cannot handle multiple-user-agent policy records. That's a pity, since it's part of the original Standard for Robot Exclusion [robotstxt.org].

... since 1994 ...

Jim

tangor

1:41 am on May 12, 2010 (gmt 0)

Jim... beat me by two years. Yet again I testify that 80legs (I'm not in their camp, though I also live in Houston) has OBEYED my general disallow following the first blowup middle of last year. I whitelist 5 bots, everything else is excluded. And the vast majority of bots obey... AWS bots don't, but that's another thread!

80Jenn, please do hang in there. We webmasters need to see what tangible benefit, as in traffic driven to our sites, your bot produces. Show that and you'll have a passel of supporters. Fail, there's no love, and some of these folks can get downright cantankerous. Seriously, show the benefit to us, the webmaster, first and not your clients who do nothing but scrape our content and run up our bandwidth. Do that and you'll have friends in the biz.

aeronautic

2:07 am on May 12, 2010 (gmt 0)

As you can see, they made the same claim in a direct e-mail sent to me:

FYI, the reason the robots.txt block didn't work is because we currently don't support the format you provided in your robots.txt. The standard format is single entry per user-agent, rather than grouping them, as you had done.

Yet their bot info and FAQ pages cite the standard directly, do they not? Have they read it?

Years ago I had single entries but it was actually harder for some bots to parse (perhaps due to the number of entries?) so I was advised to consolidate as I have.

FYI, here is the string that baffled 80legs:

User-agent: 008
User-agent: AboutUsBot
User-agent: Baiduspider+
User-agent: Becomebot
User-agent: FollowSite Bot
User-agent: sitecheck.internetseer.com
User-agent: OmniExplorer_Bot
User-agent: RB2B-bot
User-agent: SBIder
User-agent: StackRambler
User-agent: TurnitinBot
User-agent: Yandex
Disallow: /

As stated before, it works for everyone else but 80legs. Furthermore, I ran it through Google's webmaster tools just today and it did not even blink.

jdMorgan

2:14 am on May 12, 2010 (gmt 0)

One problem with distributed 'bots is that they have their own camp followers -- people who spoof the user-agent so that they can scrape sites, and all the blame goes to the (presumably legitimate) distributed robot.

That could be the case here for Webmasters reporting non-compliance with robots.txt. Or it could just be that as already established above, 80legs did not fully implement the Standard, and therefore appears to disregard robots.txt files with multiple-user-agent policy records (as defined by the Standard).

The only way out of the sack for distributed crawlers is validation using the 'secret key' method as published first here by LordMajestic for the Majestic 12 crawler (MJ12). But that requires an explicit opt-in by Webmasters -- who must sign up and specify their secret key. Since 80legs is based on the premise that bandwidth and server resources are free, and since many here here disagree, I suspect that 80legs' mission is incompatible with an opt-in approach. (Not intending to kick anyone or burst any bubbles here, but rather just state the facts.)

Jim

shiondev

3:06 am on May 12, 2010 (gmt 0)

We webmasters need to see what tangible benefit, as in traffic driven to our sites, your bot produces. Show that and you'll have a passel of supporters. Fail, there's no love, and some of these folks can get downright cantankerous. Seriously, show the benefit to us, the webmaster, first and not your clients who do nothing but scrape our content and run up our bandwidth. Do that and you'll have friends in the biz.

While it's admittedly infeasible to list every single 80legs user's intent, we know what some of our larger users use our service for. These include:

1. Search engines: many of these folks are startups themselves and are trying to develop new search technologies.

2. Ad networks/platforms: some of these guys are trying to identify websites to include in their ad channels. Others are trying to protect their clients from posting ads on sites with undesired (which can mean many things) content.

3. Social media monitoring / sentiment analysis: these guys are tracking mentions of products and brands on websites and trying to determine what people are saying about these. Any interesting mentions will show up in their users' data feeds (typically with links back to source for full content if desired).

The top 2 have direct benefits to webmasters, while the 3rd one has an indirect benefit. We have other users actually trying to prevent content theft by running fingerprinting algorithms on our platform.

Of our paying customers, we know that all of them are using 80legs for legitimate purposes. The non-paying customers are severely limited in the amount of crawling they can do with us.

tangor

3:32 am on May 12, 2010 (gmt 0)

80Jenn, bless you for continuing this dialogue, but, dear heart, that's a load of hooey. None of the above send TRAFFIC to my site...NONE. I give about that much for "startup Search Engine Wannabees". Several of my sites have NO advertising (direct sales to the public). Social media means nothing to me.

Your clients are legal, I'm sure, but your clients send no benefit to the webmasters they troll, and your distributed method (piggybacked on user machines when "apparently idle") is an insult to intelligence. I do wish your company well but I do---most gently---wish you'd just go away.

caribguy

3:52 am on May 12, 2010 (gmt 0)

Welcome back shiondev!

The non-paying customers are severely limited in the amount of crawling they can do with us.

I guess that statement echoes the sentiment of many webmasters precisely ;) Our "what's in it for me" question, exemplified by some of the comments here, deserves a much better answer.

The environment has changed: niche and startup search engines used to get the benefit of the doubt... Lately, it's become a shoot first and ask questions later situation. You only need to have a cursory look at some of the threads here to realize that webmasters are getting a "bit" jaded and cynical when they see others misappropriating their content and/or providing paid services that are based on free access to said content.

Hope that gives you some food for thought.

shiondev

4:15 am on May 12, 2010 (gmt 0)

@caribguy Like I said, it's unfeasible for us to go through our thousands of users and determine what each one is using 80legs for.

Based on just spot investigations, the few complaints we've received have come from websites that were part of large general crawls from 80legs. In other words, the 80legs user had no specific or real interest in the website that submitted the complaint. It just happened to be part of a large crawl.

As far as yours and others' comments on startups.. what can I say, that's too bad I guess.

80Jenn

4:25 am on May 12, 2010 (gmt 0)

tangor, that last 80legs post that you replied to wasn't from me. My goals here are to address specific questions and issues and to listen so we can make technical adjustments if needed (like with the robots issue identified today). I agree that the dialog probably isn't constructive :)

tangor

5:28 am on May 12, 2010 (gmt 0)

80Jenn, my apologies. That's why I said hang in there, we want the skinny. Your partner in crime shiondev seems to speak with forked-tongue. What we webmasters REQUIRE (to allow your crawls) is benefit. So far we see none. No links, no biz, no traffic. And worse, your underlying engine steals (er...uses) a distributed method cranking up (dare I say it?) carbon usage (other's electricity) to get your job done.

Just show me the money. That's all I'm asking.

shiondev

6:05 am on May 12, 2010 (gmt 0)

It seems our efforts to provide this community with insight into our business have been met with less than positive reception.

I was hopeful that engaging directly and openly with this community would help us create a better relationship, but this doesn't seem to be happening.

None of our competitors have done anything similar to this kind of effort (MJ12 is not a competitor). Most of them make it very easy to ignore robots.txt, spoof IPs and so on.

Sine our efforts have been a disappointment, we'll be disengaging from this community. We're a small team and we can't afford to dedicate time here anymore. We are still interested in making our crawler the best it can be, but we recommend any future communication be done directly through our contact form at [80legs.com...] We are happy to respond to inquiries and suggestions that way.

Good bye, and thanks for all the fish.

aeronautic

6:48 am on May 12, 2010 (gmt 0)

Sine our efforts have been a disappointment, we'll be disengaging from this community.

What an asinine statement. The offender blames the offended and then feigns offense as they exit with a version of "I'm taking my ball and going home."

If you want to achieve positive and productive results, truly implement ethical and responsible policies and dump your parasitic business model along with your "offend first ask forgiveness later values."

Failing to participate here with legitimately responsive replies, instead of the doublespeak and obfuscation offered so far, will only hurt your efforts. Without the cooperation of the webmaster community at large you have no business model. Parasites die without hosts.

You would do well 80legs to start your efforts to court webmasters anew instead of crying over, by web standards, very polite and completely legitimate criticism.

Your decision to make this statement speaks volumes:

Most of them make it very easy to ignore robots.txt, spoof IPs and so on.

Why would the misdeeds of others matter to your claimed legitimate operation or be useful as a reference for appropriate behavior? You are okay because others are worse? (shakes head)

This 61 message thread spans 3 pages: 61

80legs

GaryK

dstiles

incrediBILL

dstiles

montclairguy

dstiles

Pfui

80Jenn

tangor

80Jenn

jdMorgan

montclairguy

80Jenn

tangor

80Jenn

montclairguy

aeronautic

80Jenn

aeronautic

jdMorgan

tangor

aeronautic

jdMorgan

shiondev

tangor

caribguy

shiondev

80Jenn

tangor

shiondev

aeronautic

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week