homepage Welcome to WebmasterWorld Guest from 54.211.190.232
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Accredited PayPal World Seller

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 78 message thread spans 3 pages: < < 78 ( 1 [2] 3 > >     
Digsby IM Enables Web Crawlers Control of Your PC & Bandwidth
Plura Processing and 80Legs to Leverage Digsby Network
incrediBILL




msg:3986024
 3:24 pm on Sep 8, 2009 (gmt 0)

This is slightly complicated so follow along and read the entire post to grasp the full impact of this situation.

Digsby

Let's start at the beginning with this company called Digsby that creates this cutesy IM tool that is so cute many will just have to install it.

The problem is that Digsby has something built-in that allows your computer to become part of some idle CPU processing computing network.

Do you read all that fine print?

Most people don't, they skip through it NEXT NEXT NEXT just install this thing.

Here's the fun part of the "Digsby Research Module":

[wiki.digsby.com...]
The module turns on after your computer has been completely idle for 5 minutes (no mouse or keyboard movement). It then turns off the instant you move your mouse or the press a key on the keyboard

Basically, if you install Digsby, they can hijack your CPU idle time for fun and profit including WEB CRAWLING!

Here's what they say right in their TOS:

[digsby.com...]
15. USAGE OF COMPUTER RESOURCES.

You agree to permit the Software to use the processing power of your computer when it is idle to run downloaded algorithms (mathematical equations) and code within a process. You understand that when the Software uses your computer, it likewise uses your CPU, bandwidth, and electrical power. The Software will use your computer to solve distributed computing problems, such as but not limited to, accelerating medical research projects, analyzing the stock market, searching the web, and finding the largest known prime number. This functionality is completely optional and you may disable it at any time.

Of course they like to wrap themselves in charitable terms such as cancer research, that must be a good thing, no?

Emphasis on stock market analysis and web search is mine, far cry from cancer research huh?

Some people really don't like Digsby:

[lifehacker.com...]
It Gets Even Worse: Your PC is Being Used Without Your Knowledge

You can debate the merits of bundled crapware, and brush away the despicable nature of preying on those lacking adequate tech skills, but did you realize that Digsby is also using your processor to make money?

Plura Processing

These guys are building out monetization methods for the Digsby network.

[pluraprocessing.wordpress.com...]
80legs is a good customer to talk about as an example because they’ve taken the compute power we give them, and they’ve built something pretty cool on top. 80legs is itself a startup, and they provide a Web-scale crawling and processing service.

Disclosure: Plura and 80legs share an investor, and 80legs has been of great help to us as a guinea pig

80legs

Lets you crawl up to 2 billion pages a day using the PCs of less than savvy computer owners.

[80legs.com...]
80legs runs on a 50,000-node grid computer. This means we have a whole lot of bandwidth and compute power for you to use. The system as a whole can crawl up to 2 billion pages per day. Our unique architecture gives us (and our users) inherent advantages when it comes to crawling the web.

Do the math here:

2B pages per day / 50K computers = 40K pages per computer per day!

Assuming average web pages are about 20K these days that's 800MB downloaded per PC per day and if you include images, flash files and pdf's in this crawl using way over 1GB per PC per day is trivial.

Potential Consumer Impact

Considering most cable companies now have a fixed cap on your usage or if you're using a wireless broadband card that has a 5GB cap and no longer offers unlimited data, people are going to be paying for this usage.

Rogers Cable in Canada for instance has a 60GB cap but you can order lower bandwidth plans for Grandma called the "Ultra Lite" with a 2 GB monthly cap and $5.00 per additional GB. Imagine when Grandma, someone that probably has a very idle computer, installs Digsby and has a potential $150 excess bandwidth bill the next month! Grandma will definitely need her blood pressure medicine increased.

Potential Webmaster Concerns

Home computer users with Digsby installed may suddenly find their access restricted to many websites. The problem here is bot blocking software may already be temporarily suspending access to sites for the PCs of these hapless users and if 80legs is successful, the bot blocking battles will shift from data centers to actual home PCs, a massive transition in mind share in the bot blocking world.

This isn't just theory, it's already happened on some of my own sites. A couple of visitors wrote wanting to know why they were being restricted and I sent them a log file of a high speed crawl of 100s of pages and they denied any knowledge of this activity. While we don't know the source of this crawl yet, this is an example of what you can potentially expect moving forward if you have any anti-DOS software running on your site and 80legs comes knocking.

More importantly, stealth crawling will have reached a new pinnacle of unlimited penetration never before thought possible thanks to 80legs and Digsby's software.

If the experience with Amazon Web Services [webmasterworld.com] can be used as a guideline, I can foresee collecting and distributing lists of Digsby's and 80legs customers for webmasters to block may be in near the future.

Guess we'll have to wait and see what happens.

 

IanKelley




msg:3986768
 5:59 pm on Sep 9, 2009 (gmt 0)

IanKelly - stuff the users - what about the web site owners? It's their bandwidth being abused as much as anyone's. We also have paid-for bandwidth limits and distributed bots can make singnificant dents in that.

I'm not sure I understand this line of thought. The amount of bandwidth used by conventional spiders (Google, Bing, etc...) is the same or higher. The web needs spiders.

A large third party, affordable, distributed spidering service removes one of the major entry barriers to search and is also a fantastic research tool.

Imagine being able to sandbox test your amazing new search algo without having to pay for a server farm until after you know it works.

Or grab trend numbers from a sample of 1 billion websites to support your dissertation.

wilderness




msg:3986813
 6:50 pm on Sep 9, 2009 (gmt 0)

I'm not sure I understand this line of thought. The amount of bandwidth used by conventional spiders (Google, Bing, etc...) is the same or higher. The web needs spiders.

A large third party, affordable, distributed spidering service removes one of the major entry barriers to search and is also a fantastic research tool.

Imagine being able to sandbox test your amazing new search algo without having to pay for a server farm until after you know it works.

Or grab trend numbers from a sample of 1 billion websites to support your dissertation.

Research and/or 3rd party use is funded by grants and/or a organization directly benefiting from the results of the research (generally via future revenue).

Using Webmasters materials (pages) without their "formal" consent is outside the long established parameters of research.

Using innocent WWW browsers' (most of which are without clue as to what's taking place in their machine and/or networks background) is an even worse abuse.

incrediBILL




msg:3986815
 6:51 pm on Sep 9, 2009 (gmt 0)

A large third party, affordable, distributed spidering service removes one of the major entry barriers to search and is also a fantastic research tool.

The problem is they don't appear to share the cache pages with multiple customers.

20 customer crawls = 20 downloads of the same site using the same user agent.

Makes ZERO sense to me and the single item that will definitely keep them off my site until the day it's fixed.

shiondev




msg:3986884
 8:25 pm on Sep 9, 2009 (gmt 0)

frontpage,

If you contact us directly through our website (http://www.80legs.com/contact.html), we can work with you to figure out why this might be happening.

We have not had any trouble with 008 not respecting robots.txt, so I'm a bit surprised this is happening.

[edited by: shiondev at 8:45 pm (utc) on Sep. 9, 2009]

shiondev




msg:3986890
 8:31 pm on Sep 9, 2009 (gmt 0)

The problem is they don't appear to share the cache pages with multiple customers.

20 customer crawls = 20 downloads of the same site using the same user agent.

But we do a global, system-wide throttle. Without us, the alternative is these 20 companies pulling down your content at the same time and most likely not respecting robots.txt. It seems to me that that's a worse alternative.

incrediBILL




msg:3986947
 10:08 pm on Sep 9, 2009 (gmt 0)

But we do a global, system-wide throttle.

That's still not the point!

One page download is enough even if 20 companies want my site crawled.

Throttle schmottle, you're not downloading 200K pages per customer!

Not happening!

If you create a master cache of everything you download and then only download it a 2nd time after a day or week or whatever the caching rules were for the site, you save the webmasters a lot of bandwidth as well as the PCs doing the downloading.

Work smart, not hard, and don't strain resources more than absolutely necessary.

Can you say shared CACHE? I knew you could... ;)

dstiles




msg:3986951
 10:12 pm on Sep 9, 2009 (gmt 0)

What 20 companies? Earlier implications were it was private users, not companies. Are you now saying that actual companies benefit from your crawling? If so, in what way?

As I understood it, you were saying the pages were for your own benefit. Now I read into your latest posting that the bot users also view the pages. Which, as far as I (again) understand it, they probably didn't a) want and b) know were being grabbed in the first place, at least from a specific web site.

Leosghost




msg:3986959
 10:33 pm on Sep 9, 2009 (gmt 0)

Madame Leosghost says "it sounds like what you all call scumware" ( parasitic programs ) ..and "it's got 80 legs" ..Madame Leosghost ( average internet user ..inspite my efforts ) says anything parasitic with 80 legs gets stomped on ..and should have poison put in it's lair(s) ..then when she realised that it costs bandwidth ( she knows what bandwidth is ..it's like electricity ..someone has to pay for it ) to users and sites ..she said "ca va pas non!" ( which in this context means wt*! ) and "he's actually trying to defend this?" "It should be illegal.. this thing 80legs" ( here ..it would be ) ..

maybe even in the USA it is ?..and if not ..it should be ..

thanks ...incrediBILL for your time and your posts :)

shiondev




msg:3986984
 11:24 pm on Sep 9, 2009 (gmt 0)

As I understood it, you were saying the pages were for your own benefit.

I'm sorry if you interpreted what I said this way, but this was never intended by my posts. We are a business and we have customers. It's fairly clear from our website who we are and what we do.

To recap:

- We are a web-crawling service. We offer our users the ability to customize their crawls and process content on the web without needing to buy and setup their own data center.

- We use Plura, a grid computing system, which was created by our sister company.

- Plura utilizes the idle CPU time of many PCs for a variety of purposes. One of those purposes is web crawling done by us (for our customers).

- Plura requires all of their affiliates to notify their users that Plura is running. It's true that Digsby wasn't doing the best job of that, but that will be changing soon.

- We try very hard to be a good, properly-behaving web crawler. Webmasters are perfectly capable of controlling access to our user-agent, 008. If 008 is causing problems with a site, we are more than happy to work with the webmaster to fix the problem.

I think what upsets a lot of people is that our technology, if left unchecked, could be very harmful. But it isn't unchecked. In fact, there are several layers of checks and security measures at each level in the stack.

mack




msg:3986991
 11:43 pm on Sep 9, 2009 (gmt 0)

One suggestion...

ifmodifiedsince would mean you only need to crawl pages that have acrualy changed since last crawl. Cache them. everyone wins.

Mack.

IanKelley




msg:3986995
 12:03 am on Sep 10, 2009 (gmt 0)

If you create a master cache of everything you download and then only download it a 2nd time after a day or week or whatever the caching rules were for the site, you save the webmasters a lot of bandwidth as well as the PCs doing the downloading.

Good idea. It might not be justified, though, until they're actually handling those 20 simultaenous orders for general web spidering. I'm guessing it's not at that point right now.

IanTurner




msg:3987022
 1:08 am on Sep 10, 2009 (gmt 0)

Smells like a botnet trying to be legal to me.

swa66 - we need to know why there is any problem with this?

[edited by: incrediBILL at 2:05 am (utc) on Sep. 10, 2009]
[edit reason] formatting [/edit]

frontpage




msg:3987282
 12:58 pm on Sep 10, 2009 (gmt 0)

I am confused now, is the user-agent

"008"

or is it this

Mozilla/5.0 (compatible; 80bot/0.71; [80legs.com...]
?

also found this one.

Mozilla/5.0 (compatible; 008/0.83; [80legs.com...] Gecko/2008032620

[edited by: incrediBILL at 3:36 pm (utc) on Sep. 10, 2009]
[edit reason] disabled smileys [/edit]

blend27




msg:3987321
 1:52 pm on Sep 10, 2009 (gmt 0)

@shiondev

1. If I do use Digsby, do pages I visit are added to "Crawl to index" list on your End?
2. Every time I had STARTED the program it downloaded UPdates to my computer without asking me if I want to, only then let me sign in to my account(s). Sometimes it's 2 minutes. I want to know what "WENT IN" and what "WENT OUT".
3. If you use my bandwidth, I want to know what Pages your 80 legged spider visited while I was "reading the Paper".

shiondev




msg:3987353
 2:52 pm on Sep 10, 2009 (gmt 0)

@frontpage: We originally used user-agent 80bot, but changed to 008. We respect robots.txt directives for both.

@blend27: Pages you visit are not stored/cached/recorded by us. In fact, our software works completely in memory and actually cannot see or touch the hard drive.

The updates Digsby runs are something to do with Digsby, not us. On occasion, we have asked them to update something related to how to run our technology, but that's only happened once or twice.

If you're interested in seeing what pages are being crawled through Digsby, it may be possible with a network monitoring tool. I don't think it's possible for us to provide that information. We only check IP addresses for the bandwidth monitoring during the crawl; that information isn't recorded anywhere (for security/anonymization purposes).

incrediBILL




msg:3987393
 3:38 pm on Sep 10, 2009 (gmt 0)

We only check IP addresses for the bandwidth monitoring during the crawl; that information isn't recorded anywhere (for security/anonymization purposes).

In other words my initial assumption was correct.

You don't have a clue how much bandwidth the customer has used and you can cause over use charges.

shiondev




msg:3987400
 3:45 pm on Sep 10, 2009 (gmt 0)

Sorry, I mispoke, our node servers keep track of IP addresses for each month, but that information isn't connected to what pages are crawled.

So we do have a clue :) But just enough to monitor bandwidth.. nothing else.

incrediBILL




msg:3987471
 5:44 pm on Sep 10, 2009 (gmt 0)

You have a clue what YOU crawled, but no clue what the user has done nor the bandwidth cap of their plan if lower plans are offered like Rogers cable does.

shiondev




msg:3987475
 5:50 pm on Sep 10, 2009 (gmt 0)

We know what we have crawled, how much each IP address has crawled, and what the bandwidth cap is for each IP address we use.

We don't know what each IP address has crawled.

mack




msg:3987518
 7:19 pm on Sep 10, 2009 (gmt 0)

OK right now I am using mt hsdp connection. I have an 8 gig fair use cap. What If I downgraded tomorrow to the lowest package?

I would then only have a gig per month, but nothing in my network identity would change. I know you said earlier you only crawl on networks where you have information about the pachages, but my scinario could apply to any wireless or wired network.

Going over to the darkside, do you monitor users internet usage to work out how much bandwidth they use on an average month, then base your usage for crawling in this data?

I ask this because I can see no other way to have any idea at all about how much bandwidth a user has available.

Mack.

shiondev




msg:3987524
 7:32 pm on Sep 10, 2009 (gmt 0)

I believe we only use IP addresses that are using an ISP with no bandwidth caps or whose plans all have the same cap (i.e., no differences in caps between plans).

We do not track all Internet usage per IP (in fact, we have no way of doing this). We assume we can use a fairly low % and go with that. As I said before, this is a risk we are taking. We do have a few things in place to mitigate it.

shiondev




msg:3987530
 7:38 pm on Sep 10, 2009 (gmt 0)

I just remembered that Plura has some plans to allow affiliates to control bandwidth usage on their end.. so hopefully PCs will be able to set their Plura bandwidth usage in the near future.

Leosghost




msg:3987557
 8:13 pm on Sep 10, 2009 (gmt 0)

this is a risk we are taking.

yeah with other peoples money ( they ) ,not you ...pay their bandwidth charges ..

I just remembered

you are.. I beleive ..trying to extract the yellow liquid ..'bout par from someone whose business model is based on tricking people into downloading and installing scumware .

shiondev




msg:3987572
 8:37 pm on Sep 10, 2009 (gmt 0)

I'm not sure what the yellow liquid is, but perhaps I haven't clarified well-enough..

yeah with other peoples money ( they ) ,not you ...pay their bandwidth charges

Yes, they would end up paying extra bandwidth charges, but the risk we are taking is that if this happens, we will face significant media and PR pressure that would affect our business. Obviously we want to avoid this, not just for business purposes but also because we don't want to harm anyone, and are trying our hardest to never let this happen.

someone whose business model is based on tricking people into downloading and installing scumware

1. We don't operate the grid computing network, our sister company, Plura Processing, does. 80legs could migrate to another infrastructure platform if necessary - it isn't bound to Plura.

2. Plura doesn't trick anyone into installing anything. They require full disclosure of the use of their technology and there is no separate installation. It's just a Java process that gets kicked off and runs in memory.

Leosghost




msg:3987657
 10:24 pm on Sep 10, 2009 (gmt 0)

1. We don't operate the grid computing network, our sister company, Plura Processing, does. 80legs could migrate to another infrastructure platform if necessary - it isn't bound to Plura.

In which case Plura would do what ? ..you create a left hand ..to facilite a right hand ..via the same financing body ..( thus the same owners ) .and then try to claim they are separate unconnected entities who could exist independently ..thats implausible deniability :)..despite what your lawyers may have told you ( and after all if someone sues your corporate asses off... you still have to pay ( or have already paid ) your lawyers )..that is why they are lawyers

They require full disclosure of the use of their technology

Which apparently they dont police ..hence your statement
Plura requires all of their affiliates to notify their users that Plura is running. It's true that Digsby wasn't doing the best job of that, but that will be changing soon

My bold :) ..

"Will do better ..soon"..doesnt cut it ..

"Please judge ..I promise I wont allow my affs to scam again ..so ignore what they ( with my connivance ..because how else could I run my business model ) ..have been doing up til now" ..wont fly ..here ..and hopefully elsewhere ..like in a court.:)

From Digsby today ( apparently you are n't keeping them informed of PR matters ..
Plura They have not launched yet but once they do, you can contact them via their website for full details about the projects that are running on Plura
..so .you cant actually today get a clear answer from plura about what it's ( and you ) are doing ..( but you have been running since how long ? ) ..ah yes ..being honest , up front and clear about what you are doing is n't a priority ..til incrediBILL or someone notices you and then it's all hands to the damage control and lets wheel out "misremember" and "misspeak" and other slipperynesses ..wrong venue here ..:)

Apart from this "tinsel" ..
revenue model that conducts research similar to the projects mentioned above

the Nobel prize isn't in the post

Ps ..you ever work for the gator folks ?..or claria ? the "perfume" seems familiar ;)

shiondev




msg:3987976
 2:54 pm on Sep 11, 2009 (gmt 0)

In which case Plura would do what ?

Plura is pursuing other customers as well. :)

They require full disclosure of the use of their technology

Which apparently they dont police

Actually, as a result of the Digsby issue, Plura instituted an affiliate auditing process. Taken directly from their blog:

Additionally, we are instituting a new auditing process to ensure that our Terms of Use are being met. We pay our affiliates every month based on the number of cycles that they send our way, and each month, every affiliate who receives a check of $50 or more from Plura will be audited for compliance by a member of our team before being paid. If the Terms of Use are not met, the affiliate agreement will be voided and the relationship will be ended.

"Will do better ..soon"..doesnt cut it

I'm sure we can all agree that if someone realizes they made a mistake, it's nice to give them the chance to correct it.

From Digsby today ( apparently you are n't keeping them informed of PR matters ..

Plura They have not launched yet but once they do, you can contact them via their website for full details about the projects that are running on Plura

Plura has been stable and running for over a year now. I'm actually not sure why they said this.

Apart from this "tinsel" ..
revenue model that conducts research similar to the projects mentioned above

the Nobel prize isn't in the post

I believe Plura is attending the Supercomputing 2009 conference. One of their goals is to get research groups and projects signed up for their system. There are definitely some potential applications. I personally think the BLAST [en.wikipedia.org ] algorithm could be a fit for Plura.

Ps ..you ever work for the gator folks ?..or claria ? the "perfume" seems familiar ;)

Like the yellow liquid, I am not sure what you're talking about here. :)

wilderness




msg:3987986
 3:12 pm on Sep 11, 2009 (gmt 0)

Ps ..you ever work for the gator folks ?..or claria ? the "perfume" seems familiar

They are euphemisms.

for ?ator or ?laria, just go to archive org, insert a leading WWW, then add a com and go back to 2002.

"online behavioral marketing. ?ator enables consumers to download and use"

"online behavioral marketing".

blend27




msg:3987996
 3:34 pm on Sep 11, 2009 (gmt 0)

--- Ps ..you ever work for the gator folks ?..or claria ? the "perfume" seems familiar ;) ---

I did some work for them indirectly, and this is very, very similar to: Lets Run, Till we get Caught.

I don't believe for a second that the company that hires software developers to design the software for such a "scale of use" would make a mistake like that. It's not 1999 anymore. Use your favorite search engine for "gator Claria" if you not sure what that "was".

Leosghost




msg:3987999
 3:37 pm on Sep 11, 2009 (gmt 0)

Additionally, we are instituting a new auditing process to ensure that our Terms of Use are being met. We pay our affiliates every month based on the number of cycles that they send our way, and each month, every affiliate who receives a check of $50 or more from Plura will be audited for compliance by a member of our team before being paid. If the Terms of Use are not met, the affiliate agreement will be voided and the relationship will be ended.

Isnt working then because Digsby are clearly in breach of your TOS ..I downloaded their package which still includes your/plura's payload less than 2 minutes before my previous post ..
so ..
we are instituting
..is double speak for "we haven't done yet" .."but will at sometime in the future" ..and so for now we'll let our affs continue as before ..because actually that suits our model better ..

I'm sure we can all agree that if someone realizes they made a mistake, it's nice to give them the chance to correct it.

There is a world of difference between a mistake and scumware ..and as yet you have corrected nothing ..because Digsby are still touting your wares ..and the actual initiation of your policy hasn't happened yet ..

and the rest is squirming and pious words ..while using consumers bandwidth to line your pockets under the guise of hinting at cancer research ..

I think swa66 had it right ..you do sound remarkably like a botnet trying to pretend to be legit now that it's been found out ..

with enough exposure to the light ( again thankyou incrediBILL ) you may just make all the best malware lists and then your installations will trip all the AV's with a "warning this software you are installing may cost you considerably in bandwidth and therefore you may incur considerably higher charges from your internet service provider ..are you sure you wish to continue with the installation ?" ..

Btw ..my ISP imposes no caps on me ( I'm regularly into the tens of terras per month and not a peep out of them ..so that end of your operation doesnt impinge upon me ..but since this thread began you are banned from my sites ..not because of what you might indirectly cost me there with your parasitic software ..my hosting has very generous plans ..you are banned just on principal ...

Leosghost




msg:3988004
 3:47 pm on Sep 11, 2009 (gmt 0)

@blend27 ..came into the light ..feel the sun on your back .., ..:))
@wilderness ..I typed those names ( in full ) using my washable elastomere keyboard ( luna blue ) that sysygy suggested to me ..prevents "coffee moments" :) disrupting workflow ..and one can safely type the names of nasty software ..and some politicians ..without sullying ones fingers :)

wilderness




msg:3988009
 3:50 pm on Sep 11, 2009 (gmt 0)

and one can safely type the names of nasty software

My reason for adding the obscurity was because one of the domains has be re-assigned and is currently active.

This 78 message thread spans 3 pages: < < 78 ( 1 [2] 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved