homepage Welcome to WebmasterWorld Guest from 54.167.41.199
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 78 message thread spans 3 pages: 78 ( [1] 2 3 > >     
Digsby IM Enables Web Crawlers Control of Your PC & Bandwidth
Plura Processing and 80Legs to Leverage Digsby Network
incrediBILL




msg:3986024
 3:24 pm on Sep 8, 2009 (gmt 0)

This is slightly complicated so follow along and read the entire post to grasp the full impact of this situation.

Digsby

Let's start at the beginning with this company called Digsby that creates this cutesy IM tool that is so cute many will just have to install it.

The problem is that Digsby has something built-in that allows your computer to become part of some idle CPU processing computing network.

Do you read all that fine print?

Most people don't, they skip through it NEXT NEXT NEXT just install this thing.

Here's the fun part of the "Digsby Research Module":

[wiki.digsby.com...]
The module turns on after your computer has been completely idle for 5 minutes (no mouse or keyboard movement). It then turns off the instant you move your mouse or the press a key on the keyboard

Basically, if you install Digsby, they can hijack your CPU idle time for fun and profit including WEB CRAWLING!

Here's what they say right in their TOS:

[digsby.com...]
15. USAGE OF COMPUTER RESOURCES.

You agree to permit the Software to use the processing power of your computer when it is idle to run downloaded algorithms (mathematical equations) and code within a process. You understand that when the Software uses your computer, it likewise uses your CPU, bandwidth, and electrical power. The Software will use your computer to solve distributed computing problems, such as but not limited to, accelerating medical research projects, analyzing the stock market, searching the web, and finding the largest known prime number. This functionality is completely optional and you may disable it at any time.

Of course they like to wrap themselves in charitable terms such as cancer research, that must be a good thing, no?

Emphasis on stock market analysis and web search is mine, far cry from cancer research huh?

Some people really don't like Digsby:

[lifehacker.com...]
It Gets Even Worse: Your PC is Being Used Without Your Knowledge

You can debate the merits of bundled crapware, and brush away the despicable nature of preying on those lacking adequate tech skills, but did you realize that Digsby is also using your processor to make money?

Plura Processing

These guys are building out monetization methods for the Digsby network.

[pluraprocessing.wordpress.com...]
80legs is a good customer to talk about as an example because they’ve taken the compute power we give them, and they’ve built something pretty cool on top. 80legs is itself a startup, and they provide a Web-scale crawling and processing service.

Disclosure: Plura and 80legs share an investor, and 80legs has been of great help to us as a guinea pig

80legs

Lets you crawl up to 2 billion pages a day using the PCs of less than savvy computer owners.

[80legs.com...]
80legs runs on a 50,000-node grid computer. This means we have a whole lot of bandwidth and compute power for you to use. The system as a whole can crawl up to 2 billion pages per day. Our unique architecture gives us (and our users) inherent advantages when it comes to crawling the web.

Do the math here:

2B pages per day / 50K computers = 40K pages per computer per day!

Assuming average web pages are about 20K these days that's 800MB downloaded per PC per day and if you include images, flash files and pdf's in this crawl using way over 1GB per PC per day is trivial.

Potential Consumer Impact

Considering most cable companies now have a fixed cap on your usage or if you're using a wireless broadband card that has a 5GB cap and no longer offers unlimited data, people are going to be paying for this usage.

Rogers Cable in Canada for instance has a 60GB cap but you can order lower bandwidth plans for Grandma called the "Ultra Lite" with a 2 GB monthly cap and $5.00 per additional GB. Imagine when Grandma, someone that probably has a very idle computer, installs Digsby and has a potential $150 excess bandwidth bill the next month! Grandma will definitely need her blood pressure medicine increased.

Potential Webmaster Concerns

Home computer users with Digsby installed may suddenly find their access restricted to many websites. The problem here is bot blocking software may already be temporarily suspending access to sites for the PCs of these hapless users and if 80legs is successful, the bot blocking battles will shift from data centers to actual home PCs, a massive transition in mind share in the bot blocking world.

This isn't just theory, it's already happened on some of my own sites. A couple of visitors wrote wanting to know why they were being restricted and I sent them a log file of a high speed crawl of 100s of pages and they denied any knowledge of this activity. While we don't know the source of this crawl yet, this is an example of what you can potentially expect moving forward if you have any anti-DOS software running on your site and 80legs comes knocking.

More importantly, stealth crawling will have reached a new pinnacle of unlimited penetration never before thought possible thanks to 80legs and Digsby's software.

If the experience with Amazon Web Services [webmasterworld.com] can be used as a guideline, I can foresee collecting and distributing lists of Digsby's and 80legs customers for webmasters to block may be in near the future.

Guess we'll have to wait and see what happens.

 

steve40




msg:3986043
 3:43 pm on Sep 8, 2009 (gmt 0)

incrediBILL,

Your knowledge of bots never ceases to amaze me , some of the things you pick up and share on Webmaster World to help us mere mortals to understand how the net and more importantly use of bots is changing adds great information to this forum

Thanks
Steve

shiondev




msg:3986100
 5:15 pm on Sep 8, 2009 (gmt 0)

incrediBill,

I work on 80legs. I thought I should point out a few things that affect your analysis.

1. The number of computers in the distributed grid actually fluctuates. 50,000 is the average number we have seen connected at any given time, but it can be as high as 200,000 during certain points in the day.

2. We have built bandwidth-monitoring technology into our crawler. 80legs will never use more bandwidth than what a given computer's bandwidth cap is. We keep up-to-date records on current ISP bandwidth plans and caps and only use computers that are using an ISP for who we know the plan. We will never use a computer in a way that risks going over the cap.

3. Digsby has changed their install process since the Lifehacker article was posted. It is much easier to disable Plura now. They are also working on an entirely new installer that will show Plura during the install process, making it immediately apparent. Granted, it would be good if this is done asap, but from what I understand they have to work through several business and technical issues to make this happen.

4. Our user-agent 008, obeys robots.txt, so webmasters can control access to their sites.

Zamboni




msg:3986114
 5:35 pm on Sep 8, 2009 (gmt 0)

If you've already used 50% of a users monthly bandwidth. How can you stop the user from using 60 or 70% of their own bandwidth?

mack




msg:3986118
 5:41 pm on Sep 8, 2009 (gmt 0)

We have built bandwidth-monitoring technology into our crawler. 80legs will never use more bandwidth than what a given computer's bandwidth cap is. We keep up-to-date records on current ISP bandwidth plans and caps and only use computers that are using an ISP for who we know the plan. We will never use a computer in a way that risks going over the cap.

How can you be sure of this, how can you tell how close a user is to reaching their limit, if they have been surfing, downoading mp3's ect. I can see what you mean about not being the only reason a user will go over their limit, but you can't say for certainty that you won't push them over.

Mack.

Webwork




msg:3986123
 5:47 pm on Sep 8, 2009 (gmt 0)

it likewise uses your CPU, bandwidth, and electrical power

It never ceases to amaze me how many new and profitable business models are being based upon externalizing costs, i.e., getting something for "free".

UCE - Unsolicited Commercial Email. UGC - User Generated Content. Now there's UCE3: User Contributed Electricity, Electronics and Electronic Data Exchange.

Time to coin a new phrase having to do with meeting the growing costs of feeding and funding the expenses of the "free Web".

Something other than bankruptcy. :(

shiondev




msg:3986125
 5:55 pm on Sep 8, 2009 (gmt 0)

We use very, very little (far below 50%) of each computer's bandwidth, which is why we are fairly sure we haven't gone over any caps. I agree that we can't say with 100% certainty, but we are very conservative with our usage.

incrediBILL




msg:3986132
 6:09 pm on Sep 8, 2009 (gmt 0)

Hi shiondev and welcome to the spider forum but I think there's still some oversights in your explanations.

I thought I should point out a few things that affect your analysis.

I took those stats directly off your website so you might want to update the information if you don't want to give people the wrong idea.

Even if you can access up to 200K machines per day that's still an average 10K pages per machine to reach your stated goal of 2B pages per day and any home computer crawling 10K pages per day is running a good chance of being blocked by many of our webmasters here.

You might want to rethink the liability of causing someone's computer to be banned by various services for abuse.

80legs will never use more bandwidth than what a given computer's bandwidth cap is

How do you know if my bandwidth cap is 2GB/mo, 5GB/mo or 250GB/mo?

You can't possibly know which plan I purchased just because of the network I'm using.

If you start costing someone bandwidth charges I would expect trouble to quickly follow.

Here's a quick FOR INSTANCE of where you'll get burned unless you test the network being used each time such as a home computer using Comcast (250GB cap) until there's a network outage and it switches to Sprint Broadband (5GB cap) or likewise a laptop that anchors at home on Comcast, travels with Sprint, and uses various pay to play wifi spots.

Our user-agent 008

Here's a link for other webmasters to get the FAQ specifics [80legs.pbworks.com] of 80legs:
[80legs.pbworks.com...]

So let me get this straight, no matter what customer requests the crawl you only show "008" so we don't know who's requesting you to crawl on their behalf?

Bad idea, that will get you blocked on principle alone by many webmasters.


BTW, what happens if someone installs Digsby on their office machine and then 80legs crawls all sorts of NSFW sites, or so many sites that the company thinks the employee is doing nothing but surfing the web and fire them for goofing off?

Also, could you imagine someone trying to explain to the police sting operation why their computer was attempting to access illegal sites (nude kids?) that your software inadvertently crawled?

Another consideration is whether you're only company using the Digsby network so if someone else uses the same network to crawl that you're using, and aren't nearly as nice about it, they could easily cause your business to be hobbled unless you're the exclusive crawling agent on the Digsby network.


Last but not least, after reading your wiki it sounds like you crawl a domain per customer request and don't share the cache the pages among multiple customers, correct?

If you had 20 customers all requesting the same site to be crawled at the same time, would 80legs actually crawl the site 20 individual times?

If so, there will be some major screaming from webmasters.

incrediBILL




msg:3986140
 6:25 pm on Sep 8, 2009 (gmt 0)

very little (far below 50%) of each computer's bandwidth

My bandwidth goes up to 16mb currently and will hit 50mb-100mb within a year due to the network upgrades being deployed.

At those speeds you can exceed my monthly bandwidth cap in as little as 55 hours (110 hours at 50% usage) of downloading in a month or after the upgrade in as little as 10 hours (or 20 hours at 50%).

That means that the breadth of your available network as claimed is a quickly diminishing resource even if you only use 50% of the bandwidth per machine and try to stay below bandwidth caps.

Now explain how you're doing anything good for the planet by keeping those 50K-200K idle PCs awake?

Those machines should be in low power mode, possibly hibernation or shutdown, certainly not doing your work.

It would be easier to be a GREEN company by using a traditional data center and a few racks of severs vs. 200K machines you're keeping awake crawling the web for "free".

[edited by: incrediBILL at 6:28 pm (utc) on Sep. 8, 2009]

mack




msg:3986142
 6:27 pm on Sep 8, 2009 (gmt 0)

We use very, very little (far below 50%) of each computer's bandwidth

I still don't see how you have any idea how much bandwidth a customer uses, and at the end of the day you have probably cost people money!

Even if it was 50% you are still using half of an allowance a user has paid for.

They are also working on an entirely new installer that will show Plura during the install process, making it immediately apparent. Granted, it would be good if this is done asap, but from what I understand they have to work through several business and technical issues to make this happen

business and technical issues? more likely purely a business issue, the more hidden Plura is the better, no technichal issue, you could build a new installer in an afternoon... if you wanted to.

Mack.

shiondev




msg:3986161
 7:00 pm on Sep 8, 2009 (gmt 0)

I took those stats directly off your website so you might want to update the information if you don't want to give people the wrong idea.

Please keep in mind that we have to paraphrase many things for the purposes of marketing. "50,000 computers" is easier to understand than "A range betweeen XX and YY computers, depending on time of day.." :)

your stated goal of 2B pages per day

That is not the number of pages we crawl every day. That is what the system is capable of if used at 100% utilization and with a "perfect" distribution of domains being crawled.

Here's a quick FOR INSTANCE of where you'll get burned unless you test the network being used each time such as a home computer using Comcast (250GB cap) until there's a network outage and it switches to Sprint Broadband (5GB cap) or likewise a laptop that anchors at home on Comcast, travels with Sprint, and uses various pay to play wifi spots.

The crawls won't happen while the computer is using the Sprint card. We only accept a very small # of ISPs that we know how to analyze.

So let me get this straight, no matter what customer requests the crawl you only show "008" so we don't know who's requesting you to crawl on their behalf?

Bad idea, that will get you blocked on principle alone by many webmasters.

Is it better to let our customers specify the user-agent? It seems that would require multiple entries in your robots.txt file then. We might be able to append a unique ID or something, but we don't have any plans to specifically say the name of the customer. If webmasters feel the need to block us for this, that's just something we'll have to live with, unfortunately.

BTW, what happens if someone installs Digsby on their office machine and then 80legs crawls all sorts of NSFW sites, or so many sites that the company thinks the employee is doing nothing but surfing the web and fire them for goofing off?

While this is a risk for us, it is fairly unlikely. Most corporate PCs are behind 1 or few IPs. Because of this, not many crawl requests would be sent. Additionally, the employee would have to be running one of Plura's affiliate's programs, which are either IM clients, games, or charityware applications. This also mitigates the risk somewhat.

Also, could you imagine someone trying to explain to the police sting operation why their computer was attempting to access illegal sites (nude kids?) that your software inadvertently crawled?

We of course hope that 80legs is never used for such purposes and will comply with any legal authorities if it was. I hope that forensics would see the actual crawl requests being made were identified with our user-agent, which would be a pretty good tip-off as to what was happening.

unless you're the exclusive crawling agent on the Digsby network.

We are.

Last but not least, after reading your wiki it sounds like you crawl a domain per customer request and don't share the cache the pages among multiple customers, correct?

If you had 20 customers all requesting the same site to be crawled at the same time, would 80legs actually crawl the site 20 individual times?

If so, there will be some major screaming from webmasters.

The requests from each customer will not happen all at once. We throttle requests per domain on a system-wide basis. So if there are 20 customers requesting 1 domain, each of their crawls will go very slowly.

Now explain how you're doing anything good for the planet by keeping those 50K-200K idle PCs awake?

Plura does not keep these PCs awake. If the system is "idle", it is still running normally - there just isn't any user input. If the computer actually enters sleep mode, then Plura does not run.

I still don't see how you have any idea how much bandwidth a customer uses

We've built some interesting technology around this, and unfortunately I'm not the one that worked on it, so I don't know enough the details. At the end of the day, I can't overstate how little each computer is used for a crawl. It really is a tiny, tiny amount.

business and technical issues? more likely purely a business issue, the more hidden Plura is the better, no technichal issue, you could build a new installer in an afternoon... if you wanted to.

It's not as simple as you may thing. Digsby uses a third-party installer product, so they'd have to re-program and re-configure their software to use a new third-party installer product. And in any case, the business issues of switching over between installer products are relevant.

Hopefully I'm doing a decent job of providing information on how 80legs works. I completely understand the controversial nature of what we're doing, but I feel many people jump to some conclusions because of that. We are trying very hard to keep things secure, safe and free of negative impact. There are some risks involved, but we have implemented many safeguard to mitigate those.

bwnbwn




msg:3986173
 7:20 pm on Sep 8, 2009 (gmt 0)

shiondev to me it is a complete invasion of my privacy. I wonder if there are some little holes in the software that can be exploted?

3rd party installer jeeze enough is enough. I can see this is a bad thing that has the ability to get real nasty.

Just another reason I won't add this kind of stuff to any machine I own.

blend27




msg:3986220
 8:11 pm on Sep 8, 2009 (gmt 0)

OK, I am guilty one here too, I had Digsby up untill 2 minuts ago.

whoisgregg




msg:3986240
 8:42 pm on Sep 8, 2009 (gmt 0)

accelerating medical research projects, analyzing the stock market, ..., and finding the largest known prime number

So what's the largest known prime number that you've found so far? Which medical research projects have you accelerated?

incrediBILL




msg:3986251
 9:10 pm on Sep 8, 2009 (gmt 0)

The crawls won't happen while the computer is using the Sprint card.

I have a sprint USB device always online and I switch over the minute Comcast goes down.

How would you even know that quickly?

Are you checking per each crawl request?

Most corporate PCs are behind 1 or few IPs. Because of this, not many crawl requests would be sent.

So you're telling me that if only a single person in the company had Digsby installed that you still wouldn't crawl as much?

Somehow I suspect you're only discovering that information when 2 people have it installed on the same IP address.

Besides, in some companies I've worked for it would only take a couple of hits to a NSFW site being logged and the employee would reprimanded or possibly fired.

What about teens installing this and zapping the family bandwidth, or the parents getting network reports that the teen had been all over a bunch of explicit websites?

I see a lot of potential trouble here.

FWIW, I don't allow unauthorized software on company machines so installing Digsby alone would be grounds for termination ;)

If the computer actually enters sleep mode, then Plura does not run.

Then how do you plan to keep your network active with all these new PCs coming out with defaults set to 'sleep' in about 5-10 minutes of idle time unless you do something to keep them active?

Wouldn't be much of a network if it was all slumbering.

shiondev




msg:3986256
 9:21 pm on Sep 8, 2009 (gmt 0)

Then how do you plan to keep your network active with all these new PCs coming out with defaults set to 'sleep' in about 5-10 minutes of idle time unless you do something to keep them active?

Wouldn't be much of a network if it was all slumbering.

From what I hear from the Plura folks, they get at least 1-2 requests from potential affiliates to join the network. This would certainly help mitigate the risk you bring up here.

What's interesting is that Plura has been very explicit on their TOU with the people/companies requesting to become affiliates. I.e., they've been making sure the affiliates make Plura obvious to end-users. You might think that this would turn affiliates away, but they don't seem to mind. So the network is growing, with end-users that are explicitly aware of Plura's use.

ken_b




msg:3986257
 9:27 pm on Sep 8, 2009 (gmt 0)

shiondev

"50,000 computers" is easier to understand than "A range betweeen XX and YY computers, depending on time of day.."

Do you really think your users are that stupid?

incrediBILL




msg:3986267
 9:51 pm on Sep 8, 2009 (gmt 0)

From what I hear from the Plura folks, they get at least 1-2 requests from potential affiliates to join the network. This would certainly help mitigate the risk you bring up here.

My question was about how are you keeping IDLE machines from going to SLEEP in 5-10 minues of inactivity when you need to be using them?

IanKelley




msg:3986289
 10:12 pm on Sep 8, 2009 (gmt 0)

I can't help but say a few words in support of the witch before you all light the bonfire.

From a tech perspective it sounds to me as if they are doing a very concientious job of making their application low impact. The code that shiondev is referring to represents a lot of thought and work.

We've built some interesting technology around this, and unfortunately I'm not the one that worked on it, so I don't know enough the details. At the end of the day, I can't overstate how little each computer is used for a crawl. It really is a tiny, tiny amount.

I believe this is true, I have no reason not to. Does someone have evidence to the contrary?

Hopefully I'm doing a decent job of providing information on how 80legs works. I completely understand the controversial nature of what we're doing, but I feel many people jump to some conclusions because of that. We are trying very hard to keep things secure, safe and free of negative impact. There are some risks involved, but we have implemented many safeguard to mitigate those.

Again does someone have any actual evidence that this isn't true? It sounds to me as if it is.

And finally the most important thing... Digsby is OPT IN. It sounds like they were a bit sneaky in terms of hiding it in the fine print, but now that they have stopped doing that I don't see the problem.

As long as nothing is hidden then there is absolutely nothing wrong with asking for users to provide something in return for free software, they are free to uninstall, or never install in the first place.

dstiles




msg:3986291
 10:13 pm on Sep 8, 2009 (gmt 0)

What is the actual UA for this? Reading the above it seems to be simply 008 (is that really zeros or oo?) - or is it digsby?

I already block 80legs. The following is a single sequential log access from a couple of days ago for 80legs:

08:17:5076.114.39.nnn
08:17:5798.215.232.nnn
08:17:5824.91.234.nnn
08:18:0076.19.25.nnn
08:18:01173.64.76.nnn
08:18:0369.136.73.nnn

All hits to the home page of a single site for which it had received 403's two days previously. All IPs were blocked automatically so none of the hitters would subsequently be able to view any site on our server. Which may or may not upset the fools running the scrape-bot but at least I saved them some bandwidth. :)

So the bots co-operate to the extent that one IP receiving a 403 is relieved by another IP etc.

There was a similar pettern for another site and singles or doubles on a few more. Apart from the one above these sites all shared a single IP.

dstiles




msg:3986292
 10:16 pm on Sep 8, 2009 (gmt 0)

IanKelly - stuff the users - what about the web site owners? It's their bandwidth being abused as much as anyone's. We also have paid-for bandwidth limits and distributed bots can make singnificant dents in that.

incrediBILL




msg:3986299
 10:35 pm on Sep 8, 2009 (gmt 0)

Now to throw another wrinkle into the fray...

Does 80legs plan to offer a way for webmasters to validate that it's actually 80legs making the crawl and not someone faking or spoofing their user agent?

Majestic12, another distributed crawler, recently announced a mechanism to allow webmasters to validate MJ12bot [webmasterworld.com], so what say you 80legs?

And finally the most important thing... Digsby is OPT IN. It sounds like they were a bit sneaky in terms of hiding it in the fine print, but now that they have stopped doing that I don't see the problem.

Yahoo's toolbar is opt-in too and I've accidentally installed it a few times skipping through an install in a hurry.

Besides, most users aren't savvy enough to know what's going on half the time, the computer is an appliance and all that techno babble just confuses them so they tend to click OK just to make the box go away.

shiondev




msg:3986332
 12:35 am on Sep 9, 2009 (gmt 0)

We saw the Majestic12 post and were really intrigued by it. We're pretty busy getting our product to post-beta, but once we do, those validation techniques are on the list of things to do.

And thanks to Ian Kelley for some support :)

swa66




msg:3986342
 12:50 am on Sep 9, 2009 (gmt 0)

Smells like a botnet trying to be legal to me.

To get that many subscribers it just is too good to be true.

brotherhood of LAN




msg:3986500
 7:40 am on Sep 9, 2009 (gmt 0)

Interesting thoughts, I wouldn't be quick in condemning 'the ability to spider the web through a distributed crawling network', but there are some important issues that have already been brought up in the thread.

The ability to verify real/mock UA's would appear to be fairly important.

Regarding bandwidth issues, it would make sense and be transparent for the installer/software to provide the user with an option to cap how much bandwidth is dedicated to the spidering cause.

slartythefirst




msg:3986529
 9:40 am on Sep 9, 2009 (gmt 0)

This is fascinating. I know of research projects, (well one research project) that uses volunteer's PCs for data crunching, but this is a whole new ball game.

It's the equivalent of a valet parking service using your car for its Car Hire and Taxi service, without telling you, (or contributing to fuel and wear & tear).

Hiding such important information in the T&Cs and claiming that the user was told, would not stand up in a UK Court. Such important information would have to be made primary and pointed out in a way that could not be misunderstood.

Drew




msg:3986667
 3:04 pm on Sep 9, 2009 (gmt 0)

I tried other IM clients and actually prefer Digsby. Disabling the use of your computer by Digsby is very easy. Simply go to Help > Support Digsby and disable the "Help Digsby conduct research" option.
(you can also access this via the preferences screen by clicking the bottom button labeled "support Digsby")

The folks at Digsby should be able to make money, just not by straining the resources of website owners/caretakers.

I used to use Trillian and they had a free & pro version. Digsby should use a similar model. The free version of Digsby could be ad supported and/or have limited functionality and the pro version would cost a bit of dough.

Long term this (imo) is a much better business model, one that will be less likely to end up in civil or criminal proceedings.

pageoneresults




msg:3986669
 3:14 pm on Sep 9, 2009 (gmt 0)

It is much easier to disable Plura now. They are also working on an entirely new installer that will show Plura during the install process, making it immediately apparent. Granted, it would be good if this is done asap, but from what I understand they have to work through several business and technical issues to make this happen.

I'm sure this entire topic would just cease to exist if the above read...

It is much easier to enable Plura now. They are also working on an entirely new installer that will show Plura during the install process, making it immediately apparent, and allowing you to enable it.

Enabling anything by default during an install is a sneaky underhanded way to get things installed, activated, etc. I've buzzed through installs before only to uninstall immediately thereafter. Guess what I've uninstalled in the process? It starts with a d. ;)

IncrediBILL, I think you're sleeping with the bots.

mack




msg:3986677
 3:26 pm on Sep 9, 2009 (gmt 0)

I agree that opt in would be a much better solution than opting out. Most people who use your application will simply have no idea what it is doing.

I don't have a lot of knowledge of distributed crawlers, but not only does the app need to spider the content, it needs to return the data to you. Does this not mean you need quite a large scale of bandwidth at your end? in theory all the combined spidered data needs to be returned to you (again using someones internet connection). If you already have large scale bandwidth available, why not spider?

Incidentally shiondev, welcome to WebmasterWorld

Bet you're glad you joined us :)

Mack.

frontpage




msg:3986755
 5:20 pm on Sep 9, 2009 (gmt 0)

After reading this thread, I added the following filter to our mod security 2.x rules.

SecRule HTTP_User-Agent "008"

And sure enough our mod security logs started filling up with 406 errors due to this bot attempting to grab data without a single request for robots.txt.

Access denied with code 406 (phase 2). Pattern match "008" at REQUEST_HEADERS:User-Agent. [file "/usr/local/apache/conf/modsec2.user.conf"] [line "163"]

This 78 message thread spans 3 pages: 78 ( [1] 2 3 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved