homepage Welcome to WebmasterWorld Guest from 54.204.249.184
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
Forum Library, Charter, Moderators: bakedjake

Linux, Unix, and *nix like Operating Systems Forum

    
Crawler Speed Problem
Download speed falls sharply after 1-2 hours
Gustaf




msg:3286691
 8:04 pm on Mar 19, 2007 (gmt 0)

I have a crawler that downloads web pages from different domains, usually about 100 domains in any one moment. It runs on two different servers (both with Fedora Core 5) with very good links to the internet (100 MBit).

The problem is that after running for about two hours, the download speed falls from 2-5 MBit to 10 KBit! After the crawler has been stopped for about an hour, the speed picks up again.

The server isn't loaded and there are no visible reasons for the speed to drop.

The channel is stable, when I download files simultaneously the speed is normal.

There is no such problem on windows machines. This means that the problem is most possibly with Linux. I have tried to turn the firewall off, but it doesn't help.

Any ideas as to what may be the reason / cure?

 

DamonHD




msg:3286745
 9:06 pm on Mar 19, 2007 (gmt 0)

If they're not *your* domains then it's probably the Webservers protecting themselves against content theft.

Are you doing this with permission?

Rgds

Damon

plumsauce




msg:3286755
 9:24 pm on Mar 19, 2007 (gmt 0)

If they're not *your* domains then it's probably the Webservers protecting themselves against content theft.

um.., how can this be given the following then?

There is no such problem on windows machines.

Advice to the OP, one thing to look for in your code and the way it interacts with linux is resource exhaustion. Are you handling resources in the same way, and do the resources behave the same way under both OS's.

The alternative of course is to use what works, Windows, and get on with life. :)

jtara




msg:3286817
 11:59 pm on Mar 19, 2007 (gmt 0)

To clarify, do you mean that if you run the crawler from a Windows box, it works OK, but it you run the crawler from either of the two Linux machines, you have the problem? (As opposed to: there is no problem if the server you are crawling is using Windows.)

If you only occasionally do this from a Windows machine, the problem could still be that the websites are protecting themselves from an unwanted crawler. Do your machines have distinct, permanent IP addresses on the Internet? (At least your Linux machines?)

If so, perhaps the web sites (or some of them) have identified these addresses as housing a "problem bot". But no problem with the Windows machine, which you seldom use, because they haven't caught you coming from that box yet.

Gustaf




msg:3287648
 8:33 pm on Mar 20, 2007 (gmt 0)

you run the crawler from a Windows box, it works OK, but it you run the crawler from either of the two Linux machines, you have the problem?

Yes. The code is exactly the same, so the problem seems to be somewhere in the Linux socket layer. Maybe someone that is good at Linux can give some advice.

If you only occasionally do this from a Windows machine, the problem could still be that the websites are protecting themselves from an unwanted crawler. Do your machines have distinct, permanent IP addresses on the Internet?

The crawler downloads just a few pages from each domain, and the windows box is used at least as much as the linux boxes, if not more, so I don't think this is the problem.

Besides, quite few web servers have advanced crawler protection.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved