Crawler Speed Problem

Forum Moderators: bakedjake

Message Too Old, No Replies

Crawler Speed Problem

Download speed falls sharply after 1-2 hours

Gustaf

8:04 pm on Mar 19, 2007 (gmt 0)

I have a crawler that downloads web pages from different domains, usually about 100 domains in any one moment. It runs on two different servers (both with Fedora Core 5) with very good links to the internet (100 MBit).

The problem is that after running for about two hours, the download speed falls from 2-5 MBit to 10 KBit! After the crawler has been stopped for about an hour, the speed picks up again.

The server isn't loaded and there are no visible reasons for the speed to drop.

The channel is stable, when I download files simultaneously the speed is normal.

There is no such problem on windows machines. This means that the problem is most possibly with Linux. I have tried to turn the firewall off, but it doesn't help.

Any ideas as to what may be the reason / cure?

DamonHD

9:06 pm on Mar 19, 2007 (gmt 0)

If they're not *your* domains then it's probably the Webservers protecting themselves against content theft.

Are you doing this with permission?

Rgds

Damon

plumsauce

9:24 pm on Mar 19, 2007 (gmt 0)

If they're not *your* domains then it's probably the Webservers protecting themselves against content theft.

um.., how can this be given the following then?

There is no such problem on windows machines.

Advice to the OP, one thing to look for in your code and the way it interacts with linux is resource exhaustion. Are you handling resources in the same way, and do the resources behave the same way under both OS's.

The alternative of course is to use what works, Windows, and get on with life. :)

jtara

11:59 pm on Mar 19, 2007 (gmt 0)

To clarify, do you mean that if you run the crawler from a Windows box, it works OK, but it you run the crawler from either of the two Linux machines, you have the problem? (As opposed to: there is no problem if the server you are crawling is using Windows.)

If you only occasionally do this from a Windows machine, the problem could still be that the websites are protecting themselves from an unwanted crawler. Do your machines have distinct, permanent IP addresses on the Internet? (At least your Linux machines?)

If so, perhaps the web sites (or some of them) have identified these addresses as housing a "problem bot". But no problem with the Windows machine, which you seldom use, because they haven't caught you coming from that box yet.

Gustaf

8:33 pm on Mar 20, 2007 (gmt 0)

you run the crawler from a Windows box, it works OK, but it you run the crawler from either of the two Linux machines, you have the problem?

Yes. The code is exactly the same, so the problem seems to be somewhere in the Linux socket layer. Maybe someone that is good at Linux can give some advice.

If you only occasionally do this from a Windows machine, the problem could still be that the websites are protecting themselves from an unwanted crawler. Do your machines have distinct, permanent IP addresses on the Internet?

The crawler downloads just a few pages from each domain, and the windows box is used at least as much as the linux boxes, if not more, so I don't think this is the problem.

Besides, quite few web servers have advanced crawler protection.