Welcome to WebmasterWorld Guest from 220.127.116.11
Forum Moderators: bakedjake
The problem is that after running for about two hours, the download speed falls from 2-5 MBit to 10 KBit! After the crawler has been stopped for about an hour, the speed picks up again.
The server isn't loaded and there are no visible reasons for the speed to drop.
The channel is stable, when I download files simultaneously the speed is normal.
There is no such problem on windows machines. This means that the problem is most possibly with Linux. I have tried to turn the firewall off, but it doesn't help.
Any ideas as to what may be the reason / cure?
If they're not *your* domains then it's probably the Webservers protecting themselves against content theft.
um.., how can this be given the following then?
There is no such problem on windows machines.
Advice to the OP, one thing to look for in your code and the way it interacts with linux is resource exhaustion. Are you handling resources in the same way, and do the resources behave the same way under both OS's.
The alternative of course is to use what works, Windows, and get on with life. :)
If you only occasionally do this from a Windows machine, the problem could still be that the websites are protecting themselves from an unwanted crawler. Do your machines have distinct, permanent IP addresses on the Internet? (At least your Linux machines?)
If so, perhaps the web sites (or some of them) have identified these addresses as housing a "problem bot". But no problem with the Windows machine, which you seldom use, because they haven't caught you coming from that box yet.
you run the crawler from a Windows box, it works OK, but it you run the crawler from either of the two Linux machines, you have the problem?
Yes. The code is exactly the same, so the problem seems to be somewhere in the Linux socket layer. Maybe someone that is good at Linux can give some advice.
If you only occasionally do this from a Windows machine, the problem could still be that the websites are protecting themselves from an unwanted crawler. Do your machines have distinct, permanent IP addresses on the Internet?
The crawler downloads just a few pages from each domain, and the windows box is used at least as much as the linux boxes, if not more, so I don't think this is the problem.
Besides, quite few web servers have advanced crawler protection.