Forum Moderators: open
So far, I've written a couple of test-spiders, which work great (results show up all fine), but they take too long a time to execute.
Now, sure, if there was just 1 or 2 spiders, a 5 sec waiting time would be okay, but figuring there could be as many as 20-30, I need to cut it somehow.
Is there a way to optimize them or at least run them all at the same time? Should I switch from ASP to another programming language?
//ZS
What is actually happening is that:
1) You request the page from your server.
2) This in turn makes several requests to various other web pages, presumbly sequentially.
3) Your server then processes and combines the fetched pages into a single web page.
4) The final page is then returned to your browser.
I would say that its not the fact that ASP is slow (it isn't that slow). The delay comes from stage 2 - accessing several other pages which always takes at least a second. The fact you are accessing several sites multiplies the time factor.
What I would suggest you do is cache the data you retrieve, and only update this cache once per hour, for instance. Visitors to your site will only experience a delay the first time the data is retrieved. After that, the page should be delivered almost instantaneously.
HTH,
JP
I'm currently building a comparison site, but most of the data is provided by the suppliers as CSV files or I use a crawler that gets all the prices each night. If the prices in the market you are targetting don't change during the day then this may be a solution to improve access times.
Ah, thanks, but it'll still be very slow with displaying the uncached results. I mean, if nobody waits 2 min, no results will get cached :)
Well what I would do is:
1) Hold results in cache
2) Everytime the page is accessed,
a) display the results from cache.
b) check to see if the cache is still "current", e.g. last cache update < 60 minutes.
c) If not, then update cache time to current time.
d) Run procedure to fetch results from external sites and update cache.
Doing it like the above would mean that subsequent accesses during the 2 minutes update time would deliver the old results, but would not fire off the fetch-results-update cache procedure again. Then once the procedure has finished getting the results, the cache contents will be updated. Just remember to set the script timeout setting to a high value!
JP
Thanks for the help tho, this got me thinking :)
//ZS
It's made for this sort of thing. You could probably use a counter to keep track of how frequent you access certain pages, and if that count passes a threshold, you tell CURL to save it into a file, or maybe put it into a database.
One handy thing about it is that you can tell it to retrive X to Y bytes from a page, so say if you wanted the footer of this page, you would tell it to grab the last 1000 bytes or whatever.
Might speed up the fetching processs a little, it's a good tool to have anyway :)
As a start, think about calling one script which then calls individual scripts for each site. Then have the first script monitor these child scripts to see when they complete and then return the results.
There may (hopefully) be someone else on the board that can give more ideas.
As far as I'm aware it's impossible, though I'd love for someone to tell me I'm wrong!
ASP runs things only sequentially. It executes line by line, awaiting the previous "command" to finish before starting the next one.
The only thing that might side-step this is an extra server component which allows multi-threading calls. I've not seen one yet though :( .NET supports multithreading as standard from what I can gather.
The alternative (if you are running your own server, or your hosting company is willing to install things for you) is to roll your own component which should be able to work multi-threaded.
JP
First off you request the central script which then calls the sub scripts by doing a HTTP request as if it was a client. This is where the parrallel part comes in, at the webserver level rather than the script level. With PHP you could use fopen() for this, which I believe will return a file pointer as soon as the GET request is sent. I've not used ASP, but there should be a similar way of requesting a URL as a filestream.
Now you've sent requests to each child, which in turn have sent requests to the sub sites, you can get the output of each request in turn with a smaller delay in between them.
This is just theory, I haven't tried it out myself, but it sounds like it should be a way of absorbing the delay while the remote site sends data.