| 1:52 pm on Jan 29, 2009 (gmt 0)|
take your pick ...
| 2:57 pm on Jan 29, 2009 (gmt 0)|
some of them i have tested already (plus many others free scripts that can by run on our server (apache/php))
our main concern is the performance of the crawler, when spidering almost 1 million pages and the file size of the xml sitemap.
The question is: have you used any such crawler for a very large site? How did the script performed under constant updates in many hundrents of pages?
We dont mind the cost of the script, as long as we know that it will work.
| 2:33 pm on Jan 31, 2009 (gmt 0)|
why the concern for performance? this is not something you need to run in real time. it is not even something you need to update every night. you could setup a weekly process that runs in the middle of the night.
you may want to ask yourself what you hope to accomplish with generating a sitemap for a million pages. simply creating a sitemap does not guarantee the search engines will index the pages and it definitely does not guarantee any ranking.
| 5:46 am on Feb 28, 2009 (gmt 0)|
Google might like sitemaps though, it is there for a purpose. omoutop, try breaking your site up into smaller pieces to create a sitemap, it has helped me.
| 12:30 am on Mar 27, 2009 (gmt 0)|
|have you used any such crawler for a very large site? |
I nightly run one with over 5,000,000 pages.
|How did the script performed under constant updates in many hundrents of pages? |
Poorly. It's written in Python and does some very stupid things. You can't even exclude entire directories from the crawl (although you can exclude them from the output).
| 12:31 am on Mar 27, 2009 (gmt 0)|
|it is not even something you need to update every night |
That depends on how many new pages get created in a day.