Forum Moderators: open
Take for instance right now on this system:
AllTheWeb - full crawl.
WiseNut - full crawl.
Google - partial crawl.
Ink - Sporadic spidering.
Northern Light - significant spidering.
Alta - significant spidering.
Excite - sporadic spidering.
3rd party spiders - much spidering.
That's close to 95k pages in the last 48 hours. (probably 25k already this morning during peak hours)
Q: Is there an answer? It is a massive waste of internet resources and bandwidth.
A: A common centralized spider system paid for by the search engines. The system spiders the sites, then the se's just pull from the common db.
It would result in fresher pages and significantly reduced internet load. It is foolish for all these companies to be duplicating the same function.
Seems like where we see a dominant service, we see either predatory behavior or slow adoption to new technologies. The search engines would become dependent on the outsourced spidering and it seems as though the entire industry would be very negatively affected by a disruption of the central spider, or by pricing pressures from the monopolistic spider.
Also, in my industry, a central spidering system would not stop the 3rd party spiders that are specifically set up to spider our site's dynamic content. It takes engineers to develop scripts that will work specifically with our site's content. Because of the resources required, those specialty spiders will always need to remain independent.
By saying that many already use the same database (thinking of Ink), and with my assumption that 3rd party spidering would continue, the gains seem to be worth less than the losses.