Welcome to WebmasterWorld Guest from 188.8.131.52
in Autumn 2008 we released a series of web sites for our client, the main site is
The idea is that there are hundreds of smaller sites (currently around 600) each with their own sub
domain. For example, <http://widget.example.com>
Each of these sites should be considered as independent by Google as they have separate sub
domains, a separate Google Sitemap etc. This seems to be the case. The idea is that each site may
target and benefit a small section of people, and as the content is specific for each site, should
eventually rank quite well in search engines. i.e. a <widget> web site should rate well for such
searches, and when visited, should deliver only relevant listings to the user.
There are around 40 million pages that can be listed (according to our sitemap), yet nowhere near
that many listed. We were typing "site: example.com" into Google, and watching it
rise, but several weeks ago it froze at 501,000 pages. However, if we type a similar query for each
sub site, and total the results, we get over 1.1 million pages listed. This seems to be more accurate.
This seems to be growing at around 22,000 pages per day, with some sites falling, and some rising. At
this rate it will take a year or two to reach only 10 million pages. This is too long.
We're getting, on average, 1 unique visitor per 1,000 pages, so 10 million would get us 10,000 visitors per day. That's the plan anyway.
So, we need to find a way of speeding up pages. I have logged into Google's control panel and can adjust
the crawl rate, and will gradually do this.
However, as each site is considered separate, and there are 600 of them, and so many pages, the Google Sitemap files are large and, more importantly, the queries to produce them take up a lot of processing power. This really slows the web site down, so much it crashes. Google seems to check all the time (URLS such as <http://widget.example.com/sitemap.php>) So I set up a cache, so that the sitemaps are just sent out by the web server each time there is a request, this really reduces database load. However, even with a 14 day cache we have issues. I could set to 28 days, or even 56 days ... this will make it better, but it's still a lot of processing power.
Bear in mind that the pages on the site themselves are very navigatable by search engines, find one page and you can click to any other in a few clicks. Especially when navigating from the home page. As for sold items, which we are keeping, these are all indexed through an archive page.
So my question, are there any adverse effects in caching the sitemap files for 2 months? Google can use the old files to locate all sold items, and the web site itself for all daily updates. Infact, letís look at it another way ... are sitemaps necessary? If I could get rid of these sitemaps and still index pages, it would be a great relief. The vast majority of MySQL processing time is from Google, and sitemaps are the bulk of this.
Apart from adjusting the crawl rate is there any other way of speeding up indexing? Can we tell Google, for example, to spend less time revisiting pages and more time looking for new ones?
Any other idea how to get these sites listed quickly?
Any help appreciated - thanks for reading
[edited by: tedster at 6:36 pm (utc) on Feb. 16, 2009]
[edit reason] No specific URLs or keywords, please - see charter [/edit]
I'll address some of what you asked, and others may also have imput. Some of the challenges you face may be technical database questions that are better addressed in our Databases Forum [webmasterworld.com]
If your pages tend not to have much "churn" after they are first published, you may want to consider how well your server is responding to the "If-modified-since" request.
The XML Sitemap format ( see [google.com...] ) does allow for <lastmod>, <changefereq> and <priority> tags that may help your crawling issues significantly if you are not already using these tags.
The question of whether to go without a Sitemap altogether and just allow normal crawling is an interesting one to consider. There have been some tests on smaller sites that show a quicker googlebot response with a Sitemap, but we're talking about small time differences that would only be important in the kind of site where freshness means a lot.
Two other ideas also come to mind:
1. You can experiment with a single subdomain and discover how a new approach works before you roll it out completely.
2. You might consider submitting only a partial sitemap, one that focuses on new and recently changed urls
40m pages=no onsite visitors for 39,999,971 pages. User sticky is not likely with that many pages! Attention spans are not that long...