Welcome to WebmasterWorld Guest from

Message Too Old, No Replies

Problems listing very large site on Google

11:01 am on Feb 16, 2009 (gmt 0)

New User

5+ Year Member

joined:Feb 13, 2009
posts: 1
votes: 0


in Autumn 2008 we released a series of web sites for our client, the main site is


The idea is that there are hundreds of smaller sites (currently around 600) each with their own sub

domain. For example, <http://widget.example.com>

Each of these sites should be considered as independent by Google as they have separate sub

domains, a separate Google Sitemap etc. This seems to be the case. The idea is that each site may

target and benefit a small section of people, and as the content is specific for each site, should

eventually rank quite well in search engines. i.e. a <widget> web site should rate well for such

searches, and when visited, should deliver only relevant listings to the user.

There are around 40 million pages that can be listed (according to our sitemap), yet nowhere near

that many listed. We were typing "site: example.com" into Google, and watching it

rise, but several weeks ago it froze at 501,000 pages. However, if we type a similar query for each

sub site, and total the results, we get over 1.1 million pages listed. This seems to be more accurate.
This seems to be growing at around 22,000 pages per day, with some sites falling, and some rising. At
this rate it will take a year or two to reach only 10 million pages. This is too long.

We're getting, on average, 1 unique visitor per 1,000 pages, so 10 million would get us 10,000 visitors per day. That's the plan anyway.

So, we need to find a way of speeding up pages. I have logged into Google's control panel and can adjust
the crawl rate, and will gradually do this.

However, as each site is considered separate, and there are 600 of them, and so many pages, the Google Sitemap files are large and, more importantly, the queries to produce them take up a lot of processing power. This really slows the web site down, so much it crashes. Google seems to check all the time (URLS such as <http://widget.example.com/sitemap.php>) So I set up a cache, so that the sitemaps are just sent out by the web server each time there is a request, this really reduces database load. However, even with a 14 day cache we have issues. I could set to 28 days, or even 56 days ... this will make it better, but it's still a lot of processing power.

Bear in mind that the pages on the site themselves are very navigatable by search engines, find one page and you can click to any other in a few clicks. Especially when navigating from the home page. As for sold items, which we are keeping, these are all indexed through an archive page.


So my question, are there any adverse effects in caching the sitemap files for 2 months? Google can use the old files to locate all sold items, and the web site itself for all daily updates. Infact, letís look at it another way ... are sitemaps necessary? If I could get rid of these sitemaps and still index pages, it would be a great relief. The vast majority of MySQL processing time is from Google, and sitemaps are the bulk of this.


Apart from adjusting the crawl rate is there any other way of speeding up indexing? Can we tell Google, for example, to spend less time revisiting pages and more time looking for new ones?


Any other idea how to get these sites listed quickly?

Any help appreciated - thanks for reading

[edited by: tedster at 6:36 pm (utc) on Feb. 16, 2009]
[edit reason] No specific URLs or keywords, please - see charter [/edit]

7:07 pm on Feb 16, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
votes: 0

Hello webbedfeet, and welcome to the forums.

I'll address some of what you asked, and others may also have imput. Some of the challenges you face may be technical database questions that are better addressed in our Databases Forum [webmasterworld.com]

If your pages tend not to have much "churn" after they are first published, you may want to consider how well your server is responding to the "If-modified-since" request.

The XML Sitemap format ( see [google.com...] ) does allow for <lastmod>, <changefereq> and <priority> tags that may help your crawling issues significantly if you are not already using these tags.

The question of whether to go without a Sitemap altogether and just allow normal crawling is an interesting one to consider. There have been some tests on smaller sites that show a quicker googlebot response with a Sitemap, but we're talking about small time differences that would only be important in the kind of site where freshness means a lot.

Two other ideas also come to mind:

1. You can experiment with a single subdomain and discover how a new approach works before you roll it out completely.

2. You might consider submitting only a partial sitemap, one that focuses on new and recently changed urls

12:28 am on Feb 17, 2009 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
votes: 284

Why maintain "sold" pages? Unless these are "selling" pages they offer nothing but a dead end for the user, unless there is compelling CONTENT that might be of value.

40m pages=no onsite visitors for 39,999,971 pages. User sticky is not likely with that many pages! Attention spans are not that long...

3:43 am on Feb 17, 2009 (gmt 0)

Senior Member

WebmasterWorld Senior Member tedster is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:May 26, 2000
votes: 0

I agree that the decision to keep pages live that are no longer about available items may not be wise. It may also be part of your indexing problem.
5:35 am on Feb 17, 2009 (gmt 0)

Moderator from AU 

WebmasterWorld Administrator anallawalla is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Mar 3, 2003
votes: 3

I presume by "Autumn 2008" you are in the Northern hemisphere, i.e. 4-5 months have elapsed. In that case without a sitemap you should have 3-4M pages indexed unless your pages are heavy and the server is slow. Our main site has 2.5M pages and it took three months to index them all (without a sitemap). Until then we noticed what you are seeing -- if we did a selective site: command, we got 400k pages, which was also the count if we did a sitewide site: command. But once the pages were fully indexed, the site: operator worked as intended.

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members