homepage Welcome to WebmasterWorld Guest from 54.234.2.94
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Google / Google SEO News and Discussion
Forum Library, Charter, Moderators: Robert Charlton & aakk9999 & brotherhood of lan & goodroi

Google SEO News and Discussion Forum

    
Problems listing very large site on Google
webbedfeet




msg:3850536
 11:01 am on Feb 16, 2009 (gmt 0)

Hello,

in Autumn 2008 we released a series of web sites for our client, the main site is

<http://www.example.com>

The idea is that there are hundreds of smaller sites (currently around 600) each with their own sub

domain. For example, <http://widget.example.com>

Each of these sites should be considered as independent by Google as they have separate sub

domains, a separate Google Sitemap etc. This seems to be the case. The idea is that each site may

target and benefit a small section of people, and as the content is specific for each site, should

eventually rank quite well in search engines. i.e. a <widget> web site should rate well for such

searches, and when visited, should deliver only relevant listings to the user.

There are around 40 million pages that can be listed (according to our sitemap), yet nowhere near

that many listed. We were typing "site: example.com" into Google, and watching it

rise, but several weeks ago it froze at 501,000 pages. However, if we type a similar query for each

sub site, and total the results, we get over 1.1 million pages listed. This seems to be more accurate.
This seems to be growing at around 22,000 pages per day, with some sites falling, and some rising. At
this rate it will take a year or two to reach only 10 million pages. This is too long.

We're getting, on average, 1 unique visitor per 1,000 pages, so 10 million would get us 10,000 visitors per day. That's the plan anyway.

So, we need to find a way of speeding up pages. I have logged into Google's control panel and can adjust
the crawl rate, and will gradually do this.

However, as each site is considered separate, and there are 600 of them, and so many pages, the Google Sitemap files are large and, more importantly, the queries to produce them take up a lot of processing power. This really slows the web site down, so much it crashes. Google seems to check all the time (URLS such as <http://widget.example.com/sitemap.php>) So I set up a cache, so that the sitemaps are just sent out by the web server each time there is a request, this really reduces database load. However, even with a 14 day cache we have issues. I could set to 28 days, or even 56 days ... this will make it better, but it's still a lot of processing power.

Bear in mind that the pages on the site themselves are very navigatable by search engines, find one page and you can click to any other in a few clicks. Especially when navigating from the home page. As for sold items, which we are keeping, these are all indexed through an archive page.

QUESTION 1

So my question, are there any adverse effects in caching the sitemap files for 2 months? Google can use the old files to locate all sold items, and the web site itself for all daily updates. Infact, letís look at it another way ... are sitemaps necessary? If I could get rid of these sitemaps and still index pages, it would be a great relief. The vast majority of MySQL processing time is from Google, and sitemaps are the bulk of this.

QUESTION 2

Apart from adjusting the crawl rate is there any other way of speeding up indexing? Can we tell Google, for example, to spend less time revisiting pages and more time looking for new ones?

QUESTION 3

Any other idea how to get these sites listed quickly?

Any help appreciated - thanks for reading

[edited by: tedster at 6:36 pm (utc) on Feb. 16, 2009]
[edit reason] No specific URLs or keywords, please - see charter [/edit]

 

tedster




msg:3850840
 7:07 pm on Feb 16, 2009 (gmt 0)

Hello webbedfeet, and welcome to the forums.

I'll address some of what you asked, and others may also have imput. Some of the challenges you face may be technical database questions that are better addressed in our Databases Forum [webmasterworld.com]

If your pages tend not to have much "churn" after they are first published, you may want to consider how well your server is responding to the "If-modified-since" request.

The XML Sitemap format ( see https://www.google.com/webmasters/tools/docs/en/protocol.html#sitemapXMLFormat ) does allow for <lastmod>, <changefereq> and <priority> tags that may help your crawling issues significantly if you are not already using these tags.

The question of whether to go without a Sitemap altogether and just allow normal crawling is an interesting one to consider. There have been some tests on smaller sites that show a quicker googlebot response with a Sitemap, but we're talking about small time differences that would only be important in the kind of site where freshness means a lot.

Two other ideas also come to mind:

1. You can experiment with a single subdomain and discover how a new approach works before you roll it out completely.

2. You might consider submitting only a partial sitemap, one that focuses on new and recently changed urls

tangor




msg:3851034
 12:28 am on Feb 17, 2009 (gmt 0)

Why maintain "sold" pages? Unless these are "selling" pages they offer nothing but a dead end for the user, unless there is compelling CONTENT that might be of value.

40m pages=no onsite visitors for 39,999,971 pages. User sticky is not likely with that many pages! Attention spans are not that long...

tedster




msg:3851197
 3:43 am on Feb 17, 2009 (gmt 0)

I agree that the decision to keep pages live that are no longer about available items may not be wise. It may also be part of your indexing problem.

anallawalla




msg:3851246
 5:35 am on Feb 17, 2009 (gmt 0)

I presume by "Autumn 2008" you are in the Northern hemisphere, i.e. 4-5 months have elapsed. In that case without a sitemap you should have 3-4M pages indexed unless your pages are heavy and the server is slow. Our main site has 2.5M pages and it took three months to index them all (without a sitemap). Until then we noticed what you are seeing -- if we did a selective site: command, we got 400k pages, which was also the count if we did a sitewide site: command. But once the pages were fully indexed, the site: operator worked as intended.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Google / Google SEO News and Discussion
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved