Forum Moderators: Robert Charlton & goodroi
Quick question here,
The situation is like this, We have a site that is around 6 weeks old. The site was indexed during the first week because we’ve got some quality links. The boots has been coming in daily basis ever since.
Two weeks ago we had around 3,000 pages indexed. We wanted to get other 20,000 pages spidered but googlebot didn’t want to go trough them, so we introduced changes in the sitemap to encourage the bot to come over and pick those pages. The changes were:
1.Introduced LastModified parameter for every url (in the Google sitemap).
2.Set the ChangeFrequecy to daily for every url in the Google sitemap.
The Outcome was that in fact googlebot came to the site and spidered those pages the day after, but our indexed pages count (site operator) instead of increasing decreased to zero in one day.
Now the Webmaster Tools Console shows “No pages are included in the index, indexing can take time, …., bla bla bla”.
The weird part is that we keep having referred visitors from Google (about a thousand per day) and if we conduct a search in Google using as keyword our domain name, we actually can see that the first results are pages from our site (but they don’t appear if we use the site operator).
We also checked using a tool out there how many pages we have indexed in all 43 most commonly used Google Data Centers, the result is consistent among them (Count = 0 ).
Google keeps constant presence in our site, their bot is always there retrieving 1 page per minute for a total of about 1400 pages/day more or less. We tried resubmitting the sitemap without those parameters but the situation is the same…
The question is, does anybody have any idea what could be going on here? Would it be Google reconstructing our pages in the index for about 10 days? Would this be some kind of a penalty? Any other thoughts out thee?
The weird part is that we keep having referred visitors from Google (about a thousand per day)
Your server logs should tell you what search terms are bringing you Google traffic. Sometimes the site: operator gives buggy results. Even more often right now, webmaster tools gives buggy results. But real traffic is where the rubber meet the road.
We have a site that is around 6 weeks old.
Then you're off to a very decent start. 1,000 visitors a day from Google in that short a time is quite a good achievement. If you haven't already seen the thread I liked to below, it's worth a read for any relatively new site:
Filters exist - the Sandbox doesn't. How to build Trust. [webmasterworld.com]
Now the Webmaster Tools Console shows “No pages are included in the index, indexing can take time, …., bla bla bla”.
I see it say this periodically. webmaster tools are not the most stable things in the world, I have grown to totally disregard this warning.
First time I saw it I freaked out, but did a quick search for a keyword I ranked number one in and yep, still there... next day webmaster tools went back to saying the pages are IN.
I see it off and on saying pages are out, pages are in, but traffic is always stable, and the keywords are always intact, leading me to believe there are glitchs in webmaster tools itsself.
I wouldn't go all postal over what webmaster tools say, traffic is your best indicator, that and punching in a keyword or two that you rank in and seeing first hand if it's still there.
I am assuming its different Google data centers but who knows.
WHat I do know is if you get a result set different from mine on exactly the same query at the same time, if I email you the result page URL from my query and you browse it you will see the results that I see.
Does anyone know how queries get routed to a particular Google data center? Is is a fixed route depending on your ISP or is it dependent on traffic.
Also why are data center indexes different?
I found a tool a while ago that is helpful if it is accurate. It shows you Page Rank for a given URL in all the different data centers.
One of our homepages is consistently a 4 in all centers.
One is a 4 in some and a 5 in others.
One of our competitors home page is a 4 in some, a 5 in others, and a ZERO in one data center.
[edited by: tedster at 2:35 am (utc) on Dec. 30, 2006]
Now about the sport of data center watching. It began back in simpler Google days, when there was a regular "google dance" that accompanied an update of the algorithm. Sometimes you could catch a glimpse of what might be coming soon to google results by watching the data centers, and to a degree that is still what people are hoping for.
But life is no longer that simple. Going to a specific IP to check results very often will give you something that will never see the light of day on google.com -- or google-dot-anything. Between the data at a given IP address (even a "live" one) and final search results are a few more mysterious steps -- filters, geo-targetting, and apparently other kinds of customization.
You can use a Firefox extension like LiveHTTPHeaders and discover which Google IP address is supplying the data at that moment. But if you go directly to that IP address, you may not see the same results. Some IP addresses may even hold experiments that stand no chance of ever going live.
It's all a bit much sometimes, and today I find data center watching to be more of an interesting hobby than a way to discover any actionable information.
Well to be honest more than getting all 50K pages indexed what I am trying to do is to understand Google’s behavior.
I saw a video made by Matt Cutts in which he explained that if site submits large amounts of pages in a short period of time, the site could get flagged and held back from ranking well.
He explains that “large amounts of pages” is relative to the number of pages that site already has in the index and to time window in which the new pages are being submitted. For example an authority site with 100k indexed pages could easily submit 5k pages in a week without having any problems at all, on the other hand, a new site within the few hundred indexed pages would get flagged if submits the same amount in the same time.
So this sounded like there could be the problem I am facing, so I deleted the sitemap and created a new one submitting only 1,000 pages (in addition to the 3k we already had indexed) . The outcome was that after 12 hours the webmaster console changed the message, now it says “Googlebot last successfully accessed your home page on Dec 28, 2006. Pages from your site are included in Google's index”.
When we check the indexed pages count by DC it shows up 127 supplemental pages for all DCs, however our index page is not among them. But if we conduct a Google search using our domain name as keyword it does show up in the results.
What do you think?
Our discussion of Matt's post on too many new urls [webmasterworld.com]
Matt's blog page on too many new urls [mattcutts.com]
Interesting to note the question of scale here. Matt says that a few thousand new urls at a time shouldn't be a problem (as of last summer), but millions at once can be. So I'm not totally clear where the trigger point might lie, and whether it's related to the percentage of all urls on the domain, or it's just a "hard" number.
We may be learning more right now - a client with 250,000 urls indexed is adding a couple million more. We released about 8,000 to the bots a few weeks ago with no trouble. Some of the new urls really don't need to be indexed ever, but probably half of them do make sense as a search result. We want to get those all released, but do intend on continuing to pace ourselves. I will definitely post about it if we bang into any troubles as we go along.
We do look to create excellent organic search landing pages, and make sure they get good PR circulation plus offer the visitor access to what they would most need. It's more of a global approach that seems to work best, rather than hoping to force feed Google everything. I know they say "organize all the world's information", but really that's an ideal. So we aim to create a solid web business, one that continually improves, and we leave perfection to the perfectionists.