|Pages dropped out of google Index, still getting visitors referred|
Problem with indexed pages after submiting a sitemap
Quick question here,
The situation is like this, We have a site that is around 6 weeks old. The site was indexed during the first week because we’ve got some quality links. The boots has been coming in daily basis ever since.
Two weeks ago we had around 3,000 pages indexed. We wanted to get other 20,000 pages spidered but googlebot didn’t want to go trough them, so we introduced changes in the sitemap to encourage the bot to come over and pick those pages. The changes were:
1.Introduced LastModified parameter for every url (in the Google sitemap).
2.Set the ChangeFrequecy to daily for every url in the Google sitemap.
The Outcome was that in fact googlebot came to the site and spidered those pages the day after, but our indexed pages count (site operator) instead of increasing decreased to zero in one day.
Now the Webmaster Tools Console shows “No pages are included in the index, indexing can take time, …., bla bla bla”.
The weird part is that we keep having referred visitors from Google (about a thousand per day) and if we conduct a search in Google using as keyword our domain name, we actually can see that the first results are pages from our site (but they don’t appear if we use the site operator).
We also checked using a tool out there how many pages we have indexed in all 43 most commonly used Google Data Centers, the result is consistent among them (Count = 0 ).
Google keeps constant presence in our site, their bot is always there retrieving 1 page per minute for a total of about 1400 pages/day more or less. We tried resubmitting the sitemap without those parameters but the situation is the same…
The question is, does anybody have any idea what could be going on here? Would it be Google reconstructing our pages in the index for about 10 days? Would this be some kind of a penalty? Any other thoughts out thee?
Hello aguiarrj, and welcome to the forums.
|The weird part is that we keep having referred visitors from Google (about a thousand per day) |
Your server logs should tell you what search terms are bringing you Google traffic. Sometimes the site: operator gives buggy results. Even more often right now, webmaster tools gives buggy results. But real traffic is where the rubber meet the road.
|We have a site that is around 6 weeks old. |
Then you're off to a very decent start. 1,000 visitors a day from Google in that short a time is quite a good achievement. If you haven't already seen the thread I liked to below, it's worth a read for any relatively new site:
Filters exist - the Sandbox doesn't. How to build Trust. [webmasterworld.com]
|Now the Webmaster Tools Console shows “No pages are included in the index, indexing can take time, …., bla bla bla”. |
I see it say this periodically. webmaster tools are not the most stable things in the world, I have grown to totally disregard this warning.
First time I saw it I freaked out, but did a quick search for a keyword I ranked number one in and yep, still there... next day webmaster tools went back to saying the pages are IN.
I see it off and on saying pages are out, pages are in, but traffic is always stable, and the keywords are always intact, leading me to believe there are glitchs in webmaster tools itsself.
I wouldn't go all postal over what webmaster tools say, traffic is your best indicator, that and punching in a keyword or two that you rank in and seeing first hand if it's still there.
In my hunt for our ranking hijackers and now our competitors' ranking hijackers as well, surely the same folks, I am on the phone with folks in different parts of the country with different isp's doing the same queries in Google at the same time together on the phone using the site operator and not. We are all getting different results.
I am assuming its different Google data centers but who knows.
WHat I do know is if you get a result set different from mine on exactly the same query at the same time, if I email you the result page URL from my query and you browse it you will see the results that I see.
Does anyone know how queries get routed to a particular Google data center? Is is a fixed route depending on your ISP or is it dependent on traffic.
Also why are data center indexes different?
I found a tool a while ago that is helpful if it is accurate. It shows you Page Rank for a given URL in all the different data centers.
One of our homepages is consistently a 4 in all centers.
One is a 4 in some and a 5 in others.
One of our competitors home page is a 4 in some, a 5 in others, and a ZERO in one data center.
[edited by: tedster at 2:35 am (utc) on Dec. 30, 2006]
using the site operator in Google as follows:
all three yield quite different results
If you put a space after the colon, then you are no longer using the operator, you are doing a regular search on two character strings, one of which is your domain name (which can easily appear on domains other than yours.) I don't see any differences with or without the http:// unless the domain has https listings as well.
Now about the sport of data center watching. It began back in simpler Google days, when there was a regular "google dance" that accompanied an update of the algorithm. Sometimes you could catch a glimpse of what might be coming soon to google results by watching the data centers, and to a degree that is still what people are hoping for.
But life is no longer that simple. Going to a specific IP to check results very often will give you something that will never see the light of day on google.com -- or google-dot-anything. Between the data at a given IP address (even a "live" one) and final search results are a few more mysterious steps -- filters, geo-targetting, and apparently other kinds of customization.
You can use a Firefox extension like LiveHTTPHeaders and discover which Google IP address is supplying the data at that moment. But if you go directly to that IP address, you may not see the same results. Some IP addresses may even hold experiments that stand no chance of ever going live.
It's all a bit much sometimes, and today I find data center watching to be more of an interesting hobby than a way to discover any actionable information.
Just before BIGDADY or what ever the new system is called now - the names change every third second -.... anyhow about 10 months ago many of us had the great joy of using something which actually DID WORK I would "guestimate" (estimate) at 80%. It was a site known for LIVE PR which was run in Sweeden. I am not posting the URI as that is not allowed. The site is no longer able to offer more than the PRESENT PR (which is actually nothing much). I post this as there was, recently, for a time a means to view the LIVE PR.
Hi again, and thanks for replying!
Well to be honest more than getting all 50K pages indexed what I am trying to do is to understand Google’s behavior.
I saw a video made by Matt Cutts in which he explained that if site submits large amounts of pages in a short period of time, the site could get flagged and held back from ranking well.
He explains that “large amounts of pages” is relative to the number of pages that site already has in the index and to time window in which the new pages are being submitted. For example an authority site with 100k indexed pages could easily submit 5k pages in a week without having any problems at all, on the other hand, a new site within the few hundred indexed pages would get flagged if submits the same amount in the same time.
So this sounded like there could be the problem I am facing, so I deleted the sitemap and created a new one submitting only 1,000 pages (in addition to the 3k we already had indexed) . The outcome was that after 12 hours the webmaster console changed the message, now it says “Googlebot last successfully accessed your home page on Dec 28, 2006. Pages from your site are included in Google's index”.
When we check the indexed pages count by DC it shows up 127 supplemental pages for all DCs, however our index page is not among them. But if we conduct a Google search using our domain name as keyword it does show up in the results.
What do you think?
Can you give me a link to the Matt Cutts video that has the information about adding too many items to your website at a time. I looked through the various videos, but I could not find it.
I haven't seen Matt mention this on a video, but I know he has a regular blog post on the topic.
Our discussion of Matt's post on too many new urls [webmasterworld.com]
Matt's blog page on too many new urls [mattcutts.com]
I am not sure right now, but I think I saw it here:
I just watched it....your were correct....THANKS
Yes, I definitely appreciated the link. Turns out I had watched it before.
Interesting to note the question of scale here. Matt says that a few thousand new urls at a time shouldn't be a problem (as of last summer), but millions at once can be. So I'm not totally clear where the trigger point might lie, and whether it's related to the percentage of all urls on the domain, or it's just a "hard" number.
We may be learning more right now - a client with 250,000 urls indexed is adding a couple million more. We released about 8,000 to the bots a few weeks ago with no trouble. Some of the new urls really don't need to be indexed ever, but probably half of them do make sense as a search result. We want to get those all released, but do intend on continuing to pace ourselves. I will definitely post about it if we bang into any troubles as we go along.
I HAVE been using sitemaps but every timeI submit a sitemap my site gets penalized. I know people say that this is not true but every time I add my sitemap my pages disappear so I am not submitting sitemaps anymore. IF I use the tool I just submit my site without a sitemap. As far as I am concerned, sitemapw are just crap.
The other day I told them that my preferred domain is WWW, and what happened? Yes, you are right! Lots of non-www pages show up in their serps. GEEEESSSHHHHhh..... what next? deny Googlebot totally to get indexed?
Would you let us know the outcome of the submission of the new urls of your client? Wee would love to know what happened, did you have any problems?
Thanks in advance,
Sure -- but just so you understand, I never submit urls to a search engines, I just block the robots from crawling or indexing them by using either robots.txt or the robots meta tag noindex. Then I release the blocks in batches so they can be spidered. Will let you know.
Thanks again for replying to this post, I am curious now, why don’t you use Google’s sitemap app? I am pretty sure you have very good reasons, would you share them with us?
Some of our clients do use a Google sitemap and some do not. Especially for larger sites, I often prefer to see the results of Google's "natural" crawling -- it can illuminate areas of the site structure that we need to work on. If we were using an xml sitemap, I think that level of intelligence might be obscured.
Do they get all urls crawled naturally? If you have to predict the % of urls that google naturally crawls, in average, what would it be?
It seems to be very dependent on PR - as the Googlers have been telling us. Rough guess on a 250,000 url domain, if it has a PR 8 home page with a healthy profile of internal backlinks, it can see maybe 80% of its URLs crawled and indexed. But we don't obsess about that percentage, either.
We do look to create excellent organic search landing pages, and make sure they get good PR circulation plus offer the visitor access to what they would most need. It's more of a global approach that seems to work best, rather than hoping to force feed Google everything. I know they say "organize all the world's information", but really that's an ideal. So we aim to create a solid web business, one that continually improves, and we leave perfection to the perfectionists.