| This 109 message thread spans 4 pages: < < 109 ( 1 2  4 ) > > || |
|Google not indexing new sites as it used to|
All my previous sites have been indexed in couple of days time by building backlinks from different older sites, social bookmarking, submitting sitemap, etc., you know the drill.
The last site of mine that was created two weeks ago still doesn't come up with any result in Google when doing site:example.com. When I log into webmaster tools I see that my sitemap info says how urls have been submitted, almost all of them (25), but next to indexed urls is one big 0.
Anyone else experiencing the same thing?
I have verified other new sites in my niche,the same situation. I think it is very possible to not be a bug.
Can good quality links resolve the indexing situation?
Yeah, I mean there was talk of a "sandbox" awile back, is this the new sandbox? New sites just get hosed for six months? My site has been online since late september, getting maybe 50-100 visitors per day, but they've only indexed a fraction of my content.
And I'm only getting traffic for stuff that was mostly indexed right when the site launched.
I launched a new site at xmas with two decent links and its ranking fine already.
but we just rebuilt the main site (PR5 and index/ is generally recached once a week or so) it's been 19 days and the old sites still showing in the cache.
however there were immediate (next day)rankings moves of exactly what we were looking for with the changes.
theyre ranking us on the new version, but still showing the old version.
I don't think G is very healthy at the moment either
New pages are also taking a while to get picked up and indexed this week. 2 weeks ago i'd find a page launched in a matter of hours recent pages are still not showing in the index one week later.
Fantastic - the headline page of a section launched 20th January is finally indexed (can be found by unique phrases). It's a normal site, pays a living, and the new section is linked from home page. It's not exactly real time search.
Didn't notice any crawling at all from G for the first couple of weeks this month.
Presumably this is all about Caffeine being late, but I had to try very hard to resist a sarcastic comment on the recent thread that announced real time indexing of Myspace of all things.
Interesting find, I see JohnMu posting on the Google indexing forum, highly recommended.
|...real time indexing of Myspace of all things. |
And Facebook, too. [pcmag.com...]
It's also interesting to see if the chart "Pages crawled per day" in WMT shows less crawling for a few weeks - as a result traffic for a whole long tail site is lower.
This even happens if no new URLs have been added to the site. Therefore I think steady re-crawling and re-indexing is necessary (even for existing URLs) to maintain traffic levels.
If traffic depends on a steady flow of new pages getting indexed and ranked, I can see that a crawling slow down would have a negative effect on traffic. But how would slower crawling affect traffic to existing URLs that are already indexed?
I see URLs with old cache dates (2-3 months old) are pushed back in the SERPs also URLs which have not been crawled and indexed (x-y months old cache date or no cache date at all) are dropped from the SERPs completely or don't rank at all.
I'm still testing, but this is what I see at the moment. Probably a more intuitive information architecture is necessary than before or more PR or more recent trusted links ...
G-bot is back for my newest site, came poking around late last night, & is still crawling.
& it crawled around for about 12 hours and left.
(Too late to edit my last post)
After hours of checking cache dates it's clear they are having problems with indexing deeper paginated results.
All URLs have some sort of fresh cache date but almost all paginated results older cache dates (2-3 months old). Deeper pagination URLs seem not to be favored and refreshed at the moment. So making the pagination more prominent or sending more PR into paginated results should be the key.
It works this way:
fresh cache date: (always in the index)
last indexing: 1-2 months (still in the index)
last indexing: 3+ months - urls show no cache link and within a few days or weeks drop from the index
I've noticed this problem ("New content not ranking") since a few months with new sites, and with new sections on established well-ranking sites. None of the pages in question were deep in the URL structure, at most level 2.
Also, there's a related problem with sites posting news: [webmasterworld.com...]
I've also noticed that site structure changes are a bit slow to be recognized. I'm not sure that issue is related, but it seems so.
I'm not sure if this is a bug, or an extra level of "caution", or probation time. Could be a "Sandbox version 2".
It seems odd that on one hand Google wants to add a whole lot of "fresh" content from a few specific sites like twitter, facebook, and whatnot - - and on the other hand they stop fresh content from other sites from ranking anywhere.
I'm wondering if different topics/keywords/taxonomies are a factor here. In other words, at certain times, having fresh results for certain topics is more important to Google than other topical areas or taxonomies.
This might especially play out during periods of limitation to processing resources - I'm guessing that Caffeine has created a short term challenge in that area and processing resources may be limited. We've seen similar periods in the past, when Google's resources were sort of "pulled back" from intensive crawling and indexing, and then other periods where more resources were being devoted.
Enough cache dates checking for today, it's funny they blocked me because of possible automated queries ;)
But now I'm sure, if you don't see a cache date next to your listing this URL is in trouble, it is a 99% candidate which will be dropped from their index.
Check your cache dates for important sections of your site regularly, if they are 2+ months the cache link will disappear within a few weeks and your URL will be dropped.
I think it could be the same thing as I posted in the SERP & Update thread... If you were getting ready to switch from one type of database to another and the two are not compatible with each other, what would you do?
Keep updating the old one at 'hi-speed' or slow things down for a bit (possibly on both) so you can change over to the new one you can update at 'hyper-speed'?
How does what you're seeing look from this type of perspective... They're phasing out one type of dataset for another and the two aren't compatible. I would think they would 'put the breaks on' for a bit until they've made the change.
MadScientist - yup, that meakes some sense. Whenever we do big changes here, we shut everything down and do our work when no one is around, or worse, if it is an emergency and it has be done during the day, the call goes out "we're getting kicked out of the database", everyone loses access, the fix is made, and then we're all allowed back in. Then, one by one, various processes are restarted as things look stable and OK.
So, on a larger scale at google, it is feasible they are going through the same process. Stop all the processing, toss everyone over to a temp database/dataset, do what has to be done, then switch everyone over to the new set. During this process, not much updating would go on because it simply is not the focus. "Gotta get everyone moved over to the new database, once we do that, then we'll worry about updating the data - the temp data we're using for now will be fine for a bit. At least we will be live and functional even if the data isn't perfect." Add in a snafu or a few problems during the switch, and the whole process gets prolonged and everyone stays with the temp data longer than had been anticipated. C'est la vie.
This month I see the lowest crawl rate ever with all of my sites across 7+ inudstries. This is unbelievable.
Are you noticing any filtering (domains suppressed in the -40 to -60 range) on any of those sites since mid January? I see some, but I can't tell if they are related to the ongoing changes/data issues or to a real site penalty. I have never seen such a slow crawl rate, and old cache dates.
crobb, I don't see any filtering at this time of year with my sites, I had domains (-40 -60) which were filtered in 2008 but after six months I moved them and they are fine now. Same content on new domains saved my business.
Back on topic, if you really would like to see how slow they are use this in your virtual host
SetEnvIfNoCase User-Agent Googlebot dolog
CustomLog /yourpath/yourdomain_googlebot_log common env=dolog
It has a nice side effect, you will see how hard scammers spoof Googlebot.
I'm thinking along the lines of a capacity restraint too. But then, "my" capacity restraint is a bit different in nature...
I'm not sure that the sites that are getting the "slow treatment" ATM will suddently be updated at "hi-speed", once some DB-transfer or other operation is done. It could well be that what we are seeing now is just the normal new Caffeine state of affairs. That is, until they pull some knob or twist some lever or whatever it is they say they do when they do what they do at the Google HQ. With all big things a few adjustments are needed, so whatever takes place at the moment may not be the exact same in a few weeks.
What I'm thinking is that "hi-speed" isn't really dangerous for twitter-updates, facebook-status, and the like. These results will be replaced by something else in a matter of minutes, so no big deal if spam detection fails a little every now and then
However, for real sites -- especially good real sites -- a good rank may well last for (well, a bit longer) so there's more reason for going through "the full algo" (or screening/ranking system) with these page types.
The capacity restraint is so that if you want to rank "fresh" (news) and "superfresh" (tweets, etc) content types, you simply can't put it all through the complete Google algo. This -- with it's many bells and whistles, PageRank, filters, and so on and soforth -- simply takes too long time.
So, it seems to me that Google now has a "content-diversified" algo. Or, several algo's if you want - some "light", some "heavy" in terms of computation requirements. And it depends on some kind of page classification scheme (like you've seen with the "show options" option on the SERPS).
For this to work, a necessary first step is that all pages must go through is some sort of classification of content-type. I.e. "should this be fast-tracked or not?"
However, it seems that with the "inclusion" model they're using a whitelist model in stead of an automated system, at least for the "superfresh" category. Which may be a wise choice.
For "fresh", i.e. news sites, it seems that either you must qualify to get on that whitelist somehow, or they have implemented automated learning/classification... which doesn't seem to be quite up to speed yet. But at least - if it's the traditional Bayesian thing it will get better eventually.
Well, that's the general direction of my thinking, ATM. I might be right and I might be wrong. And perhaps time will show :)
|it seems to me that Google now has a "content-diversified" algo. Or, several algo's if you want |
I'd say you're on to something, claus. And getting a sense of which algo or algos are in play for any given query might be quite a useful skill at times.
I think we were saying something similar, only yours is much more readable and well thought out...
|I'm thinking along the lines of a capacity restraint too. |
But then, "my" capacity restraint is a bit different in nature...
I'm not sure that the sites that are getting the "slow treatment" ATM will suddently be updated at "hi-speed" ...
Let me clarify a bit, because what I'm thinking is being seen goes along with what you and tedster are saying, I think...
Anyway, in my mind it's the same only a bit different:
The 'capacity restraint' (to borrow from Claus) I was thinking about was not necessarily 'site specific' but rather 'data increase & update by necessity' specific, and as a byproduct of decreasing the speed of the data being updated in the 'mostly seen index' (Big Daddy) some sites and pages are seeming to be left out of the index, which may or may not be the case, and seems in some ways to fall in line with what the two of you are saying:
|In other words, at certain times, having fresh results for certain topics is more important to Google than other topical areas or taxonomies. |
|And it depends on some kind of page classification scheme (like you've seen with the "show options" option on the SERPS). |
Here's another version of my thoughts: News, Tweets, Other 'Extremely Time Sensitive' (er, politely 'stuff') or 'very fresh + important' (stale v. fresh, PR, TR, Etc.) would IMO be the emphasis (priority) for crawling, scoring and updating on the Big Daddy Index (AFAIK the results most people see currently see), and the pages 'nearby' the (for lack of a better phrase) 'super fresh + important' pages IMO would see some benefit from the 'freshness priority crawling', because they would be closer to the 'current crawl priority' and therefore 'fresher' since in some ways freshness 'cascades' like PR.
I'm thinking they are trying to keep new data insertion and 'obsolete index' (Big Daddy) updates down during the change over, since the storage method is going to the recycle bin soon anyway...
Why would they keep updating Big Daddy, even at the old crawl rate when it's not going to be used? It could even be they are storing the new crawl data for all pages, including 'non priority pages' (not 'super fresh' results) in the Caffeine data structure and 'pushing' it (on a 'fresh dependent basis') to Big Daddy's index after a period of time... Or, only updating the Big Daddy Index from certain crawl cycles. (Or something to the same effect, go with the point.)
IMO If they are doing one of the preceding and using the crawl data to update Caffeine directly rather than Big Daddy, it would explain quite a bit of the 'crawl to seen index' slowdown, even though pages and new sites are still being spidered, even if not as fast, because they will probably be using some of the crawling resources during the change over, which would IMO slow down on Caffeine's indexing a bit too...
I guess my thoughts are: The slowdown of 'fresh is not critical for a period of N weeks or so' sites might only be in the 'seen by most people' (Big Daddy) indexing and updating of pages, but not the Caffeine index...
IOW They could be indexing sites and pages on in the new infrastructure most can't see yet (Caffeine), and the slowdown is 'more relative' to the updating of the old index (Big Daddy), which happens to be what most people see most of the time.
I think I've rambled my point out somewhere in the preceding, and the short version might best be put as a question and answer:
Would you update non-critical (doesn't need to be 'super timely' within a 7 to 21 day period of time, because no one except webmasters cares or notices) pages (sites) on a data storage mechanism you are replacing, or would you use your resources to change over and update the new data storage system you are replacing it with? Personally, I would work on getting the new one in place and keep the old, soon to be replaced storage system updated on an 'as needed' basis.
Here's another question in basic terms to see my point about the slowdown or perceived slowdown in indexing, putting it in an 'everyday webmaster' situation... Would you update the asp version of a page you were about to convert to PHP or even the HTML 4 version of a page you're converting to HTML 5 or would you concentrate on getting the new version in place? (IOM They're doing relatively the same thing on a much larger scale.)
If i search on different TLDs for indexed pages of a COM site, with site:www.. command,returns different results.
For example, search for this forum on google com and on google de (or other country). On com are reported 30% of pages. Any explanation or ideas for this ?
Maybe they want to not show all pages, like they did with link: command ?
Also, i have a Google search box on my site and a lot of results are omitted now,with this situation.
You probably have a free search box, I believe with the paid version you can choose which sections or URLs should be crawled.
Ok about search box, but what do you think about the difference between pages indexed in com and de (my previous post) ?
TheMadScientist - you theory matches what I'm seeing with a new section launched on an established site about 2 months ago.
In the UK results pages will only show up for unique phrases in quotes.
Using a US proxy (doesn't work without one of those) to look at the latest known caffeine datacenters the pages are showing first page for some decent search terms.
It's as if new pages are indexed in big daddy, but the processing hadn't been done to figure out if they are trusted or to credit them with links.
A section launched in November last year is doing slightly better but is still very slow in big daddy and not completely indexed so there may be a few months of data missing from the serps right now.
My forums are indexed freshly. I expect them to lose traffic when competing pages start getting indexed.
Lowest crawling rate ever for the last few days, more links, more PR, less crawling.
< moved from another location >
Anyone else having this issue?
Whenever I post.. I like to do a Google Search for "site:myurl.com" ( Options = Past 24 hours )
and see if it and how it indexes,
usually takes about 5 minutes.
last 2 days NOTHING! all websites are dead....
I dont want to post anymore until i find out whats going on.. this is frustrating! Yahoo traffic up though! GO YAHOO/BING!
[edited by: Robert_Charlton at 8:18 am (utc) on Mar 29, 2010]
SEOPTI: Yes, us too - crawl rates at the moment are about 5-10% of Dec 2009 numbers.
| This 109 message thread spans 4 pages: < < 109 ( 1 2  4 ) > > |