It's the third time this year that this has happened and I don't know how it got fixed the first two times.
It's indexing old pages that have a link to the new post, but it does not want to index the new post.
I wish I could post my website URL here to get an opinion on what's wrong.
Please help. Is this some kind of penalty?
I'm beginning to loose my mind over this, literally.
A penalty that delays indexing for part of day? That's hardly likely, IMO. Penalties drive your urls down the rankings or remove them altogether.
Spidering and indexing are two separate steps - they must be in a data set as large as Google's. So we just can't think of Google the way we would think of a mySQL, Access or Oracle database, where once a record is added then it's immediately findable.
Your situation sounds to me like one of these:
1. An infrastructure change on Google's back end, possibly a temporary re-allocation of resources.
2. A different classification of your blog, so it's "freshness" in the search results is now a second tier priority, not top tier.
From your report, your new urls show up in less than a day, even though they may not migrate to all data centers for a while. Many people would envy that situation! And even a close inspection of your website is not likely to add any further insight.
So I'm not sure you've got a problem here. Do your server logs show that googlebot still comes by an hour or so after the Feedburner ping?
Yes it does. Sometimes the bot takes the new URL 2 or 3 times in the hour following the feedburner ping. It just does not want to appear in the results.
I'm not so worried about ranking because I always post original content. I know this because before I post anything I do a search and it returns no results. So, theoretically, my post should be the only result, or at least on the first page.
It is a problem because I post original content. Scraper sites copy everything and they get indexed faster and appear in the search results and get all the traffic. So practically I'm working in vain.
Another thing I don't understand:
when a new URL eventually turns up in the results, it says it was indexed 7 hours ago, although it started appearing in the results just 5 minutes ago. Could this be a geo location issue? (the new page gets indexed in a far datacenter and needs more time to get into the main index)
Something I a forgot to mention, that started a couple of weeks ago: in any given day it indexed a number of posts as usual, and a few of them did not get indexed until the next day.
Starting from last Thursday none of the new posts get indexed until the next day.
[quote]it says it was indexed 7 hours ago[quote]
That's when the spider recorded the page. As I said before, spidering and actually showing up in the index are different stages, but the timestamp is for when googlebot got the source code from your server.
Yes, there's a change in your pattern, and I can sympathize with your concern about scraper sites - although I doubt that there's much you can do to change it. Do you have Webmaster Tools set up - and do you watch it for feedback from Google?
Yes, I have Webmaster Tools set up. No feedback from Google.
And for the timestamp... the Googlebot gets the source code much earlier than what the timestamp says. I have some doubts that the timsestamp shows when the bot gets the source code (It usually downloads a new post in around an hour after the feedburner ping.)
How come I never see a timestamp that's less than 7 hours?
Until this situation I was able too see any timestamp from a few seconds to 22 hours. (from what you're saying a new post was indexed as soon as it was spidered).
This is the third time this situation happened. I'm beginning to believe that there's a time penalty of some sort, but I can't figure out the reason (I can't figure out how it got solved the first two times either) because I'm playing by the rules.
|time penalty of some sort |
I don't think so. Google's spidering and indexing behavior is (as far as we know/think), algorithmically-driven, based largely on PageRank. The settings tend to sometimes slip and slide a notch or two. When they do, it's logical that some sites will see changes in frequency of spidering and speediness of indexing.
I actually don't think you have a "problem" here, but this is just a slight modification to G's behavior.
I've seen blogs with less PageRank that get indexed with no problems, so I don't think PageRank is a factor in this matter.
If there's no penalty, some settings are changing for certain and I think those settings are referring to which datacenter the bot is assigning my domanin. I say this because I observed that the datacenters update far slower than usual.
|I've seen blogs with less PageRank that get indexed with no problems, so I don't think PageRank is a factor in this matter. |
I didn't say it was *all* PageRank. Obviously there are other factors, can be as simple or as complicated as the folks at Google would like it to be.
|I know this because before I post anything I do a search and it returns no results. So, theoretically, my post should be the only result, or at least on the first page. |
Just a quick question: are you blogging for Google or for the visitors of your site?
It seems to me you are over-obsessed with your Google Rankings... just do whatever you do -write original content- and Google will follow (eventually).
|It seems to me you are over-obsessed with your Google Rankings |
Just my point. I don't care about Ranking because I usually am the first to post on a particular subject. But if scraper sites get indexed faster than me, there's no point in posting at all.
Here's just a wild thought assuming that speed of getting indexed is what actually makes the difference when fighting scraper sites. How about a small script that checks who's asking for a page and giving a 404 or a blank page until it's Googlebot requesting the page - after which shut the script down and publish the page.