Why is updating slower than crawl?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why is updating slower than crawl?

graeme_p

12:24 pm on Aug 5, 2013 (gmt 0)

According to the number of pages crawled per day in Google webmaster tools, Google crawls enough pages to reindex my entire site every two days at the "low" rate, and the whole site daily at the "high" rate.

So why do changes to the site take so long to be reflected in the SERPS? Changes to the site navigation are still apparently not affecting the serps - I did a search like

site:example.com -"phrase in top navigation"

and got number of resuljs approx equal to 10% of site - and I added the phrase to the navigation weeks ago.

I also expected the change to the navigation to affect site links (because several of the previous ones have been removed from the navigation), and they are still the same in both the full and one line versions.

What is the point of regular re-crawling if major changes do not affect the SERPS? Is it an indicator of any kind of problem?

JD_Toims

8:40 pm on Aug 5, 2013 (gmt 0)

I've noticed the same type behavior, especially with the cache and I'm guessing it's due to a couple of things:

1.) They have a *huge* system to process the data through and it just keeps getting bigger.

2.) This is pure speculation, because I don't have "proof" or a reference for it, but I personally think they may have started "waiting to trust changes" rather than "reacting instantly" to them, much the same way as they started "waiting to trust 301 redirects" a while ago. (So, basically I think they're spidering, noting the changes, re-spidering and if the changes remain in place for [some period of time] or [some number of spidering runs] the changes are considered "permanent" and the results are updated to reflect them.)

aakk9999

10:41 pm on Aug 5, 2013 (gmt 0)

3.) Google is crawling certain URLs many times over, and other URLs are not being crawled often (i.e. not crawled daily/bi-daily on your site). E.g. Home Page may be crawled several times per day, whilst some internal page may be crawled once every two weeks. (E.g. whilst the number of pages crawled may be =50/day, these 50 pages may in fact be only 7 different pages being requested, some of them multiple times).

4.) Google is crawling old URLs that do not exist any more and also URLs that return the same content owing to addition of query string parameters. Googlebot even tries to add spurious parameters to URLs itself, just to see if something is returned under that URL.

Check your server logs to see what exactly (and how many pages) is Googlebot crawling.

lucy24

6:01 am on Aug 6, 2013 (gmt 0)

Further speculation: they've got a real index and a temporary "holding" index, and it takes time for information to move from one to the other.

Two things I've seen in gwt:

When I added a slew of new pages all at once-- by pagecount alone, close to 25% of total site-- there was a one-day upward hiccup in "pages indexed" in gwt, even though all those new pages were no-indexed. Then the graph returned to normal.

Elsewhere, "HTML improvements" (or errors or whatever it calls them) made complaints about one specific page that was no-indexed from the moment of its creation. The complaints presently disappeared, but it took more than a day. Maybe as much as a week.

Tentative reading: g### has become so vast that sometimes it can take several days for the right hand to find out what the left hand is doing.

I took a quick look at Googlebot crawls over a 4-day block, and found that two specific directory-index pages accounted for 40-45% of page crawls. EACH. They are, in my own opinion, the most important directories-- but neither traffic nor rate of change would warrant that kind of attention. Huh.

brotherhood of LAN

6:33 am on Aug 6, 2013 (gmt 0)

In line with the above, you can fairly safely assume that crawling & subsequent updates reflected in SERPs are not synonymous.

Same goes for any large index going really, look at the likes of majestic and all you're really seeing is a snapshot in time, not real time.

Some stuff will get pushed into the index quicker than other pages. A prominent site like this one can see pages indexed & visible within a few hours... remember a good few years back it was all the rage to have "minty fresh [mattcutts.com]" pages in the index.

graeme_p

7:51 am on Aug 6, 2013 (gmt 0)

There may be an element of some pages not being crawled, but that is not the whole answer, so time to process is definitely a major element.

graeme_p

8:05 am on Aug 6, 2013 (gmt 0)

All the pages that have not been reindexed have no toolbar PR, whereas almost all other inner pages on the site are 2 or 3.

I also wonder whether changes to structure and navigation take longer to be reflected in the SERPS than changes to content.

Robert Charlton

8:21 am on Aug 6, 2013 (gmt 0)

A top-of-my head, perhaps simplistic view, not guaranteed to be accurate...

- Crawling is simply indexing or re-indexing your site.

- Reranking involves... in addition to crawling your site... crawling and reindexing the rest of the web, and re-computing algo factors that affect comparative rankings.

As I understand it, the Google databases (there are many of them) are so large and so complex that there are databases just to keep track of the rules for the order in which tables are modified or queried.

Some computations and operations are necessarily done cyclically... there's a recursive aspect to them. They can't all be done simultaneously in real time. So, how the cycles happen to mesh and where in the cycles of interdependent cycles your changes happen to come will also affect where you are in the chain. ;)

Additionally, yes to #s 1, 2, 3, 4, as well as to "further speculation" and "in line with the above"... as mentioned above.

PS: Link-related factors (including navigation) probably involve more recursion and computation than content-related factors do.

graeme_p

12:26 pm on Aug 6, 2013 (gmt 0)

In my case I think I can eliminate 4 on the basis on server logs and webmaster tools.

3 is interesting - 900+ URLs regularly crawled, 100 not crawled after weeks.

JS_Harris

12:50 pm on Aug 6, 2013 (gmt 0)

According to the number of pages crawled per day in Google webmaster tools, Google crawls enough pages to reindex my entire site every two days at the "low" rate, and the whole site daily at the "high" rate.

These are not full page crawls. Google often just asks for a header response to see the last updated date and file size. If nothing has changed with the header response they may not crawl the rest of the page at all but since they checked it's still reported as a crawl.

You can verify that by checking your server logs. If GWT says they crawled 200 times yesterday you will probably not have 200 hits in your server logs for googlebot unless you record HEAD requests in your logs.

graeme_p

1:15 pm on Aug 6, 2013 (gmt 0)

I am talking about a sitewide change, so every page has changed.

My server logs show no HEAD requests, and the number of GET requests roughly matches those in GWT.

looking at the access logs for July it looks as though it is a combination of infrequent crawling of some pages, and slow processing of the site wide and home page changes)

I am not surprised that Google takes some time to process changes, but I am surprised by how long it takes. Surely the crawl rate should adapt to the fact that there has been a site wide change? And if a crawl finds significant changes then recalculations should also happen sooner?

JD_Toims

5:53 pm on Aug 6, 2013 (gmt 0)

My server logs show no HEAD requests...

No actual GoogleBot will make a HEAD request.
If you are or were seeing them they were spoofed.

jimbeetle

8:05 pm on Aug 6, 2013 (gmt 0)

All the pages that have not been reindexed have no toolbar PR, whereas almost all other inner pages on the site are 2 or 3.

That makes sense. In the distant past, PR was part of the crawl budget algo (we haven't talked about that in some time; have no idea if it still works the same), so if would be probable/possible that PR is part of indexing priority.