Welcome to WebmasterWorld Guest from 54.211.135.32

Forum Moderators: Robert Charlton & goodroi

Googlebot rendering old pages without re-crawling them

     
1:00 am on Jul 1, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 12, 2004
posts: 656
votes: 13


I was just going to ask what the H* is going on with the Google cache. The problem is not the 'cache not found' error, nor the old cache dates. My problem is pages deleted years ago still show-up in SERPs, and they have fresh cache dates.

I've some pages deleted more than a year ago, and after reading Yoast's 'attachment pages' issue & their recommendation of creating a sitemap for URLs that were gone, I created a sitemap for deleted pages.

After a couple of days, to my surprise, Search Console reported thousands of them (more than half of deleted pages) as valid, and site:url operator also showed them indexed.

The number of deleted (but indexed) pages started to drop, and still dropping slowly, but the ones still indexed are somehow showing recent 'snapshot' days. For example, a page deleted 2 years ago is cached with April 2019 snapshot date (even the copyright date on cached page is 2017).

I checked all the headers, caches etc. Pages return correct 404 or 410 responses, search consoles live tests also confirm the 404 error.

When I inspect the URL, Google Index shows April 2019 as the 'Last crawl,' and 'Page fetch' status is 'Successful.' (Once again, the page was deleted 2 years ago.) Then I noticed a recent meta link in header for css, that was recently added and injected via js.

I figured Google is not re-crawling the deleted URL, instead it's rendering the page source code that it has from 2 years ago, and calling it a successful 'Page fetch.' 'Crawled page' code on the search console is also the 2-year-old source code, rendered with recent/updated js files.

I've no idea if this is a bug, or if they're trying to catch-up with rendering their historical source code data.
5:38 am on July 1, 2019 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12311
votes: 396


Levo, did you ever use a header checker to confirm that these urls were actually gone? What's their header response now?

6:01 am on July 1, 2019 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:9914
votes: 972


Regardless, g never forgets (and probably never deletes) any url/page they have encountered. Tales to the contrary are not to be believed.
6:24 am on July 1, 2019 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12311
votes: 396


g never forgets

tangor, while that's very true, they usually disappear from the search console sooner than a year.

9:33 am on July 1, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 12, 2004
posts: 656
votes: 13


Levo, did you ever use a header checker to confirm that these urls were actually gone? What's their header response now?


Yes, the response is 'HTTP/1.1 404 Not Found,' and Search Console's live test also report 'Failed: Not found (404)' Also, I'm using a custom CMS and the content of the page was deleted from the database. So, if somehow the server returned 200, the cached content would be blank & copyright would be 2019. BTW, thanks to the sitemap for deleted pages, the number of 'ghost' pages are dropping, albeit slowly.

The issue is, at least for my website, Google is rendering crawled source code from 2 years ago and calls it a successful page fetch, with recent 'Last crawl' date on Search Console & 'snapshot' date on cache page. The rendered/cached pages are also full of broken links, missing images, and look awful due to mixing two-year-old source code with recent css/js files.
12:36 am on July 3, 2019 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12311
votes: 396


The issue is, at least for my website, Google is rendering crawled source code from 2 years ago and calls it a successful page fetch, with recent 'Last crawl' date on Search Console & 'snapshot' date on cache page. The rendered/cached pages are also full of broken links, missing images, and look awful due to mixing two-year-old source code with recent css/js files.

levo, somwhere, I think in a Yoast forum, I remember reading about some garbled page fragments that sound pretty much like what you are describing. Your issue doesn't sound so much like a Google problem as a problem that Google is having figuring out some beat up page fragments that were once a Yoast problem, but have since gone through the washing machine (ie, you've made a lot of changes). In what I remember reading, the messed up pages related to WordPress "attachment pages", and were report as first appearing sometime after the Yoast upgrade bug appeared, roughly in May 2018.

Yoast describes WP media attachment pages this way...
When you upload an image in WordPress, WordPress does not only store the image, it also creates a separate so-called attachment URL for every image. These attachment URLs are very "thin": they have little to no content outside of the image.

Here's the Yoast blog article that's the source of the above description, and is also about fixing the bug issue. I don't know whether you could apply it to media attachment pages left over from a site on which you've made a lot of navigation changes. Possibly, the changes you made shouldn't matter, though I don't know.

Media / attachment URL: What to do with them?
30 May 2018
[yoast.com...]

I'll quote only the intro to the article....
In our major Yoast SEO 7.0 update, there was a bug concerning attachment URLs. We quickly resolved the bug, but some people have suffered anyhow (because they updated before our patch). This post serves both as a warning and an apology....

The article describes dealing with the pages/urls under a variety of circumstances, focusing mainly on Yoast plug-in settings.

---

The temporary site-map you refer to that Yoast suggested reminds me of this SEL article about John Mueller's suggestion of using them to remove a lot of pages, but you should check it carefully to make sure it applies to your issue. Do not use Google's bulk page removai tool in GSC.

Need to expedite page removal in Google’s search index? Try a temporary sitemap file
Do you need to remove a lot content from Google quickly? Here is how to do it if you own the site.
Barry Schwartz on January 8, 2019 at 9:29 am

[searchengineland.com...]

...But if you have hundreds or thousands of pages, removing URLs one by one can be time-consuming. John Mueller, a Google Webmaster Trends Analysts said you can use temporary sitemaps. First you 404 or set the pages to noindex. Then you upload a temporary sitemap file with the URLs you want removed but make sure to list them with the last modification date as of the date you set them to 404. It can help speed things up by giving Google a hint to look at these pages because they have changed. When they figure out the pages have changed and are 404ed, Google may remove them faster.
Because of my time limitations, all of the above is pretty sketchy, but I'm hoping it helps. Please let us know.

1:11 am on July 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 12, 2004
posts: 656
votes: 13


Thanks Robert, but I never used wordpress, nor Yoast. I just used their suggestion of creating a temporary sitemap to expedite removal of indexed pages from Google. Adding that temporary sitemap to GSC led me to more information on deleted URLs on my website.

I'm still not sure if this is a bug or a sneak peak to things to come. The pages in question were thin, so I've been combing them out. Some of them were returning 404 error for the last 2 years, yet they're still in Google's index.

Instead of dropping them from the index, Googlebot is rendering the last successful crawled code it has from 2 years ago, and updates the cache, snapshot date, and last crawl date as if it was successful.

If it's a bug, it could be related to recent cache date problem: [seroundtable.com...]
If not, they may be rendering old codes to look for past shenanigans.
6:18 am on July 3, 2019 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator robert_charlton is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 11, 2000
posts:12311
votes: 396


I never used wordpress, nor Yoast. I just used their suggestion of creating a temporary sitemap to expedite removal of indexed pages from Google

levo, that's good to know. ;) I thought that your comments about Yoast and the attachment pages suggested that your extra pages were connected to the Yoast plug-in update bug.

Somewhere, there is a description of garbage pages, as you sort of describe, which started appearing after the plug-in upgrade.

---

Anyway, John Mueller also suggested temporary site-maps, and the specificity of his comments... regarding last modification date, and why... are what stood out for me...
...a temporary sitemap file listing these URLs with the last modification date (eg, when you changed them to 404 or added a noindex), so that we know to recrawl & reprocess them....This is something you’d just want to do for a limited time (maybe a few months), and then remove...

Worth noting that Google suggests disregarding the cache date, as it's not always about what you think it is. Take a look at my post in this thread, which ultimately also references the same seroundtable thread that you do....

Recent copy of cache error 404
May, 2018
https://www.webmasterworld.com/google/4902307.htm [webmasterworld.com]

8:43 am on July 3, 2019 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Dec 12, 2004
posts: 656
votes: 13


This is different, this time the cache date tells me that there are months and even years of gap between the first wave indexing (crawling) and the second wave of indexing (rendering).' [searchenginejournal.com...]

Let's say I've an old and thin article page https://example.com/article/00001 I've deleted on ~June 2017. This url has been returning 410 error since then. Recently I created a temporary sitemap, and included a list of deleted articles.

Once Google processed the sitemap, it told me that https://example.com/article/00001 is:

1- Valid: Submitted and indexed (GSC URL Inspection)
2- Last crawled Apr 12, 2019; Page fetch: Successful (GSC URL Inspection)
3- In GSC Inspection section > Crawled page > HTTP Response is:

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Server: Apache
Cache-Control: max-age=90
Last-Modified: Wed, 19 Apr 2017 00:14:37 GMT
Content-Language: en-US
Content-Encoding: gzip
Content-Length: 4058
Date: Wed, 19 Apr 2017 00:14:37 GMT
Connection: keep-alive
Vary: Accept-Encoding
X-Google-Crawl-Date: Thu, 11 Apr 2019 23:49:15 GMT


4- In Inspection section > Crawled page > HTTP Response is 2 years old source code, recently 'second wave rendered.'
5- According to server logs, this URL was last crawled by Googlebot on June 25, 2019, the response status was 410.
6- Page is still in Google index, it shows up when I search with the site:url or even with the title.
7- Google's cache has Apr 11, 2019 23:49:15 GMT snapshot date.
8- Live test results in: 'URL is not available to Google,' Page fetch: 'Failed: Not found (404)'

Instead of removing the URL, for some reason, Google is ignoring the result of its 'first wave indexing' - which was/is a 410, and instead does a 'second wave of indexing' using years-old code. As a result, Google thinks the URL still exists, updates last crawl dates and its cache, and keeps it in its index. This is not a simple date error, crawled code in GSC has hints that it's been recently rendered.

PS. I thought if creating the temporary sitemap triggered a bug and returned it back to Google's index, but Performance report shows that it has been receiving impressions for the last 2 years.