I just want to do what I can to let google know these pages are gone - no need to recrawl.
Google knows that these pages are gone, but it has its own reasons for recrawling.
By reporting the 404s, Google is just telling you that they requested the url for the page, and that your server didn't find anything and returned a "404 Not Found" response to Googlebot.
If you think that your server should have found something... ie, that you believe the pages are still around and that Google should
not have gotten a 404 Not Found response when it requested the url, then Google's message is useful because it alerts you to a possible problem. Otherwise, 404s are the expected response and are perfectly normal.
As to why Google recrawls urls that you think are gone or non-existent, there are numerous reasons. One is that links to the urls may persist somewhere on the web. You can't do anything about some of the external links, but by recrawling periodically over time, Google will keep track of the responses and recrawl these old urls less often.
It might be, though, that a site will still have internal nav links to the urls of pages that have been removed. This is unlikely in your case because you hadn't gotten requests for these urls for a long while. It can be worth checking a site with Xenu or Screaming Frog, though, to make sure that these urls aren't in the site's code.
I've observed that in addition to periodically rechecking the lists of 404s it keeps, Google also often recrawls these lists when there's a refresh of the index, as might occur at a large update of the type we just had.
This observation from a 2006 interview with the Google Sitemaps Team is helpful... [
smart-it-consulting.com...]
My emphasis added... When Googlebot receives either (a 404 or 410) response when trying to crawl a page, that page doesn't get included in the refresh of the index. So, over time, as the Googlebot recrawls your site, pages that no longer exist should fall out of our index naturally.
My sense of the above is that by recrawling the old lists at updates or refreshes, Google is able to generate "clean" reference points of sorts, with currently 404ed urls removed from the visible index. The above interview was in 2006, though, and the index has gotten much more complex, so it's hard to say whether the 404ed pages are removed from the index in one pass, or after many.
There is a separate crawl list, and your observation suggests that the old urls are recrawled. I note from your report that the number of 404s peaked at just about the time of the update, and that the number is trending down gradually.
Re robots.txt, etc, in situations like this, I'll quote John Mueller's comments, cited above, for reference here...
For large-scale site changes like this, I'd recommend:
- don't use the robots.txt
- use a 301 redirect for content that moved
- use a 410 (or 404 if you need to) for URLs that were removed
- make sure that the crawl rate setting is set to "let Google decide" (automatic), so that you don't limit crawling
- use the URL removal tool only for urgent or highly-visibile issues.