Does Googlebot ever forget? (Still hitting long dead pages)

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Does Googlebot ever forget? (Still hitting long dead pages)

samwest

8:22 pm on Apr 4, 2015 (gmt 0)

As I watch live bot traffic, I keep seeing googlebot (and msnbot) trying to hit long dead pages. Even stuff I 410'd for a year (then deleted the 410 recently) and yet they keep searching for it. Even test pages I was editing and may have been live for a day or two at best (then deleted) are still being searched for.

Seems like a big waste of processing power. 410's should be seen once an honored. I can understand the 404's lingering for a few months, but do they EVER forget them? They should.

On a slightly unrelated topic, I filed a few DMCA's back in 2012. They removed the offending listing, but since then Google has refused to index any of my images. Will they ever forget about that? and did I suffer just because of the DMCA report? Sounds like a daisy cutter approach to preventing anyone from stealing my images again. Thanks for the favor G.

Kratos

1:34 pm on Apr 5, 2015 (gmt 0)

There's a video out there of Matt Cutts saying how they treat 404s and 410s as the same practically. The reasoning behind this was that it's very easy for a webmaster to make an unintentional mistake and 404 important parts of a site, whereas to deliver a 410, you need actual technical knowledge and thus Google interprets that you haven't messed up when you give them a 410 (i.e. you're intentionally telling them that the content no longer is there). However, they will still come back even years after the 410-ed page although not as frequently as with a 404-ed page. This is especially so if the 410 had some PR from internal or external links.

I have seen it even with pages that never existed which had external links (badly written) pointing at them.

[edited by: Kratos at 1:47 pm (utc) on Apr 5, 2015]

samwest

1:44 pm on Apr 5, 2015 (gmt 0)

So this makes me wonder what the best practices would be for maintaining redirects. I'm sure this has been covered ad nauseam in this forum, but you know how things change. Do you keep them forever, and at what time do you just not care? I would think the sitemap would be sufficient to keep the bots from trying to find pages that were gone years ago, especially those with no link on some other site. I always 301 those anyway. Just seems like I am seeing an excessive amount of bot traffic looking for what's no longer there....or even never was there.
One example: www.mydomain.com/tom-snyder which came up yesterday...(and promoted this topic) I never even had a page about him, not even a tag. Seems like a random probing. Makes me wonder if it's a symptom of a hack or a naughty WP plugin.

Kratos

1:51 pm on Apr 5, 2015 (gmt 0)

I get those links too. I believe they're referral spam links that Google has found and followed. I would say a good 80% of my 404s are referral spam links pointing to made up pages (this is done automatically with a script). The spammers remove the referral link after a while, which is why they don't appear on your WMT (or in the 404 report) and you're left wondering why the heck you have so many weird URLs of your site that never existed as 404.

samwest

2:23 pm on Apr 5, 2015 (gmt 0)

I'm having one other strange issue on a WP site that continually gets bot hits that appear to be coming from my own IP. A little investigation by hosting support shows it's coming from wp-cron, which runs on every visit! Seems other plugins can hook to it and cause some real problems. My task for the coming week is to check out an article titled "why wp-cron sucks" and ways to make it suck less. Then I need to figure out which plugin is misbehaving...if any.

A temporary workaround was to disable wp-cron then set up an actual cron job to run every 15 minutes rather than every visit. Not sure if it's a coincidence or not, but suddenly traffic seems more varied and robust. Bounce seems to have dropped too. This is on a dedicated server with about 30 sites, none of which are particularly busy.

Kratos

3:01 pm on Apr 5, 2015 (gmt 0)

I had a similar issue in one WP site. The problem with disabling the cron job was that a plugin that I had to make daily backups stopped working and so did the fed that I was pulling from another site of mine for the front page.

Have you found anything has stopped working in your case when you disabled wp-cron?

lucy24

8:35 pm on Apr 5, 2015 (gmt 0)

Do you keep them forever, and at what time do you just not care?

Keep them forever unless you genuinely don't care if the robot hits the occasional 404. But on a WP (or similar CMS) site there's one more consideration: Returning an explicitly coded 301 or 410 is far less work for the server than having to delve into the CMS and get the information (whether 301 or 410) that way.

In the case of 404 vs 410 you have to consider the difference between crawling and indexing. Both are treated exactly the same in indexing: if they can't find the page, they can't crawl it, and hence can't index its content. But a 410 is intentional-- the server doesn't keep track of files that used to exist and have now been removed-- so Google will assume you took it away on purpose, and the Googlebot will stop crawling faster. (This is the Google subforum. Bing doesn't seem to care. I've recently had a flurry of visits from the msnbot requesting URLs that haven't existed in years, exactly as if it wandered out of the retirement home thinking it's 1998.)

I would think the sitemap would be sufficient to keep the bots from trying to find pages that were gone years ago, especially those with no link on some other site.

Sitemaps should be considered inclusive, not exclusive: "my pages include the following" rather than "look only at these pages". If you look up the tab in gwt where it tells you how it learned about a given URL, it will sometimes say "in sitemap". But that doesn't necessarily mean your current sitemap; it might mean one that you had up for five minutes one day in 2007.

netmeg

9:31 pm on Apr 5, 2015 (gmt 0)

Do you keep them forever

If I go to the bother of doing redirects, then I usually keep em forever. I still have G trying to hit pages from like three site versions ago - 2007 and earlier - because there's old scraper links out there. The important ones I wrote redirects for. The others, meh.

samwest

2:56 pm on Apr 7, 2015 (gmt 0)

exactly as if it wandered out of the retirement home thinking it's 1998.

That's funny, but so true, and which prompted the OP.

This also begs the question (in my mind at least) if these bots know when these pages were created and they know when the scrapers pages were created, could that data not be employed by the algo to determine the original author and trash the scraped content? Thx for all the great replies.

Robert Charlton

8:25 pm on Apr 7, 2015 (gmt 0)

Regarding the long dead pages, here's a discussion from almost two years ago which covers some of the reasons why Googlebot doesn't forget...

17 May 2013 - GWT Sudden Surge in Crawl Errors for Pages Removed 2 Years Ago?
http://www.webmasterworld.com/google/4575982.htm [webmasterworld.com]

At the end of the thread, I've cited an old interview with members of the Google Sitemaps team... and the thread also offers some useful comments from Google's John Mueller, etc.