Is there any pattern to the URLs of nonexistent pages? Google does tend to understand a 410. So if it's practical-- and if you don't mind "taking responsibility" for pages that never existed in the first place-- returning an explicit 410 might work.
But even with a 404, the number of requests should drop off. If they're requesting the pages just as often now as they did at the beginning, it implies that the request is being reinforced somewhere. Pick some random 404s in wmt and see where google claims to have heard about it. Make sure they're not linked from anything currently in your site, and make sure they're not on an auto-generated sitemap. (The words "in sitemap" by themselves don't necessarily mean you did anything wrong. It just means that some sitemap at some time in the historical past had this URL on it.)
There is no pattern really except that all the URL's are derived from the sample data. For example http://www.example.com/66-sample-data-articles/joomla/extensions/modules/display-modules/19-footer-module - note the'sample-data' string in the URL.
So, pages like this existed for a few days when I downloaded the sample data (big mistake) then after I deleted the site and re-built it, they became 404's. The thing is that the 'Linked from' tab in WMT for the above 404 URL shows that the page is linked from other pages but they are 404 pages also?! How can it state they are linked from other pages when the pages they are linked from either don't exist or they are not linked from at all? For example, there a few 404 pages that WMT states they are linked from my homepage, but that is not true because I have rebuilt the site.
Also, when I hover over the URL in WMT to see a preview I get a really strange/basic text version of a site with half the page showing my content and half showing the sample data content. So, I thought the cache must be out of date.
[edited by: aakk9999 at 8:55 pm (utc) on Dec 28, 2013]
[edit reason] Exemplified - No URLs as per Charter [/edit]
|How can it state they are linked from other pages when the pages they are linked from either don't exist or they are not linked from at all? |
It's WMT -- Notoriously slow to update and "glitchy" on a good day.
|For example, there a few 404 pages that WMT states they are linked from my homepage, but that is not true because I have rebuilt the site. |
Nothing you can do except remove them [already done], wait, and quit worrying about WMT telling you things you know aren't accurate.
-- The pages and links are removed. That's all you can really do except "get back to business" and keep building your site.
BTW: Welcome to WebmasterWorld!
|There is no pattern really except that all the URL's are derived from the sample data. For example http://www.example.com/66-sample-data-articles/joomla/extensions/modules/display-modules/19-footer-module - note the'sample-data' string in the URL. |
So there are other URLs containing the string "sample-data" that are legitimate pages? No recurring theme to the numbers (here "66-"), or some other part of the URL? Eeuw.
The term "linked from" doesn't necessarily mean the link is present right now. Like "in sitemap", it simply means that's how the search engine first learned about the page.
Welcome to WebmasterWorld, Drag1!
Google search for inurl:/66-sample-data-articles/joomla/extensions/modules/display-modules/ returns over 3 million results from many websites, so I would guess you are not the only one with this problem of random URLs being created.
Unfortunately, whilst it is so easy to leak URLs to Google, it can take a long time for Google to drop URLs from its 404/410 graph. Google will drop URLs a bit faster if you return 410 instead of 404.
As Lucy says, you should investigate whether you have legitimate pages with 66-sample-data-articles pattern in URL and if you don't, use this pattern to return 410 Gone.
Even then it can take more than a year for Google to drop these pages - depending on the size of the site, the number of URLs that return 404/410 and also depending on how often the site is crawled.
All good advice, yet do know that none of the sesrch engines, G, B, and Y in particular, ever "forget" a url they have met. Expect at some time in the distant future that url request will come back again... so keep those 410s in place.
I still have pages deleted (properly) over 10 years ago still being requested on occasion.
I honestly, never knew that Google would act like this. Thanks so much everybody, you have been a real help.
From what you all say, the first step is to return a 410 error code for all those pages. Does anybody know a place on the web where I can find out how to do that? I haven't done this before so i want to make sure I do it correctly.
|From what you all say, the first step is to return a 410 error code for all those pages. Does anybody know a place on the web where I can find out how to do that? |
I suggest that you search our Apache Web Server [webmasterworld.com] forum.
There are many examples of how to serve 410 Gone using .htaccess directive. You may also post a thread there with your question - but try to give your best shot at creating .htaccess directives yourself first, then post what you have come up with, changing your domain name with example.com
I will. Thanks again for everyone's help. I really appreciate it.
I have the same problem, trying to get rid of some URLs. I added the 410 header this week, but the URLs show up as 404 response code in webmaster tools. Shouldn't i see 410 or google doesn't display this?
Double check, please: In wmt there's a category for "page not found". You then have to look closer to see which response code is returned, 404 or 410. I think Bing goes into even finer detail.
What exactly do you mean by "410 header"? Where servers are concerned, "exactly" is key. That's assuming the response is being returned by the server as such; if so, you can also check your access logs. If the response header is generated by php or similar, you won't learn anything from logs. But it's worth trying some random URLs with Live Headers or equivalent to make sure you're getting the intended response.
|Currently Google treats 410s (Gone) the same as 404s (Not found). |