Msg#: 4635714 posted 9:03 am on Jan 7, 2014 (gmt 0)
I am looking at my webmaster tools crawl errors and almost everyday I see new pages listed. Is that normal
For information i added a 410 in my ht access in order to remove some pages that i don't want that were created to a an issue with my cms. I thought they were all removed but don't seem to be yet. is there a way to know how many google has still to remove ?
It says 525 not found as off today and the maximum I got index in april 2013 was 765. Does it mean that google still has to remove 765-525 + the exact number of pages my website really has ( 38 ) - which means that google still have to remove 278 pages ?
Is that correct ?
Then when I type site:mywebsite.com I see new url blocked by robots.txt when I click " repeat the search with omitted search results " I see pages listed.
As soon as I remove those pages with the URL removal too some new ones appear (and different ones).
Why is that ? and why are the web address listed there different than the ones in the not found URL error ?
My google index says i have 75 pages ( normally i should only have 38 because my website is only 38 pages ) does it mean that what is left to appear when I type site:mywebsite.com with omitted search results is 75-38 = 37 ?
Msg#: 4635714 posted 2:36 pm on Jan 7, 2014 (gmt 0)
There is no real way of knowing how many pages Google has left to remove unless you can narrow the pages down in search using the intitle: and inurl: etc searching options.
Any pages returning the 410 header will eventually fall out of the index.
But, check the site data tab on the error to see where a link to the page is coming from (if any links are found), if a link is found on your web site then remove the link so the page won't be crawled again. If the link exists on another web site there is not a lot that can be done other than return the 410 header, maybe get the link no followed and mark the error as fixed.
I have found that if a link still exists to a 410 page then Google will try coming to the page again even once the 410 header has been returned.
Are you using robots.txt to block the pages you want removed from Google? If you are remove it and let Google crawl it as usual but return the 410 header.
Google from time to time checks the links it has in it's index and will find your 410 headers and remove the pages.
I've had to remove 8000 pages from the index and I have an average of 10 turn up each day even though they were returning the 410 header.
I've used the URL removal tool to remove a lot of pages, it removes them, but they will still turn up in crawl errors.
Msg#: 4635714 posted 5:03 pm on Jan 7, 2014 (gmt 0)
How long does it take for a page with a 410 header to fall out of the index ?
I am using robots.txt to remove block the pages I want removed from google ( the only issue with that is the pages that appear when I type site:mywebsite.com are pages like www.mywebsite/administrator or www.mywebsite.com/myfolder/modules etc…
I don't want to give a 410 error to those pages, they are needed for my website ( at lease I think ) I just don't want google to not index those ( my guess is that it did index those because of the issue i had months ago with my cms ) but i don't want to give it the opportunity to index all my /module or / component folder…
So what do you recommend to remove those from showing when I type site:mywebsite.com
using the URL removal tool and remove those one by one as they appear ?
"I've used the URL removal tool to remove a lot of pages, it removes them, but they will still turn up in crawl errors."
What do you mean by that ?
In conclusion what is the correct number for indexed page that google has in its index for my website ? the indexed status where it says total indexed or should i look at the crawl error not found number versus the maximum number of pages ever indexed for my website ?
Google says :
"The number of indexed URLs is almost always significantly smaller than the number of crawled URLs, because it does not include URLs that have been identified as duplicates or non-canonical, or less useful, or that contain a meta noindex tag."
That is true in my case indexed url is small than crawled url / not found , the only questions are those not found that google detects every day or so and list still in its index even those it doesn't count those in its index status ?
In other words do those url that google detects are marks with a certain priority hurt my rankings as they are duplicate of my pages ?
Msg#: 4635714 posted 6:59 pm on Jan 7, 2014 (gmt 0)
I'm not sure exactly how long it takes but when I used 410 headers there were still pages left in the index after 3 months, but a majority of them were removed within the 3 months (2000 left in the index of the original 8000). This was when I used the URL removal tool on the remaining 2000 to speed things up. They were removed within 24 hours.
What I meant was some of the URLs I removed via the URL removal tool are still being picked up as 410 crawl errors. No links exist to them on my web site or on external web sites and they are no longer in the index but they still get picked up. I just mark them as fixed and cross my fingers they won't turn up again.
I would leave your robots.txt as it is then and give Google a chance to catch up with the changes.
As for your other questions I don't have enough experience to answer.
Msg#: 4635714 posted 7:29 pm on Jan 7, 2014 (gmt 0)
I am using robots.txt to remove block the pages I want removed from google
Remember, you have to choose: EITHER let search engines crawl, so they can see the <noindex> meta (or X-Robots header if necessary) OR don't let them crawl, running the risk that the page will show up in SERPs if someone searches for carefully chosen linking text