Why Did Google Index Pages With NoFollow?

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Why Did Google Index Pages With NoFollow?

martinacastro

2:19 pm on Aug 3, 2011 (gmt 0)

A month ago I added the nofollow attribute to the meta tags of many pages, more of them new, but some of these new pages with nofollow attribute are indexed.

Also some of pages that are in a subdirectory that was always blocked by robots.txt, are indexed...

20 days ago, I deleted 25% of pages of my site with thin content, and google today still show me that pages...

Any idea why happen this?

SEOMike

4:44 pm on Aug 3, 2011 (gmt 0)

Just because you nofollowed links into those pages, doesn't mean that Google won't index them. A nofollow is just a signal that you don't trust the pages you're linking to and any pagerank that would go to them is just wasted. You can try blocking the pages with NOINDEX / NOCACHE in the HEAD section but even that doesn't work every time. If a page is in an excluded directory, or has HEAD section exclusions but has external links pointing to it, Google will still index it and even rank it. They'll follow your robots.txt instructions to a point, but if they think a document is popular enough they'll ignore your instruction.

martinacastro

5:40 pm on Aug 3, 2011 (gmt 0)

Thanks SEOMike

I made a mistake when I wrote the Post. I must said NoIndex. The pages have noindex attribute now in the HEAD and some of them are indexed...

netmeg

6:33 pm on Aug 3, 2011 (gmt 0)

First things first.

The stuff in the directory that's blocked by robots.txt - is it possible that got in before you blocked it? Because once you wall off that directory, Google goes no further. And it still won't keep them out of the index necessarily - they'll just show up as URLS only, without titles or meta descriptions.

Next - the URLs that you NOINDEXed - has Googlebot been back to pick up the new tag? Check the cache date on the ones you see in Google.

And finally - just because you delete the pages on your site doesn't mean Google will necessarily drop it from the index. I've had old pages hang around for over a year or more.

If you really really really don't want something in Google, you gotta password protect it (ultimate security) or NOINDEX it (pretty good security) or block it by robots.txt (maybe 50% security) and if it's already there, you gotta remove it yourself (via GWT) or wait till Google notices the NOINDEX or the 404. Which could be tomorrow or infinity or any time in between.

(And as an aside, I'd use the robots.txt testing tool inside GWT to make sure it's set up correctly. I can write robots.txt files in my sleep and I still check it periodically - just in case)

martinacastro

2:38 am on Aug 4, 2011 (gmt 0)

@netmeg thank you.

about the directory, since I created it was always blocked using robots.txt.

tangor

5:34 am on Aug 4, 2011 (gmt 0)

And finally - just because you delete the pages on your site doesn't mean Google will necessarily drop it from the index. I've had old pages hang around for over a year or more.

I've had pages dead near 10 years still appear, even after G dropped them... each time they change the algos (Panda for example) all the old stuff comes back. Google (Bing, too, and Yahoo before that) never forget a URL they met... and keep testing it over and freakin' over.

Once on the web, always on the web (indexers). And as Walter Cronkite used to say "And that's the way it is..."

phranque

7:38 am on Aug 4, 2011 (gmt 0)

have you checked your server access logs to see if the noindexed urls have been crawled?
you can also check the cached version of your content to see if it shows the meta robots noindex unless you also use noarchive.

if you add a meta robots tag to a document in a directory that is excluded by robots.txt the url won't get crawled and the SE won't see the meta robots tag, so the url may become or remain indexed.

a 410 Gone status code response actually means "gone" as opposed to "not found" (404) and usually works better for removing content.

SEOMike

4:54 pm on Aug 4, 2011 (gmt 0)

And finally - just because you delete the pages on your site doesn't mean Google will necessarily drop it from the index. I've had old pages hang around for over a year or more.

a 410 Gone status code response actually means "gone" as opposed to "not found" (404) and usually works better for removing content.

Agreed.

I've had good success getting hundreds of pages removed from Google by simply adding some lines to .htaccess to make the server return a 410 for the removed pages. Google will eventually remove it. They'll count them as errors for a while in webmaster tools but I saw no ill effect in ranking from those "errors." In my tests 410 responses got the pages removed much faster than 404s.

You could move the pages you want blocked to a directory blocked by robots.txt and serve 410s for their previous location.

martinacastro

6:12 pm on Aug 4, 2011 (gmt 0)

Thanks for the comments!

Watching GWMT

Can someone provide me an example of how to force the server to return 410 using htaccess?

Thank you

SEOMike

6:15 pm on Aug 4, 2011 (gmt 0)

It's real easy for single pages:

Redirect gone /DIRECTORY/PAGE.htm

martinacastro

6:16 pm on Aug 4, 2011 (gmt 0)

Also another comment...

using the site command, sometimes I see that I have for example 570 pages and sometimes I have 427 ...

Seems that when I have 427 pages indexed my positions increase (remember that the pages that I deleted were thin content)

netmeg

6:57 pm on Aug 4, 2011 (gmt 0)

I wouldn't go by the site command; it's severely messed up.

martinacastro

7:01 pm on Aug 4, 2011 (gmt 0)

@netmeg what command or Tool you recommend to know the pages that are indexed?

martinacastro

4:03 pm on Aug 9, 2011 (gmt 0)

Now google sometimes shows using the site command, more pages that were always blocked by robots.txt in a directory.

One doubt, is correct this nomenclature:

User-agent: *
Disallow:/subdirectory/ or I need to put an space like this

Disallow: /subdirectory/

Thanks

tedster

4:37 pm on Aug 9, 2011 (gmt 0)

The syntax that Google publishes includes the space - see [google.com...]

It probably works without the space, too - but why push it? If in doubt, test your robots.txt file with the tool they offer inside Webmaster Tools.

mikeavery11

3:52 am on Sep 22, 2011 (gmt 0)

Its a nice sharing but not clear. Need to know more details

tedster

4:11 am on Sep 22, 2011 (gmt 0)

Mike - ask a more specific question and we'll do our best.

g1smd

6:50 am on Sep 22, 2011 (gmt 0)

I deleted a number of pages in April and returned 410 for some and a 301 redirect for others. It has taken Google 4 months to remove all of the URLs from their index [site:example.com].

In WMT, so far about half of the URLs are gone. The number is dropping by a few dozen every few days. Only a very few URLs in the "internal links" list show as having internal links pointing at them; most URLs show "not available" for the link data. Once a URL is seen as "Gone" AND all the URLs that linked to IT are also seen as "Gone" Google continues looking for a month or two and then deletes the URL from the list.