Blocked Pages indexed

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Blocked Pages indexed

member22

1:47 pm on Jul 18, 2014 (gmt 0)

I upload a new website and when I uploaded I forgot to block in the robots.txt some pages that I don't want indexed that include components such as a download modules.

When I realized that I blocked them in the robots.txt and now when I type site:mywebsite.com I see those pages indexed.

How can I remove those page now from google index so that they don't show in the "repeat the search with the omitted results included."

netmeg

2:26 pm on Jul 18, 2014 (gmt 0)

First of all, remember that robots.txt controls crawl and the NOINDEX directive controls indexing

If you want to make sure that pages stay out of Google, slap a NOINDEX on them. Don't block them in robots.txt.

Ok so for your existing issue, what I would do is (for the moment) leave them blocked in robots.txt.

Then I would remove them using the REMOVE URLS option in GWT. Unless there's a directory or category in the URL, you're going to have to do it one at a time; there's no bulk upload.

The URLs will be gone in an hour or so. NEXT make sure that you have the noindex directive in the <head> statement of those URLs.

<meta name="robots" content="noindex">

NOW you can take them out of robots.txt. You want to do that because otherwise, search engines won't see see the noindex, and you risk the chance of them going back into Google.

Time consuming yes, but that's how you do it.

Robert Charlton

8:46 pm on Jul 18, 2014 (gmt 0)

Here's an extensive discussion of the issues involved. It's also time consuming (to read), but that's how you understand it....

Pages are indexed even after blocking in robots.txt
http://www.webmasterworld.com/google/4490125.htm [webmasterworld.com]

member22

9:05 pm on Jul 18, 2014 (gmt 0)

Thank you for your replies but how can I add a new index to a page that looks like this www.myurl.com/modules/ajax� ( if it was a real page that would be easy but a / module or /component is not a real page ?

By the way my /module already returns a 404 ?

JD_Toims

9:23 pm on Jul 18, 2014 (gmt 0)

If it's something I really don't want indexed, I'll forget the robots.txt and the noindex on the page(s) and just serve GBot [and all GBot spoofers] a very understandable 403.

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} GoogleBot [NC]
RewriteRule ^(?:modules|component) - [F]

Robert Charlton

10:56 pm on Jul 18, 2014 (gmt 0)

Are you by any chance using a combination of relative paths and the base element in your CMS?

lucy24

1:00 am on Jul 19, 2014 (gmt 0)

RewriteCond %{HTTP_USER_AGENT} GoogleBot [NC]

Oddly enough, my htaccess has the line

BrowserMatch GoogleBot keep_out

in the mod_setenvif section, paired with

RewriteCond %{HTTP_USER_AGENT} Googlebot
RewriteCond %{REMOTE_ADDR} !^(66\.249|74\.125)\.

et cetera in the mod_rewrite section. No [NC] ;)

incrediBILL

1:06 am on Jul 19, 2014 (gmt 0)

First of all, remember that robots.txt controls crawl and the NOINDEX directive controls indexing

That's not exactly correct.

Sorry, but I've been around when robots.txt was the only way to control a spider and index vs. all the new fancy additions like NOINDEX.

See: Block URLs with robots.txt
[support.google.com...]

When you block a page in robots.txt it also doesn't index that page blocked in robots.txt.

For instance if you don't want your whole site indexed in some site that requests robots.txt just use "Disallow:/" and it's gone.

Things like NOINDEX were created for people running hosted blogs that didn't have access to robots.txt.

Anyway, if you block it in robots.txt today, the results won't be instant in Google, may take weeks or even months depending on the number of pages being blocked. If you want to speed the process along, after updating your robots.txt go into Google WMTs and select "Google Index -> Remove URLs" and you can wipe then out of the index immediately, but the must also be in the robots.txt file for this to work so do that first.

Another simple strategy is to 301 redirect the old page URLs to the new page, if it's a 1-to-1 relationship Also, make a 404 page that has a menu or site index to the new pages so users are basically in the site and not just staring at a "Sorry, This page is missing" page which is stupid.

I would consider the 301 or 404 first and leverage your current indexed pages while Google slowly removed them they'll have actual value.

JD_Toims

1:22 am on Jul 19, 2014 (gmt 0)

When you block a page in robots.txt it also doesn't index that page blocked in robots.txt.

True, but it can be confusing to many since although the page itself and it's contents are not technically indexed:

As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site can appear in Google search result.

And of course they make preventing the issue as clear a mud in the same paragraph by saying:

Emphasis Added

You can prevent this by combining this robots.txt with other URL blocking methods, such as password-protecting the files on your server, or inserting meta tags into your HTML.

Which links to: [support.google.com...]

Which states:

To entirely prevent a page's contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag. As long as Googlebot fetches the page, it will see the noindex meta tag and prevent that page from showing up in the web index.

--

When we see the noindex meta tag on a page, Google will completely drop the page from our search results, even if other pages link to it.

Which says, "As long as Googlebot fetches the page", which totally contradicts the robots.txt info that says a page will not be indexed when the robots.txt is combined with noindex or password protection, because if GoogleBot follows the robots.txt directive and doesn't access the page the noindex or password protection on the page won't ever be seen, meaning their own support information is absolutely, unquestionably contradictory and wrong, because the initial robots.txt support page says:

Emphasis Added

A robots.txt file is a text file that stops web crawler software, such as Googlebot, from crawling certain pages of your site.

If they don't crawl the page, they don't see the "verification required" or meta tag, so the two can't be combined to keep a page out of the index!

[JD_Toims does face-palm for not realizing sooner why sooo many people are totally confused about why combining robots.txt with <meta name=robots rel=noindex> just plain won't work -- Should have realized Google's support says it does days ago!]

[edited by: JD_Toims at 1:42 am (utc) on Jul 19, 2014]

incrediBILL

1:42 am on Jul 19, 2014 (gmt 0)

Which is ridiculous, since if GoogleBot follows the robots.txt directive and doesn't access the page the noindex on the page won't ever be seen, meaning their own support information is absolutely, unquestionably wrong, because the initial robots.txt support page says:

You first problem is believing that Google won't access those pages just because they're blocked in robots.txt as it clearly states is will which has been confusing webmasters for years.

If you really want to piss off webmasters, just build a site linking to every page blocked by their robots.txt page and submit it to Google and then site back and watch the fireworks fly as the webmaster being crawled goes on a tirade screaming that Googlebot is broken, doesn't honor robots.txt, etc. You get the idea.

OK, here's where your conundrum happens:

1) if the page or folder is blocked by robots.txt Google claims it honors robots.txt and won't crawl it

2) HOWEVER, if the page is linked from an external 3rd party site, Google crawls the page anyway in a link checking capacity and to verify that the page content, the page is linked from, and the anchor text are all relevant. It's running a quality test.

Google still hasn't broken their word here as they're accessing the page under the guide of a link checker, not a spider. The fact that they still call it Googlebot without any further information about why it's in a blocked area is what sets webmasters off screaming.

3) If that page blocked in robots.txt, linked from a 3rd party, is link checked by Googlebot and it encounters the META NOINDEX, then there is no reference made whatsoever. it's blacked out for good.

Got all that?

Twisted but that's what it appears to do.

JD_Toims

1:46 am on Jul 19, 2014 (gmt 0)

Got all that?

Huh, what? lol -- I'll opt for my initial GTFO solution of a simple, unquestionable, not subject to interpretation, there's nothing for you to see or index here:

403 Forbidden!

Added:

If they really try to argue a "link checker" called Googlebot is not webcrawler software such as Googlebot, they're essentially saying 6 isn't the same as a half-dozen, which is laughable and means their support info surrounding the subject is still === FALSE.

incrediBILL

2:51 am on Jul 19, 2014 (gmt 0)

I simply added a script that blocks anything that ever asks for robots.txt and then comes back asking for denied pages so on some sites I do feed Googlebot a 403 when they're in 'link checker' mode.

It was easy, just get a PHP library that reads robots.txt, primarily used by crawler code, and change how it's used to validate inbound requests instead. Feed the robots.txt PHP function the user agent asking for the page and see what happens.

VOILA! a robots.txt enforcer, it's that easy.

JD_Toims

2:59 am on Jul 19, 2014 (gmt 0)

Cool Idea!