Pages are indexed even after blocking in robots.txt

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Pages are indexed even after blocking in robots.txt

shaunm

11:04 am on Aug 31, 2012 (gmt 0)

Hi all,

I have blocked some of my website pages through robots.txt. Now I see that some of them are indexed while some are not. I am confused as I don't get the point in indexing pages that has been blocked in robots.txt

Can you please suggest? Or adding the noindex html tags in pages that I don't want to be indexed will do the trick?

Thanks a lot!

Robert Charlton

7:39 am on Sep 6, 2012 (gmt 0)

Shaddows - Thanks for your extreme precise and clear restatement of how this works, including this general principle, which is key to this ongoing discussion....

This is one of the areas where precise terminology is key. However, the vast majority of casual conversations (and indeed some official resources) tend to be quite lax.

lucy24 asks...

Would not a person of ordinary intelligence interpret this to mean that a file in a roboted-out directory will stay out of the index, once removed?

The dilemma, lucy, is that there's a difference between a page/file and a reference (url or link) to that page/file. Here's a very complete discussion, from back in May, 2010....

robots.txt - Google's JohnMu Tweets a tip
http://www.webmasterworld.com/google/4143083.htm [webmasterworld.com]

As you'll see, robots.txt has its detractors, and vocabulary needed to be clarified.

In the discussion, I link to a comment from GoogleGuy (Matt Cutts), worth noting again here. It's from back in 2003, when I first encountered the problem...

GoogleGuy, with my emphasis added...

If we have evidence that a page is good, we can return that reference even though we haven't crawled the page.

I was as outraged then as you are now, lucy, and I've seen occasional flare-ups by others over the years.

PS....

Could you please tell me WHEN/WHY DO I NEED A ROBOTS.TXT, THEN? I beg, could anyone please explain it to me precisely?

robots.txt will keep the contents of the pages out of the index.

But if you absolutely want to keep the references/urls/links out of the serps, then robots.txt may not suffice. If you don't mind the urls which might (or might not) show in the serps, then tedster's suggestion of how to use robots.txt is just fine. Whether the links will show depends on whether there are unblocked links to these pages existing somewhere on the web.

shaunm

7:47 am on Sep 6, 2012 (gmt 0)

@Shaddows

You use robots.txt to keep Google off your page. It stops them knowing stuff. That's it.

Real-world reasons for employing it include, but are not limited to
- Preserving Crawl budget (CSS files might not need crawling)
- Blocking file directories (/images/)
- Creating bad spider lists (block a directory, link to it in a hidden link, ban anything that finds its way there)

And why do I need to keep them off my pages/files when they can simply ignore the robots.txt and index those pages/files in their SERPs through external, internal links to those pages/files?

I know I got it wrong, but why don't I get the context yet?

Thanks

tedster

8:51 am on Sep 6, 2012 (gmt 0)

Yes, they can "index" the URL - but they won't "crawl" its content and insert that content into the search results, or even be able to shard the content data and rank the page based on those factors.

This means that a robots.txt blocked URL is HIGHLY unlikely to get much if any search traffic. In most cases, the URL will not even appear in the index. It takes links for that to happen, and even then the relevance data Google can access is very limited. For this reason alone, ranking for anything except a site: operator query is rare.

That's one reason I like robots.txt for a quick control on query string "sort" parameters and the like. Sorted product URLs are very easily inserted into social media links by well-intentioned fans, The robots.txt file is a down and dirty way to stop crawling from generating a mess of duplicate content as well as messing up the quality of your site's googlebot crawl altogether.

shaunm

10:00 am on Sep 6, 2012 (gmt 0)

Thanks Ted!

That's one reason I like robots.txt for a quick control on query string "sort" parameters and the like. Sorted product URLs are very easily inserted into social media links by well-intentioned fans

I was wondering why you did not use a Canonical instead for all the pages with query string(sort)?

tedster

10:06 am on Sep 6, 2012 (gmt 0)

Some of my clients prefer the canonical link. But I don't because exectution of the canonical tag is up to the search engine. If I don't even let them crawl the content in the first place, then the whole thing is a non-issue and Google can't mess it up.

I always prefer controlling what I can on my own server rather than throwing it into that immense pile of data that Google has to play with. I know very well that "stuff happens" when you've got a huge database, and Google's Caffeine infrastructure is far beyond what I can even think about very well.

I also don't want to see different "sort" variations crawled at all. I don't want to use up the bandwidth or the crawl budget.

shaunm

10:18 am on Sep 6, 2012 (gmt 0)

Thank you all for your insightful answering! I very much appreciate all your help. I like this forum so much than any other forums out there.

To put an end to this getting-bigger thread, I now understand the purpose of Sitemaps and Robots.txt. I have furnished below my understanding of this two. If what I think as right is wrong, please tell me that I am wrong :) I would appreciate you saying that.

1. Primary reason of any robots.txt file is to 'stop the crawlers from crawling the page CONTENT/HEADERS'

2. Through a robots.txt, I can prevent a page/directory/file from getting crawled - No content, no header will be crawled. Because of this, the particular page/file will not rank for any search queries, but will appear only as a SNIPPET ONLY PAGE, that too not for any query but only when I use site: command. Also the SNIPPET only version shows up because this particular page has internal links, anchor texts pointing to it from somewhere else. If no reference URLs exists, even the SNIPPET won't come up in the SERPs.

3. If I put disallow rules for a page in robots.txt, but at the same time there pages is in SITEMAP. Google and other crawlers will prefer SITEMAP details over ROBOTS.TXT files. Thus, they will crawl that particular page and will index it in the SERPs - Like they display any pages(Title, Desc, URL)

4. If I use NOINDEX meta tag and have that particular page in SITEMAP, it will be indexed at the end ignoring NOINDEX command because the URL is in SITEMP.

5. Finally, a robots.txt is USELESS and WASTE OF TIME when you put it in place for blocking a particular page/portion of your website from appearing in the SERPs. But what you should do is, make sure all that URLs are not in your SITEMAP and put a NOINDEX meta tag in EACH and EVERY page?

Many thanks guys!

Best,

lucy24

11:09 am on Sep 6, 2012 (gmt 0)

Oh, hey, I just remembered something. I posted about it last year some time.

:: shuffling papers ::

The relevant bit is buried among a lot of other blather, so I'll just quote.

I once described someone as the world's leading authority on such-and-such obscure subject. I didn't and still don't know if he really is, but I haven't seen any serious competition. Months later it occurred to me that it should be possible to look it up.

::search, search::

Oh, now this sounds promising: an article by the person I named. I've read the article; it's damn good. Maybe I glossed over an introduction by some equally knowledgeable person, describing him as the world's leading et cetera.

No luck. Maybe in some older, cached version. This comes with the g### boilerplate, informing me that my search terms only appear in pages that link to this page. Let's stop right there.

Bingo. If the page I'm talking about had happened to be roboted-out, it would still have come up in my search, thanks to that "pages that link to this page". And some searchers would presumably have been curious enough to click.

shaunm

11:29 am on Sep 6, 2012 (gmt 0)

@lucy24
Seriously, what does all that mean? :D

I found your post here and have no idea what you were talking about lol :D:D:D
[webmasterworld.com...]

lucy24

8:59 pm on Sep 6, 2012 (gmt 0)

That's why I thought it was better to quote the relevant bit rather than link to the whole thread :) The key point is that I got a search-engine hit based purely on the text that linked to a page. It happened to be my own link, and the page happened to be fully indexed in its own right-- but both of those are tangential. The significant part is that it was a search that could have occurred in real life.

MikeNoLastName

12:11 pm on Sep 9, 2012 (gmt 0)

>Yes, they can "index" the URL - but they won't "crawl" its content and insert that content into the search results<

My experience seems to seriously contradict this. If you read my earlier post. They obviously DO crawl it's content and DO keep it in the internal database, since they were able to show my blocked page (without displaying the actual title and description) first in the results based primarily on it's title.

We can't control who links to a page from off our site which leads to another glaring example of how a competitor COULD affect your rankings despite G's disclaimers. IF they know you have a duplicate of pages of the site somewhere which is robot.txt disallowed. And IF G keeps all that disallowed content locked away in it's database (as it appears it does), but accidentally forgot to exclude it from their algorithm to judge duplicate content, then all a competitor needs to do is set some links to your disallowed duplicate content. It apparently doesn't even matter if they put any anchor text.

Leosghost

12:54 pm on Sep 9, 2012 (gmt 0)

The key point is that I got a search-engine hit based purely on the text that linked to a page. It happened to be my own link, and the page happened to be fully indexed in its own right-- but both of those are tangential. The significant part is that it was a search that could have occurred in real life.

Actually it happens very often, with real life search..I must see it with around 10% of searches that I make..

Robert Charlton

8:20 pm on Sep 9, 2012 (gmt 0)

That's one reason I like robots.txt for a quick control on query string "sort" parameters and the like. Sorted product URLs are very easily inserted into social media links by well-intentioned fans, The robots.txt file is a down and dirty way to stop crawling from generating a mess of duplicate content as well as messing up the quality of your site's googlebot crawl altogether.

I completely agree that robots.txt is a good way to keep such pages from being crawled, particularly on a large site where crawl budget is an issue. And yes, generally "a robots.txt blocked URL is HIGHLY unlikely to get much if any search traffic."

Where I've encountered problems, though, are in very different areas, with different concerns... eg, syndicated co-branded mirrors of an entire site placed in its own subdirectory on large daily newspapers. In this kind of situation, the pages did attract links, and we found that urls were being returned in the serps for competitive searches.

I've also encountered situations where development pages or information pages that clients wanted to keep out of the index, away from the eyes of competitors, were showing up on site:domain searches... not ranking competitively, but definitely not private.

These are areas, I feel, where you should not use robots.txt. I think it's helpful to understand both the situation and what you're trying to do... whether you want to prevent crawling a page's contents, or to prevent urls or "references" from appearing in the index... and to choose your methods accordingly.

There's no easy way to do both at once, because the references/links to a page can occur anywhere on the web.

lucy24

9:15 pm on Sep 9, 2012 (gmt 0)

Now, if there were a header directive (HTTP header, not HTML document head) that said both Don't crawl and Don't index, and the search engines obeyed that directive...

Robert Charlton

9:29 pm on Sep 9, 2012 (gmt 0)

The problem is, lucy, that the engines want to index references to things. If you follow my threads above back to my first reported exchange with GoogleGuy, you'll see that I was similarly frustrated years ago.

Here's a fragment I started and then cut out of and earlier post...

My response in 2003...

...I'd suggest that less aggressive indexing here would be helpful. I can't imagine why Google would want to return a link to a blocked page.

My thoughts in July, 2012, in the thread I cited earlier in this discussion... [webmasterworld.com...]

I have very mixed thoughts about Google's aggressive indexing, btw. As a web professional who knows what I want indexed and what I don't, Google's aggressiveness in indexing has been a PITA. As a searcher looking for important information where webmasters have been too inept to make it visible, I can understand what Google's doing, and occasionally I've been glad they've done it.

Bottom line, if you don't want something indexed, use noindex and/or password protection.

"Indexed" here being used in the sloppy sense of being displayed in the serps.

Robert Charlton

9:33 pm on Sep 9, 2012 (gmt 0)

PS - I see your point about the header directive. That would be seen by the engines to be intentional enough that they wouldn't treat it a possible accident (the way they do with 404s)... but you've seen how long it's gotten them to believe that a 410 is a 410 is a 410.

Sgt_Kickaxe

2:16 am on Sep 12, 2012 (gmt 0)

< moved from another location >

Mods note: original post title:
Phantom pages as a result of Google ignoring robots.txt

Perplexed as to why one of my 500 page mini-sites suddenly began listing 30,000 pages indexed when performing a /site:example.com I did some digging, here's what I found. Hope it helps others, especially if you run wordpress.

- Though Google reports 30,000 pages indexed you cannot see them all in google. If you click to the last visible page of the results there are all of a sudden only a handful of pages worth of content indexed, not the original 30,000 google reported. Bug?

- By playing around with the site command, and adding some parameters I managed to get Google to reveal that the 29,500 EXTRA pages indexed are in fact comment edit pages which are supposed to be blocked by robots.txt

The entry for all 29,500 of these is as follows...

www.example.com/wp-admin/comment.php?action=editcomment&c=COMMENT NUMBER
A description for this result is not available because of this site's robots.txt � learn more

notice how Google is indexing content that says is restricted by robots.txt right on their results page?!?

- The webmaster tools new "index status" feature says that my site now has 30,000 known pages of which 2,000+ are indexed, 8,000 are not chosen and 28,000 are blocked by robots. None of that is accurate, the site has 500 articles. The site command was accurate 3 months ago and I've changed nothing since.

Questions: Should I remove the /wp-admin/ entry from robots.txt since Google is ignoring it completely? How can I remove these phantom pages from serps since Google is ignoring my directives? Should I be in touch with a lawyer since they are crawling where explicitly banned from? Other ideas?
.

[edited by: Robert_Charlton at 3:26 am (utc) on Sep 12, 2012]

tedster

4:14 am on Sep 12, 2012 (gmt 0)

Sgt_Kickaxe - I appreciate that your post did not begin its life in this thread. i think you'll find a lot of answers and directions already posted here. for the sake of completeness, I've added a few direct answers below.

since they are crawling where explicitly banned from?

They are not - they are INDEXING pages you have told them not to CRAWL. These are two different functions - crawling and indexing.

Should I be in touch with a lawyer

No, because a robots.txt file does not have the force of any law behind it.

How can I remove these phantom pages from serps since Google is ignoring my directives?

Add a noindex robots meta tag and ALLOW crawling. Google doesn't want to index meaningless URLs like this either, it's just crawling run amok technically. However, I'll bet it hasn't caused any actual search traffic problems for you. Am I right?

g1smd

9:44 am on Oct 5, 2012 (gmt 0)

Someone asked this in another thread,

This page has been blocked by robots.txt but is still indexed?

and I thought the answer to be important enough to copy over to this thread.

What do you mean by "indexed"?

Google records the fact that a URL exists as soon as it sees a link to it. It immediately adds the URL to its database, for later crawling.

A URL "exists" as soon as a link is created pointing to a web resource - even if it is subseqently found that the hostname doesn't respond, or there's no page by that name on that hostname, or that page crawling is blocked by a robots.txt rule. The URL itself still "exists" for all of the time that there's a link with that URL in, found somewhere on the web.

If the hostname responds but the resource is blocked by an entry in the robots.txt file Googlebot will not fetch it (but page preview might) but Google will still keep a note that the URL "exists".

In order to determine the HTTP status for the URL and index the content on the page, Googlebot has to fetch it and will only do so if it is not blocked by robots.txt.

The page might return 301, 404, 403 or other non-content status codes. If the page returns 200 OK, only then is the on-page content indexed. However, if the page itself contains a meta robots noindex directive, the content will not appear in any search results.

Robert Charlton

7:24 pm on Oct 8, 2012 (gmt 0)

g1smd - Thanks for re-posting. The thread you refer to is...

Google: 65 Search Quality Changes For August and September
http://www.webmasterworld.com/google/4504193.htm [webmasterworld.com]

One of the search "quality updates" announced by Google involves this one...

#82407. [project "Other Search Features"] For pages that we do not crawl because of robots.txt, we are usually unable to generate a snippet for users to preview what's on the page. This change added a replacement snippet that explains that there's no description available because of robots.txt.

That's led to several questions by members here, and we're encouraging those to be posted in this thread, which has already covered much of the ground.

lucy24

10:36 pm on Oct 8, 2012 (gmt 0)

This change added a replacement snippet that explains that there's no description available because of robots.txt

Yup, seen that in real life. Has anyone picked up any information-- it would have to be anecdotal at this point-- about how-or-whether this phrase affects ordinary human searchers?

g1smd

7:22 pm on Oct 9, 2012 (gmt 0)

Human nature being what it is, I'd expect a number of clicks from people wondering what's on the 'uber-sekrit' page.

Hollywood

8:05 pm on Oct 9, 2012 (gmt 0)

I blocked Google from files in my Robots.txt before even putting them live and guess what, Google decided to send the robots to check the pages anyway, really PISSES me off.

g1smd

8:28 pm on Oct 9, 2012 (gmt 0)

Which UA and IP was snooping about?

For your blocking rules, which UA did you target for those specific URLs? Does your robots.txt file have separate sections for different UAs.

Hollywood

9:47 pm on Oct 9, 2012 (gmt 0)

g1smd, I could find the bots that checked again but not at that location. My robots.txt is super simple

Disallow: /red-bot.html

And they went to those files /red-bot.html anyway... a few times.

g1smd

9:57 pm on Oct 9, 2012 (gmt 0)

Is that the whole file? Is there a

User-agent: ...

line?

Hollywood

10:45 pm on Oct 9, 2012 (gmt 0)

Yes has user agent, but there are not many lines.... it's truly simple. No newbie, ran a site that took in 50 million in 2 weeks having to do with the NFL.

This 56 message thread spans 2 pages: 56