Page indexed in Google despite robots.txt entry

Forum Moderators: goodroi

Message Too Old, No Replies

Page indexed in Google despite robots.txt entry

How did they find it in the first place?

pmkpmk

11:13 am on Apr 28, 2005 (gmt 0)

I have a page at

www.company.com/testpage.html

. This page is NOT accessible via navigation!

In my robots.txt I have:


User-agent: *
Disallow: testpage.html

As of yesterday, it was neverthless indexed in Google.

Any ideas? And how did they find it in the first place?

larryhatch

11:18 am on Apr 28, 2005 (gmt 0)

Hi pmk:

I'm the wrong guy to ask, but should the disallow read somthing like

disallow /testpage.html with a leading slash? -Larry

Leosghost

11:26 am on Apr 28, 2005 (gmt 0)

Who looks at it using a machine with a "G" toolbar installed ..?

Lord Majestic

12:02 pm on Apr 28, 2005 (gmt 0)

Adding slash is necessary as robots txt exclusion document states that "any URL that starts with this value will not be retrieved", and since all URLs requested from server will have to have slash in front, it means none will be matched.

pmkpmk

1:10 pm on Apr 28, 2005 (gmt 0)

WebmasterWorld's own robots.txt evaluater found no error in it :-)

Lord Majestic

1:20 am on Apr 29, 2005 (gmt 0)

I just played a bit with robots.txt validator and it considered the following robots.txt valid:

User-agent: *
Disallow: forum/
Disallow: forum
Disallow: /forum/
Disallow: /forum

Now, syntax wise it is indeed a valid robots.txt, however syntax wise 1/0 is correct operation, but it won't work well during run-time. From run-time point of view it contains 3 incorrect disallow statements that should have been picked up by the validator -- they are all mistakes. Here are the details:

> Disallow: forum/
> Disallow: forum

These are both incorrect because they have no / before actual path. Robots.txt standard is pretty clear about disallow statements: "This can be a full path, or a partial path; any URL that starts with this value will not be retrieved."

The mistake is to think that the path part of URL does not start with /, say in: [example.com...] the URL part that will be checked against disallow statements is /forum, and clearly neither "forum" nor "forum/" start with this value - they have / before actual value.

Don't believe me? Check robotstxt.org site for yourself -- in all their examples they have / before actual path, for a reason because / is integral component of it and to have no /'s in disallow statements is meaningless.

> Disallow: /forum/

You think this is correct? Think again, a URL can be [example.com...] without slash at the end
and its perfectly valid because /forum does not start with /forum/

I had to modify my robots.txt code to take into account this error -- even robots.txt use /'s at the end without realisation that original URL may not have it and still remain valid.

> Disallow: /forum

Now this is THE way to do it -- all other errors should have been highlighted by the validator and I reckon the only reason they were not due to validator not being crawler, had it been crawler it would have received enough hate mail to analyse common errors people make to explain them here.

IMO validator should be changed to be more than just syntax checker, if WebmasterWorld agrees with my argument then I will send the invoice via post ;)

Leosghost

9:11 am on Apr 29, 2005 (gmt 0)

Concur with "his lordship" ..I always use this form

Disallow: /forum

for robots txt ...In the world of disobedient and badly coded( sometimes deliberately ) or ( "rip it all" bots )..this is the only one that will stop the first 2.

Brett_Tabke

2:33 pm on Apr 29, 2005 (gmt 0)

User-agent: *
Disallow: testpage.html

That should block indexing on everypage named testpage.html in all directories throughout the site. Using a leading slash would only specify the file in the root.

pmkpmk

2:36 pm on Apr 29, 2005 (gmt 0)

That should block indexing on everypage named testpage.html in all directories throughout the site.

Which would have been fine. But still G has indexed the page - and still has it in the page although robots.txt has been requested a few times since then and I change the robots.txt to inlcude all possible variations:

User-agent: *
Disallow: testpage.html
Disallow: /testpage.html
Disallow: html/testpage.html
Disallow: /html/testpage.html

(The last two being an older name for the same page)

Brett_Tabke

2:40 pm on Apr 29, 2005 (gmt 0)

Is it a case of the robots.txt being a bit ahead of G's downloading of it? Google can fall behind in robots.txt fetching. Sometimes it is behind 45 days. (eg: a change in a bots txt can take 30-60 days to be realized)

pmkpmk

2:43 pm on Apr 29, 2005 (gmt 0)

Hmmm.... need to dig into the logfiles, but I'm pretty sure it was in place when G first fetched the page. Which leads to the question WHY the page was fetched in the first place because it is not connected in the navigation.

Lord Majestic

2:45 pm on Apr 29, 2005 (gmt 0)

That should block indexing on everypage named testpage.html in all directories throughout the site. Using a leading slash would only specify the file in the root.

It may work in Google's implementation (they appear to use pattern matching), but it may not do so for other robots that follow robots.txt standard to the letter, here is quote:

"Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html."

Starts is the keyword -- if you pick filename from the middle or URL, or its end then don't expect it to be matched by robots that follow the standard to the letter.

pageoneresults

3:03 pm on Apr 29, 2005 (gmt 0)

Based on my personal experiences, it almost seems as though Google will index URI only for those items that are specifically disallowed in your robots.txt file. I sometimes think to myself if adding the path to the robots.txt file alerts Googlebot which will then spider the page but not index its contents. It will index the URI only whether it is linked to or not. Again, this has been my experience.

I'll have to concur with Brett on what happens when the path is not preceded with a forward slash. For example...

User-agent: *
Disallow: testpage.html

Blocks all pages named testpage.html

Disallow: /testpage.html

Blocks only the page at the root level.

Disallow: html/testpage.html

Blocks all pages with html/testpage.html in path.

Disallow: /html/testpage.html

Blocks only the testpage.html in the /html/ directory.

[edited by: pageoneresults at 3:06 pm (utc) on April 29, 2005]

pmkpmk

3:04 pm on Apr 29, 2005 (gmt 0)

The page in question shows up with its Metatag-description.

pageoneresults

3:16 pm on Apr 29, 2005 (gmt 0)

Yikes, I've not seen that before.

There have been many topics concerning Googlebot indexing disallowed content. I believe it was jdMorgan who suggested that if you have content and don't want it indexed, then don't use a Disallow: in the robots.txt file. What he suggested was to use a Robots META Tag on the page not to be indexed...

<meta name="robots" content="none">

I've experimented with this and it does work. I removed a line from my robots.txt file disallowing a specific page from my site. It was showing with a URI only listing. I dropped the Robots META Tag in that page and within 60 days the URI only listing was gone. Just checked again and it is definitely not indexed. Pages that I have disallowed in the robots.txt file are indexed with URI only.

Lord Majestic

3:18 pm on Apr 29, 2005 (gmt 0)

Pages that I have disallowed in the robots.txt file are indexed with URI only.

Well, existance of URI does not mean the page is indexed -- some OTHER content (anchor text) pointing to your page was indexed, not the page itself. IMO thats fair play even if site's robots.txt says not to index it, and if META tags say not to index it as the page is not indexed as requested.

pmkpmk

3:41 pm on Apr 29, 2005 (gmt 0)

I just saw that with a different query it comes on SERP #3 with text from the page, NOT the metatags.

jdMorgan

4:13 pm on Apr 29, 2005 (gmt 0)

Since I've been quoted here, I'd like to chime in and say that I interpret -- and have observed all robots to interpret -- the Robots Exclusion Standard text that LordMajestic cited above to mean that robots use prefix-matching.

Therefore, a leading slash is required on all URL-paths to be disallowed, unless the record is specific to a particular robot that has been demonstrated to use pattern-matching as opposed to prefix-matching.

IOW, I would treat "slashless URL-path" usage in the same way as other non-standard robots protocol extensions; Use them only in records directed to a specific robot that is know to support the extension.

There are some subtleties of terminology that may catch the unwary here: Robots.txt tells compliant robots not to fetch (read, request, or GET) a resource (page, file, etc.), while the on-page robots meta-tag tells compliant robots not to index the resource (list as a search result, include in their index or search results).

In order for an on-page robots meta-tag to have any effect, that page must *not* be Disallowed in robots.txt, otherwise the page and the tag will never be fetched by a compliant robot.

If Googlebot finds a link to a page and that page is Disallowed in robots.txt, then Google will show that page in search results as a URL-only listing -- No title and no description. It will show up in searches based on the link-text used in the link(s) that Googlebot found on other pages. MSN and Ask Jeeves appear to behave in a similar manner.

Yahoo does almost the same thing, except that the listing in search results is not just the page's URL -- Yahoo uses the link text it found as the title of the page. (This is a relatively new behaviour, and may change).

Jim

pmkpmk

8:05 pm on Apr 29, 2005 (gmt 0)

So my first (and maybe only) error was to assume that robots.txt excludes from being listed/indexed at all. OK, makes sense the way you describe it, but I guess there's a huge, huge number who fall in the same trap than I did.

Still a few mysteries remain.

#1: For one query, the page appears with its META description as SERP result #1, for a second query it appears with on-page text excerpts as SERP result #30something. IN no case I found only an URL listing.

#2: Why has it been indexed in the first place? It is linked from nowhere - you need to KNOW the URL in order to get to the page.

pmkpmk

8:13 pm on Apr 29, 2005 (gmt 0)

Mea culpa, mea maxima culpa!

During setting up the page, I had some technical problems which I discussed in a public forum. Somebody asked me if he could take a look and I dropped the URL. The forum is nicely indexed in Google, but the backlin could only be found by the

link�

query.

OK, that solved mystery #2.

P.S. Darn... I missed my 1111th post...

pmkpmk

12:44 pm on Apr 30, 2005 (gmt 0)

As of today, the page is not indexed in Google anymore. All I changed is the robots.txt according to what has been proposed here.

pmkpmk

5:14 pm on May 1, 2005 (gmt 0)

It's back in again. Guess I need to put these on-page META tags in.

jdMorgan

9:17 pm on May 1, 2005 (gmt 0)

Remember that Google is not "one big computer." It is thousands of machines all over the world. They are updated over a period of time, and not all at once. Your page may have been removed from some of them but not others. Best to wait a few weeks before you decide to change something.

The key is that a page may be listed as a URL-only listing if it is disallowed in robots.txt. You can remove it completely (at the cost of it being spidered repeatedly) by "allowing" it in robots.txt, and placing the <meta name="robots" content="noindex"> tag on the page itself.

Jim