Google "URL exclusion" not following our robots.txt!

Forum Moderators: goodroi

Message Too Old, No Replies

Google "URL exclusion" not following our robots.txt!

robots.txt in subdirectory is not getting followed

joeduck

4:54 am on Sep 18, 2005 (gmt 0)

Any advice would be appreciated!

We have a subdirectory of form subdirectory.oursite.com

Google has indexed many dynamic URLs of the form subdir.oursite.com and we want to get rid of all of these from the index.

We put a Googlebot disallow in robots.txt and place it in the subdirectory.

Then we submitted to Google exclusion.

We got a message "completed" by Google, but the pages are still all there!

MarkHutch

5:18 am on Sep 18, 2005 (gmt 0)

This will not be instant. It will take time to get the pages removed, just like it took time to get them in there in the first place. If you have a properly formated robots.txt file in the root of that sub directory, the pages will disappear from the index over the next few months. However, you will still be able to see the pages if you type in the exact URL via Google. They will just be text links with no description. This is how the removal normally works.

joeduck

5:32 am on Sep 18, 2005 (gmt 0)

Hi Mark -

But in the past I've had good results with the URL exclusion tool. It takes a few days and the pages disappear as long as I've excluded them in the robots.txt. Changing robots and *waiting* would result in months of time but the URL exclusion acts quickly, but in this case did not follow the instructions of robots.txt

Dijkgraaf

3:08 am on Sep 19, 2005 (gmt 0)

Robots.txt should always be in the root folder the website as per the standard. robots.txt files in other directories will not be requested by bots/spiders.

Also you haven't given an example of the exclussion you have in your robots.txt file, but I suspect that you haven't done them correctly.

They should be in the format of
disallow: /directory/startofthingtoexclude

I suspect you probably have something like
disallow: startofthingtoexclude
which in not valid, as all items have to start with /

Also you seem to be confusing subdomain and subdirectory they are not one and the same thing and this could be part of your problem.
Do you see the robots.txt file when you enter
[subdirectory.example.com...]

[edited by: ThomasB at 10:09 pm (utc) on Sep. 20, 2005]
[edit reason] examplified [/edit]

joeduck

4:39 pm on Sep 20, 2005 (gmt 0)

Hi D -

We are trying to delete all content indexed in this subdir:

events.oursite.com

We placed the robots.txt in the subdomain here:

[events.example.com...]

The robots.txt has only these two lines of text:

User-agent: Googlebot
Disallow: /

[edited by: ThomasB at 10:10 pm (utc) on Sep. 20, 2005]
[edit reason] examplified [/edit]

Dijkgraaf

1:37 am on Sep 21, 2005 (gmt 0)

Your robots.txt is correct.

Maybe it just takes a little bit of time for it to take effect.
Give it a few days, and if your pages are still listed by Google, then contact them.

joeduck

2:36 am on Sep 21, 2005 (gmt 0)

The problem is that using URL exclusion it should take only a few days, then Google follows the new robots.txt instructions, you get a removal "complete" message, and the pages are gone. I've used it many times for subdirectories, but never on subdomain as here.

I got the "complete" message and they are not gone which means the robots.txt is not getting followed correctly.

jdMorgan

2:51 am on Sep 21, 2005 (gmt 0)

I'm sure you're aware that Google is not one big computer. I think the last I heard, it was 170,000 computers distributed all over the world. As a result, it takes time to 'roll out' updates to all of these machines.

This is the same cause for the hundreds of posts we see when an update is occurring (or even rumored) that "My site is in one minute and out the next -- I'm worried!" -- The reason is that with load-sharing and round-robin DNS, you never know just what server a google domain name will resolve to; One minute you connect to one machine, the next minute, to an entirely different one. And updating them takes time.

The fact that this is a subdomain hosted in a subdirectory should have nothing to do with it. After all, "www.domain.com" is a subdomain of example.com, and there are many "www" sites that seem to work just fine... :)

I'd give this a few more days, and then see where you stand.

Jim

joeduck

9:00 pm on Sep 21, 2005 (gmt 0)

Thx Jim and everybody...

The listings are gone today so I'm happy.

I still think it was odd that it took two requests and about 2 weeks for the URLs to disappear but I'm happy now.

Also, appears some aspects of the 'duplicate content" filter are now gone though our google traffic has not come back noticeably.