homepage Welcome to WebmasterWorld Guest from 54.242.200.172
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Google not obeying robots.txt
Disallow: /*? not going through
smallcompany




msg:3995467
 7:27 am on Sep 25, 2009 (gmt 0)

I run several sites. I use Google AdWords. On all sites, I use a same technique for tracking, by appending variables to URLs, those that are valuable to me. Nothing special, keywords and referring URLs.

Long ago, in webmaster tools, I noticed that a page with variables appended got listed under "Links to your site".

Sometimes, it may be a spammy site that picks links and ads from AdWords, I have no clue why, and sometimes it's from people that create content manually in blogs, and then they link to my site for reference purpose, by picking a link from AdWords.

Anyhow, it's duplicated content when you have index.html, index.html?1, and index.html?2 listed.

To fix that, I used robots.txt by simply adding this:

Disallow: /*?

There is even a reference from Google about it:
[google.com...] under Pattern matching.

Now, just seeing those URLs under Links in webmaster tools, does that really mean that the page gets indexed as a separate one? Or this simply means "this is how somebody links to you"?

Thanks

 

smallcompany




msg:3995816
 6:31 pm on Sep 25, 2009 (gmt 0)

I got confused here.

When I take a better look into all pages under Webmastertools, I see that stuff I block in robots.txt is listed as blocked.

The URLs listed under external links simply show how other sites link to my site.

Finally, when I query Google by site:example.com, I get pages as per my sitemap. Still, if i pick "include omitted results", trailing stuff after question mark shows up, and also other pages banned through robots.txt show up.
Is that right?
I would think that if something is banned through robots.txt, that should be evaded 100%.

jdMorgan




msg:3995936
 10:41 pm on Sep 25, 2009 (gmt 0)

A robots.txt Disallow says "Don't fetch this page," and that's all it says. Google does not have to fetch the page to show it as a URL-only result in search, or to show it as a URL with link-text from the link on a site that links to it.

I don't like this behaviour either, but it's in full accordance with the purpose and scope of the Standard for Robot Exclusion, which explicitly states that it is a 'fetch control' mechanism.

If you don't want the page to show up in the "show omitted" search results, then don't Disallow it in robots.txt. Instead, permit Google to fetch it, but only after adding a <meta name="robots" content="noindex,nofollow"> tag to the page.

Jim

smallcompany




msg:3996077
 4:48 am on Sep 26, 2009 (gmt 0)

If you don't want the page to show up in the "show omitted" search results, then don't Disallow it in robots.txt. Instead, permit Google to fetch it, but only after adding a <meta name="robots" content="noindex,nofollow"> tag to the page.

Thanks very much for thorough explanation Jim.

But... here I'm coming from the same spot as in Apache forum where you replied as well. It is about:

page.html
page.html?v=something

I cannot change meta tags in those pages. It's same page that has to be indexed, but with variables applied (query string).

Vicious circle...

There is that "parameter exclusion" in Webmaster tools that Tedster has mentioned under "Goole Search" (yeah, I posted about same problem there, too), which I hope will help.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved