|Google not obeying robots.txt|
Disallow: /*? not going through
| 7:27 am on Sep 25, 2009 (gmt 0)|
I run several sites. I use Google AdWords. On all sites, I use a same technique for tracking, by appending variables to URLs, those that are valuable to me. Nothing special, keywords and referring URLs.
Long ago, in webmaster tools, I noticed that a page with variables appended got listed under "Links to your site".
Sometimes, it may be a spammy site that picks links and ads from AdWords, I have no clue why, and sometimes it's from people that create content manually in blogs, and then they link to my site for reference purpose, by picking a link from AdWords.
Anyhow, it's duplicated content when you have index.html, index.html?1, and index.html?2 listed.
To fix that, I used robots.txt by simply adding this:
There is even a reference from Google about it:
[google.com...] under Pattern matching.
Now, just seeing those URLs under Links in webmaster tools, does that really mean that the page gets indexed as a separate one? Or this simply means "this is how somebody links to you"?
| 6:31 pm on Sep 25, 2009 (gmt 0)|
I got confused here.
When I take a better look into all pages under Webmastertools, I see that stuff I block in robots.txt is listed as blocked.
The URLs listed under external links simply show how other sites link to my site.
Finally, when I query Google by site:example.com, I get pages as per my sitemap. Still, if i pick "include omitted results", trailing stuff after question mark shows up, and also other pages banned through robots.txt show up.
Is that right?
I would think that if something is banned through robots.txt, that should be evaded 100%.
| 10:41 pm on Sep 25, 2009 (gmt 0)|
A robots.txt Disallow says "Don't fetch this page," and that's all it says. Google does not have to fetch the page to show it as a URL-only result in search, or to show it as a URL with link-text from the link on a site that links to it.
I don't like this behaviour either, but it's in full accordance with the purpose and scope of the Standard for Robot Exclusion, which explicitly states that it is a 'fetch control' mechanism.
If you don't want the page to show up in the "show omitted" search results, then don't Disallow it in robots.txt. Instead, permit Google to fetch it, but only after adding a <meta name="robots" content="noindex,nofollow"> tag to the page.
| 4:48 am on Sep 26, 2009 (gmt 0)|
|If you don't want the page to show up in the "show omitted" search results, then don't Disallow it in robots.txt. Instead, permit Google to fetch it, but only after adding a <meta name="robots" content="noindex,nofollow"> tag to the page. |
Thanks very much for thorough explanation Jim.
But... here I'm coming from the same spot as in Apache forum where you replied as well. It is about:
I cannot change meta tags in those pages. It's same page that has to be indexed, but with variables applied (query string).
There is that "parameter exclusion" in Webmaster tools that Tedster has mentioned under "Goole Search" (yeah, I posted about same problem there, too), which I hope will help.