|Robots.txt for canonical preference|
having dup content issues
A few of my sites are having dup content issues with BING. bingdude told me my robots.txt is not formatted correctly and that I need to set-up some rules that will establish that http://www.example.com/ is the canoncial and to ignore the https and non-www versions as well as any other added parameters for the entire site. I think it is important to note that I have lots of folders. I did some digging on here and online and can't find a specific answer/example for my issue. Can someone give me some input and/or point me in the right direction?
Not in robots.txt.
I add a canonical meta tag to my sites where relevant but usually force the issue in the site's domain setup (I use IIS and use a 301 redirect from, eg, non-www to www). The meta tag solution is really for duplicate pages but can also apply to querystring variations on a page.
In the case of https I tend to check whether the visitor is a bot or person and prohibit https for bots - they never even know it's a possibility. This works because I only use SSL on "cart" type pages; the real meat of the site is always http. As far as I'm concerned a bot should not need to know about carts and forms such as local search and feedback, which I also block.
I am specifically having this issue with BING. I got this email from them: The crawlers don't follow htaccess files. They reference robot.txt files. The htaccess file is a server-level file to tell the server how to handle certain requests made of it - like 301 redirects, custom 404 pages, etc.
If you're seeing that message suggesting you have duplication issues, it's because our crawler thinks you have duplicate content.
Are those variables, when placed into a URL string in a web browser, returning a valid webpage with a 200 ok code?
If yes, then it's a dupe issue.
I have fine-tooth-combed Bing's WMT and I'm ### if I can find a place to tell them which version of the site name, with or without, you want used in their index. And I don't suppose you can remove the "wrong" pages, because they'll only look at the path. (I can't test this myself because I can't find any duplicates.)
#1 I just did a search for a keyword constrained to my site, and every single hit was in the www. form. (This happens to be what I want, and what everyone gets 301-redirected to.) *
#2 Conversely, if I look at a site in Index Explorer, all URLs are given in without-www form. This is independent of what form I give in the box at the top of the list where you enter a site name.
#3 When the bingbot does its stuff, there are always a great many redirects. Logs don't say, of course, but if they're asking for a viable URL and get a 301, you have to assume it's a with-or-without-www issue. They even ask for robots.txt in the "wrong" place.
It all makes me wonder if it's deliberate on their part: If a page name exists in two forms, they'll index both-- and then maybe hold it against you for not redirecting at your end?
* Psst! bingdude! Got a bit of a file-encoding problem on your Results screen, and it isn't from anything on my site ;)
whoever wrote that email is technically inaccurate.
|The crawlers don't follow htaccess files. |
the crawlers request a url which resolves to your server.
your server checks the requested hostname and redirects a non-canonical request to the canonical hostname.
the crawler gets a 301 status code and a Location: header with the canonical url in the response.
the crawler can either make a subsequent request for the canonical url or ignore the response.
there's no "follow" involved.
the crawler cannot possibly know where in your server configuration the redirect response was generated - .htaccess or otherwise.
|They reference robot.txt files. |
a robots.txt file can be used to exclude a crawler from requesting a url but it can't do anything about canonicalization.