| This 64 message thread spans 3 pages: < < 64 ( 1  3 ) > > || |
|Canonical Tag vs. Block in Robots.txt|
Hi there, Everyone:
The product pages on my ecommerce web site are (by default) available via multiple versions of the URL (namely, a long query string version, and a short version).
For years, I have simply blocked the long query string URLs via the robots.txt file (The long query string URLs have a "virtual" directory in the URL, so I just block that virtual directory).
But with "trust" being such an important issue after the Panda updates, I wonder if it might be better to unblock those URLs in robots.txt and just let the canonical tag take care of it.
In webmastertools, under crawl diagnostics, it lists something like 700 URLs blocked by Robots.txt, and if it is something that is being measured by google, I can't help but think that they are somehow using that information for something.
It's called the canonical link element. 99.9999% of the time people talk about tags they are referring to the element, not the tag. A tag is the opening or closing text of an element. Ultimately the only proper time to refer to a tag is if one is inquiring if an element is self-closing, e.g. should you use the closing tag for the element or not.
|aakk9999: 301 would be a long job, i am not too much into coding and as you said there might be some other spelling or some other kind of mistakes and such URLs can pop out. |
If this is a scripted page (PHP or whatever) then the fix for the entire site might boil down to half a dozen lines of code.
This is especially true if there is a simple and consistent all-lower-case or Camel-Caps-With-Hyphen format supposed to be in use for all HTML page URLs.
BEFORE the point in the PHP script where the DOCTYPE is sent out to the browser, sniff the requested URL path part and store it in a variable (say $uriPathRequested) and then use your "URL formatting rules" to generate another internal variable (say $uriPathCanonical) with the capitalisation and hyphenation (and any other URL path problems) fixed. Next, compare the two variables. If the values are different, immediately send the server 301 redirect header and the redirected-to URL and quit.
If the value of the $Requested and $Canonical variables are the same, proceed to requesting content from the database. If there is no matching record in the database (because of a misspelling in the requested URL) then immediately send the HTTP 404 Not Found header and the HTML content for the 404 error page.
If content is found in the database, build the HTML page and send it.
This self-fixing system is incredibly easy to implement if your site has been built in a modular fashion and the designer has separated function from presentation.
It is even easier to implement if the canonical URL has an ID number at the beginning of it, such as
www.example.com/1428383-this-fantastic-product as then you can then deliberately post links like
example.com/1428383 to Twitter and other places knowing that your site will automatically redirect the request to the correct URL.
|In webmastertools, under crawl diagnostics, it lists something like 700 URLs blocked by Robots.txt, and if it is something that is being measured by google, I can't help but think that they are somehow using that information for something. |
FYI, I forgot to mention that until recently my WMT's said always said upwards of 35K blocked, Google didn't seem to care.
I revamped the site somewhat, changed what could be crawled, switched some things to NOINDEX vs. robots.txt for various reasons, got it down to 8.5K blocked now ;)
|if it is something that is being measured by google, I can't help but think that they are somehow using that information for something. |
Google tracks all kinds of data that they don't actively use in the ranking formulas. Reading their collection of patents makes that pretty clear to me.
I think that message is a classic case of an FYI message: "Hey buddy, you're blocking this many URLs that we know about. Just in case it's a technical error, we thought you should know what we see."
Talk about dancing to Googles ever changing guidelines.
As i see it the whole point of Robots.txt is to prevent bots crawling content that doesnt need to be crawled or indexed.
If you have an ecommerce site with say a section on blue widgets with 400 blue widgets available and you have 10 pages of 40 items, i would list your first page with the 40 on, no follow/ no index the pages 2-40 and exclude in the robots txt the individual pages to the buying cart as they are duplications.
Using robots.txt is the simplist way to do this and is what its designed for, I cant believe for one minute that google would give some kind of negative signal for a site that has thousands of blocked pages.
|For years, I have simply blocked the long query string URLs via the robots.txt file |
I would continue doing just that. IF it were found that google is for some strange reason using it as a negative signal, the net result would be webmasters not blocking pages and letting google bot crawl zillions of extra pages on the net that it doesnt need to - not a bright idea or likely imo
|the net result would be webmasters not blocking pages and letting google bot crawl zillions of extra pages. |
Googlebot crawls them now when listed in robots.txt. They don't index, but they crawl. I've never been fond of robots.txt because Google interprets the guidelines literally. I've seen sites show thousands, hundreds of thousands of URI only entries due to this crap.
Me, I just noindex those items that do not belong in the indexing pool. We serve it dynamically based on the request. Been doing it that way for years with no ill effects. And, it keeps documents OUT of the index. Unlike robots.txt entries which get a URI only listing and are available for all to see via the site: command.
I've seen robots.txt files give away information that I don't think the general public should have access to. There are too many prying eyes these days that are up to no good. I don't need no stinkin' robots.txt file to provide them with a map of everything I don't want indexed. And, I don't care that Google "crawls and indexes" noindex pages. It doesn't display them in the SERPs, ever, and that is the intended goal.
Also, WTF is the canonical element good for? What happens when another crawler grabs those documents and doesn't understand the canonical? There are still lots of those out there. Googlebot is not the only one you need to be concerned about. What's going to happen is there will be all sorts of URI configurations that are scraped, repurposed, etc. Now you have all those incoming redirects and/or fragmented signals to contend with. Not me, never used the canonical and probably never will - it's a hack.
I am not sure how the canonical issue of links has become so popular lately.
Are you people have serious problems with your web applications? Typically the URL generation part of a web application sets up the parameters in the same way:
You are saying somehow you end up generating also
If so you need to check the application and fix the problems.
One more thing about robots.txt files. Googlebot will discover URIs via this method. Yes, they will get in there and start crawling URIs that it can find via robots.txt. One look in your GWT and you can see what they are crawling. In some instances, you may discover technical issues you were not aware of. In others, you'll find the bot has crawled what you "thought" you told them not to crawl. Just keep in mind, we're discussing a crawl and not an indexing, they are different.
Do this, take an item from your robots.txt file that is capable of generating thousands of pages. Now, do a site: search for that specific path. What did you find? One URI only entry with a link to show omitted results? Okay, how many omitted results were there? Now tell me, why would you want those thousands of URI only entries available for someone to scrape? It's a road map for everything you don't want indexed.
enigma1: parameter order is just one possible way that multiple urls get generated with the same content. Other issues that I've seen:
1) Extra tracking parameters for ads, affiliates, marketing, or sessions
2) Extra parameters that subtly change the page such as adjusting the breadcrumbs based on where you have been on the site, turning off particular teasers, or adjusting the color of a product.
3) Url capitalization. /hello.html and /Hello.html are not the same url but often return the same page.
4) www.example.com vs example.com
5) Merged data such as two product ids that are for the same product and handled via an internal forward rather than a redirect.
6) Extra slashes in the url. Apache has a habit of serving the same page for /hello.html and //////hello.html
7) default documents: /index.html vs /
8) a beta or development version of the site that accidentally goes live on a different host or subdomain
@deadsea I still fail to see how this will cause problems because you suppose to have a single function that gets parameters and generates the final url.
>> extra tracking parameters
you mean internal links that somehow need to be generated and point to the same page? If these are internal links why not using cookies for this?
>>turning off particular teasers, or adjusting the color of a product.
transparent js or post methods could be used instead.
>> Url capitalization.
Who generates that? The URL generator should generate the same link always,
>> www.example.com vs example.com
application dependent you should always check $_SERVER['HTTP_HOST'] as someone may also enter using the domain's ip directly.
>> Merged data such as two product ids
Again use js, in general a non-URL mechanism.
>> Extra slashes in the url. Apache has a habit of serving the same page for /hello.html and //////hello.html
Application dependent, the URL generator should never append extra slashes.
>> default documents: /index.html vs /
application dependent somewhere you generate a link with the index.html instead of the plain /
If you are certain the application does not generate abnormal links then you're ok because you don't care what others are injecting or if external sites manipulate your site's links.
The only way to get dup content is if your domain somehow generates or recreates the duplicated links or if it's prone to URL poisoning.
Parameter order can be effected by the order of elements in forms. Googlebot is now submitting and crawling simple forms. Even short of that, if users can get to the url, Google may crawl it based on toolbar, analytics, or adsense visibility.
Extra tracking parameters are often external, however sessionids in particular are often a problem on internal links. Some forum software is configured to use params when cookies are disabled and does so for googlebot.
One of my websites has mixed case canonical urls. A bunch of people on the web assume that all urls look better lower case. I've seen inbound links that lowercase entire urls. Some crawlers (not googlebot) try to crawl sites only with lower case urls.
For extra slashes, I've seen malformed links like href=".//page.html" that end up making a spider trap. Every time that link is clicked it gives back the same page with an extra slash in the url.
Once googlebot finds a non-canonical version of a url, it will continue to crawl it forever. Even if you fix the bug on your site that caused the link to become visible to start with. And unfortunately, there are a large number of possible ways to create non-canonical urls. Even savvy well meaning developers create them from time to time.
|Once googlebot finds a non-canonical version of a url, it will continue to crawl it forever |
Only if it finds within your site. And that will point to the application from where somehow it gets generated.
There should be no problem with incoming urls even if they include sessions, trackers or any other parameters simply because you never regenerate these parameters and expose them with the HTML. That's my point.
If a user wants to rearrange the parameters it makes no difference because nothing is exposed from your site's pages that constitutes a duplicated link or dup content. You should always parse only the parameters your scripts are aware of and ignore the rest.
If parts of malformed links propagate somehow with the normal links the application generates, then again it's a problem with the application and you need to get to the root of the problem. The robots.txt and rels adjustments won't fix it.
A duplicate problem may exist because the domain itself creates it.
External factors in this case won't change it. If you do a mistake and say now you want to change one page's name to another issue a 301 on the request for the particular page. And it will rectify the problem after a bit, because again the old link is no longer present in your site.
|Googlebot crawls them now when listed in robots.txt. They don't index, but they crawl. I've never been fond of robots.txt because Google interprets the guidelines literally. I've seen sites show thousands, hundreds of thousands of URI only entries due to this crap. |
I see no evidence that Google crawls these URLs (i.e. fetches these from the server). They simply add the URL to their database and list it as a URL-only entry in the SERPs.
|If you are certain the application does not generate abnormal links then you're ok because you don't care what others are injecting or if external sites manipulate your site's links. |
The only way to get dupe content is if your domain somehow generates or recreates the duplicated links or if it's prone to URL poisoning.
You have made this bold statement in several recent threads. It isn't true. If Google requests a URL and it returns "200 OK" then it is fair game for indexing.
I'd use a 301 or my second choice would be a canonical link element in the head + noindex.
See this thread is you think you have your robots.txt correct and GoogleBot is crawling the URL(s) ... There could be an 'interpretation issue' [webmasterworld.com...]
Important note from vanessafox:
|Googlebot follows the line directed at it, rather than the line directed at everyone. |
It makes sense to me from the examples given in the thread.
ADDED: noindex to 'what I would do'.
Robots.txt would be something I used as a 'last resort' ... You lose any and all link juice in all search engines with it, but with a 301 or canonical you retain it in some way whether all engines interpret and use the the canonical link info or not.
|You have made this bold statement in several recent threads. It isn't true. If Google requests a URL and it returns "200 OK" then it is fair game for indexing. |
I totally disagree, if that would be the case anybody could artificially generate duplicate content for any site he wanted. And there will be absolutely nothing you could do to filter it. Where you going to apply a redirect? To what parameter?
I 301 redirect to remove *every* unknown or un-needed parameter. Each page template has a list of parameters they accept. Each parameter is either required to display the page correctly, optional to display a section of the page (product color), known but only for tracking, or unknown. We generally allow only the required parameters to remain on the url. Others we stash into cookies (if needed) and redirect to remove.
It shouldn't be possible to link into our site with non-canonical urls that don't 404 or redirect appropriately.
|Canonical Tag suits Dynamic website very much, especially e-commerce websites. Robots.txt suits static websites, especially company websites. |
|I 301 redirect to remove *every* unknown or un-needed parameter. |
Ok how are you going to parse a url like this?
With php for example they won't even show up in the global $_GET array. Now you going to start parsing the request for an infinite number of combinations. It can get so complicated and your scripts can introduce errors and significant latencies. You can inject arrays as parameter names, various empty fields and so forth, just use your imagination.
This is what JohnMu posted a year ago:
|As long as you remove the problematic content on your site, you can ignore the external, spammy links to that content. Our algorithms understand that you can't control everything outside of your site, but we do expect that you could do that within your site |
This is what they posted about the parameter handling feature in the WMT that was introduced long ago.
|Google lists the parameters they’ve found in the URLs on your site |
"in the URLs on your site" not external urls but "on your site". Why they would bother to do all this if the external URL poisoning wasn't handled by the search engine?
The query string is usually available as as variable on its own. The logic is usually something like this:
myurl = expectedpath + expectedscript + '?' + expectedparam + '=' + expectedvalue;
if myurl != request.path + request.script + '?' + request.querystring: request.redirect(myurl, 301);
From what you posted I don't understand, how you can setup the expected parameters or predict them if you don't process the request first against a db or something.
Here is pseudo code for a hypothetical product detail page that has exactly one required parameter (productid) but may accept a color param and a tracking param. The canonical url will only ever have the productid parameter:
# Required product id param
product = db.lookup(productid)
if not product: return response.404("product not found")
# Optional product color param, stash it in a cookie
# so that is saved after a canonicalization redirect
color = request.getParam("color")
if color: response.setCookie("color", color)
# Tracking params, stash them in cookies
tracking = request.getParam("tracking")
if tracking: response.setCookie("tracking", tracking)
# The expected canonical url
expectedurl = '/products/detail.cgi?productid=' + urlencode(productid)
# The url the server actually got
requesturl = request.getPath() + request.getScript()
if (request.getQueryString()) requesturl += '?" + request.getQueryString()
# issue a 301 redirect for url canonicalization if needed
if (requesturl != expectedurl): response.redirect301(expectedurl)
# Now display the product page.....
Yes I understand setting up the parameters then validate them against the query. So say now you have 2 parameters the page uses for urls if you ever change say the parameter order or add/remove a parameter in the future, your site disappears from the SEs right?
As of the cookie headers, it's bad to send cookies without validating the parameters as you don't know where the request is coming from.
You are absolutely right about cookie validation. I omitted it for brevity.
If you use multiple parameters, you have to specify a canonical order. The redirect issued for the actual url not matching the expected url would take care of that canonicalization.
If you are adding or removing parameters, you are by definition changing your urls. That can be risky and you need to think about it carefully. It can be done with proper 301s and such.
I tend not to use parameters much anyway. I prefer short friendly urls. When choosing the url for your product page you have lots of choices:
I'd prefer the last one, but then you have extra headaches such as supporting product name changes. Even in the last case you probably want to canonicalize
I prefer short URLs sans parameters, with some sort of record number at the start of the path.
It is then easy to redirect all of these to the right place:
even the bare URL like this:
that someone posted to twitter.
There is no argument having short and meaningful urls. Assume also the web application always generate the right urls without parameter order problems.
The difference of opinion is because if I get requests
in both cases I will return 200 OK. While you would return 301 redirect in one case. That's the difference. Although the web engine will always generate one version of the urls.
Yes and the urls I prefer to generate are 100% content related and without identifiers.
I would use "without identifier" too, but you then have to have additional logic in the CMS to ensure you never repeat a page title, and the error handling is just that little bit more complex.
Additionally, is a request for
example.com/this-coo a typo for
example.com/this-cool-gadget? Without the record number, we'll never know.
Finally, if the page title were to be changed, the old title would need to be stored as a database record with redirect information enclosed. If it were to be changed again, the oldest record would need amending as well as creating another redirect entry.
For a system with record numbers in the URL, the title can be changed at any time, and there is no need to keep a record of old page titles because any request other than
www.example.com/23456-exact-text-match is redirected to the canonical URL.
You must refer to the application you use. In my case, page titles and links are kept separately. What you have as a page title doesn't mean it represents the URL 100%. By default the link may get generated by the title and have the exact keywords in it but you can edit the link in case you need to modify it. Or if you change the title the link will stay as it was. In other words they're independent.
I use some approximation redirect mechanism so if there is ever a request for example.com/this-coo the code will pickup the closest or more popular link or whatever conditions I put in it, to find a similar link and then do a redirect. This has is pretty much the same effect with having an identifier.
The problem with identifiers is that they may detriment the url as when you start having multiple ids with products, categories, brands, articles, topics etc and you want to combine them into one link you need to have prefixes in place and the ids can be longer than the actual text. And there are database dependencies.
If say you delete a product the identifier won't be there so you need to hard-code some other url if you wanted a redirect to a specific page. In general with database management because of the identifier dependency you may have quite some overhead instead of using plain links.
By "title", I was loosely referring to the part of the URL after the ID number. In many cases this will be a filtered version of the page title from the title element (filtered to remove punctuation, add hyphens between words, and make all lower case), but there's also a place in the admin section to alter that text to be anything you want it to be.
Products that are "deleted" aren't actually deleted immediately. The pages are first marked as "out of stock" or "discontinued" with links to newer products prominently added. After some time, usually a few months, the status is changed so they are served with 404 status instead of 200. So, people can still revisit them, but searchengines are encouraged to drop them from their index. At some point the pages may be completely dropped from the database but often aren't.
Yes filtering of title -> url is implied along with editing urls and meta tags from the admin.
But one other issue to keep in mind with db ids in the urls is exposure. I prefer not to expose the database ids at least with the urls generated for search engines to improve security.
What's the reason of your ids preference over the totally static ones?
It may look simpler in terms for the url decoder having the id in the url but if you think about it is pretty much the same with the totally static links as you use some sort of signature. The only problem I faced was with configuration on some cases where combinations of categories, products and manufacturers were using short names eg:
then if I don't configure the urls by combination of brand/category the url generator will create cars and cars2 to distinguish while with the ids it won't be a problem but the url ends up longer
I don't use category in the URL for pages. A page might be linked from multiple categories, but they will all link to the exact same categoryless URL.
The product page links out to 3 or 4 cross-sell and up-sell items. It also links back to the categories that the page is listed in, by way of "Find more [gadgets] [widgets] [doodads]" links.
The product ID is the one canonical item that is unchanging, even when the title or category changes. However, with your security concerns, I think perhaps there's an easy way to hash that ID and make the actual db record number different, but related to, the publicly identifiable number. Food for thought. Is it any more secure to use "Red Widget FK252" as the database key though?
One of the first things the script does after receiving a request is to look to see if the ID, perhaps 12345, is a valid ID. If not, the 404 page is sent. If the ID is valid, the title is pulled from the db and compared to the title text found within the URL. If they don't match, then a 301 redirect is issued. If they do match, then the content is pulled from the db and served. This content includes the meta data and the on-page content such as links to images, prices, page text, as well as links to related categories and cross-sell and up-sell items and so on.
For the initial request I will always do a redirect on a mismatch. It's just where to redirect but that depends on the query. I think a 404 passes no juice in any case and for retired items, I would prefer the visitor to go to a similar one if possible. If it's not possible redirect to the home page.
Linking of products into multiple categories or articles into topics etc I found it harder to manage in various cases. That is if the requirement is to expose both categories and products, or brands and products with the product url. I would have to pick up the first brand or category and prefix the url one way or another to get around the dup generation. Using ids the challenge is going to be the same.
And yes, is way simpler to generate the urls for a single entity vs combining parameters.
Adding a hash would mean another encoder/decoder although tiny I try to simplify the code as much as possible. You also need a prefix for the ids in order to differentiate the entity type.
Back to the OP's post a bit, robots.txt I rarely used it to filter urls because it doesn't work. Google will still access them because it won't read the robots.txt before every single request. If for some reason you need to modify the url structure or add/remove parameters make sure the old structure is not regenerated in some way.
I think lots of the canonical problems because of it. In some cases hard-coded links are forgotten, hidden with content and the store owner tries to block the old url but Google still sees it. There is no detailed information about it in WMT, in other words all steps the search engine followed to get to the duplicate page. That would be really helpful.
Years ago when shared hosting was popular, I remember seeing cases where urls will be inserted with the content and will include the session id. Needless to say what the consequences are. I am sure a similar thing happens today with tracking ids, referrers etc and just one mistake in the content, can cause a great number of duplicated pages and security issues.
| This 64 message thread spans 3 pages: < < 64 ( 1  3 ) > > |