|Multilingual site based on different file (web page) extensions?|
Sorry, if the issue is stale, but I haven't found the answer on my question, really..
Suppose you want to create a multilingual site.. No more than 3 languages 'd be available in the future. Always...
Is it stupid or not stupid to create being based different languages pages on their extensions? Suppose, all .html pages 'd be in Russian, .htm - in Ukrainian, and .shtml - in English. What if i submit the site in Russia using the URL http://example.com/index.html, and in UK - http://example.come/index.shtml, etc..?
Would Uncle Google be unhappy? Why?
[edited by: tedster at 2:36 pm (utc) on Mar 28, 2010]
[edit reason] switch to example/com [/edit]
|Is it stupid or not stupid to create being based different languages pages on their extensions? |
Makes no sense to me at all. File extensions relate to how your web server handles your pages... not to the nationality of the language on the pages. I can see you creating unnecessary server complications by using extensions to do something they weren't intended to do.
You're much better off using subdirectories or subdomains to sort out your files... or even some kind of indicator in your filenames, but not the file extensions.
Eg, if you used subdirectories...
http://example.com/ru/ would be the root page for your Russian language pages.
You don't want index.html, index.shtml, etc to appear in your filepaths at all. http://example.com/index.html will be treated as a duplicate of http://example.com/
Take a look at the Hot Topics [webmasterworld.com] section, pinned to the top of the Google Search forum home page, and look at the Canonical Issues discussions under Duplicate Content... and in fact read through all of the discussions.
Start by looking at...
Canonical URL Issues - including some new ones [webmasterworld.com]
Domain Root vs. index.html [webmasterworld.com] - yet another kind of duplicate
Thank you, Robert
I have to agree with Robert on the three .exts suggested, but to throw a 'wrench in the works' or a 'unique solution in the mix' that I'll probably get some, huh?, reactions from... If you were good with Mod_Rewrite* you could set the extension as the country code...
IMO You would definitely have to use document relationship markup in the pages and links, but, it would be a 'cool' unique way to structure the language sections of a site because all of your paths and pages could be exactly the same except for the extension, and there is an on-page way to tell browsers and SEs what language the page is in so they know what's going on, and people would easily 'get it' so I wouldn't think there's too much concern there...
You could then use a single index.htm page for the home page and overview of the site and have a link to the /home.ru and /home.en etc. making them the 'home for each specific language' and you would not need to worry about duplicate content any more than on any other site.
I guess to me it was the wrong extensions to ask about doing it with, but IMO not an 'all bad' idea in any way. Honestly, I actually kind of like the idea the more I think about it, so thanks for sharing... I'd have to see it in a URL for a bit before I would use it, but IMO it can't look any worse the /en-us/ looks at the beginning of the path for every page on a site.
Sorry for getting long-winded.
IMO: Not stupid at all if you use the country code for the extensions.
* In fact if the site was dynamic EG php and you were on an Apache box you could simply parse the country code as php (or whatever dynamic lang you chose) and not even have to use mod_rewrite... Hmmm... I keep adding to this, because you could actually use the country code as the page extension fairly easily...
From a management perspective which is easier and more reliable? To make sure you (and anyone else ever working on the site) doesn't ever accidentally upload page.html from inside the /en-us/ folder to the /ru/ folder (or sub-domain) on the site, or to make sure you don't save a page written in English with a .ru extension? The pages for all 3 languages could go in the same directory and it would be very easy to identify which page went with which language at any level of the site...
You wouldn't ever have one of those copy and paste issues where you accidentally got page.html (the Russian version) and thought you had page.html (the English version) and accidentally saved or uploaded one and accidentally deleted the other from the site, because each would be called the same thing with a different extension... I'm thinking this idea is absolutely NOT STUPID personally.
Let me be straight. Don't do this. This is not what extensions are for and G will probably don't understand your intentions properly. Here's some possiblities for you to explore:
- Use different TLDs (e.g. widgets.ru , widgets.com etc.)
- Use subdomains (e.g. ru.widgets.net, en.widgets.net etc.)
- Use subdirectories (e.g. widgets.net/ru/, widgets.net/en/ etc.).
An interesting fact about the last approach, is that you can add subdirectories to webmaster tools separately. Hence, you can geotarget them independently. While subdomains remain the preferred course of actions if you lack the resources to grab the domains, remain aware that this relatively simple option can work out quite well.
Also, don't forget the content-language header, which you can set using meta http-equiv. Even though you might be able to use this header to geotarget based on extensions, there is no way to differentiate the apges in webmaster tools, which provides a more robust option for geotargeting.
You can actually use way more than a meta http-equiv to communicate the purpose to search engines, and while you may want to geo target in WMT, you may also want to have a 'worldwide language based target' rather than limiting your target for a language to a country...
Links and Search Engines [w3.org]
I actually don't see any issue with the first approach other than being confusing to manage and to visitors... If you can use all three extensions on your server, then it's not going to cause your server any issues for you to use them. It's no different than using js, php and html (or asp, php, htm, js or php, cfm, js, css, htm, inc, ico, gif, jpg, jpeg, fla, swf, txt, png) extensions on the same server, and with the number of sites and software installations using page.php.something.else or page.cfm.stuff-here?otherstuff I can't see how 3 different extensions would cause any more issue for a server.
As far as duplicate content with the index.ext goes, you run into the same duplicate content issue with a single extension if both the root domain and the index.ext are accessible, so you have to redirect one to the other anyway, and it's easily overcome, but you don't have 'triplicate content' because each index.ext is treated as a different location. index.htm is not the same page as index.html and if they have different content (even language interpretations) they are different pages and not duplicates.
There are also site redirecting to http://www.example.com/index.ext rather than serving the content at http://www.example.com/ and I've seen one in a very competitive niche doing very well redirecting the root domain to /page-name, so you DO NOT need to redirect from /page-name to the root domain for ranking (or any other) purposes, you just cannot have the same content available at both locations. (It bothers me when people redirect the root domain to a page name, but it's not an issue with search engines at all.)
Robert Charleton even said you could designate the language in the page name, and a different extension is a different URL, meaning it's essentially the same, because each different extension is treated as an independent location, whether the location changes from the information before the . or after the . is really irrelevant. Any difference in location is a different location. You could actually have three different languages on your site by different capitalization: Page.html page.html pagE.html Those are 3 different URLs and would be treated as 3 unique locations by search engines and they would be very confusing to visitors and to manage, but you could still communicate the same point with the exception of not being able to designate a geotarget in WMT.
If you can designate a language difference between two pages ending in html because they're in a different directory, then you can certainly designate a language difference for two pages that don't have the same extension and have search engines figure it out...
If you want a conservative approach, or geo targeting in WMT (personally I wouldn't use it), use one of the other ones suggested, but if you want something different and don't need geo targeting in WMT there are plenty of ways to communicate your point and 'silo' your link structure and navigation so search engines understand what you're doing.
I'm sure there will probably be more people saying you can't use extensions to do what you want, and we'll have to just agree to disagree there, because I think you probably can and actually like the idea myself. But I do go by TheMadScientist here, so I'm probably a bit nutty or something and it might be better for you to 'follow the crowd' than to do anything different or new you thought of by yourself...
Thank you all, guys. I know about 3-rd level domain or subdirectory ways, of course, but just wanted to explore other possibilities. This question is one of my oldest "problems" ;) and I asked this one a few years ago but (as far as i remember) only one guy replied, and I was unsure till now. Yeah, if a client does not want to buy more than 1 domain name, 3-rd level domain is the best choice imho...
i always was so curious about all this and created a working (not submitted for SEs yet) "extension-based solution" at last. ru (.html) version works in real world since February, and en and uk parts aren't ready yet (the owner has not much time to work with content now). The site is implemented in LAMP (+CodeIgniter +jQuery stuff) If anybody interested and wants to see how it works, i can sticky a link to my laboratory subdomain where the "full" 3-lang ext-based version situated.. (en and uk contents are dummies)
And Robert's KILLED me with his "You don't want index.html, index.shtml, etc to appear in your filepaths at all. http://example.com/index.html will be treated as a duplicate of http://example.com/" -- the most serious argument not to do this (or to use home.lang pages like TheMadScientist advised) Think first... lol
Thank you all again :) Will use subdomains-based solution. But if anybody wants to discuss this subject here, you are welcome, of course.
P.S. Special thnxs to TheMadScientist... Seems you've inspirated me to resume half-read but forgotten "Apache mod_rewrite" by Rich Bowen ;) nice book btw... guess the best manual on the subject i've ever seen
|If you were good with Mod_Rewrite* you could set the extension as the country code. |
That's quite creative - and yes, you could. However, I'd say the idea falls under my personal motto "Not everything that can be coded should be coded."
In this case, I certainly don't have any concrete experience to apply. But here's what I do see. Google has involved a VERY complex set of back end structures. Those structures have embedded assumptions, often based on "standard practices" that they see in their huge pile of crawling data.
Within this back end complexity, Google seems to trip up even themselves on a regular basis. It's all too easy in a large project for one team to code something that breaks another team's assumptions.
Such pitfalls happen in complex team programming of any type, and beta testing usually uncovers the worst of it. However, Google does their beta tests live. We see bizarre changes in search results and we call them "bugs". Eventually they get worked out - and when they do it's because more than one edge case surfaces. The affected group gets discovered statistically, and Google pulls apart their logic and assumptions to patch up the logic.
Now what happens if your site is using a one-of-a-kind solution? It can either run into a Google bug down the line somewhere, or it might even smack immediately into some hidden assumption in the current algo logic.
Sorry, your site is one-of-a-kind and you've got no company. You're out there as one single exception out of billions and billions of pages. It's not very likely you could ever come back from that kind of ranking problem. Statistically, you're just invisible!
And that's why I avoid clever, one-of-a-kind solutions whenever there's a more standard equivalent hanging around.
I understand your point tedster, but, for the sake of discussion, personally I would not rely on the extensions to provide any SE relevant information, but I would rather rely on them to provide a 'clear difference' in the URLs for both bots and people, which they would.
Then I would rely on document and link relationship markup to make a designation for the search engines, which I would guess is one of the 'more widely accepted' and 'more widely used' ways to point out what a specific page is about and how each page relates to the rest of the pages. (Maybe it would even be 'a more inclusive' standard than something 'Google specific', like a setting in WMT?)
Personally I think those who rely on a subdomain or directory actually leave quite a bit on the table, because they think, 'How do I do this for Google?', rather than thinking, 'Hmmmmmm... If they mis-interpret this or not all SEs accommodate the same things Google does in the same ways they do, what can I do more definitively?' which would probably lead to finding the w3c standard for defining language and link relationships within documents and sites, so to me 'What's Google going to do if I don't use a subdomain or directory specifically for each language?' is not the question I would ask...
I would rather wonder how a site with a .com tld and .html extension without a language specific subdomain or directory could possibly rank in the Russian language if there is not another way to define the text on the page as Russian, and could reasonably assume if Google can interpret a .html extension on a site without a 'language specific subdomain or directory' as Russian and not English there must be a way for them to do so without being told via subdomain or directory, which would lead me to believe if I follow the standard set by the w3c in the language and link markup I would probably be building a more inclusive site, rather than having a site 'built for Google and their webmaster programs.'
So, unlike those who have some dependence on a subdomain or directory designation to be interpreted correctly by Google and other search engines (or any dependence on a setting in WMT which is completely Google specific), I would not depend on either and would also not depend on the extensions, but would rather depend on following the w3c standards for document language and link markup...
From there, inbound links would probably also help tell the story, but I do see your point for those who don't want to do the work of making sure they follow certain standards and adding all that stinking markup to their pages because it's so much easier to just create a subdomain or directory and set it in WMT for Google. (I'm not in any way trying to say you don't follow the standards, because I know from your posts you're actually more 'old school standards' (and conservative) than me ;) but I do think many people who are not you don't bother to read the standards (or apply them) because they build sites for Google and rely on 'G Tools' or 'G Info' more 'exclusively' than they should...)
In reading through this I did notice a late night typing / communication issue:
page.php.something.else or page.cfm.stuff-here?otherstuff
something.else.page.php or page.cfm?stuff-here.otherstuff