Welcome to WebmasterWorld Guest from 188.8.131.52
This post is NOT supposed to be talking ‘at’ you, but rather it’s aim is start a discussion (who says I got it right, and what is “right way” anyways :) )
To clarify any potential confusion when talking about URLs, URIs, URNs, etc., especially for our newer webmasters, take a look at these definitions and clarifications from W3C:
URIs, URLs, and URNs: Clarifications and Recommendations 1.0 [w3.org]
Axioms of Web Architecture [w3.org]
URI Model Conseqences [w3.org]
Why do we care about filepaths and filenames – their structure and what keywords are in them?
First of, just getting filepaths and names to work well for you, your visitors, and bots is not ‘be all end all’ – it just one part of the puzzle.
In a nutshell, both human visitors and bots will ‘see’, and make some kind of a decision, about your site and/or page based of filepaths and file names. For humans it comes down to usability issues, and for bots it provides algorithmical guidance. Reasons and methods for getting this right for both, humans and bots, are intertwined.
Stepping back for a moment - For most part, and for different reasons, we all want a lot of first time and repeat visitors to our sites(s). Good chunk of first time visitors comes to us from search engines (SEs). And usually those visitors will only go through couple pages of SERPs (search engine results page), so that means that we want to rank well – higher the better :). But it’s not just enough to rank well, we want searchers to click on our SERPs listing among all choices. File names we choose can entice searchers to do so – did you notice that in SERPs search keywords are bolded? (It only means that SEs are being helpful to searchers and are emphasizing search terms entered – but we can take advantage of that). To illustrate, say you searched for red widgets, and SEPRs comes back with results containing URLs such as:
h t t p: //example.com/page-34.html?ID=1234567&c=rw
h t t p :// example.com/red-widgets.html
Which one is more likely to draw your eye to it, and hopefully get you to click on it? Just looking at the file name of the second URL you cold get an idea what the page is about. It gives searcher reinforcement signal, that he/she might find what they are looking for on that page. Again, this is only part of the equation; other elements play a role as well (title, snippet, etc.) but we are just talking about files now.
Perhaps by similar logic, SE’s algo come to same conclusion – people usually name things for what they are, so SEs might (and I think they do) take this into consideration. There were reports that established site was able to rank for keyword only found in file path (no instances of it in page content). See Keywords in url - still useless right? [webmasterworld.com] – Supporters Forum. Don’t conclude anything by just looking at the title of that thread.
With human visitor’s in mind you want to choose file names that are easily comprehendible and memorable. Short of using a bookmark, most repeat visitors will come back by either typing-in domain name in browser’s nav bar, or using SE with your domain name as a search term , sometimes coupled with other search keywords. When they land on any of the pages of your site, you should provide them with intuitive, easy and well structure way to navigate your site. You don’t want to frustrate your visitors, but rather make it very easy to find what they are looking for – making them feel like they are “masters of the internet and know what they are doing” . Intuitive navigation is a beautiful thing when done right, but it’s a hard work to get there. And this touches on another important topic – site architecture. [u]In part[/u], file paths are expressions of your site architecture. Different approaches work for different sites, but most common ones are flat (where everything is under the root), and vertical. Theme Pyramid [webmasterworld.com] is good example of vertical (you could also have inverted pyramid, etc.). So with file paths, or more to the point with subfolders/directories you can structure and organize site in a meaningful way, hence name them appropriately. However, this doesn’t mean that a page three directory levels down , will be buried or unseen especially by SEs. As part of the site architecture, you could (and should) also have link structure as a parallel and complimentary method to organizational (directories/subfolder) structure. This means that page residing in third directory from the root, can be only one click away from the root if you choose so (but this is a huge topic just by itself).
Another thing we can do when naming files and directories, is capitalization, such as for example
Generally you should be consistent with your choice across the site. Effect of this on SE’s algorithm is unknown to me, although it wouldn’t surprise me if it is taken into consideration (at times).
Another interesting, and sometimes contested, topic is what kind of word separators should be used in file path. Choice of separators, as with most things, should be approached from human and bot perspective. “-“ vs. “_” gets most attention (so let’s not make discussion into another “dashes vs underscore” thread), however other common separators are “&” and space (you see it as %20 in url). I am firm believer of not using space as a separator in url , mainly from usability standpoint, however I am not sure of affect on ranking since I didn’t experiment with them. Although most of my examples depict static pages, same applies for dynamic pages. Most common field separator there is “&”. As a side note, generally it’s agreed upon that ID field is useless for the topic of our discussion, and actually might hinder your rankings depending where and how it’s implemented in the filepath. I am inclined to believe that number of separators in the file name does not triggers a filter (or “penalty”) just in it self. That is, I think that file name such as
would not trigger and adverse actions from SEs algos just due to six separators (remember we are talking files and filepaths not domain names).
And that leads into what would trigger “some dial on a algo filter” to go down?
(Excessive) Keyword stuffing in the body content, title, etc., got a lot of attention as a thing that might (and does) adversely affect ranking. Same could be said for ‘spamming’ filepaths
That doesn’t mean that you should use only one occurrence of the keyword in the file path (you do want to provide logical emphasis), however going overboard, generally, will do you more harm then good. Where that line is, for your particular site, is the ‘money’ question.
There are many ways to organize your site, but just s an example of logical and structured way to do so, take a look at service manual for your car (or appliance, etc.). It will usually start of with general section , and then be broken down into main areas, which in turn are further broken down (just like a pyramid structure mentioned earlier). If you follow that logic and naming convention (main area => folder name, etc..), you are setting yourself for good start.
Although in some examples above I used “.html”, I did so just to make it easier to illustrate the point. If you can (and you do) try to make your file names ‘extensionless’. At some point you might want to change technology you are using to serve your pages, and having file extension might complicate matters greatly, and potentially hurt your rankings. Link to your /MySuperPage.html will not work if you change it to /MySuperPage.asp. Now there are ways to address that, however with little forethought and simple server side rewrite you can have your pages served without extension as in /MySuperPage .
Take a look at Cool URIs don't change [w3.org]
What do you take into consideration when naming files in URL?
references and additional readings
Is it time to kill the dashes / hyphens in my domain name? [webmasterworld.com]
Suggestion on Hierarchy for Spidering [webmasterworld.com]
101 Signals of Quality : Keywords in File Path [webmasterworld.com]
100 variables [webmasterworld.com]
Is this not SE friendly? [webmasterworld.com]
Do Subdomains Help with SEO? [webmasterworld.com]
Treatment of a Subdomain Compared to a Domain [webmasterworld.com]
1. lowercase file only in file names and directory names
2. use one or two keywords in the file or directory name as a general rule, with a maximum of three
3. separate those keywords with a dash (or possibly a period/dot if you can convince me there's no other way) and nothing else -- that includes no spaces
4. if a site's technology requires query strings, then rewrite those query string urls and disallow all urls with a "?" in robots.txt
5. verify that the rewritten urls are unique, and not just keying off a record number: /581/keyword/and /581/any-old-garbage/ should not both resolve to the same "page"
6. keep tracking parameters of any kind out of the url
I do that, too. But it wasn't too long ago that the NYT used a format for its articles of four letters all in caps. I understood the goal of brevity, but not all browsers accept caps, do they?
> use one or two keywords in the file or directory name as a general rule, with a maximum of three
I do this, too. I usually choose one word from the title tag. I'll avoid file names that duplicate part of the domain name or directory name. (Spammy and somewhat redundant.)
> separate those keywords with a dash (or possibly a period/dot if you can convince me there's no other way) and nothing else -- that includes no spaces
I don't do that, as a personal preference. I just don't like hyphenation. I avoid the issue often just by choosing a single word for the file name. I do this for the cleaner look, to encourage linking, and make it not look spammy. I don't think hyphenation or spaces, or commas, or _ characters look great.
When possible, I'll choose file names in a specific directory which substantiate a theme.
Or if there's no theme, just a string of successive numbers, that shows Google natural file name choices:
I like .html; I used it from the beginning and never saw anything better.
I now try to choose file names which I'll never have to change. I tend to think the more stable they are, the better. It's one thing that it just seems best to get right the first time.
> I think that file name such as /my-house-on-the-beach-during-renovation.html would not trigger and adverse actions from SEs algos just due to six separators (remember we are talking files and filepaths not domain names).
I'd tend to agree but suspect Google is more lenient, if it were to get suspicious, with blogs, i.e., if file names are chosen by the software. Blogs seem to spit out file names that are the entire page title tag with hyphens separating each word?
but not all browsers accept caps, do they?
yes, all browsers accept caps. The HTTP spec (IMHO, required reading) says file paths are case-sensitive; eg. "Page.htm" is different from "page.htm". Whether they resolve to the same "page" or not is determined by the server (Apache, IIS, etc), and oddly is also affected by the OS that the server is running on (UNIX, Win, etc). Some hosts are case-sensitive, some are not. This sometimes causes unexpected canonicalization issues for unwary webmasters, so it's worth checking on your own sites.
/581/keyword/and /581/any-old-garbage/ should not both resolve to the same "page"
Wholeheartedly agree. In fact this is one of my pet peeves, and lots of A-list sites do it badly. I'm guilty of letting it slip sometimes with a loose Regex, and I constantly deal with the guilt and self-loathing that results.
Most common field separator there is “&”.
What do you take into consideration when naming files in URL?
If it is an existing site going through a rewrite, the first thing I take into consideration is the current taxonomy. In "my" perfect world, the URI would follow the taxonomy to the letter. But, that doesn't always make sense and there are times where creative naming conventions come into play.
I prefer brevity too. I always wondered why certain Blog platforms chose the multiple-hyphenated-keyword-file-names as they are one of the most unfriendly formats, I would never do that, but that's me. I like single word sub-directory naming. I will use two words separated by one hyphen when required. If I've got to put two hyphens in there, that's a flag for me. Time to rethink the naming convention for that particular scenario.
I prefer not to show file extensions.
I prefer lower case too but have been using Pascal Casing more and more in advertising. We are on Windows so case sensitivity is not an issue. But, if you don't have a rewrite in place to force lower case, there may be duplication issues to contend with as URI Naming Conventions are case sensitive.
We don't pass queries in URI strings. We only uses hyphens and slashes as separators, nothing else.
And, a little OT, we always make sure that there are no references within the site that are generating a redirect. No need to have any additional redirect layers. For example, we redirect http to https where applicable. Sometimes I've hardcoded http when I shouldn't have and I'll find the redirects when running my QC tools.
With human visitor’s in mind you want to choose file names that are easily comprehendible and memorable.
I don't really worry about humans and paths past the domain name because people either click in a link or bookmark to get there, or use search to find it.
People remember domain names but I'm pretty sure nobody tries to remember entire paths.
Maybe someone will remember "http://example.com/bobspage" but who ever types in "http://example.com/forum/topic/myfavforum/etc..." ?
You'll get about as far as "http://example.com/forum/" from memory and the rest is either a click within the site or a bookmark, so the paths longer that short starting path are kind of moot other than for the search engines.
[edited by: tedster at 5:29 pm (utc) on Mar. 3, 2008]
[edit reason] switch to example.com - it can never be owned [/edit]
People remember domain names but I'm pretty sure nobody tries to remember entire paths.
Ask my clients if they remember the paths to content on their sites. Most will tell you, yes, we can type in this and pretty much "guess" that this is what follows. I like to refer to it as "Intuitive URI Naming Conventions". Let's put the visitor aside for a moment and look at this from a marketing and maintenance standpoint. As mentioned above "Cool URIs Don't Change". Once you've established the URI naming conventions, that's it moving forward. If you've chosen an "Intutive" path, they become learned over time by those intimately involved with the website. This includes website administrators, webmasters, marketers, etc. It also includes returning users who have become accustomed to your navigation. Those are going to be few and far between for many but for some, the audience is there.
Its much easier to be on a telcon with a client or developer and tell them to go to example.com/sub/sub/. If you've followed the brevity rule, it makes maintenance a breeze and those involved with the day to day grind of the site truly appreciate the naming conventions, it makes their life that much easier.
If your URI naming conventions have followed the navigational structure of your site, the taxonomy, that all becomes part of the "package". Everything has meaning. URI naming conventions are an important part of the "package".
Users often use the browser history displayed in the address bar to get back to a page, so it helps if they can easily decode the URL.
This isn't just for static pages but for generated pages like search results.
For example say you have a web site that searches by location and keyword and date for events of some kind:
1. The user inputs search terms:
keywords : happy hour
date: 10 March 2008 to 21 March 2008
location: London, UK
2. User POSTs search, server sends back a 302 response with the query parameters rewritten to a human readable URL aka slug e.g.:
Store the "londonuk-10Mar2008to21Mar2008-happyhour" slug in a criteria table that logs your search parameters. This is preferable to coming up with a url format that you can decode back into search criteria, you can now lookup the slug in your table to repeat any search.
3. Page 2 of the results could be:
Using these URLs allows a user to bookmark/digg/stumbleupon search results and makes it easier for people who use their address bar history.
Ask my clients if they remember the paths to content on their sites. Most will tell you, yes, we can type in this and pretty much "guess" that this is what follows.
OK, you have clients on the opposite extreme of the spectrum.
I always assumed titles were more important so they could easily spot the page in a bookmark or history, go figure.
in regards to lowercase folders I would say use lower case if-you- hyphenate-the-keywords. If you dont use space or any kind of seperation for Keyword1 & Keyword2, I would use capital letters to distinguish the keywords from each other.
Which one looks better?
Which really is just a cop out on me not adequately fixing my security flaws but you get the idea :)
I'd like to agree with potentialgeek in using .html for my file names. I have my htaccess rewrite .php to .html since it's harder for someone to guess which language you're really using and exploit it's security flaws.
Unless you also set "expose_php = Off" in the php.ini file your Apache server will continue to insert the "X-Powered-By: PHP" in the HTTP header which means anyone can detect PHP in use regardless of your rewrite rules.
3. I use dots.between.words.in.URLs. Works for me. I mean, there are dots in the host name, and another before the extension, so why not use them elsewhere in the path?
4. I don't disallow query strings in robots.txt, I explicitly 301 redirect them to the static URL that I do want to be indexed.
I'm not liking the sound of a 302 redirect. They rapidly cause trouble in so many situations.
Once you have had the indexing of a site completely screwed by a silly error, you'll soon revert to the all-lower-case option.
I for whatever reason have files ending with .htm by design.
Currently I am looking at going extensionless for both HTML and PHP.
I try to keep page names along the lines of proper sentence case.
If I'm writing about Brett, the page is example.com/Brett_Tabke.htm
I DON'T make enough use of directories.
Everything is dumped into www, which can be frustrating at times.
Using a period (.) seems odd to me, first thing I think of is de.li.cio.us which is a clever usage of subdomains, but ehhh...
Mostly I'm aiming for a url that communicates well, with safety and confidence, to the person who sees it in the SERP. My thinking is not just rankings, but also traffic generation.
Traditionally, most search engines don't see the words as separate words when you use an underscore.
You also lose the underscore visually in inderlined links.
Those are good enough reasons for me, for why I haven't used underscores in URLs for many years.
Has anyone ever done a study on public perception of Search Results? I'd guess 99% of users simply go from top to bottom of SERPs, with little attention to page title, domain name, or file name.
I think there was one study on landing pages, after clicking on those Search Results, which indicated you have only a few seconds to convince users to stay on your site. The decision-making process is very fast on the binary decision to stay/leave.
I realize Cool URI's don't change but I had to do it ;-) So far so good, it's only been a week or two now. Let's see what happens!
I don't like periods or commas. Look weird. Some major sites use them still. Probably started using them for no good reason, and then kept using them just for consistency. Time.com is one example.
I don't like the 'underlines'; news.com uses them, among others. Both time.com and news.com use numbers only in file names. I lean towards doing that now more. It's pretty innocuous, but also boring.
[Hope you don't mind mention of specific sites, "real-world" examples.]