|When is a URI "the same URI"? |
Two URIs are the same if (and only if) they are the same character for character.
Two URIs which are different may in fact be equivalent, in that they may refer to the same thing, and give the same result in all operations. In some cases any agent looking at two URIs can deduce, from knowledge of the various web standards, that they must be equivalent, in that they must refer to the same thing. For example, HTTP URIs contain domain names, and the Domain Name System is case-insensitive. Therefore, while it is normal practice to use lower case for domain names, any agent which comes across two URIs which differ only in the case of the domain name can conclude that they must refer to the same thing. In another case, a client agent may use out-of-band information about a web site to know that its URI paths are case-invariant, or that URIs ending in "/" and "/index.html" are equivalent. It is bad engineering practice to make new protocols require such processing.
There are a long series of such algorithms. Which ones an agent can apply depends on what information it has to hand, and depend on what knowledge of which protocols has been programmed into it. New schemes may be defined in the future, for which different forms of canonicalization can be done. There is, therefore, no definitive canonicalization algorithm for URIs. Generic URI handling code should handle URIs as case-sensitive character strings.