"Multilingual Internet" has many dimensions
I received an email from a someone in Paraguay today, inquiring about VLAN support on 3Com switches. He obviously tried very hard to communicate his needs in English, and his effort was no worse than I imagine my frequent efforts in Spanish to be, so I replied in English and Spanish. He immediately thanked me profusely, entirely in Spanish. Perhaps my Spanish isn't as bad as I think.
Fortunately, English and Spanish mostly share a common script. With the exception of the occasional tilde and umlaut mark, the majority of characters (glyphs) of Romance (New Latin) languages can be represented in basic ASCII. However, if I were to have received email in both native language and English from someone in Beijing, Seoul, Bankok, Tel Aviv, or Mosul, I'd need the ability to type and display characters from Chinese, Korean, Thai, Hebrew, or an Arabic language scripts in my email client or browser (and of course understand these languages). In some countries, many language scripts are used, so I might even need to understand how to read and type scripts of local dialects.
The problem of native and local language character representation has been solved in many operating systems and applications. Email and web, for example, use Multipurpose Internet Mail Extensions (MIME) to provide multilingual support for many languages and scripts. Unfortunately, the Domain Name System cannot easily accommodate national and local character sets for several reasons. Principal among these is that DNS labels must follow composition rules for ARPANET hostnames, and can only contain letters, digits, and hyphens (known as "LDH"). With this restriction, a tilde or umlaut used in certain New Latin scripts cannot be used in domain names, nor can any character or glyph of other known languages. Such languages require that domain names be represented to users in the Unicode Character Set.
Several problems exacerbate the problem. First, most Internet applications present - and expect users to submit - a domain name in the same representation (presentation syntax) as the DNS protocol uses as its "over the wire" (its *transfer* syntax). This is true even in MIME-enabled email: while you may be able to use the Spanish word baño or the Thai character ko kai (ก) in a message body or certain headers, you cannot do so in any host name part of an email address (you can probably confirm this by examining mail headers in certain spam messages you receive). Similarly, while you can place Unicode-encoded national language characters in a web page, you can't use them in a hyperlink.
A partial solution to this problem, generally referred to as Internationalized Domain Names (IDN), currently provides the ability to use national and local characters in all DNS labels other than the Top Level Domain (TLD) label (e.g., com, net, org, biz, info, and the "country code" TLDs). This solution is specified in RFC 3490, Internationalizing Domain Names in Applications (IDNA). IDNA describes how Internet applications can convert labels presented to users in Unicode characters to an "ASCII compatible encoding" that name servers can process (LDH), and how applications can convert a label from an ASCII-compatible-encoding into a native or local character string for presentation to an application user. (Interested readers should also consider two companion standards: RFC 3491, NamePrep and RFC 3492, Punycode).
IDNA is currently available for use in 2nd level labels in domain names registered through IDN-capable registries, under LDH-encoded top level domains. This means that a registrant can register cuartodebaño.com today; more generally, anyone can use any Unicode-encoded national or local character set in a 2nd level label, register that name under a generic TLD (com, biz, net...), have this resolve correctly by the authoritative DNS. ICANN provides guidelines governing the composition of such Internationalized Domain Names here. This document provides general label composition rules for registries, and in particular defines guidelines that try to thwart the deceptive use of visually confusable characters by would-be pharmers (at last, something security-related!). To appreciate this form of deception, consider the two domain names paypal.com and pаypаl.com. Visually, they appear identical; the first however uses the Roman small "a" and the second uses a Cyrillic (Russian) small "a" (Unicode hexidecimal 0430). ICANN's IDN guidelines prohibit the mixing of scripts in labels to prevent misuse by pharmers (Readers seeking a more detailed treatise on such practices should see Unicode Security Considerations).
Currently, ICANN's IDN-capable registries do not support national and local character sets in top level labels. This means that while a party can register cuartodebaño.com, the country of Spain could not use "españ" as friendlier alternative to the two letter country code "es", nor can countries use national and local scripts. This may seem a trivial matter until you consider that LDH glyphs are not universally recognized and not easily typed into certain keyboards, that some scripts are not written left-to-right, and that everyone should have the opportunity to use the Internet to communicate in languages they share, using the characters normally used to write and print those languages (see RFC 4185).
ICANN and the Internet community at large will consider two technical solutions for incorporating national and local characters in TLDs. The first attempts to apply the IDNA standards in the TLD; the second attempts to provide DNAME equivalence mappings for TLD strings.
Variants of the IDNA technique are currently used by "breakaway" root name services and registries to support TLD labels in several national characters and scripts. While these initiatives are providing native language TLDs today for the constituencies that subscribe to them, they have the undesirable effect of fracturing the single authoritative root name service: TLD labels registered in these languages are not resolved by the authoritative root name service but rather the "local root name service" operated by (or on behalf of) a country or constituency.
"Breakaway" root name services solve an immediate and localized need by adopting and deploying IDN technology in advance of international guidelines developed through a consensus-building process. Anyone who's built one of anything will agree that this is a much smaller set of problems to solve than the set facing ICANN and the Internet community at large. We must assure that domain names can be resolved consistently and correctly irrespective of characters used in name composition and geographic location. Breakaway root name services sidestep the challenging problems. To date, they don't attempt to solve multinational multilingual issues but instead spin off multiple root name services instead. The minor matters they choose not to solve include how to maintain cooperation among
nations (North and South Korea are .kp and .kr respectively, but who decides what characters are used to globally represent "Korea" at the TLD label, who decides which scripts are used for generic TLDs, and how to handle duplicates?),
among constituencies within nations (Sunnis, Shiites and Muslim tribes must agree on language preferences at *some* level), and
between nations and private interests (coordinating and preserving intellectual property and protecting branding through registration processes) to establish globally acceptable guidelines.
Some attempt has been made to recognize multiple official languages and scripts, but the "policy" test case for this problem will be a country like India, which has twenty two official languages, 325 local dialects and numerous scripts.
It's somewhat promising that most interested parties attending the IDN workshops prior to and during the Vancouver ICANN meeting appear to be moving past whining and posturing over the injustice of the "IDN.ascii" (the pejorative acronym used to illustrate that TLDs are still restricted to LDH encoding) and appear to be looking for global answers.
While boning up on IDNs, I tried to surf the web using one of the IDN-capable breakaway root name services. (I had to learn some Cyrillic to complete one of my first programming assignments, a Cyrillic keyboard interface to a Burroughs L-series minicomputer). It wasn't long before I became frustrated at how hard it was to surf using unfamiliar languages and scripts. I can imagine how trying my Internet experience would be if this were a 24x7 circumstance. I can see how tempting a quick fix like a breakaway root name service might seem. Hopefully, these are interim solutions that do no permanent harm, and that a globally palatable alternative is identified soon.
Archived at http://www.securityskeptic.com/arc20051201.htm#BlogID482
by Dave Piscitello