Articles
Carsten Strotmann

Internationalized domain names (IDNs)

How does the internationalized domain name system work? How does it affect the existing domain name system (DNS)?

Apr 14th, 2010

It's been a long while since I heard about internationalized domain names (IDNs). For the normal internet user, this meant that visiting websites with non-ASCII characters would not always result in "page not found" or some erratic browser behavior. I expected that most of the currently registered domains that had originally sacrificed their native characters for ASCII counterparts would swiftly re-register their domains in their proper language script.

However, the transition to IDNs has been much slower than I anticipated, at least for languages based on the latin script, such as Icelandic. Countries that use non-latin scripts, such as Cyrillic, Chinese or Arabic scripts, are much more interested in adopting internationalized domain names, and more recently internationalized TLDs as well.
This is understandable since these countries must diverge much further from their usual writing style to use the original domain name system than those that use latin-based scripts.

But how does the internationalized domain name system work? How does it affect the existing domain name system (DNS)?
Well, it doesn't. IDNs are implemented on the application level and don't require any implicit changes to the DNS protocol. The application (e.g. browser) converts a Unicode IDN string to an ASCII domain name, which is recognized by the existing DNS infrastructure as any other domain name. The encoding of the Unicode characters to an ASCII string is called punycode and was described in RFC3492.
In principle, each non-ASCII character in the string is removed and converted to an ASCII character sequence which includes both its Unicode identifier and its location in the string. This encoded sequence appears after the remaining ASCII characters, separated by a hyphen (-). The resulting domain name will look something like:

xn--[ASCII string original in characters]-[PUNYCODE characters non-ASCII encoded].TLD

The prefix "xn--" is simply to distinguish IDNs from ordinary domain names. The punycode encoding is applied to each individual label (including a potential internationalized TLD), each having the "xn--" prefix if they include any non-ASCII characters. The resulting domain name string must obey normal DNS rules, such as the 63 character limit for each label and the 253 character limit for the whole domain name.

So apparently, the mechanism for incorporating IDNs in the domain name system is quite simple.

It is the deployment of the IDNs that has been problematic. The diversity of supported writing styles and symbols in the Unicode character set present a potential security problem if the registration of IDNs is not carefully monitored, i.e. domain name spoofing.
By replacing characters in known domain names, such as paypal.com or microsoft.com, with non-ASCII characters that look almost the same as the ASCII characters in the known domain name, a user could be fooled in the wrong direction without ever knowing it. The fact that this has been demonstrated for these two domains using Cyrillic characters shows that this can indeed be a security threat.
To prevent this, the registries should not accept IDNs with the whole of the Unicode character set, but only a subset including the standard ASCII characters and the characters needed to write words and numbers in the language of the TLD.
Cyrillic or arabic characters should not be allowed in .is domain names, for instance. The applications deal with name spoofing in various ways, such as by displaying the punycoded domain name instead of the IDN or just by keeping track of and filtering out the malicious domain names.

Another threshold preventing a widespread deployment of IDNs is the fact that IDNs were not initially deployed at the top level.
This means that non-Latin script users, after entering an internationalized second level domain name, still need to switch their keyboard script to latin to enter the TLD.
Even more awkwardly, Arabic script users also need to change the type direction in midst of entering the fully qualified domain name. Internationalized TLD are only now in the first stages of deployment, the IDN ccTLD fast track process having been launched in November 2009. Once the internationalized TLDs have become widely available, users will finally be able to enter whole domain names in their native scripts which should make it far more appealing to register IDNs.
Still, the users wont be able to enter non-ASCII characters in the local parts (left to @) of e-mail addresses. Although that issue is separate from IDNs and the current discussion, the work done by the EAI working group indicates it should not be too long before internationalized e-mail addresses become a reality in addition to IDNs.

Since IDNs are implemented on the application level, application support for IDNs is crucial. Although the major browsers have supported IDNs since about 2005, the support amongst other basic applications is still poor.
For instance, it may be particularly frustrating for network administrators and advanced users that basic commandline utilities like ping, ssh, or even the DNS lookup utility dig don't yet support IDNs. The user still has to manually obtain and enter the punycode representation of an IDN to get these applications to work with IDNs. The punycode representation is generally quite cryptic and difficult to remember and work with, which defies the very purpose and spirit of DNS.

The Men&Mice Suite is used by customers all over the world that use a variety of scripts. Currently, only the punycode representation of IDNs is supported in the application, i.e. the domain names must be input as an ASCII character sequence in the punycode format "xn---" and will only be displayed in that form.
The simplest way to generate the punycode representation of any given IDN is to enter the string in a browser such as Mozilla Firefox or Google Chrome that convert any Unicode input to punycode.
The result can then be used for a domain name in the Men&Mice Suite. It would nevertheless be difficult to keep track of or validate many IDNs in the application. It is therefore clear that future versions of the Men&Mice Suite must provide proper support for IDNs by converting to and from the punycode representation, that is if the usage of IDNs ever becomes as widespread as expected.

Read up on updates about Internationalized domain names (IDNs).