February 2008 – Emil's Lost & Found Archive

Ever wondered how you can translate European national characters with diacritics such as å, ä, ö, ó etc into their base characters (a and o)? This can be useful for example when constructing filenames from user-given strings.

This is one way to do it:

string s = "ö";
string normalizedString = s.Normalize(NormalizationForm.FormD);

Console.WriteLine("Composite character: " + s[0]);
if (normalizedString.Length > 1)
{
   Console.WriteLine("Base character: " + normalizedString[0]);
   Console.WriteLine("Diacritic character: " + normalizedString[1]);
}

The result will be:

Composite character: ö
Base character: o
Diacritic character: ¨

Obviously, the key is the String.Normalize function. When we pass it NormalizationForm.FormD as the requested Unicode normalization form, it will separate all composite characters into their constituents. If the char is not a composite, then nothing will happen to it.

Note that the resulting string will be longer that the original if characters were separated. If needed it’s easy to iterate over the characters to filter our non-letters using conditions such as

if (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.LowercaseLetter ||
    CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.UppercaseLetter)
{
   ...
}

Happy character-mangling!

Month: February 2008

Extracting the base character from a character marked with a diacritic