Emil’s Blog

Programming Windows, .Net, EPiServer and whatnot…

[Powered by WordPress.]

February 5, 2008

Extracting the base character from a character marked with a diacritic

by @ 8:59. Filed under .Net programming

Ever wondered how you can translate European national characters with diacritics such as å, ä, ö, ó etc into their base characters (a and o)? This can be useful for example when constructing filenames from user-given strings.

This is one way to do it:

string s = "ö";
string normalizedString = s.Normalize(NormalizationForm.FormD);

Console.WriteLine("Composite character: " + s[0]);
if (normalizedString.Length> 1)
{
   Console.WriteLine("Base character: " + normalizedString[0]);
   Console.WriteLine("Diacritic character: " + normalizedString[1]);
}

The result will be:

Composite character: ö
Base character: o
Diacritic character: ¨

Obviously, the key is the String.Normalize function. When we pass it NormalizationForm.FormD as the requested Unicode normalization form, it will separate all composite characters into their constituents. If the char is not a composite, then nothing will happen to it.

Note that the resulting string will be longer that the original if characters were separated. If needed it's easy to iterate over the characters to filter our non-letters using conditions such as

if (CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.LowercaseLetter ||
    CharUnicodeInfo.GetUnicodeCategory(c) == UnicodeCategory.UppercaseLetter)
{
   ...
}

Happy character-mangling!

[powered by WordPress.]

jour·nal n. A personal record of occurrences, experiences, and reflections kept on a regular basis; a diary.

Internal links:

Categories:

Search blog:

Archives:

February 2008
M T W T F S S
« Jan   Jun »
 123
45678910
11121314151617
18192021222324
2526272829  


View Emil Åström's profile on LinkedIn

General links:

I read:

Visitors

Recent Comments

Spam caught

Other:

Clicky Web Analytics

36 queries. 0.363 seconds