Posts Tagged ‘Normalization’

MARC8 and UTF8 – what does it mean?

Tuesday, August 25th, 2009

Last week we looked at the Anatomy of an Authority Record, but what if we look even deeper? Both Bibliographic and Authority records are essentially text, made up of characters formed either in MARC8 or UTF8. But what does that mean, and whats the difference?

MARC-8

The MARC-8 character set uses 8-bit characters, meaning it natively displays ASCII and ANSEL text. Because of the limitation of characters that this allows, the MARC-8 character set includes methods to extend the displayable characters. One method is to include both spacing base characters and nonspacing modifier characters (diacritics).

Spacing or nonspacing refers to cursor movement: a spacing character moves the cursor, a nonspacing character does not. A nonspacing character is always associated with a single spacing character, but multiple nonspacing characters may be associated with the same spacing character.

In MARC-8, when there is a nonspacing character, it precedes the associated spacing character: any cursor movement occurs after displaying the character. This method allows basic and extended Latin characters to be displayed using the default character set.

 Another method MARC-8 uses to extend the displayable characters is to use alternate character sets. This is done by using escape sequences, special character sequences containing codes to indicate which character set is being selected for display. Possible alternate sets include subscripts, superscripts, Hebrew, Cyrillic, Arabic, and Greek. Chinese, Japanese and Korean are also possible by this method using EACC character encodings for these characters. While this method allows for many additional characters to be used, it is still limited and somewhat burdensome.

 UTF-8

As computers needed to support a wider character set, many computer related companies formed a group to define the Unicode Standard. This standard is based on 16-bit characters. UTF-8 is a method of encoding these characters into sequences of from 1 to 3 bytes. Unicode, using the UTF-8 encoding, was accepted as an alternative character set for use in MARC records, with an initial limitation to using only the Unicode characters that have corresponding characters in the MARC-8 character set.

Decomposed

Unicode has definitions for nonspacing characters like MARC-8, except that the nonspacing character follows the character it modifies: cursor movement occurs before the character is displayed. Decomposed UTF-8 characters are similiar to MARC-8 diacritics, in which a base character is modified by one or more non-spacing characters. For example a base character ‘n’ with a non-spacing ‘~’ would combine to display ‘ñ’. Decomposed is also the current LC standard.

Precomposed

 Unicode also includes many precomposed characters. These are spacing characters that are the equivalent of one or more nonspacing characters and a spacing character. A precomposed ‘ñ’, instead of having a base character and an additional non-spacing diacritic mark would combine all those elements into one code which represents the character with the diacritic as a whole. This causes a more difficult normalization routine.

Normalization

To handle the various ways a composite character could be normalized, standardized normalization forms have been defined. These include NFD (Normalization Form Decomposed) and NFC (Normalization Form Composed). In NFD, every character that can be decomposed is converted to its most decomposed form following rules for canonical decomposition. In NFC, the characters are first decomposed as in NFD, then composed into precomposed (composite) forms following canonical rules.  This may result in the sequence of characters for a given character changing to an alternate, equivalent form.

Conclusion

Many library systems are moving from MARC-8 to UTF-8 character encodings. This is a good move because it gives you the ability to accurately reflect the data, while lessoning the possibility of error. Backstage Library Works can return data in MARC-8, or UTF-8 (decomposed or composed) form.