UTF-8 « MARS Authority Control

Posts Tagged ‘UTF-8’

MARS Authority Control Updates

Thursday, March 31st, 2011

This quarter has been a busy one for us at Backstage in the Authority Control / Automation service dept. We always get together and discuss upcoming goals and apply priorities to those goals throughout the year, just as everyone on this list participates in their own goals-oriented meetings. And, just like everyone else, sometimes our goals are changed around or new goals come into play that causes us to drop everything and focus on those instead. (more…)

Tags:Announcements, Authority Control, RDA, UTF-8
Posted in Authority Control, Automated Authority Control, RDA | Comments Closed

Russian Ligatures in MARC

Monday, September 13th, 2010

We have had a few other libraries ask about the Russian ligatures since there seems to be several inconsistencies with how these diacritics are coded in LC’s authority records. A couple of months ago, one of our programmers was looking into this issue. Here is the information that he found: (more…)

Tags:Bibliographic Records, UTF-8
Posted in Authority Records, Automated Authority Control, Bibliographic Records, MARC | 4 Comments »

MARC8 and UTF8 – what does it mean?

Tuesday, August 25th, 2009

Last week we looked at the Anatomy of an Authority Record, but what if we look even deeper? Both Bibliographic and Authority records are essentially text, made up of characters formed either in MARC8 or UTF8. But what does that mean, and whats the difference?

MARC-8

The MARC-8 character set uses 8-bit characters, meaning it natively displays ASCII and ANSEL text. Because of the limitation of characters that this allows, the MARC-8 character set includes methods to extend the displayable characters. One method is to include both spacing base characters and nonspacing modifier characters (diacritics).

Spacing or nonspacing refers to cursor movement: a spacing character moves the cursor, a nonspacing character does not. A nonspacing character is always associated with a single spacing character, but multiple nonspacing characters may be associated with the same spacing character.

In MARC-8, when there is a nonspacing character, it precedes the associated spacing character: any cursor movement occurs after displaying the character. This method allows basic and extended Latin characters to be displayed using the default character set.

Another method MARC-8 uses to extend the displayable characters is to use alternate character sets. This is done by using escape sequences, special character sequences containing codes to indicate which character set is being selected for display. Possible alternate sets include subscripts, superscripts, Hebrew, Cyrillic, Arabic, and Greek. Chinese, Japanese and Korean are also possible by this method using EACC character encodings for these characters. While this method allows for many additional characters to be used, it is still limited and somewhat burdensome.

UTF-8

As computers needed to support a wider character set, many computer related companies formed a group to define the Unicode Standard. This standard is based on 16-bit characters. UTF-8 is a method of encoding these characters into sequences of from 1 to 3 bytes. Unicode, using the UTF-8 encoding, was accepted as an alternative character set for use in MARC records, with an initial limitation to using only the Unicode characters that have corresponding characters in the MARC-8 character set.

Decomposed

Unicode has definitions for nonspacing characters like MARC-8, except that the nonspacing character follows the character it modifies: cursor movement occurs before the character is displayed. Decomposed UTF-8 characters are similiar to MARC-8 diacritics, in which a base character is modified by one or more non-spacing characters. For example a base character ‘n’ with a non-spacing ‘~’ would combine to display ‘ñ’. Decomposed is also the current LC standard.

Precomposed

Unicode also includes many precomposed characters. These are spacing characters that are the equivalent of one or more nonspacing characters and a spacing character. A precomposed ‘ñ’, instead of having a base character and an additional non-spacing diacritic mark would combine all those elements into one code which represents the character with the diacritic as a whole. This causes a more difficult normalization routine.

Normalization

To handle the various ways a composite character could be normalized, standardized normalization forms have been defined. These include NFD (Normalization Form Decomposed) and NFC (Normalization Form Composed). In NFD, every character that can be decomposed is converted to its most decomposed form following rules for canonical decomposition. In NFC, the characters are first decomposed as in NFD, then composed into precomposed (composite) forms following canonical rules. This may result in the sequence of characters for a given character changing to an alternate, equivalent form.

Conclusion

Many library systems are moving from MARC-8 to UTF-8 character encodings. This is a good move because it gives you the ability to accurately reflect the data, while lessoning the possibility of error. Backstage Library Works can return data in MARC-8, or UTF-8 (decomposed or composed) form.

Tags:MARC, MARC-8, Normalization, UTF-8
Posted in MARC | Comments Closed

MARS Authority Control