RDA 1.4
STEP 1.4 : FILE DOWNLOAD AND FILE FORMAT
1.1 - Records Delivered by Backstage | ||
---|---|---|
☐ Website | ☐ FTP | |
☐ MARC-8 | ☐ UTF-8 |
utf-8 vs marc-8 format
The MARC-8 character set uses 8-bit characters. Due to the limitation of characters that this allows, MARC-8 also includes methods to extend the displayable characters: spacing based characters (for cursor movement) and non-spacing characters (diacritics).
MARC-8 also employs the use of alternate character sets in order to tackle the diacritic display issue. This is done by using escape sequences, which are special codes to indicate which character set is being selected for display: subscripts, superscripts, CJK characters, etc.
While these methods allow for many additional characters to be used, it is still limited and somewhat burdensome.
UTF-8 is a standard based on 16-bit characters. It is a method of encoding characters into sequences of from 1 to 3 bytes. Unicode has definitions for nonspacing characters like MARC-8, except these characters are handled differently for UTF-8.
UTF-8 also includes many precomposed characters. These are spacing characters that are equivalent to one or more diacritic characters and a spacing character. To handle the various ways a composite character could be displayed, normalization forms have been defined.
Normalization Form Decomposed (NFD) and Normalization Form Composed (NFC) are standardized forms for handling composite characters.
In NFD, every character that can be decomposed is converted to its most decomposed form following rules for canonical decomposition.
In NFC, the characters are first decomposed as in NFD, then composed into precomposed (composite) forms following canonical rules. This may result in the sequence of characters for a given character changing into an alternate, equivalent form.
LINKS
1.0 - 2.0 - 3.0 - 4.0 - 5.0 - 6.0