Difference between revisions of "RDA 1.2"
(→UTF-8 vs MARC-8 format) |
(→UTF-8 vs MARC-8 format) |
||
Line 2: | Line 2: | ||
[[Image:rda1-1.png]]<br><br> | [[Image:rda1-1.png]]<br><br> | ||
===UTF-8 vs MARC-8 format=== | ===UTF-8 vs MARC-8 format=== | ||
− | The MARC-8 character set uses 8-bit characters. Due to the limitation of | + | MARC-8 has been the standard format for MARC-21 records since 1968. Nearly every system that can export records in MARC format can do so in MARC-8 format. The MARC-8 character set uses 8-bit characters. Due to the limitation of |
characters that this allows, MARC-8 also includes methods to extend the | characters that this allows, MARC-8 also includes methods to extend the | ||
displayable characters: spacing based characters (for cursor movement) and | displayable characters: spacing based characters (for cursor movement) and | ||
Line 13: | Line 13: | ||
While these methods allow for many additional characters to be used, it is still | While these methods allow for many additional characters to be used, it is still | ||
− | limited and somewhat burdensome. | + | limited and somewhat burdensome. For example, built into the MARC-21 format is a limitation that no record can exceed 99,999 characters, and no field can exceed 9,999 characters. If a record exceeds the field or record size limits, there may be truncation or loss of data. |
− | UTF-8 is a standard based on 16-bit characters. | + | UTF-8 has been in use since early 1993, and is a standard based on 16-bit characters. The main difference between MARC-8 and UTF-8 is that UTF-8 allows for more character types to be used within the records. Since UTF-8 can represent many more characters than MARC-8, the files tend to be larger in size. Each character in UTF-8 is between 1 - 4 bytes (whereas MARC-8 is only 1 byte in length). |
− | + | ||
− | characters | + | |
− | + | ||
− | UTF-8 also | + | If your system uses UTF-8, please also let us know whether the characters are in precomposed or decomposed format. Precomposed characters use combined diacritics (e.g., n & ~ are combined to form: ñ). Decomposed format separates the characters. |
− | + | ||
− | + | ||
− | + | ||
− | Normalization Form Decomposed (NFD) and Normalization Form Composed | + | Additionally, to handle the various ways a composite character could be displayed, |
− | (NFC) are standardized forms for handling composite characters. | + | normalization forms have been defined. Normalization Form Decomposed (NFD) and Normalization Form Composed |
− | + | (NFC) are standardized forms for handling composite characters. In NFD, every character that can be decomposed is converted to its most | |
− | In NFD, every character that can be decomposed is converted to its most | + | decomposed form following rules for canonical decomposition. In NFC, the characters are first decomposed as in NFD, then composed into |
− | decomposed form following rules for canonical decomposition. | + | |
− | + | ||
− | In NFC, the characters are first decomposed as in NFD, then composed into | + | |
precomposed (composite) forms following canonical rules. This may result in | precomposed (composite) forms following canonical rules. This may result in | ||
the sequence of characters for a given character changing into an alternate, | the sequence of characters for a given character changing into an alternate, |
Latest revision as of 16:51, 28 March 2013
RDA 1.2: Records Delivered by Backstage
UTF-8 vs MARC-8 format
MARC-8 has been the standard format for MARC-21 records since 1968. Nearly every system that can export records in MARC format can do so in MARC-8 format. The MARC-8 character set uses 8-bit characters. Due to the limitation of characters that this allows, MARC-8 also includes methods to extend the displayable characters: spacing based characters (for cursor movement) and non-spacing characters (diacritics).
MARC-8 also employs the use of alternate character sets in order to tackle the diacritic display issue. This is done by using escape sequences, which are special codes to indicate which character set is being selected for display: subscripts, superscripts, CJK characters, etc.
While these methods allow for many additional characters to be used, it is still limited and somewhat burdensome. For example, built into the MARC-21 format is a limitation that no record can exceed 99,999 characters, and no field can exceed 9,999 characters. If a record exceeds the field or record size limits, there may be truncation or loss of data.
UTF-8 has been in use since early 1993, and is a standard based on 16-bit characters. The main difference between MARC-8 and UTF-8 is that UTF-8 allows for more character types to be used within the records. Since UTF-8 can represent many more characters than MARC-8, the files tend to be larger in size. Each character in UTF-8 is between 1 - 4 bytes (whereas MARC-8 is only 1 byte in length).
If your system uses UTF-8, please also let us know whether the characters are in precomposed or decomposed format. Precomposed characters use combined diacritics (e.g., n & ~ are combined to form: ñ). Decomposed format separates the characters.
Additionally, to handle the various ways a composite character could be displayed, normalization forms have been defined. Normalization Form Decomposed (NFD) and Normalization Form Composed (NFC) are standardized forms for handling composite characters. In NFD, every character that can be decomposed is converted to its most decomposed form following rules for canonical decomposition. In NFC, the characters are first decomposed as in NFD, then composed into precomposed (composite) forms following canonical rules. This may result in the sequence of characters for a given character changing into an alternate, equivalent form.
Default
Files are delivered in UTF-8 format through the website. |
links
1.0 - 2.0 - 3.0 - 4.0 - 5.0 - 6.0