Difference between revisions of "RDA 1.2"

From AC Wiki
Jump to: navigation, search
(UTF-8 vs MARC-8 format)
(UTF-8 vs MARC-8 format)
 
Line 2: Line 2:
 
[[Image:rda1-1.png]]<br><br>
 
[[Image:rda1-1.png]]<br><br>
 
===UTF-8 vs MARC-8 format===
 
===UTF-8 vs MARC-8 format===
The MARC-8 character set uses 8-bit characters. Due to the limitation of
+
MARC-8 has been the standard format for MARC-21 records since 1968.  Nearly every system that can export records in MARC format can do so in MARC-8 format. The MARC-8 character set uses 8-bit characters. Due to the limitation of
 
characters that this allows, MARC-8 also includes methods to extend the
 
characters that this allows, MARC-8 also includes methods to extend the
 
displayable characters: spacing based characters (for cursor movement) and
 
displayable characters: spacing based characters (for cursor movement) and
Line 13: Line 13:
  
 
While these methods allow for many additional characters to be used, it is still
 
While these methods allow for many additional characters to be used, it is still
limited and somewhat burdensome.
+
limited and somewhat burdensome.  For example, built into the MARC-21 format is a limitation that no record can exceed 99,999 characters, and no field can exceed 9,999 characters.  If a record exceeds the field or record size limits, there may be truncation or loss of data.
  
UTF-8 is a standard based on 16-bit characters. It is a method of encoding
+
UTF-8 has been in use since early 1993, and is a standard based on 16-bit characters. The main difference between MARC-8 and UTF-8 is that UTF-8 allows for more character types to be used within the records. Since UTF-8 can represent many more characters than MARC-8, the files tend to be larger in size.  Each character in UTF-8 is between 1 - 4 bytes (whereas MARC-8 is only 1 byte in length).
characters into sequences of from 1 to 3 bytes. Unicode has definitions for nonspacing
+
characters like MARC-8, except these characters are handled differently
+
for UTF-8.
+
  
UTF-8 also includes many precomposed characters. These are spacing
+
If your system uses UTF-8, please also let us know whether the characters are in precomposed or decomposed format.  Precomposed characters use combined diacritics (e.g., n & ~ are combined to form: ñ). Decomposed format separates the characters.
characters that are equivalent to one or more diacritic characters and a spacing
+
character. To handle the various ways a composite character could be displayed,
+
normalization forms have been defined.
+
  
Normalization Form Decomposed (NFD) and Normalization Form Composed
+
Additionally, to handle the various ways a composite character could be displayed,
(NFC) are standardized forms for handling composite characters.
+
normalization forms have been defined.  Normalization Form Decomposed (NFD) and Normalization Form Composed
 
+
(NFC) are standardized forms for handling composite characters. In NFD, every character that can be decomposed is converted to its most
In NFD, every character that can be decomposed is converted to its most
+
decomposed form following rules for canonical decomposition. In NFC, the characters are first decomposed as in NFD, then composed into
decomposed form following rules for canonical decomposition.
+
 
+
In NFC, the characters are first decomposed as in NFD, then composed into
+
 
precomposed (composite) forms following canonical rules. This may result in
 
precomposed (composite) forms following canonical rules. This may result in
 
the sequence of characters for a given character changing into an alternate,
 
the sequence of characters for a given character changing into an alternate,

Latest revision as of 16:51, 28 March 2013

RDA 1.2: Records Delivered by Backstage

Rda1-1.png

UTF-8 vs MARC-8 format

MARC-8 has been the standard format for MARC-21 records since 1968. Nearly every system that can export records in MARC format can do so in MARC-8 format. The MARC-8 character set uses 8-bit characters. Due to the limitation of characters that this allows, MARC-8 also includes methods to extend the displayable characters: spacing based characters (for cursor movement) and non-spacing characters (diacritics).

MARC-8 also employs the use of alternate character sets in order to tackle the diacritic display issue. This is done by using escape sequences, which are special codes to indicate which character set is being selected for display: subscripts, superscripts, CJK characters, etc.

While these methods allow for many additional characters to be used, it is still limited and somewhat burdensome. For example, built into the MARC-21 format is a limitation that no record can exceed 99,999 characters, and no field can exceed 9,999 characters. If a record exceeds the field or record size limits, there may be truncation or loss of data.

UTF-8 has been in use since early 1993, and is a standard based on 16-bit characters. The main difference between MARC-8 and UTF-8 is that UTF-8 allows for more character types to be used within the records. Since UTF-8 can represent many more characters than MARC-8, the files tend to be larger in size. Each character in UTF-8 is between 1 - 4 bytes (whereas MARC-8 is only 1 byte in length).

If your system uses UTF-8, please also let us know whether the characters are in precomposed or decomposed format. Precomposed characters use combined diacritics (e.g., n & ~ are combined to form: ñ). Decomposed format separates the characters.

Additionally, to handle the various ways a composite character could be displayed, normalization forms have been defined. Normalization Form Decomposed (NFD) and Normalization Form Composed (NFC) are standardized forms for handling composite characters. In NFD, every character that can be decomposed is converted to its most decomposed form following rules for canonical decomposition. In NFC, the characters are first decomposed as in NFD, then composed into precomposed (composite) forms following canonical rules. This may result in the sequence of characters for a given character changing into an alternate, equivalent form.

Default

Files are delivered in UTF-8 format through the website.

links

1.1 - 1.2 - 1.3 - 1.4 - 1.5
1.0 - 2.0 - 3.0 - 4.0 - 5.0 - 6.0