Difference between revisions of "Dedupe 2.0"

From AC Wiki
Jump to: navigation, search
(Verification Terms - Method)
(020 - ISBN)
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
==Dedupe 2.0: Group 1==
 
==Dedupe 2.0: Group 1==
Section 2.0 of the dedupe profile will guide you through the verification and parameters used for hitting on the 010/020/022 fields.
+
Section 2.0 of the dedupe profile deals with the 010, 020, 022 field hits & subsequent verification criteria. These fields are typically considered to be more reliable within the bibliographic record than other fields such as the 245 (Title). Due to this, you may consider relaxing some of the verification points depending on how much trust you put into the cataloging integrity of these particular fields within your records.
===Group 1 Hit Fields===
+
 
 +
==Group 1 Hit Fields==
 
Grouping allows the user to have different parameters for different potential matches. Since numeric fields such as the LCCN, ISBN, and ISSN are fairly reliable in most cases, they are grouped together and will have the same verification parameters.
 
Grouping allows the user to have different parameters for different potential matches. Since numeric fields such as the LCCN, ISBN, and ISSN are fairly reliable in most cases, they are grouped together and will have the same verification parameters.
  
====Library of Congress Control Number (LCCN/010)====
+
===010 - LCCN===
The Library of Congress control number assigned to a catalogued item is recorded in the 010 tag. This number is used to  to distinguish each record from every other record in the database Library of Congress database.
+
The Library of Congress Control Number (LCCN) assigned to a catalogued item is recorded in the 010 tag. This number is used to  to distinguish each record from every other record in the database Library of Congress database.
  
 
The LCCN has three parts: the prefix, a year (represented by two or four digits) and a serial number (six digits) followed by another space in the case of pre-2001 LCCNs. Suffixes and revision dates following some printed LCCNs may or may-not be keyed into MARC record.
 
The LCCN has three parts: the prefix, a year (represented by two or four digits) and a serial number (six digits) followed by another space in the case of pre-2001 LCCNs. Suffixes and revision dates following some printed LCCNs may or may-not be keyed into MARC record.
Line 18: Line 19:
 
   010 $a  08000123 $z  80000123 </font>
 
   010 $a  08000123 $z  80000123 </font>
  
====International Standard Book Number (ISBN/020)====
+
===020 - ISBN===
This field records the International Standard Book Number(s) assigned to a catalogued item. Each valid ISBN is entered in a separate 020 tag; two or more invalid or cancelled ISBNs may be recorded in a single 020 tag.
+
This field records the International Standard Book Number(s) (ISBN) assigned to a catalogued item. Each valid ISBN is entered in a separate 020 tag; two or more invalid or cancelled ISBNs may be recorded in a single 020 tag.
  
 
The following subfields are valid in the 020 tag:
 
The following subfields are valid in the 020 tag:
Line 26: Line 27:
 
*'''z''' - Cancelled/invalid ISBN
 
*'''z''' - Cancelled/invalid ISBN
  
Valid ISBNs are always ten or thirteen digits long; all ISBNs are assumed valid unless they have too many or too few digits, or unless a shelflist card specifically identifies an ISBN as cancelled or invalid:
+
Valid ISBNs are always 10 or 13 digits long; all ISBNs are assumed valid unless they have too many or too few digits, or unless a shelflist card specifically identifies an ISBN as cancelled or invalid:
 
   <font size="3">
 
   <font size="3">
 
   020 $a 0049812187
 
   020 $a 0049812187
Line 35: Line 36:
 
   020 $a 012817409X</font>
 
   020 $a 012817409X</font>
  
Subfield a may contain qualifying information (publisher, binding, format, volume numbers). This information is usually entered within parentheses; separate pieces of information with space-colon-space:
+
Subfield $a may contain qualifying information (publisher, binding, format, volume numbers). This information is usually entered within parentheses; separate pieces of information with space-colon-space:
 
   <font size="3">
 
   <font size="3">
 
   020 $a 001281947X (pbk.)
 
   020 $a 001281947X (pbk.)
Line 41: Line 42:
 
   020 $a 0137183911 (large print)</font>
 
   020 $a 0137183911 (large print)</font>
  
Prices appearing after ISBNs are catalogued in subfield c:
+
Prices appearing after ISBNs are catalogued in subfield $c:
 
   <font size="3">
 
   <font size="3">
 
   020 $a 0174620684 :$c $21.95
 
   020 $a 0174620684 :$c $21.95
 
   020 $a 0049812187 (pbk.) :$c $17.40</font>
 
   020 $a 0049812187 (pbk.) :$c $17.40</font>
  
====International Standard Serial Number (ISSN/022)====
+
===022 - ISSN===
 
An International Standard Serial Number (ISSN) is an identification number assigned to a serial (the entire serial, not just a particular issue).  The ISSN is similar in function to the ISBN assigned to books.
 
An International Standard Serial Number (ISSN) is an identification number assigned to a serial (the entire serial, not just a particular issue).  The ISSN is similar in function to the ISBN assigned to books.
 
The 022 tag is repeated whenever a serial has two or more valid ISSNs.  This sometimes happens when a serial changes its title and a new ISSN is assigned;  when a record is created for the new title, both the new and the old ISSNs (both still valid) are entered in the new record.
 
The 022 tag is repeated whenever a serial has two or more valid ISSNs.  This sometimes happens when a serial changes its title and a new ISSN is assigned;  when a record is created for the new title, both the new and the old ISSNs (both still valid) are entered in the new record.
Line 66: Line 67:
 
   022 $a 1234-5678 $l 1234-1231</font>
 
   022 $a 1234-5678 $l 1234-1231</font>
  
===Group 1 - Verification Criteria and Terms===
+
==Group 1 - Verification Criteria and Terms==
 
The Verify criteria is used to reduce the number of initial matches found using the Hit criteria.  This allows you to further define what constitutes a good match based on the existence of other fields within your records.
 
The Verify criteria is used to reduce the number of initial matches found using the Hit criteria.  This allows you to further define what constitutes a good match based on the existence of other fields within your records.
 
Fields should be selected on the basis of the match-rate you expect to see with the deduplication.  Selecting fewer fields will result in more matches; selecting more fields will result in fewer, but better matches.
 
Fields should be selected on the basis of the match-rate you expect to see with the deduplication.  Selecting fewer fields will result in more matches; selecting more fields will result in fewer, but better matches.
 
Verify criteria should also be selected in conjunction with Hit criteria.
 
Verify criteria should also be selected in conjunction with Hit criteria.
  
====Verification Terms - Method====
+
===Method===
 
There are a few methods for verifying: FULL, PARTIAL, and WITHIN. These methods will be used for comparing data found in a specific field against the same field in a potential match record.
 
There are a few methods for verifying: FULL, PARTIAL, and WITHIN. These methods will be used for comparing data found in a specific field against the same field in a potential match record.
 
+
====Full====
*FULL - Compares the full verify string up to the verify length.
+
Full compares the entire verify string up to the verify length.
*PARTIAL - Truncates the compare strings to the shortest string, then does a full compare:
+
====Partial====
 +
Partial truncates the compare strings to the shortest string, then does a full compare:
 
   <font size="3">
 
   <font size="3">
   "The fox in and the hound" in one record, "The fox" on the other record: Both truncate to "The fox" and compared. l ksajdf;lkasd jflskadjfl askdjf;sldakfjsd;lkfjl;kjsdf.</font>
+
   Record A:
*WITHIN - Searches each compare string truncated at verify length against the full un-truncated string of the other field:
+
    <font color="red">The fox</font> and the hound.
 +
 
 +
  Record B:
 +
    <font color="red">The fox</font>.
 +
 
 +
  ''Both are truncated to'' <font color="red">The fox</font> ''and compared.''</font>
 +
====Within====
 +
Within searches each compare string truncated at verify length against the full un-truncated string of the other field:
 
   <font size="3">
 
   <font size="3">
   "Cat" will verify against "The cat in the hat."</font>
+
   Record A:
 +
    <font color="red">Cat</font>.
 +
 
 +
  Record B:
 +
    The <font color="red">cat</font> in the hat.
 +
 
 +
  <font color="red">Cat</font> ''in Record A will verify against'' The <font color="red">cat</font> in the hat ''in Record B.''</font>
  
====Verification Terms - Normalization====
+
===Normalization===
 
Normalization refers to how the string will be presented when compared to another string. Note that any normalization will not change anything in the record, but is only used when the program compares the strings.
 
Normalization refers to how the string will be presented when compared to another string. Note that any normalization will not change anything in the record, but is only used when the program compares the strings.
  
 
Types of normalization are:
 
Types of normalization are:
*NACO/CJK - retains spaces and subfield delimiters:
+
*NACO/CJK - retains spaces and subfield delimiters
 +
*FULL - all spaces and subfield delimiters removed
 
   <font size="3">
 
   <font size="3">
   '''245 $a Daniel Boone :$b a pioneer.''' would be normalized as '''$a daniel boone :$b a pioneer'''</font>
+
   '''original field''':
*FULL - NACO normalization with all spaces and subfield delimiters removed:
+
    $a Daniel Boone :$b a pioneer.
   <font size="3">
+
 
   '''245 $a Daniel Boone :$b a pioneer.''' would be normalized as '''danielbooneapioneer'''</font>
+
  '''normalized (naco/cjk)''':
 +
    $ DANIEL BOONE $ A PIONEER
 +
    
 +
   '''normalized (full)''':
 +
    DANIELBOONEAPIONEER</font>
  
====Verification Terms - LENGTH and WORDS====
+
===Length & Words===
 
This refers to how much of a given string the program will present for potential matches:
 
This refers to how much of a given string the program will present for potential matches:
*LENGTH - Refers to the number of characters for the verify field. The number of characters to be used is 1-2048, or all:
+
*Length - Refers to the number of characters for the verify field. The number of characters to be used is 1-2048, or all. Using a length of 10 gives us this example:
 
   <font size="3">
 
   <font size="3">
   Using 10 for LENGTH would truncate '''245 $a Daniel Boone :$b a pioneer.''' to '''Daniel Boon''' if FULL method was used.</font>
+
   '''original heading''':
*WORDS - Refers to a count of words to match within a given string:
+
    $a Daniel Boone :$b a pioneer.
 +
 
 +
  '''normalized (full), length = 10''':
 +
    1-------10
 +
    <font color="red">DANIELBOON</font>EAPIONEER</font>
 +
*Words - Refers to a count of words to match within a given string:
 
   <font size="3">
 
   <font size="3">
   Using 2 words for '''245 $a Daniel Boone :$b a pioneer''' returns any of the words as keywords for match possibilities.</font>
+
   '''original heading''':
NOTE: using WORDS will not include non-filers.
+
    $a Daniel Boone :$b a pioneer.
 +
 
 +
  '''words = 2''':
 +
    <font color="red">Daniel</font>, <font color="red">Boone</font>, <font color="red">A</font>, or <font color="red">Pioneer</font> are all possibilities for keyword matching.</font>
 +
NOTE: Non-filers are excluded from Words.
  
====Verification Terms - Must Verify====
+
===Must Verify===
If this option is used for any given field, then that verify has to verify or it is not considered a match. This is almost always used for the 245verification and is common in other fields as well
+
This option requires the given field to match between the two records. It also means that the verify field in question must exist in both records (and must match). This is typically common to include as part of the 245 title verification, though other fields may find it useful as well.
  
====Verification Terms - Only if Both====
+
===Only if Both===
 
This only does a verify comparison if both records have a specified field; verifies as true if only one of the records has the field. If this option was used on the 1xx field, the following would be true:
 
This only does a verify comparison if both records have a specified field; verifies as true if only one of the records has the field. If this option was used on the 1xx field, the following would be true:
 
Example 1:
 
Example 1:
Line 115: Line 144:
 
    
 
    
 
   Record B has no 100
 
   Record B has no 100
     245$a Adventures of Huckleberry Finn.</font>
+
     245 $a Adventures of Huckleberry Finn.</font>
 
'''RESULT''': This would be a match because the 100 exists in one but not the other.
 
'''RESULT''': This would be a match because the 100 exists in one but not the other.
  
Line 129: Line 158:
 
'''RESULT''': This would not be a match because the 100 differs.
 
'''RESULT''': This would not be a match because the 100 differs.
  
 +
== Topics ==
 +
===Group 1 - Hit Criteria===
 +
* [[Dedupe_2.1|Step 2.1]] - 010, 020, 022 Field Hits
 +
* [[Dedupe_2.2|Step 2.2]] - Additional Information
 +
 +
===Group 1 - Verify Criteria===
 +
* [[Dedupe_2.3|Step 2.3]] - Leader Bytes 06 & 07
 +
* [[Dedupe_2.4|Step 2.4]] - 008 Dates
 +
* [[Dedupe_2.5|Step 2.5]] - 008 [23] Form of Item
 +
* [[Dedupe_2.6|Step 2.6]] - 1XX $a Main Entry
 +
* [[Dedupe_2.7|Step 2.7]] - 245 $a, $b Title
 +
* [[Dedupe_2.8|Step 2.8]] - 245 $n, $p Title Parts
 +
* [[Dedupe_2.9|Step 2.9]] - 245 $h GMD
 +
* [[Dedupe_2.10|Step 2.10]] - 260 $b Publisher
 +
* [[Dedupe_2.11|Step 2.11]] - 260 $c Publication Date
 +
* [[Dedupe_2.12|Step 2.12]] - Other Fields
 +
* [[Dedupe_2.13|Step 2.13]] - Additional Information
 
==links==
 
==links==
<center><font size="4">[[Dedupe 2.1|2.1]] - [[Dedupe 2.2|2.2]] - [[Dedupe 2.3|2.3]] - [[Dedupe 2.4|2.4]] - [[Dedupe 2.5|2.5]] - [[Dedupe 2.6|2.6]] - [[Dedupe 2.7|2.7]] - [[Dedupe 2.8|2.8]] - [[Dedupe 2.9|2.9]] - [[Dedupe 2.10|2.10]] - [[Dedupe 2.11|2.11]] - [[Dedupe 2.12|2.12]]
+
<center><font size="4">[[Dedupe 2.1|2.1]] - [[Dedupe 2.2|2.2]] - [[Dedupe 2.3|2.3]] - [[Dedupe 2.4|2.4]] - [[Dedupe 2.5|2.5]] - [[Dedupe 2.6|2.6]] - [[Dedupe 2.7|2.7]] - [[Dedupe 2.8|2.8]] - [[Dedupe 2.9|2.9]] - [[Dedupe 2.10|2.10]] - [[Dedupe 2.11|2.11]] - [[Dedupe 2.12|2.12]] - [[Dedupe_2.13|2.13]]
 
<hr>
 
<hr>
 
[[Dedupe 1.0|1.0]] - [[Dedupe 2.0|2.0]] - [[Dedupe 3.0|3.0]] - [[Dedupe 4.0|4.0]] - [[Dedupe 5.0|5.0]] - [[Dedupe 6.0|6.0]]</font></center>
 
[[Dedupe 1.0|1.0]] - [[Dedupe 2.0|2.0]] - [[Dedupe 3.0|3.0]] - [[Dedupe 4.0|4.0]] - [[Dedupe 5.0|5.0]] - [[Dedupe 6.0|6.0]]</font></center>
 
[[category:Profile Guide]]
 
[[category:Profile Guide]]

Latest revision as of 11:52, 2 April 2013

Dedupe 2.0: Group 1

Section 2.0 of the dedupe profile deals with the 010, 020, 022 field hits & subsequent verification criteria. These fields are typically considered to be more reliable within the bibliographic record than other fields such as the 245 (Title). Due to this, you may consider relaxing some of the verification points depending on how much trust you put into the cataloging integrity of these particular fields within your records.

Group 1 Hit Fields

Grouping allows the user to have different parameters for different potential matches. Since numeric fields such as the LCCN, ISBN, and ISSN are fairly reliable in most cases, they are grouped together and will have the same verification parameters.

010 - LCCN

The Library of Congress Control Number (LCCN) assigned to a catalogued item is recorded in the 010 tag. This number is used to to distinguish each record from every other record in the database Library of Congress database.

The LCCN has three parts: the prefix, a year (represented by two or four digits) and a serial number (six digits) followed by another space in the case of pre-2001 LCCNs. Suffixes and revision dates following some printed LCCNs may or may-not be keyed into MARC record.

The following subfields are valid in the 010 tag:

  • a - LC control number
  • b - NUCMC control number -- This subfield is used only in archival/manuscripts format
  • z - Cancelled/invalid LC control number
 
 010 $a   91001938 
 010 $a  2001012884
 010 $a   08000123 $z   80000123 

020 - ISBN

This field records the International Standard Book Number(s) (ISBN) assigned to a catalogued item. Each valid ISBN is entered in a separate 020 tag; two or more invalid or cancelled ISBNs may be recorded in a single 020 tag.

The following subfields are valid in the 020 tag:

  • a - International Standard Book Number
  • c - Terms of availability
  • z - Cancelled/invalid ISBN

Valid ISBNs are always 10 or 13 digits long; all ISBNs are assumed valid unless they have too many or too few digits, or unless a shelflist card specifically identifies an ISBN as cancelled or invalid:

 
 020 $a 0049812187
 020 $a 9780049853217

The only letter that is ever part of an ISBN is X (roman numeral 10); it must always be capitalized:

 
 020 $a 012817409X

Subfield $a may contain qualifying information (publisher, binding, format, volume numbers). This information is usually entered within parentheses; separate pieces of information with space-colon-space:

 
 020 $a 001281947X (pbk.)
 020 $a 0018942113 (Bally Bros. : pbk.)
 020 $a 0137183911 (large print)

Prices appearing after ISBNs are catalogued in subfield $c:

 
 020 $a 0174620684 :$c $21.95
 020 $a 0049812187 (pbk.) :$c $17.40

022 - ISSN

An International Standard Serial Number (ISSN) is an identification number assigned to a serial (the entire serial, not just a particular issue). The ISSN is similar in function to the ISBN assigned to books. The 022 tag is repeated whenever a serial has two or more valid ISSNs. This sometimes happens when a serial changes its title and a new ISSN is assigned; when a record is created for the new title, both the new and the old ISSNs (both still valid) are entered in the new record.

The following subfields are valid in the 022 tag:

  • a - International Standard Serial Number
  • l - ISSN-L
  • m - Cancelled ISSN-L
  • y - Incorrect ISSN
  • z - Cancelled ISSN

This field only contains digits except the last digit may be X (roman numeral 10):

 
 022 $a 1234-5678
 022 $a 9876-123X

Subfield l links together various media versions of a continuing resource:

 
 022 $a 1234-5678 $l 1234-1231

Group 1 - Verification Criteria and Terms

The Verify criteria is used to reduce the number of initial matches found using the Hit criteria. This allows you to further define what constitutes a good match based on the existence of other fields within your records. Fields should be selected on the basis of the match-rate you expect to see with the deduplication. Selecting fewer fields will result in more matches; selecting more fields will result in fewer, but better matches. Verify criteria should also be selected in conjunction with Hit criteria.

Method

There are a few methods for verifying: FULL, PARTIAL, and WITHIN. These methods will be used for comparing data found in a specific field against the same field in a potential match record.

Full

Full compares the entire verify string up to the verify length.

Partial

Partial truncates the compare strings to the shortest string, then does a full compare:

 
 Record A:
   The fox and the hound.
 
 Record B:
   The fox.
 
 Both are truncated to The fox and compared.

Within

Within searches each compare string truncated at verify length against the full un-truncated string of the other field:

 
 Record A:
   Cat.
 
 Record B:
   The cat in the hat.
 
 Cat in Record A will verify against The cat in the hat in Record B.

Normalization

Normalization refers to how the string will be presented when compared to another string. Note that any normalization will not change anything in the record, but is only used when the program compares the strings.

Types of normalization are:

  • NACO/CJK - retains spaces and subfield delimiters
  • FULL - all spaces and subfield delimiters removed
 
 original field:
   $a Daniel Boone :$b a pioneer.
 
 normalized (naco/cjk):
   $ DANIEL BOONE $ A PIONEER
 
 normalized (full):
   DANIELBOONEAPIONEER

Length & Words

This refers to how much of a given string the program will present for potential matches:

  • Length - Refers to the number of characters for the verify field. The number of characters to be used is 1-2048, or all. Using a length of 10 gives us this example:
 
 original heading:
   $a Daniel Boone :$b a pioneer.
 
 normalized (full), length = 10:
   1-------10
   DANIELBOONEAPIONEER
  • Words - Refers to a count of words to match within a given string:
 
 original heading:
   $a Daniel Boone :$b a pioneer.
 
 words = 2:
   Daniel, Boone, A, or Pioneer are all possibilities for keyword matching.

NOTE: Non-filers are excluded from Words.

Must Verify

This option requires the given field to match between the two records. It also means that the verify field in question must exist in both records (and must match). This is typically common to include as part of the 245 title verification, though other fields may find it useful as well.

Only if Both

This only does a verify comparison if both records have a specified field; verifies as true if only one of the records has the field. If this option was used on the 1xx field, the following would be true: Example 1:

 
 Record A has:
   100 $a Twain, Mark.
   245 $a Adventures of Huckleberry Finn.
 
 Record B has no 100
   245 $a Adventures of Huckleberry Finn.

RESULT: This would be a match because the 100 exists in one but not the other.

However, when two records each have their own 1XX field and they differ, we have this scenario:

 
 Record A has:
   100 $a Twain, Mark.
   245 $a Adventures of Huckleberry Finn.
 
 Record B has:
   100 $a Clemens, Samuel.
   245 $a Adventures of Huckleberry Finn.

RESULT: This would not be a match because the 100 differs.

Topics

Group 1 - Hit Criteria

Group 1 - Verify Criteria

links

2.1 - 2.2 - 2.3 - 2.4 - 2.5 - 2.6 - 2.7 - 2.8 - 2.9 - 2.10 - 2.11 - 2.12 - 2.13
1.0 - 2.0 - 3.0 - 4.0 - 5.0 - 6.0