Difference between revisions of "Dedupe 4.0"
(→links) |
|||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
− | == | + | ==Dedupe 4.0: Group 3== |
+ | Section 4.0 of the dedupe profile will guide you through the verification and parameters used for hitting on other fields besides Group 1 (010, 020, 022) and Group 2 (245) fields. | ||
+ | ===Group 3 Title Text Hits=== | ||
+ | Grouping allows the user to have different parameters for different potential matches. Since the title field (245) could yield extremely different potential match possibilities than a numeric hit, it is separated out to it's on group and will have the different verification parameters than the other groups. | ||
+ | |||
+ | ====Title (245)==== | ||
+ | This field contains the title of the book, including main title, subtitle, statement(s) of responsibility, and occasionally other information appearing on the title page. | ||
+ | |||
+ | Every record must have a 245 tag. If some of the records in the catalog do not have a title please refer to your Backstage project manager for possible suggestions. | ||
+ | |||
+ | The 245 will be used as a default for searching, although it will be searched a little different than the other default hit fields. | ||
+ | |||
+ | ===Group 3 - Verification Criteria and Terms=== | ||
+ | The Verify criteria is used to reduce the number of initial matches found using the Hit criteria. This allows you to further define what constitutes a good match based on the existence of other fields within your records. | ||
+ | Fields should be selected on the basis of the match-rate you expect to see with the deduplication. Selecting fewer fields will result in more matches; selecting more fields will result in fewer, but better matches. | ||
+ | Verify criteria should also be selected in conjunction with Hit criteria. | ||
+ | |||
+ | ====Verification Terms - Method==== | ||
+ | There are a few methods for verifying: FULL, PARTIAL, and WITHIN. These methods will be used for comparing data found in a specific field against the same field in a potential match record. | ||
+ | |||
+ | *FULL - Compares the full verify string up to the verify length. | ||
+ | *PARTIAL - Truncates the compare strings to the shortest string, then does a full compare: | ||
+ | <font size="3"> | ||
+ | "The fox in and the hound" in one record, "The fox" on the other record: Both truncate to "The fox" and compared.</font> | ||
+ | *WITHIN - Searches each compare string truncated at verify length against the full un-truncated string of the other field: | ||
+ | <font size="3"> | ||
+ | "Cat" will verify against "The cat in the hat."</font> | ||
+ | |||
+ | ====Verification Terms - Normalization==== | ||
+ | Normalization refers to how the string will be presented when compared to another string. Note that any normalization will not change anything in the record, but is only used when the program compares the strings. | ||
+ | |||
+ | Types of normalization are: | ||
+ | *NACO/CJK - retains spaces and subfield delimiters: | ||
+ | <font size="3"> | ||
+ | '''245 $a Daniel Boone :$b a pioneer.''' would be normalized as '''$a daniel boone :$b a pioneer'''</font> | ||
+ | *FULL - NACO normalization with all spaces and subfield delimiters removed: | ||
+ | <font size="3"> | ||
+ | '''245 $a Daniel Boone :$b a pioneer.''' would be normalized as '''danielbooneapioneer'''</font> | ||
+ | |||
+ | ====Verification Terms - LENGTH and WORDS==== | ||
+ | This refers to how much of a given string the program will present for potential matches: | ||
+ | *LENGTH - Refers to the number of characters for the verify field. The number of characters to be used is 1-2048, or all: | ||
+ | <font size="3"> | ||
+ | Using 10 for LENGTH would truncate '''245 $a Daniel Boone :$b a pioneer.''' to '''Daniel Boon''' if FULL method was used.</font> | ||
+ | *WORDS - Refers to a count of words to match within a given string: | ||
+ | <font size="3"> | ||
+ | Using 2 words for '''245 $a Daniel Boone :$b a pioneer''' returns any of the words as keywords for match possibilities.</font> | ||
+ | NOTE: using WORDS will not include non-filers. | ||
+ | |||
+ | ====Verification Terms - Must Verify==== | ||
+ | If this option is used for any given field, then that verify has to verify or it is not considered a match. This is almost always used for the 245verification and is common in other fields as well | ||
+ | |||
+ | ====Verification Terms - Only if Both==== | ||
+ | This only does a verify comparison if both records have a specified field; verifies as true if only one of the records has the field. If this option was used on the 1xx field, the following would be true: | ||
+ | Example 1: | ||
+ | <font size="3"> | ||
+ | Record A has: | ||
+ | 100 $a Twain, Mark. | ||
+ | 245 $a Adventures of Huckleberry Finn. | ||
+ | |||
+ | Record B has no 100 | ||
+ | 245$a Adventures of Huckleberry Finn.</font> | ||
+ | RESULT: Example 1 would be a match because the 100 exists in one but not the other. | ||
+ | |||
+ | Example 2: | ||
+ | <font size="3"> | ||
+ | Record A has: | ||
+ | 100 $a Twain, Mark. | ||
+ | 245 $a Adventures of Huckleberry Finn. | ||
+ | |||
+ | Record B has: | ||
+ | 100 $a Clemens, Samuel. | ||
+ | 245 $a Adventures of Huckleberry Finn.</font> | ||
+ | RESULT: Example 2 would not be a match because the 100 differs. | ||
+ | |||
+ | ===Defaults and Options=== | ||
+ | The default for finding a potential match for this field is as follows: | ||
+ | #Use subfield a and b | ||
+ | #Use first 64 characters | ||
+ | #Use NACO normalization | ||
+ | |||
+ | Other options include: | ||
+ | *Include other subfields, e.g., n or p | ||
+ | *Use more or less than 64 characters: minimum of 1, maximum of 1024 | ||
+ | *Use other normalization methods | ||
+ | *CJK - also retains spaces and subfield delimiters | ||
+ | *FULL - This is the same as NACO normalization except it will remove spaces and subfield delmiters | ||
+ | |||
+ | == Topics == | ||
+ | ===Group 3 - Hit Criteria=== | ||
+ | * [[Dedupe_4.1|Step 4.1]] - Other Field Hits | ||
+ | * [[Dedupe_4.2|Step 4.2]] - Normalization Mode | ||
+ | * [[Dedupe_4.3|Step 4.3]] - Match Tag Like Tag | ||
+ | * [[Dedupe_4.4|Step 4.4]] - Additional Information | ||
+ | ===Group 3 - Verify Criteria=== | ||
+ | * [[Dedupe_4.5|Step 4.5]] - Leader Bytes 06 & 07 | ||
+ | * [[Dedupe_4.6|Step 4.6]] - 008 Dates | ||
+ | * [[Dedupe_4.7|Step 4.7]] - 008 [23] Form of Item | ||
+ | * [[Dedupe_4.8|Step 4.8]] - 010 $a LCCN | ||
+ | * [[Dedupe_4.9|Step 4.9]] - 020 $a ISBN | ||
+ | * [[Dedupe_4.10|Step 4.10]] - 1XX $a Main Entry | ||
+ | * [[Dedupe_4.11|Step 4.11]] - 245 $a, $b Title | ||
+ | * [[Dedupe_4.12|Step 4.12]] - 245 $n, $p Title Parts | ||
+ | * [[Dedupe_4.13|Step 4.13]] - 245 $h GMD | ||
+ | * [[Dedupe_4.14|Step 4.14]] - 260 $b Publisher | ||
+ | * [[Dedupe_4.15|Step 4.15]] - 260 $c Publication Date | ||
+ | * [[Dedupe_4.16|Step 4.16]] - Other Fields | ||
+ | * [[Dedupe_4.17|Step 4.17]] - Additional Information | ||
+ | ==links== | ||
<center><font size="4">[[Dedupe 4.1|4.1]] - [[Dedupe 4.2|4.2]] - [[Dedupe 4.3|4.3]] - [[Dedupe 4.4|4.4]] - [[Dedupe 4.5|4.5]] - [[Dedupe 4.6|4.6]] - [[Dedupe 4.7|4.7]] - [[Dedupe 4.8|4.8]] - [[Dedupe 4.9|4.9]] - [[Dedupe 4.10|4.10]] - [[Dedupe 4.11|4.11]] - [[Dedupe 4.12|4.12]] - [[Dedupe 4.13|4.13]] - [[Dedupe 4.14|4.14]] - [[Dedupe 4.15|4.15]] - [[Dedupe 4.16|4.16]] - [[Dedupe 4.17|4.17]] | <center><font size="4">[[Dedupe 4.1|4.1]] - [[Dedupe 4.2|4.2]] - [[Dedupe 4.3|4.3]] - [[Dedupe 4.4|4.4]] - [[Dedupe 4.5|4.5]] - [[Dedupe 4.6|4.6]] - [[Dedupe 4.7|4.7]] - [[Dedupe 4.8|4.8]] - [[Dedupe 4.9|4.9]] - [[Dedupe 4.10|4.10]] - [[Dedupe 4.11|4.11]] - [[Dedupe 4.12|4.12]] - [[Dedupe 4.13|4.13]] - [[Dedupe 4.14|4.14]] - [[Dedupe 4.15|4.15]] - [[Dedupe 4.16|4.16]] - [[Dedupe 4.17|4.17]] | ||
<hr> | <hr> | ||
[[Dedupe 1.0|1.0]] - [[Dedupe 2.0|2.0]] - [[Dedupe 3.0|3.0]] - [[Dedupe 4.0|4.0]] - [[Dedupe 5.0|5.0]] - [[Dedupe 6.0|6.0]]</font></center> | [[Dedupe 1.0|1.0]] - [[Dedupe 2.0|2.0]] - [[Dedupe 3.0|3.0]] - [[Dedupe 4.0|4.0]] - [[Dedupe 5.0|5.0]] - [[Dedupe 6.0|6.0]]</font></center> | ||
[[category:Profile Guide]] | [[category:Profile Guide]] |
Latest revision as of 11:13, 2 April 2013
Contents
Dedupe 4.0: Group 3
Section 4.0 of the dedupe profile will guide you through the verification and parameters used for hitting on other fields besides Group 1 (010, 020, 022) and Group 2 (245) fields.
Group 3 Title Text Hits
Grouping allows the user to have different parameters for different potential matches. Since the title field (245) could yield extremely different potential match possibilities than a numeric hit, it is separated out to it's on group and will have the different verification parameters than the other groups.
Title (245)
This field contains the title of the book, including main title, subtitle, statement(s) of responsibility, and occasionally other information appearing on the title page.
Every record must have a 245 tag. If some of the records in the catalog do not have a title please refer to your Backstage project manager for possible suggestions.
The 245 will be used as a default for searching, although it will be searched a little different than the other default hit fields.
Group 3 - Verification Criteria and Terms
The Verify criteria is used to reduce the number of initial matches found using the Hit criteria. This allows you to further define what constitutes a good match based on the existence of other fields within your records. Fields should be selected on the basis of the match-rate you expect to see with the deduplication. Selecting fewer fields will result in more matches; selecting more fields will result in fewer, but better matches. Verify criteria should also be selected in conjunction with Hit criteria.
Verification Terms - Method
There are a few methods for verifying: FULL, PARTIAL, and WITHIN. These methods will be used for comparing data found in a specific field against the same field in a potential match record.
- FULL - Compares the full verify string up to the verify length.
- PARTIAL - Truncates the compare strings to the shortest string, then does a full compare:
"The fox in and the hound" in one record, "The fox" on the other record: Both truncate to "The fox" and compared.
- WITHIN - Searches each compare string truncated at verify length against the full un-truncated string of the other field:
"Cat" will verify against "The cat in the hat."
Verification Terms - Normalization
Normalization refers to how the string will be presented when compared to another string. Note that any normalization will not change anything in the record, but is only used when the program compares the strings.
Types of normalization are:
- NACO/CJK - retains spaces and subfield delimiters:
245 $a Daniel Boone :$b a pioneer. would be normalized as $a daniel boone :$b a pioneer
- FULL - NACO normalization with all spaces and subfield delimiters removed:
245 $a Daniel Boone :$b a pioneer. would be normalized as danielbooneapioneer
Verification Terms - LENGTH and WORDS
This refers to how much of a given string the program will present for potential matches:
- LENGTH - Refers to the number of characters for the verify field. The number of characters to be used is 1-2048, or all:
Using 10 for LENGTH would truncate 245 $a Daniel Boone :$b a pioneer. to Daniel Boon if FULL method was used.
- WORDS - Refers to a count of words to match within a given string:
Using 2 words for 245 $a Daniel Boone :$b a pioneer returns any of the words as keywords for match possibilities.
NOTE: using WORDS will not include non-filers.
Verification Terms - Must Verify
If this option is used for any given field, then that verify has to verify or it is not considered a match. This is almost always used for the 245verification and is common in other fields as well
Verification Terms - Only if Both
This only does a verify comparison if both records have a specified field; verifies as true if only one of the records has the field. If this option was used on the 1xx field, the following would be true: Example 1:
Record A has: 100 $a Twain, Mark. 245 $a Adventures of Huckleberry Finn. Record B has no 100 245$a Adventures of Huckleberry Finn.
RESULT: Example 1 would be a match because the 100 exists in one but not the other.
Example 2:
Record A has: 100 $a Twain, Mark. 245 $a Adventures of Huckleberry Finn. Record B has: 100 $a Clemens, Samuel. 245 $a Adventures of Huckleberry Finn.
RESULT: Example 2 would not be a match because the 100 differs.
Defaults and Options
The default for finding a potential match for this field is as follows:
- Use subfield a and b
- Use first 64 characters
- Use NACO normalization
Other options include:
- Include other subfields, e.g., n or p
- Use more or less than 64 characters: minimum of 1, maximum of 1024
- Use other normalization methods
- CJK - also retains spaces and subfield delimiters
- FULL - This is the same as NACO normalization except it will remove spaces and subfield delmiters
Topics
Group 3 - Hit Criteria
- Step 4.1 - Other Field Hits
- Step 4.2 - Normalization Mode
- Step 4.3 - Match Tag Like Tag
- Step 4.4 - Additional Information
Group 3 - Verify Criteria
- Step 4.5 - Leader Bytes 06 & 07
- Step 4.6 - 008 Dates
- Step 4.7 - 008 [23] Form of Item
- Step 4.8 - 010 $a LCCN
- Step 4.9 - 020 $a ISBN
- Step 4.10 - 1XX $a Main Entry
- Step 4.11 - 245 $a, $b Title
- Step 4.12 - 245 $n, $p Title Parts
- Step 4.13 - 245 $h GMD
- Step 4.14 - 260 $b Publisher
- Step 4.15 - 260 $c Publication Date
- Step 4.16 - Other Fields
- Step 4.17 - Additional Information
links
1.0 - 2.0 - 3.0 - 4.0 - 5.0 - 6.0