Difference between revisions of "Statistical Summary"

From AC Wiki
Jump to: navigation, search
(Section 1 : Record Overview)
 
(7 intermediate revisions by the same user not shown)
Line 19: Line 19:
 
==Section 1 : Record Overview==
 
==Section 1 : Record Overview==
 
The first section lists a breakdown of the bibliographic formats processed. These are broken down according to the value in the bibliographic record's leader, bytes 6 and 7. The Record Format Table (see below) lets you know what our system considers a Book or a Computer File, etc.
 
The first section lists a breakdown of the bibliographic formats processed. These are broken down according to the value in the bibliographic record's leader, bytes 6 and 7. The Record Format Table (see below) lets you know what our system considers a Book or a Computer File, etc.
{| border="0" cellspacing="0" cellpadding="5" align="left"
+
{| border="1" cellspacing="0" cellpadding="5" align="left"
 
|- style="background:silver"
 
|- style="background:silver"
 
| Format || # of Records || % of File || Changed || % Changed
 
| Format || # of Records || % of File || Changed || % Changed
Line 67: Line 67:
 
|}
 
|}
 
<div style=clear:both></div><br>
 
<div style=clear:both></div><br>
 +
 +
==Section 2 : Field Distribution==
 +
This second section deals ''entirely'' with Step 2 of the Planning Guide, Bibliographic Validation. It lists all of the possible tags that could be affected by any changes made during the implementation of Step 2's processing.
 +
 +
At first glance, this section may be a little confusing trying to understand what each of the columns represent. Let's look at a sample part from this section:
 +
{| border="1" cellspacing="0" cellpadding="5" align="left"
 +
|- style="background:silver"
 +
| Tag || 0 Fields || 1 Field || 2+ Fields || Total Fields || Avg Fields || Max Fields || Changed Fields || % Changed
 +
|- align="right"
 +
| 010 || 269 || 273 || 0 || 273 || 0.50 || 1 || 272 || 99.6
 +
|- align="right" style="background:lightblue"
 +
| 020 || 84 || 11 || 447 || 1089 || 2.01 || 12 || 0 || 0.0
 +
|- align="right"
 +
| ... || ... || ... || ... || ... || ... || ... || ... || ...
 +
|- align="right" style="background:lightblue"
 +
| 100 || 170 || 372 || 0 || 372 || 0.69 || 1 || 192 || 51.6
 +
|}
 +
<div style=clear:both></div><br>
 +
 +
The first tag in this sample part is the LCCN (010). That next column, '''0 Fields''', lets us know that there are 269 bibs with '''no''' 010 fields. Then there are 273 bibs with '''only one''' 010 field. And there are '''zero''' bibs with ''more than one'' 010 field. So right away we can see that out of our sample of 542 records, there are 273 records which have one and only one 010 field for processing.
 +
 +
The '''Max Fields''' column lets us know how many total fields actually appear in at least one record. Since we already know there are not any records with more than one 010 field, we know that the max fields number will only be '''1'''.
 +
 +
The '''Changed Fields''' column tells us exactly how many of the 273 LCCN fields were changed in some way during the Step 2 processing. In this case, 272 of the 273 LCCN fields were modified, which gives us a '''Percent Changed''' of 99.6%. Many of these particular changes are most likely due to prefix spacing or formatting incorrect LCCN (hyphens, etc.).
 +
 +
When we next look at the entry for the ISBN (020), we immediately notice that nearly all of the columns have data in there. While there are 84 records which do not have an 020 field, there are just 11 records that have only one 020 field, and a substantially larger number (447) that have at least two 020 fields in there. In fact, looking at '''Max Fields''' we can see that there is at least one record with 12 ISBN fields in there. Since the '''Changed Fields''' is zero, there was no changes made to any of the 1,089 total 020 fields.
 +
 +
===Field Distribution===
 +
Section 2 of the Statistical Summary is further subdivided according to the type of fields found in the bibliographic records. Here are the sections and some of the more significant fields listed within each section:
 +
<br>
 +
*'''Control Fields - 0XX'''
 +
**001, 005, 006, 007, 008
 +
**010, 020, 022
 +
**040, 041, 050, 090
 +
*'''Descriptive Fields - 2XX, 3XX'''
 +
**240, 245, 246
 +
**260, 300
 +
*'''Main & Added Entries - 1XX, 4XX, 7XX, 8XX'''
 +
**100, 110, 111, 130
 +
**440, 490
 +
**700, 710, 711, 730, 740
 +
**780, 785
 +
**800, 810, 811, 830
 +
**880
 +
*'''Subject Access Fields - 6XX'''
 +
**600, 610, 611
 +
**630
 +
**650, 651
 +
**655
 +
*'''Notes & Local Fields - 5XX, 9XX'''
 +
**500, 501, 502, 504, 505, 510, 520, 521, 533
 +
**910, 949, 987
 +
<br>
 +
Again, it is worth noting that even though a particular field may be listed (e.g., a 9XX field) within Section 2 of the Statistical Summary, that does not necessarily mean any changes were made to that field during processing. One of the purposes of Section 2 is to list every ''possible'' bibliographic field as well as any ''actual'' changes made to that field.
 +
 +
==Section 3 : Authority Control==
 +
This section lists all of the tags that were changed during the ''Authority Control'' processing. Where Section 2 dealt with any changes made due to bib validation or cleanup (indicators, punctuation, etc.), Section 3 lists each field that was either '''fully matched''' or '''partially matched''' during processing.
 +
 +
Before we dive into how the report for this section appears, take a quick look back up at Section 2's last entry. It is for the 100 field. You'll notice right away that there are '''372''' total 100 fields in the bib file. This is important to remember when we look at Section 3's listing for the 100 field:
 +
 +
{| border="1" cellspacing="0" cellpadding="5" align="left"
 +
|- style="background:silver"
 +
| Tag || Total || Full || Partial || Matched || % Matched
 +
|- align="right"
 +
| 100 || 359 || 355 || 0 || 355 || 98.9%
 +
|- align="right" style="background:lightblue"
 +
| 100/240 || 13 || 11 || 2 || 13 || 100.0%
 +
|}
 +
<div style=clear:both></div><br>
 +
 +
When we look at the '''Total Fields''' column for Section 3, we notice that it now says '''359''', instead of the '''372''' in Section 2. Why is this? It is as if Section 3 lost track of '''13''' of the original '''372''' 100 fields.
 +
 +
Looking one row down, we can see that Section 3 '''grouped''' the 100 & 240 fields together '''13''' times in order to fill out that part of the statistics. This is where our missing '''13''' tags went. If a 1XX exists in the record alongside a 240 field, then both are grouped together in Section 3, listing their own respective statistical information.
 +
 +
===Full vs Partial Matches===
 +
A heading is '''fully matched''' if the entire string found a match in LC:
 +
<br>
 +
<font color="brown">sh2008117652</font><br>
 +
650 _0<font color="blue">$aCats$xBehavior</font>.<br>
 +
 +
Since all of the subfields (in this case, both $a and $x) matched an LC authority record, this is considered a full match. The <font color="blue">blue</font> is just to signify how much of the entire string matched.
 +
 +
Here is an example of a '''partial match'':
 +
<br>
 +
<font color="brown">sh 85021262</font><br>
 +
650 _0<font color="blue">$aCats</font><font color="red">$xBehavior</font>.<br>
 +
 +
Since only one of the subfields (in this case, $a) matched an LC authority record and the remaining subfields (in this case, $x) did ''not'' match, this is considered a partial match. Partial matches can also be viewed in '''R06-Partially Matched Headings''', which is an optional report that '''bolds''' the part of the heading that was matched; for this example, the partially matched heading is in <font color="blue">blue</font>.
 +
 +
===Field Distribution===
 +
Section 3 lists ''all'' of the bibliographic fields under Authority Control that have the potential to be matched in some way. Any fields that are separated by a slash '/' are considered grouped:
 +
*'''Main & Added Entries'''
 +
**100, 110, 111
 +
**100/240, 110/240
 +
**130, 730
 +
**700, 710, 711
 +
*'''Subject Access Fields'''
 +
**600, 610, 611
 +
**630
 +
**650, 651
 +
**655
 +
*'''Series Fields'''
 +
**440, 490
 +
**800, 810, 811, 830
 +
*'''Field Types'''
 +
**1XX, 6XX, 7XX, 440/8XX
 +
**X00, X10, X11, X30
 +
 +
You'll notice that the final section of Section 3 lists '''Field Types'''. This is a section where the Statistical Summary groups ''entire ranges'' of authorized fields together for statistical information.
 +
 +
So '''1XX''' would correspond with '''all''' of these authorized fields:
 +
*100, 100/240, 110, 110/240, 111, 130
 +
 +
And '''X00''' would correspond with '''all''' of these authorized fields:
 +
*100, 100/240, 600, 700, 800
 +
 +
==Section 4 : Authority Control Processing==
 +
This section lists all of the reports that were ''generated'' during the '''Authority Control''' processing (Step 3). Most of these are not delivered back to the client unless the client has requested that they should be. The reason why they are usually not returned is that the majority of the reports listed in Section 4 are considered '''Optional Reports''' and are at extra cost.
 +
 +
However, the first two reports listed in Section 4, '''R05-Matched Headings''' & '''R06-Partially Matched Headings''', are considered '''Optional Reports''' but at no extra cost. Both of these reports can be rather unwieldy so, unless a client requests them, we do not typically return these either. Of course, if a client wishes to view these, we can provide them at no extra cost.
 +
 +
Even though our system contains nearly 100 reports, Section 4 may contain reports that have counts of '''zero'''. This just lets the client know that not all of the possible reports found information on which to actually report.
 +
 +
This report is broken down into different reports listing the changes made during Step 3, Authority Control processing.
 +
 +
{| border="1" cellspacing="0" cellpadding="5" align="left"
 +
|- style="background:silver"
 +
| Report || Report Title || Count
 +
|- align="left"
 +
| 30 || Updated Headings || 162
 +
|- align="left" style="background:lightblue"
 +
| 31* || Split Headings || 0
 +
|- align="left"
 +
| 32* || Tags Flipped || 0
 +
|- align="left" style="background:lightblue"
 +
| 33 || Subdivisions Flipped || 0
 +
|- align="left"
 +
| 35 || Minor Heading Changes || 3181
 +
|- align="left" style="background:lightblue"
 +
| 36 || Leading Article Deleted || 0
 +
|- align="left"
 +
| 37 || Filing Indicator Changed || 2
 +
|- align="left" style="background:lightblue"
 +
| 38 || Changed O and l to 0 and 1 in Dates || 0
 +
|- align="left"
 +
| 39 || Subfield Code Changed From $x to $v || 12
 +
|}
 +
<div style=clear:both></div><br>
 +
 +
Report numbers with '''asterisks (*)''' after them are '''Standard Reports''', which Backstage provides at no extra cost to the client.
 +
 +
==Section 5 : MARC Update Processing==
 +
This section lists all of the reports that were ''generated'' during the '''Bibliographic Validation''' processing (Step 2). Since all of the reports listed in this section are '''Optional Reports''' at extra cost to the client, these are not typically returned unless specifically requested.
 +
 +
This report is broken down into different reports listing the changes made during Step 2, Bibliographic Validation.
 +
 +
{| border="1" cellspacing="0" cellpadding="5" align="left"
 +
|- style="background:silver"
 +
| Report || Report Title || Count
 +
|- align="left"
 +
| 60 || Obsolete Tags Flipped || 0
 +
|- align="left" style="background:lightblue"
 +
| 61 || Obsolete Subfield Codes Updated || 0
 +
|- align="left"
 +
| 62 || Obsolete Fields Removed || 0
 +
|- align="left" style="background:lightblue"
 +
| 63 || Obsolete Subfields Removed || 2
 +
|- align="left"
 +
| 64 || Empty Field Deleted || 0
 +
|- align="left" style="background:lightblue"
 +
| 65 || Leader Fixed Field Values Updated || 0
 +
|- align="left"
 +
| 67 || New 007 Field Added || 0
 +
|- align="left" style="background:lightblue"
 +
| 68 || 008 Fixed Field Values Updated || 0
 +
|- align="left"
 +
| 69 || Data Moved to New Field || 0
 +
|- align="left" style="background:lightblue"
 +
| 70 || LCCN Format Corrected || 272
 +
|}
 +
<div style=clear:both></div><br>
 +
 +
[[category:Authority Control]]

Latest revision as of 10:08, 8 July 2009

Statistical Summary

A MARS 2.0 Statistical Summary is generated for every project that involves processing bibliographic records for authority control. The Statistical Summary includes both high-level and detailed statistical information about the records processed. It also includes the number of times selected actions were taken and the number of headings that met certain criteria.

Five Sections

The statistical information is divided into five sections:

Section 1: Record Overview

A high-level view of the processed files. This section includes the number of bibliographic records by type (books, serials, etc.) and how many records were changed during MARS 2.0 processing.

Section 2: Field Distribution

A statistical analysis of the distribution of fields (by tag) within the bibliographic file. Included are how many records had none, one, or two instances of each field, and how many fields changed (by tag). Changes listed in this section correspond with Step 2 of the Planning Guide.

Section 3: Authority Control

Provides match-rate statistics for fields under authority control examined during Step 3 of the Planning Guide.

Section 4: Authority Control Processing

Counts of specific changes made, and conditions found, during Step 3 of the Planning Guide.

Section 5: MARC Update Processing

Counts of specific changes made, and conditions found, during Step 2 of the Planning Guide.

Sections 4 and 5 also serve as a list of the reports available. Those reports marked with an asterisk (following the report number) are available for all MARS 2.0 authority control projects at no additional cost.

Section 1 : Record Overview

The first section lists a breakdown of the bibliographic formats processed. These are broken down according to the value in the bibliographic record's leader, bytes 6 and 7. The Record Format Table (see below) lets you know what our system considers a Book or a Computer File, etc.

Format # of Records  % of File Changed  % Changed
Books (BK) 536 98.9 519 95.8
Continuing Resources (CR) 2 0.4 2 0.4
Mixed Materials (MX) 0 0.0 0 0.0
Music (MU) 2 0.4 1 0.2
Maps (MP) 0 0.0 0 0.0
Sound recording (MU) 0 0.0 0 0.0
Visual Materials (VM) 2 0.4 2 0.4
Computer Files (CF) 0 0.0 0 0.0
Other 0 0.0 0 0.0
Totals 542 100.0 524 96.7

The Number of Records column contains the different formats that your bibliographic file comprises. The next column to the right, Percent of File directly relates to the Number of Records.

Record Format Table

Type LDR 06 LDR 07
BK t or a a or c or d or m
CF m
MP e or f
MU c or d or i or j
CR a b or i or j
VM g or k or o or r
MP p

Section 2 : Field Distribution

This second section deals entirely with Step 2 of the Planning Guide, Bibliographic Validation. It lists all of the possible tags that could be affected by any changes made during the implementation of Step 2's processing.

At first glance, this section may be a little confusing trying to understand what each of the columns represent. Let's look at a sample part from this section:

Tag 0 Fields 1 Field 2+ Fields Total Fields Avg Fields Max Fields Changed Fields  % Changed
010 269 273 0 273 0.50 1 272 99.6
020 84 11 447 1089 2.01 12 0 0.0
... ... ... ... ... ... ... ... ...
100 170 372 0 372 0.69 1 192 51.6

The first tag in this sample part is the LCCN (010). That next column, 0 Fields, lets us know that there are 269 bibs with no 010 fields. Then there are 273 bibs with only one 010 field. And there are zero bibs with more than one 010 field. So right away we can see that out of our sample of 542 records, there are 273 records which have one and only one 010 field for processing.

The Max Fields column lets us know how many total fields actually appear in at least one record. Since we already know there are not any records with more than one 010 field, we know that the max fields number will only be 1.

The Changed Fields column tells us exactly how many of the 273 LCCN fields were changed in some way during the Step 2 processing. In this case, 272 of the 273 LCCN fields were modified, which gives us a Percent Changed of 99.6%. Many of these particular changes are most likely due to prefix spacing or formatting incorrect LCCN (hyphens, etc.).

When we next look at the entry for the ISBN (020), we immediately notice that nearly all of the columns have data in there. While there are 84 records which do not have an 020 field, there are just 11 records that have only one 020 field, and a substantially larger number (447) that have at least two 020 fields in there. In fact, looking at Max Fields we can see that there is at least one record with 12 ISBN fields in there. Since the Changed Fields is zero, there was no changes made to any of the 1,089 total 020 fields.

Field Distribution

Section 2 of the Statistical Summary is further subdivided according to the type of fields found in the bibliographic records. Here are the sections and some of the more significant fields listed within each section:

  • Control Fields - 0XX
    • 001, 005, 006, 007, 008
    • 010, 020, 022
    • 040, 041, 050, 090
  • Descriptive Fields - 2XX, 3XX
    • 240, 245, 246
    • 260, 300
  • Main & Added Entries - 1XX, 4XX, 7XX, 8XX
    • 100, 110, 111, 130
    • 440, 490
    • 700, 710, 711, 730, 740
    • 780, 785
    • 800, 810, 811, 830
    • 880
  • Subject Access Fields - 6XX
    • 600, 610, 611
    • 630
    • 650, 651
    • 655
  • Notes & Local Fields - 5XX, 9XX
    • 500, 501, 502, 504, 505, 510, 520, 521, 533
    • 910, 949, 987


Again, it is worth noting that even though a particular field may be listed (e.g., a 9XX field) within Section 2 of the Statistical Summary, that does not necessarily mean any changes were made to that field during processing. One of the purposes of Section 2 is to list every possible bibliographic field as well as any actual changes made to that field.

Section 3 : Authority Control

This section lists all of the tags that were changed during the Authority Control processing. Where Section 2 dealt with any changes made due to bib validation or cleanup (indicators, punctuation, etc.), Section 3 lists each field that was either fully matched or partially matched during processing.

Before we dive into how the report for this section appears, take a quick look back up at Section 2's last entry. It is for the 100 field. You'll notice right away that there are 372 total 100 fields in the bib file. This is important to remember when we look at Section 3's listing for the 100 field:

Tag Total Full Partial Matched  % Matched
100 359 355 0 355 98.9%
100/240 13 11 2 13 100.0%

When we look at the Total Fields column for Section 3, we notice that it now says 359, instead of the 372 in Section 2. Why is this? It is as if Section 3 lost track of 13 of the original 372 100 fields.

Looking one row down, we can see that Section 3 grouped the 100 & 240 fields together 13 times in order to fill out that part of the statistics. This is where our missing 13 tags went. If a 1XX exists in the record alongside a 240 field, then both are grouped together in Section 3, listing their own respective statistical information.

Full vs Partial Matches

A heading is fully matched if the entire string found a match in LC:
sh2008117652
650 _0$aCats$xBehavior.

Since all of the subfields (in this case, both $a and $x) matched an LC authority record, this is considered a full match. The blue is just to signify how much of the entire string matched.

Here is an example of a 'partial match:
sh 85021262
650 _0$aCats$xBehavior.

Since only one of the subfields (in this case, $a) matched an LC authority record and the remaining subfields (in this case, $x) did not match, this is considered a partial match. Partial matches can also be viewed in R06-Partially Matched Headings, which is an optional report that bolds the part of the heading that was matched; for this example, the partially matched heading is in blue.

Field Distribution

Section 3 lists all of the bibliographic fields under Authority Control that have the potential to be matched in some way. Any fields that are separated by a slash '/' are considered grouped:

  • Main & Added Entries
    • 100, 110, 111
    • 100/240, 110/240
    • 130, 730
    • 700, 710, 711
  • Subject Access Fields
    • 600, 610, 611
    • 630
    • 650, 651
    • 655
  • Series Fields
    • 440, 490
    • 800, 810, 811, 830
  • Field Types
    • 1XX, 6XX, 7XX, 440/8XX
    • X00, X10, X11, X30

You'll notice that the final section of Section 3 lists Field Types. This is a section where the Statistical Summary groups entire ranges of authorized fields together for statistical information.

So 1XX would correspond with all of these authorized fields:

  • 100, 100/240, 110, 110/240, 111, 130

And X00 would correspond with all of these authorized fields:

  • 100, 100/240, 600, 700, 800

Section 4 : Authority Control Processing

This section lists all of the reports that were generated during the Authority Control processing (Step 3). Most of these are not delivered back to the client unless the client has requested that they should be. The reason why they are usually not returned is that the majority of the reports listed in Section 4 are considered Optional Reports and are at extra cost.

However, the first two reports listed in Section 4, R05-Matched Headings & R06-Partially Matched Headings, are considered Optional Reports but at no extra cost. Both of these reports can be rather unwieldy so, unless a client requests them, we do not typically return these either. Of course, if a client wishes to view these, we can provide them at no extra cost.

Even though our system contains nearly 100 reports, Section 4 may contain reports that have counts of zero. This just lets the client know that not all of the possible reports found information on which to actually report.

This report is broken down into different reports listing the changes made during Step 3, Authority Control processing.

Report Report Title Count
30 Updated Headings 162
31* Split Headings 0
32* Tags Flipped 0
33 Subdivisions Flipped 0
35 Minor Heading Changes 3181
36 Leading Article Deleted 0
37 Filing Indicator Changed 2
38 Changed O and l to 0 and 1 in Dates 0
39 Subfield Code Changed From $x to $v 12

Report numbers with asterisks (*) after them are Standard Reports, which Backstage provides at no extra cost to the client.

Section 5 : MARC Update Processing

This section lists all of the reports that were generated during the Bibliographic Validation processing (Step 2). Since all of the reports listed in this section are Optional Reports at extra cost to the client, these are not typically returned unless specifically requested.

This report is broken down into different reports listing the changes made during Step 2, Bibliographic Validation.

Report Report Title Count
60 Obsolete Tags Flipped 0
61 Obsolete Subfield Codes Updated 0
62 Obsolete Fields Removed 0
63 Obsolete Subfields Removed 2
64 Empty Field Deleted 0
65 Leader Fixed Field Values Updated 0
67 New 007 Field Added 0
68 008 Fixed Field Values Updated 0
69 Data Moved to New Field 0
70 LCCN Format Corrected 272