An Investigation of Data Quality and Collaboration

Michael B. Twidale
Paul F. Marty

Graduate School of Library and Information Science
University of Illinois, Urbana-Champaign
twidale@uiuc.edu

Abstract

This paper explores the possibilities for improving the quality of a system’s data by taking advantage of the usage of that same data. These ideas are informed and illustrated by a study of a museum database currently under construction and by explorations of collaborative technologies that could further enhance quality control while meshing with existing work practices. Deeper analysis of the case study enables us to uncover issues applicable to other museums, to related organizations such as libraries and archives, and to databases in general.

Introduction

All organizations collect data and then use, analyze, manipulate, add to and modify that data. Increasing automation offers both potential cost savings as well as opportunities for new forms of data manipulation hitherto impossible or infeasibly expensive. However, as many failed automation attempts of white collar work have shown, it is all too easy for an impressively efficient system to be unable to cope with exceptions, and particularly with erroneous or unexpected data. An example is the threatening letter for non-payment of a fraction of a penny. Although some problems were clearly also present in older paper-based systems, and are only accelerated by automation, others appear as a product of automation itself. Where a paper form (such as an invoice) passes through many hands, there is at least the possibility of errors being detected and remedied by alert members of staff. Where much of the same work is partially automated, there will be fewer eyes looking at the data, and equally significantly, those spotting a problem may not be able to report or correct it easily. This paper explores ways of reintroducing such aspects of informal error checking back into advanced systems, while maintaining the cost and functionality advantages of those systems. We explore the possibilities for improving the quality of a system’s data by taking advantage of the usage of that same data by people both inside and outside the organization. We use the results of a study of data management in a particular museum to provide exemplars of good practice and raise wider issues for consideration.

Our approach is to examine and understand existing practice and then consider how activities may be extended by suitable changes involving the use of new systems functionalities. In particular, we consider how the insights and techniques from Computer Supported Cooperative Work (CSCW) may be applied in this area to support the goal of improving data quality. By such an analysis, we are able to formulate a number of research questions, and, we hope, encourage a debate on this method of analyzing the problem. Our aim is to uncover issues applicable not merely to other museums, but also to related organizations such as libraries and archives, and ultimately to databases in general.

In essence, we are proposing that databases should make it easy for users to indicate the existence of an error in a given field so that this error can later be corrected by a suitably authorized person. This active error correction relies on the altruistic behavior of the users and so raises questions of why and whether people will bother, as well as issues of how to minimize the effort required. We provide evidence for the plausibility of such activity occurring in certain circumstances. Also discussed is a complementary approach: passive error detection where metadata (including usage information) is employed to predict data elements that have a greater than average probability of being erroneous and so prioritize the data checking activity.

A museum database serves as a useful source of examples of types of database errors: inaccurate descriptors are typed into fields, data becomes obsolete over time, accurate data is difficult to obtain, and quality of data is often compromised by constraints of time and money. Although these problems are not unique to museums, museum databases display many problems of data quality that can serve as useful exemplars for addressing analogous problems in other databases that might otherwise be ignored. The inherent nature of cataloguing museum artifacts yields a rich variety of errors. Unlike cataloguing books in libraries, the uniqueness of museum artifacts inevitably leads to greater ambiguity in the process of description: the data can be difficult to obtain, requires specialized knowledge, and may need updating in the light of evolving scholarship. In addition, the public nature of museum research means that this information is generally more accessible for study than the conventional commercial database. Thus, a study of data quality issues in museum databases provides us with a wide variety of examples that can easily generalize to other kinds of database systems.

Literature review

We are not attempting to provide a definitive survey of the data quality field, nor claiming that the collaborative approach to be outlined is a panacea, but rather to show how an examination of a particular case can lead to explorations of interesting possibilities. The issue of data quality has been of interest to researchers from a number of disciplines including computer science, library and information science, and management information systems. Naturally, there is a strong commercial interest in this last area, emphasizing the costs to commercial organizations of poor quality data (Redman 1998). Redman gives typical error rates for the value of a database field of one to five percent. Various researchers have attempted to clarify the issue of data quality by attempts at definitions and distinguishing different dimensions of data quality (Fox et al. 1994; Redman 1996; Wand & Wang 1996). Medawar (1995) reviews the literature on database quality from a library and information science perspective, relating it to work on Total Quality Management (as does Wang, 1998) and a focus on user satisfaction. Ballou & Tayi (1999) note the importance of assigning priorities to efforts to enhance data quality and propose models for determining those priorities based on cost-benefit analysis.

Jasco (1993a&b) considers issues of data quality from the perspective of an end user of a database. He notes "the appalling amount of sloppiness in bibliographic and numeric databases." He considers (Jasco 1993a) the problems of missing values in certain fields and how that can lead to misleading results. As with Fox et al. (1994), he notes the confusion caused when some fields are legitimately blank whereas for others a searcher is inclined or even entitled to assume that there should be a value for all records in the database, or all records of a certain type. His description of search techniques to assess the extent of the problem in a given database contains useful insights into the development of functionality that should be provided for supporting data quality management. His second paper (Jasco 1993b) notes the contribution that effective use of controlled vocabularies can make to the problem. Nevertheless, problems remain with existing commercial databases, including typographic and spelling errors and the use of some fields as a dumping ground for values that don't fit into the database’s current field structure. Further problems are caused by legitimate variant spellings, especially in cases where names change over time. In such cases, cross-references are a powerful solution, but problems occur when they are not used universally in the database.

Mintz (1990) notes the commercial imperatives militating against 100% accuracy in database publishing. She advocates a consideration of errors in those fields or segments that are likely to be most damaging to the customer. Wang et al. (1995) explore the metadata requirements of a focus on data quality. They note that it is insufficient to have a single quality measure for a record. Rather, each element of a record may need different quality information. Another approach to the problem is some form of quality labelling of the whole database (Armstrong 1995).

In an example of end user feedback, Davis (1989) describes work by OCLC (The Online Computer Library Center) to elicit response from the users of their Online Union Catalog about its quality. Besides impressions of overall quality (8% of records were perceived to have errors), users were asked about their preferences for focus for quality control. OCLC had been taking advantage of user feedback about errors for some time. The process for the users (librarians at affiliated libraries) was laborious - involving the postal mailing of Change Request forms along with photocopied supporting documentation. Davis notes this as one of the reasons that Change Requests generated less than half of the 125,000 records replaced per year, 31% of respondents said they never reported errors, and 42% only reported a few errors. What to us is remarkable is that anyone bothered to report errors. We use this as evidence for the feedback approach being feasible, provided that ease of use issues are addressed. 70% of the librarians surveyed said that they would increase their reporting of errors if a more accessible online error reporting system were available.

Orr (1998) also advocates user feedback as a way to maintain data quality (this is the closest in approach to the mechanisms outlined in this paper). He notes the impossibility of perfect data quality and rather focuses concern on quality that is good enough. Where the database is considered part of a feedback control system, there is the possibility of usage statistics leading to the detection and correction of errors. This analysis leads to six data quality rules reproduced here:

The context of the case study

The Spurlock Museum at the University of Illinois is a cultural heritage museum with a collection of over 45,000 ethnographic artifacts from around the world. These artifacts represent a broad spectrum of history and culture ranging from ancient Sumeria to modern day Ecuador, from Paleolithic chipped-stone tools to tiles from the Space Shuttle.

The museum is currently in the middle of a five-year process begun in 1996 to complete a 100% re-inventory of the museum’s collections, as part of a move to a new building. As part of this process, each artifact is being retrieved from storage, analyzed and evaluated, weighed, measured, and photographed, identified and catalogued; the resulting data is stored in specially developed relational databases. Each record in the primary Artifacts database features over one hundred fields tracking such detailed specifications as nomenclature classifications, physical dimensions, material analyses, geographical, cultural, and temporal designations, accession records, artifact histories, exhibit information, scholarly remarks, condition and conservation records, research notes, etc.

This effort represents a project of enormous scope, requiring the cooperation of museum staff members, various external experts, and dozens of part-time undergraduate student employees. Since the people involved in this project have varying levels of expertise, there is the potential for many different errors of various types. The challenge (as for many other databases) is to maintain high data quality when it is known beforehand that the work of some participants will necessarily be of relatively low quality. The museum registrar has overall responsibility for monitoring quality and organizing and modifying work practices to maximize this quality under the constraints of budget and time. Clearly re-inventorying is an example of workflow, with an interaction between the physical artifacts being processed and the use of a combination of paper and electronic records. The detection of errors may occur in various ways.

Example. While browsing a particular data record, the museum registrar notices that an artifact has no weight entered in the database. This problem must be noted, the artifact retrieved, weighed, and value entered into the database.

Example. A museum staff member notices that the length of a Merovingian brooch is entered in the database as 3000 cm. Clearly this is an error; brooches are never 30m long. The artifact must be remeasured and the correct value entered into the database.

Example. A undergraduate senior in anthropology working at the museum reads an artifact record for an Egyptian earthenware bowl. She notes that the time period designation is entered as "predynastic." However, the manufacturing process field notes that the artifact was wheel-thrown. Because of her knowledge of Egyptology, she is aware that this represents an inconsistency in the data - predynastic Egyptian pots were not wheel-thrown. However, not being an expert, she doesn’t know which field is incorrect, nor (because of her junior status) is she allowed to change them. She reports the problem, but a specialist in Egyptian archaeology must examine the pot to resolve the contradiction.

Kinds of error

The above examples illustrate very different kinds of error. As noted in the literature review, various authors have explored the kinds of error that may occur. Here we consider two broad distinctions and the implications that they have for the approach to data quality management.

Input Error

The most obvious kind of error is a data input error, where the correct value was known (or knowable) at the time of input, but that value is not what was entered. We may further distinguish between slips and misconceptions:

Slip: the person entering the data knows the correct value but accidentally enters another one. This is frequently due to typing errors, but can also be caused by interruptions, distractions, etc. (Rouncefield et al. 1994).

Misconception: the person entering the data believes that she knows the correct value and actually enters that value, but it is not correct. The likelihood of a misconception varies with the expertise of the person making the decision. Even an expert may have misconceptions, but a novice is likely to have more.

In general terms, appropriate ways of dealing with slips include the use of controlled vocabularies (Lancaster 1986), which can detect some, but not all, typing errors. Selection from menus eliminates the problem of the entry of an invalid entry, but not the problem of an incorrect entry – an accidental tremor with the mouse can lead to the selection of the wrong value. Data dictionaries can be used to forbid or flag values outside expected norms and therefore suspicious. Careful checking will also help to reduce the problem, but will not eliminate it.

Another approach is to have the data entered independently by two people and to check for discrepancies. If errors are rare and randomly distributed, this is effective. It is also, however, expensive, which is the reason it is not done for every entry at the Spurlock, and we would presume, at many other locations. Similarly, one way of tackling misconception errors is to ensure that the entry (or assignment of a value - the value could be specified by one person, say on a paper form, and actually entered into the system by someone else) is done by the most expert people available. Again, the issues of cost and the scarcity of expertise arise. It would be nice if all attribution of all fields were done by an expert, but that is not always possible. If it were done that way, the projected completion time would be greatly extended, especially in the case of a rapidly growing collection; hence the use of undergraduates at the Spurlock. One compromise is to split the task into relative layers of expertise. For example, a novice can be assigned the task of entering already extant data from paper forms into the database system. More advanced students trained in artifact handling procedures can measure the museum’s artifacts and record this data on paper forms for later data entry. Finally, the museum’s most experienced students are set to such advanced tasks as assigning nomenclature classifications, identifying the provenance, or analyzing the material composition of a given artifact.

However, even apparently relatively simple activities can lead to confusion. For example, at the Spurlock, the length, width, and height of each artifact is entered into the database. While measuring these values is not difficult, deciding which value goes with which field is often confusing because this presupposes knowledge of the artifact’s correct physical orientation. An error here could lead to confusion for exhibit designers about the intended orientation of the piece, resulting in incorrectly shaped exhibit cases or display mounts.

This sort of error could not be solved by the implementation of controlled vocabularies. Moreover, although ideally all such data should be entered by an expert, this is not always feasible due to financial constraints. In the following section we shall consider how collaborative approaches can make the best use of the limited resource of expertise.

Emergent Error: Data Entropy

Even if a value is known to be correct and entered correctly into the database, it can become incorrect over time. The problem of keeping data up to date is widespread in commerce. It varies from extreme cases, where it is clear that data is constantly changing, such as cash balances in accounts, through values that frequently change such as the location of stock items, and on to values that change, but infrequently, such as the addresses of customers. Thus, over time, if nothing is done to maintain accuracy a database will degrade in quality. In short, data exhibits entropy. Where that is acknowledged, policies can be put in place to address it. Problems arise where the change is so slow that it is considered imperceptible or not worth the bother of tracking. Thus, in the case of libraries, the checkout status of a book is constantly changing and is necessarily dealt with. Furthermore, the potential for problems that can arise due to that change, such as a book being temporarily or permanently lost, are acknowledged and allowed for. By contrast, the bibliographic details of a book are usually considered fixed. Assuming that they were entered correctly, they are unlikely to go out of date.

Museum informatics is strongly influenced by library informatics due to the similarity of the activity and the degree of professional overlap. Also, computerization activities have generally been more advanced in the library world than the museum world. However, drawing on a model of bibliographic data for museum artifact data is problematic. First, there is less standardization of fields overall, as the museum community, unlike the library community, finds it more difficult to settle on a common metadata format for describing museum artifacts (Taylor 1999 p97). Second, museum artifacts generally need many more descriptive fields relative to a book (Taylor 1999 p10); the Spurlock uses 126 fields to describe each artifact versus fifteen in the Dublin Core: a widely supported metadata standard for documents of all types including museum artifacts (CIMI 1999). In part this is because every artifact is unique. Books from the same print run can usually be considered identical and often contain within themselves most, if not all, the cataloguing information required. With an artifact, the acquisition of that information is more difficult and hence liable to greater error. Furthermore, complicating the problem and leading to what we are calling emergent errors is the phenomenon that scholarship and knowledge evolves. What was once considered a correct attribution may become incorrect in the future.

Example. The museum owns many artifacts from the former Soviet Union. The "country" designation for these artifacts was originally entered as "USSR." As the museum staff re-inventories the collection, these artifacts are being re-classified as coming from "Russia," "Georgia," etc.

Example. The Spurlock has a sizeable collection of Paleolithic chipped stone tools. In the museum’s older ledgers, these are all classified as weapons. Subsequent scholarship has revealed that many of these so-called weapons were actually multi-purpose tools, also used for cooking, scraping, cleaning hides, etc. Therefore, many of these artifacts originally classified as a weapon may now need to be re-classified.

Thus the data in the database has to refer not just to the artifact within the walls of the museum but also in its larger context; whether in the world at large or in the scholarly community. As external events change, whether they are geopolitical in nature or represent an evolving state of knowledge, the database needs to be updated.

Proposed collaborative solutions

Given the problems outlined above, we believe that automated and semi-automated solutions, such as the use of controlled vocabularies, although important, and to be strongly advocated, will not solve all the problems. Careful manual checking of every item is infeasibly expensive, not least because data entropy means that it is an ongoing commitment. We need to consider ways to improve quality that are not prohibitive and which are capable of fitting with existing work practices. The proposed solutions must also acknowledge the wide variation in expertise of the people involved in the process. Our approach has been informed by consideration of several observed cases such as the following illustrative example.

Example. A student entering cultural data on a particular Mesopotamian ceramic vase, did not know whether this artifact was Sumerian or Akkadian in origin. Rather than guessing at random or failing to enter anything, she entered the data as Sumerian but appended a note to the record (using a specially designed problems field) indicating her uncertainty in this matter, the nature of the problem, and her recommendation that a specialist in Mesopotamian history double-check the record. This example illustrates that even relatively low status employees, with presumably less institutional commitment, may be willing or can be encouraged to flag problems, provided that we give them suitable mechanisms to do so easily (see below).

Error checking and creative volunteering of information have been seen in various different ways throughout the museum’s system: it is almost commonplace to see phrases such as "I don’t know," "Needs further research," or even simple question marks scattered throughout the records. Thus, even though this process of recording errors has not been formalized as a required duty of the student employees, these examples serve as evidence that people are not only willing but will find their own ways to indicate quality problems of their own accord. Consequently, it is our hope that if we encourage this activity, both managerially and through appropriate systems re-design, we can produce usage based data quality improvement.

Usage based data quality improvement

Examples of this form of altruistic error correction behavior date back to the initial stages of the museum’s re-inventory process. In late 1997, when digital records for the museum’s collections were first made available for review over the Internet, a specialist in African art located in Phoenix, Arizona, examined over 200 records and sent detailed comments on each to the museum’s registrar via email. These comments were evaluated by the museum staff for accuracy and subsequently entered into the museum’s database systems.

This type of feedback has continued throughout the re-inventory process. For example, recently, a University of Illinois professor examined from his office a number of on-line records of cuneiform tablets and emailed the registrar with his suggestions and corrections for the records. On occasion, he noted that although the textual data within the record was accurate, the digital photograph which accompanied the record was upside-down!

These examples show how data quality may be improved by its usage - that as people look at the data for their own purposes, they may spot errors (Orr 1998). This is a valuable resource but it is difficult to exploit. Note that the information can come from different kinds of people and involve different degrees of knowledge and detail. It may involve users both within and external to the organization. It may involve experts who can identify the problem and its correct solution, or people who just have a hunch that there may be a problem. The complicating factor is that the error correction and remediation may itself involve errors; people who believe there is an error when the data is in fact correct, and people whose suggested corrections to real errors are also erroneous.

In general terms, this leads to the following research questions:

In particular, one very important question is whether people will actually bother to volunteer this information. This deserves much more detailed study. We can provide some evidence in the form of observed examples of it occurring.

This approach has precedent elsewhere. Gasser (1986) noted the various workarounds that occurred as workers devised mechanisms to cope with the inflexibility of a computer system, including dealing with erroneous information. In a study of air traffic control work, Hughes et al. (1992) observed that the less experienced assistants picking up flight strips from a computer and moving them to a display sometimes spotted errors in the printouts and reported them. This was an important part of the informal error checking that was integral to the safety culture of air traffic control. The study emphasized the social mechanisms employed to create a total system that was far more reliable than its very fallible components – exactly the situation needed for data quality. An existing system with several of the features required for data quality management is Talkback from Full Circle Software, which is used in Netscape Communicator. When Communicator crashes, Talkback offers the user the option of sending a bug report to Netscape. The user may choose to volunteer additional information. We need to provide similar mechanisms for a user observing a data error.

Mintz (1990) proposed the FIXIT feature to enable quality control by online searchers of a fee-based database. This attempted to address the barriers to giving feedback to database producers, and to provide a credit for the accessing of erroneous data. FIXIT would switch the user to a free database (avoiding further online charges) and then ask the user for details about the faulty record that had been detected. She further proposed that the search services should publish details about how many corrections were sent to each provider and how many linked records were added to correct the FIXITs sent. Rather than just giving a credit for each erroneous record the user had retrieved, Mintz suggested that the database providers pay double for error reports in order to encourage them.

Ojala (1993) describes the creation of a databases bulletin board in Finland that allows searchers to complain when they find a problem record in a database, and vendors to respond. This is much more laborious than the FIXIT option and requires a much greater switching of context. However, it allows for a more public discussion of error problems, perhaps encouraging greater participation through solidarity and the observation of feedback to individuals and others' identified problems (similar to the software bulletin board discussions of computer scientists discussed below).

It seems reasonable to assume that the more difficult it is to volunteer the information, the less likely it is that a person would bother. Thus a design challenge is to develop mechanisms that are as low cost as possible while also exploring the potential benefits of use. What are the incentives for people to volunteer their help? These can include a commitment to scholarship, irritation with an observed error, and internal altruistic motives. For some, the intrinsic reward is to see that their actions have an impact. Thus it is important that reports be rapidly acted on since the volunteers of information may check on the results of their suggestions. If they choose to provide contact details they can be thanked and told the consequences of their report.

Raymond (1998a&b, 1999) considers the culture of computer scientists, particularly hackers, in the collaborative construction, error detection, debugging and enhancement of open source software. As Raymond notes: "Given enough eyeballs, all bugs are shallow" (Raymond 1998a). The continuing development of the Linux operating system serves as an existence proof of how overall quality can be created out of submissions (bug patches) of variable quality. In the case of open source software (OSS), users are actively involved in both the detection and remediation of problems: people report bugs and submit code patches to correct those bugs. The data quality equivalent is the detection of errors in the data and the suggestion of more accurate values. However, OSS goes one step further in that the bug reports are themselves public (often on a newsgroup or mailing list), so that the identifier of the error and the proposer of a potential solution are not necessarily the same person. Another consequence is that there can be multiple solutions proposed for the same problem, and it is the system maintainer’s job to make decisions about which patches to accept as part of the next release. However, with a policy of very frequent release, it is clear to users which problems have been ‘solved’ and which remain to be addressed by whomever wishes to take on the task.

Raymond proposes the idea of hacker culture as a gift economy to account for the apparently altruistic behavior of the participants (Raymond 1998b). Although seemingly very remote from the world of museums and even museum informatics, we believe that it can shed light on the mechanisms that are necessary to encourage collaborative errors detection and recovery, as well as serving as an existence proof for the plausibility of the idea. The gift economy analysis emphasizes that the altruistic behavior needs to be acknowledged, probably publicly. Furthermore, the community of testers and developers needs to be acknowledged, along with continuing feedback that overall improvement is resulting from the collective efforts of that community.

Another recent example of collaborative effort is the SETI@Home project (Hayes 1998), where participants world-wide download screensaver software that conducts analysis of radio astronomy data during a PC’s idle time. The SETI page includes statistics of the relative contributions made by different universities and commercial organizations, presumably as an incentive (using a subtle form of implicit advertising) for giving away processor time (or rather, for going to the trouble of downloading the software).

It is also possible to offer extrinsic rewards, as in the case of the FIXIT proposal. This is important in a pay-per-item data environment. It is also possible to offer credits not just for the undesired erroneous item but for future use to encourage response. A database such as a museum's that is free to use could offer other material incentives, such as discounts on museum shop merchandise. This would require a budget for error detection, but maybe that is no bad thing.

The above remains labor intensive and has additional costs in response activity. It remains to be seen if it is justifiable in cost compared with alternative error checking strategies. That would require the development of a prototype and measurement of its operational costs and effectiveness. This paper is concerned with considering the features that such a prototype should have to maximize its chances of success.

The importance of error metadata: beyond the amnesiac database

Once we move from a desire solely to get data right the first time towards a consideration of continual change in data quality, it becomes important to consider the issues of error metadata. When considering any use of advanced technology it is useful to look at existing practice for inspiration of powerful ideas and metaphors. Some activities one may wish to directly replicate and some to keep in spirit but modified heavily in practice in the light of the new opportunities afforded by that technology. In this case, we draw our inspiration from the activity in the use of an earlier data recording technology, the index card. There are many instances where index cards reveal not just the information that was first recorded on them, but how that information has been modified. The information on the cards is annotated and added to over time. Often this information includes details of the date and who made the change or addition, and even why. The resultant information, although often created for reasons of expediency, can in certain cases be more useful than that conventionally obtained by just entering the final values on the index card into a database and destroying the older incorrect one. All this error metadata is potentially of great scholarly interest. It is also of great practical interest in managing error checking and management. It helps to know who has considered the data, when, and why, and what they did. This metadata can help in deciding how much credit to give to the resultant data and how much checking time to devote to it (Baker 1994).

We can contrast this with a conventional (computer) database. Certain people have permission to make changes to the data, but this is often by overwriting, obliterating the old information, often with no record of who did it, when, and why. It would seem that such a database ironically suffers from amnesia about itself. We propose that each field in the database should itself have associated with it data about its creation, checking, and modification. This would include date and time information, who was involved and a continuing record of the previous values that have subsequently been updated. Rothenberg (1996) outlines a range of metadata fields that can be used in improving data quality. Our proposals, focusing on aspects of usage and evolution, form a subpart of his wider analysis.

Based on some of our preliminary suggestions, the Spurlock has implemented two systems for tracking error metadata. First, a "modification history" is maintained for each record in the computer system that tracks the digital evolution of the data contained within the record. Every time a modification is made to any field in any record, this fact is recorded in the modification history for that record. Each entry in the modification history logs the nature of the change, who did it, when and why. By tracking the contents of this history throughout the lifespan of a record, the museum’s Registrar, among others, is able to gain an understanding of the interaction over time between museum staff members and the computer system. Second, a "problems field" in each record exists for the recording of any inconsistencies, difficulties, or uncertainties that a user of the system might encounter. When a museum employee finds a problem of any type with a given record, he or she records this in the problem field and sets a flag that alerts the museum’s Registrar to the existence of a new problem with this record.

Looking to future developments, we would advocate extending this error metadata provision from the record level down to the field level. Furthermore, extending the index card idea into the digital world, we could add information about the usage of the database, such as incrementing a counter every time a user (or a certain category of user) looked at it. This is much harder to do in the physical world, although its inspiration is the different physical condition of documents: Papers and index cards that are used a lot become dirtier and dog-eared. This idea, termed by Hill & Hollan (1992) "read wear and write wear", can contribute to improving data quality. As a working hypothesis we propose that heavily used data is of higher quality, because the chances of error detection, reporting and correction are higher than that for rarely used data. If this is so, then usage information, although unlikely to be definitive in identifying errors, can contribute to a probabilistic approach to data quality management as outlined later.

Once the usage data is recorded (provided it is inexpensive and acceptable to do so) it is possible to test the hypothesis by asking questions such as "are records that have been looked at frequently now of higher quality than those that are looked at rarely?" (N.B. 'looked at’ is itself a problematic phrase, perhaps defined as 'has appeared on the screen of a user').

There are many different ways in which such metadata might be recorded and considerable variation in the quantity that might be retained. For example, for each field one might provide an ever growing list of its earlier values, when they were entered and who entered them, along with an optional text field for annotations about the updating process. A separate data table could contain details about all the people involved in data entry, their status and evolving experience over time, thus allowing a judgement about someone’s skills at the time of working on any given field. Immediately we see how issues of privacy can arise. Under this proposed arrangement, all of a person’s mistakes in entering information are recorded in perpetuity. In order to be acceptable, the approach of recording and preserving all changes forever may need to be modified. In addition, controls will be needed on who has access to this information, which may be perceived as sensitive.

Similarly there are many different kinds of usage data that might be collected. The simplest would be a counter of the number of data accesses of this field since its last modification. Alternatively, the timestamp of every data access could be permanently recorded and used in data quality and other usage based analyses, including analyses of the evolution of usage over time. Finally, if the use of a database requires logging in, it would be possible to record a user identity (or anonymized unique user identifier) along with each usage incident. (N.B. This last violates privacy too much to be acceptable in all but the most security-conscious applications.)

All the above ideas need further research to explore their computational and social consequences as well as the degree to which they support cost effective data quality improvement. Of all the costs, that of storing what appears to be substantially more information than is conventionally recorded for a database is perhaps of the least concern. The costs of computer storage continue to fall rapidly and compared to image data (such as one or more high quality photographs of an artifact), this proposed additional textual information will likely be only a marginal increase. The ongoing costs of collection should also be relatively low, since most of the data described can be collected automatically by a suitably designed system. The main additional costs are in adapting the systems in place to include the collection of this information, and the development and use of the systems to make use of the data, including ensuring agreed privacy safeguards.

Data Drilling in the error probability field

Data mining is a well established technique for discovering useful unexpected features in existing data. We can consider the search for errors in our data as an example of data mining. If we extend the mining metaphor, the spontaneous altruistic error reporting described above can be considered as analogous to the happenstance of a natural outcropping. This can then be used as an indicator of a cluster of linked errors. Drilling for oil is an expensive endeavor; one attempts to improve one's probabilities of a strike by accumulating multiple pieces of contributing evidence. In the same way, once an error is brought to one’s notice it can be used to adjust the probability field of likely error clusters.

For example, if one Mesopotamian pot is mis-classified as Sumerian instead of Akkadian, one begins to suspect other Mesopotamian pots. In particular, one suspects those classified at about the same time and by the same person. One is also especially suspicious of data items that have been rarely looked at. The linkages between the data that adjust the probabilities can be both aspects of the data itself (pot, Sumerian) and about its metadata (who entered it, when, if it has ever been checked for this or any other error). The probabilistic nature of this approach is important to consider. One could in theory check everything, but usually resources do not permit this. It becomes useful to maximize the potential of identifying errors even when this means that not all errors are detected. Multiple factors can contribute to this information.

For example, in recording details about the geographical provenance of a given artifact, students are asked to enter the artifact’s continent, country, region, city, etc. When undertaking a random check of artifacts from Africa, it was noticed that one artifact had in the fields continent, country and region, the values Africa, Kenya and East Africa, respectively. This is incorrect. The region field should be a sub-part of a country (e.g. South Kenya), not a sub-part of a continent. Subsequent analysis led to the discovery that the same individual was responsible for this misinterpretation of the region field in other instances. This fact led the museum’s registrar to suspect all geographical entries by this individual and allowed her to search the system for all such entries by this individual and correct them. It is easy to see the cause of the misconception, and consequently worth checking (if time permits) whether the same error occurs elsewhere.

Another example can be found in the analysis of records pertaining to maps. It was discovered that some students cataloguing these items incorrectly recorded in these same geographic provenance fields the details of the country depicted on the map rather than the geographical information about where the map was created (i.e. a map of West Africa made in Spain would have a value of Africa in the continent field rather than Europe). As in the previous example, once one such error was detected, suspicions were raised about other records. Those map records entered by the same student seem most likely to be in error, followed by (to a lesser extent) all records about maps and other artifacts that depict countries, which could cause the same kind of confusion. The example of the misclassified Paleolithic chipped stone tools discussed above shows the same issue. Once it was realized that one given record of this type was misclassified as a weapon, it was necessary to reassess all similar artifacts.

The examples serve to show how the data can be regarded as a probability field, with each data item having a certain chance of being incorrect. It is not the case that each item has the same error probability. We can draw on information about prior experience to ascribe higher probabilities based on a range of factors, such as how difficult it is to obtain the correct value for a certain category of artifact, and how expert the person making those decisions is. As new information is obtained (such as discovering the map misconception), the probability field shifts. We can use the information in the probability field to help decide where to allocate scarce expert error-checking resources, that is, where to drill for errors.

The fact that an item has been checked and found to be correct is also useful information to record. It reduces the probability field for that item (but not to zero, since even experts can make mistakes). It means that subsequently, if it becomes desirable and possible to do a more thorough check, one can choose to ignore certain records that have been checked recently and by a respected person. In addition, the usage information outlined above can be applied to reduce the error probability for items that have been looked at by many people.

Implications for design

This analysis of the Spurlock Museum’s data quality has raised two interlinked collaborative approaches.

Both these approaches have strong parallels with the work on collaborative recommender systems, which can use both explicit and implicit rating systems (Nichols 1997). As with recommender systems, explicit ratings (explicit error reports) are more valuable, being more targeted, but require greater effort from the person doing the reporting. Implicit recommendations have lower costs – the user need do nothing, other than consent to her use data being employed in this manner. As for recommender systems, some similar design implications follow. Explicit reporting of errors should be as easy as possible, since it relies on the altruism of the user. Ideally, we should make the mechanism involve a single button click. If the user wishes to say more than "there’s a problem here," there should be the option of explaining the nature of the problem, and suggesting solutions. However, it should always be easy to cut the interaction short. Moreover, we need to explore ways of rewarding the user for this effort. Once an error report has been sent, it needs to be forwarded to the appropriate person for assessment and possible action. This links to the use of quality metadata and implicit usage information to inform adjustments to the error probability field, and consequently additional error checking activities.

The probabilistic approach to error checking requires a set of rules for linking related data values so that a change in information about the data quality of a single value can affect others. It will require further investigation to develop sets of these rules, some of which will be widely applicable, some of which can be easily tailored to a given database, and some of which must be hand crafted or even automatically derived from usage information using data mining techniques for an individual database. Mechanisms will need to be developed to accumulate the multiple sources of evidence of error. We have used the term ‘probability’ to describe the resulting single numeric value, but that is only a fiction for ease of use, a very loose description of the purpose of the value, not a specification of its nature. We have for simplicity ignored issues of conditional probabilities.

In the simplest case, these rules need not be codified. Instead they can remain implicit in the actions of the skilled checker of the quality metadata. That is, after spotting an instance of the ‘map-of versus map-from’ misconception, the checker decides to re-examine other maps with values entered by the same person, rather than the system suggesting this. Even after the identification of some of these rules and their direct incorporation into the system, we do not believe that we should aspire to a totally automated process for identifying the most promising error sites. The expertise of the checker must still play a role involving the opportunistic and creative use of ‘likeness’ measures to generate candidates for subsequent error checking.

The system needs to provide useful and usable visualizations of the adjusting probability field (raising the research question of whether a simple textual listing of the currently most suspicious values is useful or even sufficient). It also needs to support the creation, adaptation and testing of the linkage rules themselves. This will need to be easy to use, preferably with an interface that clearly indicates the consequences of the proposed rule, in the same way that a spreadsheet interface indicates how the calculation rules produce values in any given cell (Nardi & Miller 1991).

In addition to supporting the identification and prioritization of subsequent data items to check, the system needs to support the management of the overall checking activity. This includes

Finally, the whole process of probabilistic error management imposes an additional cost on data quality control. Time spent on using such a system is time that could have been spent checking every data item in turn. Thus, operation costs must be minimized and evidence provided that the use of such a system is itself cost effective in improving overall data quality.

Wider implications

Although our examples derive from a study of a particular museum, we believe that they have much wider applications. Many of the data problems are analogous to problems that arise with any database. We also believe that the solutions are also potentially transferable. Museums do enjoy considerable goodwill and altruistic help from the scholarly community. This of course will not apply to commercial databases, and commercial incentives will be required rather than relying solely on altruism. However, there is much to be made of taking advantage of the especial interest that individuals have in their own data. It is a classic anecdote that one of the first things many authors do when encountering a new bibliographic database is to look up their own publications. We may be confident that they will be motivated to report any errors or omissions encountered, provided that it is easy to do. In the same way, it should be easy for customers to report errors in their own data to the company, even providing the updated data. This is increasingly being done, especially with changing addresses (Orr 1998). Within the organization, the users and manipulators of the data can serve as an important part of improving overall data quality. This requires a combination of technological and organizational innovation. Not only must it be easy to flag errors and possible remediations, perhaps as outlined above, but the organization must acknowledge this as valuable work by the employee with a definite (albeit small) time cost. Unless the work that goes into error checking is recognized as part of an employee’s job description, there is little incentive for that employee to bother, or to let it detract from their more visible and valued efforts. As a way of supporting such change, it would, for example, be relatively easy to generate monthly reports of errors correctly identified and remedied at least in part by the efforts of a given employee.

As with many new technologies, if this approach proves to be successful, it is likely to change the way that people in the field work. As noted above, the recording of error metadata raises issues of privacy where the person or persons responsible for the entry of a value of a field are not merely recorded, but where if that field is subsequently found to be erroneous, the incorrect value and its author are stored in the metadata record in perpetuity. We have advocated this as a powerful resource for future data analysis, but it remains to be seen whether, regardless of its utility, it turns out to be acceptable. A naïve justification of the approach is that it is not substantially different from an old-fashioned index card in a small scholarly collection which may contain a number of typewritten and hand-written revisions and additions. Often these are signed and dated or are otherwise identifiable by correlating with other information about who was involved in cataloguing work in different years. However, appeals to non-computerized social conventions are not necessarily sufficient. People are understandably suspicious and fearful of the rapid searching possibilities of a computerized system that may be used against them, and therefore more concerned about privacy issues in a computerized environment than its physical analogue. In the case of permanent recording of identified data entry we can envisage pressures for the obliteration of old information, particularly where that has the potential to be embarrassing for people in power.

The approach offers the possibility of making data available more quickly than is traditional, certainly in areas influenced by library and information science (LIS). The recognized importance of high data quality conventionally leads to a careful but necessarily slow approach to public availability of scholarly information. For example, cataloguing a newly acquired collection is slow and the catalogue is conventionally only released when complete and there is a general level of confidence in its quality. This is considered an important part of scholarly and professional practice. The disadvantage is the time it takes to get to that state, during which period, the collection may be in various ways inaccessible. By contrast, a re-orientation towards continual data quality improvement opens up new possibilities (they need not be taken, but they are now possible). For example, a rapid, low quality first pass may be done (by hiring less skilled and cheaper labor, such as undergraduates) and the information made public, with the clear understanding, perhaps embodied in the system’s interface, that its quality is very far from ideal. This may sound anathema to a traditionalist scholar with a background in LIS or museum work, but it draws on the pragmatics of engineering and the release of alpha and beta versions of software to solicit feedback. It is an open, but testable, question whether scholars would prefer more immediate access to the data at the cost of far more inaccuracy. Having produced the rapid ‘alpha release catalogue,’ the approach leads to the temptation to cut corners and subsequent failure to institute data quality improvements.

Conclusion

Based on an ongoing study of the re-inventorying practices of a particular museum, we have attempted to identify existing collaborative activities that can serve as the inspiration for future technical innovations. By linking those findings with current CSCW research, we have outlined a new collaborative approach to data quality management. We believe that this approach offers great potential not only for museum and bibliographic databases, but also for commercial ones. We intend to continue this work by a combination of observational studies and the development and testing of prototype systems embodying the features outlined in this article.

References

Armstrong, C. J. (1995): Database information quality. Library & Information Briefings, vol. 1995, no. 62, pp. 1-14.

Baker, Nicholson (1994): Discards. The New Yorker, vol. 70, no. 7, pp. 64-86.

Ballou, Donald P. and Tayi, Giri Kumar (1999): Enhancing data quality in data warehouse environments. Communications of the ACM, vol. 42, no. 1, pp. 73-78.

CIMI (Consortium for the Computer Interchange of Museum Information) (1999): Guide to Best Practice: Dublin Core. http://www.cimi.org/documents/meta_bestprac_final_ann.html

Davis, Carol C. (1989): Results of a survey on record quality in the OCLC database. Technical Services Quarterly, vol. 7, no. 2, pp. 43-53.

Fox, Christopher, Levitin, Anany, and Redman, Thomas (1994): The notion of data and its quality dimensions. Information Processing & Management, vol. 30, no. 1, pp. 9-19.

Gasser, Les (1986): The integration of computing and routine work. ACM Transactions on Information Systems, vol. 4, no. 3, pp. 205-225.

Hayes, Brian (1998): Collective Wisdom. The American Scientist, vol. 86, no. 2, pp. 118-22. http://www.amsci.org/amsci/issues/Comsci98/compsci1998-03.html

Hill, William C. and Hollan, James D. (1992): Edit wear and read wear. CHI '92. Proceedings of the Conference on Human Factors in Computing Systems, Monterey, CA, . ACM Press, pp. 3-9.

Hughes, John .A., Randall, David., and Shapiro, Dan. (1992): Faltering from ethnography to design. CSCW'92. Proceedings of the Conference on Computer-Supported Cooperative Work, Toronto, . ACM Press, pp. 115-122.

Jacso, Peter (1993a): Searching for skeletons in the database cupboard Part I: errors of omission. Database, vol. 1993, no. February, pp. 38-49.

Jacso, Peter (1993b): Searching for skeletons in the database cupboard Part II: errors of commission. Database, vol. 1993, no. April, pp. 30-36.

Lancaster, F.W. (1986): Vocabulary control for information retrieval. Arlington, VA: Information Resources Press.

Medawar, Katia (1995): Database quality: a literature review of the past and a plan for the future. Program, vol. 29, no. 3, pp. 257-272.

Mintz, Anne P. (1990): Quality control and the zen of database production. Online, vol. 1990, no. November, pp. 15-23.

Nardi, Bonnie A. and Miller, J.R. (1991): Twinkling lights and nested loops: distributed problem solving and spreadsheet development. International Journal of Man-Machine Studies, vol. 34, no. 2, pp. 161-84.

Nichols, David M. (1997): Implicit rating and filtering. Proceedings of the Fifth DELOS Workshop on Filtering and Collaborative Filtering, Budapest, Hungary,, 10-12 November. ERCIM, pp. 31-36.

Ojala, Marydee (1993): The Finns again wake: the wake up call to information quality. Information Today, vol. 10, no. 3, pp. 41-42.

Orr, Ken (1998): Data quality and systems theory. Communications of the ACM, vol. 41, no. 2, pp. 66-71.

Raymond, Eric. S. (1998a): The Cathedral and the Bazaar. First Monday, vol. 3, no. 3. http://www.firstmonday.dk/issues/issue3_3/raymond/index.html

Raymond, Eric. S. (1998b): Homesteading the Noosphere. First Monday, vol. 3, no. 10. http://www.firstmonday.dk/issues/issue3_10/raymond/

Raymond, Eric. S. (1999): The magic cauldron.

Redman, Thomas C. (1996): Data quality for the information age. Norwood, MA: Artech House.

Redman, Thomas C. (1998): The impact of poor data quality on the typical enterprise. Communications of the ACM, vol. 41, no. 2, pp. 79-82.

Rothenberg, J (1996): Metadata to support data quality and longevity. Proceedings of the 1st IEEE Metadata Conference, Silver Spring, MD, 16-18 April.

Rouncefield, Mark, Hughes, John A., Rodden, Tom, and Viller, Stephen (1994): Working with "constant interruption": CSCW and the small office. In R. Furuta, and C. Neuwirth (eds.): CSCW '94. Proceedings of the Conference on Computer-Supported Cooperative Work, Chapel Hill, NC, . ACM Press, pp. 275-286.

Taylor, A.G. (1999): The Organization of Information. Englewood, CO: Libraries Unlimited.

Wand, Yair and Wang, Richard Y. (1996): Anchoring data quality dimensions in ontological foundations. Communications of the ACM, vol. 39, no. 11, pp. 86-95.

Wang, Richard Y. (1998): A product perspective on total data quality management. Communications of the ACM, vol. 41, no. 2, pp. 58-65.

Wang, Richard Y., Reddy, M. P., and Kon, Henry B. (1995): Toward quality data: an attribute-based approach. Decision Support Systems, vol. 13, no. 3,4, pp. 349-372.