In-Depth

Data Warehousing Special Report: Data quality and the bottom line

During the past 50 years, the developed world has moved from an industrial economy to an information economy. Companies now compete on their ability to absorb and respond to information, not just manufacture and distribute products. Intellectual capital and know-how are more important assets than physical infrastructure and equipment.

If information is the currency of the new economy, then data is a critical raw material needed for success. Just as a refinery takes crude oil and transforms it into numerous petroleum products, companies use data to generate a multiplicity of information assets. These assets form the basis of the strategic plans and actions that determine a firm's success.

Consequently, poor quality data can have a negative impact on the health of a company. If not identified and corrected early on, defective data can contaminate all downstream systems and information assets.

The problem with data is that its quality quickly degenerates over time. Experts say 2% of records in a customer file become obsolete in a month because customers die, divorce, marry and move. In addition, data-entry errors, systems migrations and changes to source systems, among other things, generate bucketloads of errors. As well, as organizations fragment into different divisions and units, interpretations of data elements mutate to meet local business needs. A data element that one individual finds valuable may be nonsense to an individual in a different group.

TDWI estimates that poor quality customer data costs U.S. businesses a staggering $611 billion a year in postage, printing and staff overhead (TDWI estimates based on cost-savings cited by survey respondents and others who have cleaned up name and address data, combined with Dun & Bradstreet counts of U.S. businesses by number of employees.). Frighteningly, the real cost of poor quality data is much higher. Organizations can frustrate and alienate loyal customers by incorrectly addressing letters or failing to recognize them when they call, or visit a store or Web site. Once a company loses its loyal customers, it loses its base of sales and referrals, as well as future revenue potential.

Given the business impact of poor quality data, it is bewildering to see the casual way in which most companies manage this critical resource. Most companies do not fund programs designed to build quality into their data in a proactive, systematic and sustained manner. According to TDWI's Data Quality Survey, almost half of all firms have no plan for managing data quality.

Part of the problem is that most organizations overestimate the quality of their data and underestimate the impact errors and inconsistencies can have on their bottom line. On one hand, almost half of the companies who responded to our survey believe the quality of their data is "excellent" or "good." Yet more than one-third of the respondent companies think the quality of their data is "worse than the organization thinks."

Although some firms understand the importance of high-quality data, most are oblivious to the true business impact of defective or substandard data. Thanks to a raft of new information-intensive strategic business initiatives, executives are beginning to wake up to the real cost of poor quality data. Many have bankrolled high-profile IT projects in recent years -- data warehousing, CRM and e-business projects -- that have failed or been delayed due to unanticipated data-quality problems.

For example, in 1996, FleetBoston Financial Corp. (then Fleet Bank) in New England undertook a much publicized $38 million CRM project to pull together customer information from 66 source systems. Within three years, the project was drastically downsized and the lead sponsors and technical staff were let go. A major reason the project came unraveled was the team's failure to anticipate how difficult and time consuming it would be to understand, reconcile and integrate data from 66 different systems.

According to TDWI's Industry Study 2000 survey, the top two technical challenges firms face when implementing CRM solutions are "managing data quality and consistency" (46%) and "reconciling customer records" (40%). Considering that 41% of CRM projects were "experiencing difficulties" or "a potential flop," according to the same study, it is clear that the impact of poor data quality in CRM is far reaching ("Harnessing Customer Information for Strategic Advantage: Technical Challenges and Business Solutions."

Data warehousing, CRM and e-business projects often expose poor quality data because they require companies to extract and integrate data from multiple operational systems. Data that is sufficient to run payroll, shipping or accounts receivable is often peppered with errors, missing values and integrity problems that do not show up until someone tries to summarize or aggregate the data.

Also, since operating groups often use different rules to define and calculate identical elements, reconciling data from diverse systems can be a huge, and sometimes insurmountable, obstacle. Sometimes the direct intervention of the CEO is the only way to resolve conflicting business practices, or political and cultural differences.

Every firm, if it looks hard enough, can uncover a host of costs and missed opportunities caused by inaccurate or incomplete data. Consider the following:

* A telecommunications firm lost $8 million a month because data-entry errors incorrectly coded accounts, preventing bills from being sent out.

* An insurance company lost hundreds of thousands of dollars annually in mailing costs due to duplicate customer records.

* An information services firm lost $500,000 annually and alienated customers because it repeatedly recalled reports sent to subscribers due to inaccurate data.

* A large bank discovered that 62% of its home-equity loans were being calculated incorrectly, with the principal getting larger each month.

* A health insurance company in the Midwest delayed a decision support system for two years because the quality of its data was "suspect."

* A global chemical company discovered it was losing millions of dollars in volume discounts in procuring supplies because it could not correctly identify and reconcile suppliers on a global basis.

* A regional bank was unable to calculate customer and product profitability due to missing and inaccurate cost data.

In addition, new industry and government regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) and Bank Secrecy Act, are upping the ante. Organizations are now required to carefully manage customer data and privacy or face penalties, unfavorable publicity and loss of credibility.

What can go wrong?
The sources of poor quality data are myriad. Leading the pack are data-entry processes, which produce the most frequent data quality problems, and systems interfaces.

Not surprisingly, survey respondents cite data-entry errors by employees as the most common source of data defects. Examples of errors include misspellings, transposition of numerals, incorrect or missing codes, data placed in the wrong fields and unrecognizable names, nicknames, abbreviations or acronyms. These types of errors are increasing as companies move their businesses to the Web and allow customers and suppliers to enter data about themselves directly into operational systems.

Lack of validation routines. Interestingly, many data-entry errors can be prevented through the use of validation routines that check data as it is entered into Web, client/server or terminal-host systems. Respondents to the TDWI survey mentioned a "lack of adequate validation" as a source of data defects, noting this grievance in the "Other" category.

Valid, but not correct. But even validation routines cannot catch typos where the data represents a valid value. Although a person may mistype a telephone number, the number recorded is still valid -- it is just not the right one. The same holds true for social security numbers, vehicle identification numbers, part numbers and last names. Database integrity rules can catch some of these errors, but firms need to create complex business rules to catch the rest.

Mismatched syntax, formats and structures. Data-entry errors are compounded when organizations try to integrate data from multiple systems. For example, corresponding fields in each system may use different syntax (first-middle-last name vs. last-first-middle name), data formats (6 byte date field vs. 4 byte date field), or code structures (male-female vs. m-f vs. 1-2). In these cases, either a data cleansing or ETL tool needs to map these differences to a standard format before serious data cleanup can begin.

Unexpected changes in source systems. Perhaps a more pernicious problem is structural changes that occur in source systems. Sometimes these changes are deliberate, such as when an administrator adds a new field or code value and then neglects to notify the managers of connecting systems about the changes. In other cases, front-line people reuse existing fields to capture new types of information that were not anticipated by the application designers.

Spiderweb of interfaces. Because of the complexity of systems architectures today, changes to source systems are easily and quickly replicated to many other systems, both internal and external. Most systems are connected through a spiderweb of interfaces to other systems. Updating these interfaces is time-consuming and expensive, and many changes slip through the cracks and "infect" other systems. Thus, changes in source systems can wreak havoc on downstream systems if adequate change management processes are not in place.

Lack of referential integrity checks. It is also true that target systems do not adequately check the integrity of the data they load. For example, data warehouse administrators often turn off referential integrity when loading the data warehouse for performance reasons. If source administrators change or update tables, this can create integrity problems that are not detected.

Poor system design. Source or target systems that are poorly designed can create data errors. As companies rush to deploy new systems, developers often skirt fundamental design and modeling principles, which leads to data integrity problems down the road.

Data conversion errors. In the same vein, data migration or conversion projects can generate defects, as well as ETL tools that pull data from one system and load it into another. Although systems integrators may convert databases, they often fail to migrate business processes that govern the use of data. In addition, programmers may not take the time to understand source or target data models, and may therefore write code that introduces errors. One change in a data migration program or system interface can generate errors in tens of thousands of records.

The fragmentation of definitions and rules. A much bigger problem comes from the fragmentation of our organizations into a multitude of departments, divisions and operating groups, each with its own business processes supported by distinct data management systems. Slowly and inexorably, each group begins to use slightly different definitions for common data entities -- such as "customer" or "supplier" -- and apply different rules for calculating values, such as "net sales" and "gross profits." Add mergers, acquisitions and global expansion into countries with different languages and customs, and you have a recipe for a data-quality nightmare.

The problems that occur in this scenario have less to do with accuracy, completeness, validity or consistency, than with interpretation and protecting one's "turf." That is, people or groups often have vested interests in preserving data in a certain way even though it is inconsistent with the way the rest of the company defines data.

For example, many global companies squabble over a standard for currency conversions. Each division in a different part of the world wants the best conversion rate possible. And even when a standard is established, many groups will skirt the spirit of the standard by converting their currencies at the most opportune times, such as when a sale was posted vs. when the money was received. This type of maneuvering wreaks havoc on a data warehouse that tries to accurately measure values over time.

Slowly changing dimensions. Similarly, slowly changing dimensions can result in data-quality issues depending on the expectations of the user viewing the data. For example, an analyst at a chemical company wants to calculate the total value of goods purchased from Dow Chemical for the past year. But Dow recently merged with Union Carbide, which the chemical company also purchases materials from.

In this situation, the data warehousing manager needs to decide whether to roll up purchases made to Dow and Union Carbide separately, combine the purchases from both firms throughout the entire database, or combine them only after the date the two companies merged. Whatever approach the manager takes, it will work for some business analysts and alienate others.

In these cases, data quality is a subjective issue. Users' perception of data quality is often colored by the range of available data resources they can access. Where there is "competition" -- another data warehouse or data mart that covers the same subject area -- knowledge workers tend to be pickier about data quality, said Michael Masciandaro, director of decision support at Rohm & Haas.

Delivering high-quality data
Given the ease with which data defects can creep into systems, especially data warehouses, maintaining data quality at acceptable levels takes considerable effort and coordination throughout an organization. "Data quality is not a project, it's a lifestyle," said David Wells, enterprise systems manager at the University of Washington and the developer of TDWI's full-day course on data cleansing ("TDWI Data Cleansing: Delivering High Quality Warehouse Data").

And progress is not always steady or easy. Improving data quality often involves exposing shoddy processes, changing business practices, gaining support for common data definitions and business rules, and delivering education and training. In short, fixing data quality often touches a tender nerve on the underbelly of an organization.

One top executive leading a data-quality initiative said, "Improving data quality and consistency involves change, pain and compromise. The key is to be persistent and get buy-in from the top. Tackle high ROI projects first, and use them as leverage to bring along other groups that may be resistant to change."

The University of Washington's Wells emphasizes that managing data quality is a never-ending process. Even if a company gets all the pieces in place to handle today's data-quality problems, there will be new challenges tomorrow. That is because business processes, customer expectations, source systems and business rules all change continuously.

To ensure high-quality data, firms need to gain broad commitment to data-quality management principles and develop processes and programs that reduce data defects over time. To lay the foundation for high-quality data, firms need to adhere to the methodology outlined below.

Step 1. Launch a data quality program. The first step to delivering high-quality data is to get top managers to admit there is a problem and take responsibility for it.

The best way to kickstart a data-quality initiative is to fold it into a corporate data stewardship or data administration program. These programs are typically chartered to establish and maintain consistent data definitions and business rules so the firm can achieve a "single version of the truth" and save time on developing new apps and looking for data.

Step 2. Develop a project plan. The next step is to develop a data-quality project plan or series of plans. A project plan should define the scope of activity, set goals, estimate ROI, perform a gap analysis, identify actions, and measure and monitor success. To perform these tasks, the team will need to dig into the data to assess its current state, define corrective actions and establish metrics for monitoring conformance to goals.

Step 3. Build a data-quality team. Organizations must assign or hire individuals to create the plan, perform initial assessment, scrub the data and set up monitoring systems to maintain adequate levels of data quality.

Step 4 and Step 5. Review business processes and data architecture. Once there is corporate backing for a data-quality plan, the stewardship committee -- or a representative group of senior managers throughout the organization -- needs to review the company's business processes for collecting, recording and using data in the subject areas defined by the scope document. With help from outside consultants, the team also needs to evaluate the underlying systems architecture that supports the business practices and information flows.

Step 6. Assess data quality. After reviewing information processes and architectures, an organization needs to undertake a thorough assessment of data quality in key subject areas. The purpose of the assessment is to identify common data defects; create metrics to detect defects as they enter the data warehouse or other systems; and create rules or recommend actions for fixing the data. This can be long, arduous and labor-intensive work, depending on the scale and scope of the project, as well as the age and cleanliness of the source files.

Step 7. Clean the data. Once the audit is complete, the job of cleaning the data begins. A fundamental principle of quality management is to detect and fix defects as close as possible to the source to minimize costs.

Prevention is the least costly response to defects, followed by correction and repair. Correction involves fixing defects in-house, while repair involves fixing defects that affect customers directly. Examples of repair are direct mail pieces that are delivered to a deceased spouse, or software bugs in a commercially available product.

Step 8. Improve business practices. As mentioned earlier, preventing data defects involves changing attitudes and optimizing business processes. "A data quality problem is a symptom of the need for change in the current process," said Brad Bergh, a veteran database designer with Double Star Inc. Improving established processes often stokes political and cultural fires, but the payoff for overcoming these challenges is great.

Having a corporate data stewardship program and an enterprise-wide commitment to data quality is critical to making progress. Under the auspices of the CEO and the direction of corporate data stewards, a company can begin to make fundamental changes in the way it does business to improve data quality.

Step 9. Monitor data continuously. Organizations can quickly lose the benefits of their data preparation efforts if they fail to monitor data quality continuously. To do this, companies need to build a program that audits data at regular intervals, or just before or after data is loaded into another system such as a data warehouse. Companies then use the audit reports to measure their progress in achieving data-quality goals and complying with service-level agreements negotiated with business groups.

Service-level agreements should specify tolerances for critical data elements and penalties for exceeding those tolerances.

The above techniques, although they are not easy to implement in all cases, can help bring a company closer to achieving a strong foundation on which to build an information-based business. The key is to recognize that managing data quality is a perpetual endeavor. Companies must make a commitment to build data quality into all information management processes if they are going to reap the rewards of high-quality data -- and avoid the pitfalls caused by data defects.