Measuring Data Quality for Ongoing Improvement

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Listen to Laura Sebastian-Coleman, author of the book Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework, and I discuss bringing together a better understanding of what is represented in data, and how it is represented, with the expectations for use in order to improve the overall quality of data.  Our discussion also includes avoiding two common mistakes made when starting a data quality project, and defining five dimensions of data quality.

Laura Sebastian-Coleman has worked on data quality in large health care data warehouses since 2003.  She has implemented data quality metrics and reporting, launched and facilitated a data quality community, contributed to data consumer training programs, and has led efforts to establish data standards and to manage metadata.  In 2009, she led a group of analysts in developing the original Data Quality Assessment Framework (DQAF), which is the basis for her book.

Laura Sebastian-Coleman has delivered papers at MIT’s Information Quality Conferences and at conferences sponsored by the International Association for Information and Data Quality (IAIDQ) and the Data Governance Organization (DGO).  She holds IQCP (Information Quality Certified Professional) designation from IAIDQ, a Certificate in Information Quality from MIT, a B.A. in English and History from Franklin & Marshall College, and a Ph.D. in English Literature from the University of Rochester.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Data Profiling Early and Often — Guest James Standen discusses data profiling concepts and practices, and how bad data is often misunderstood and can be coaxed away from the dark side if you know how to approach it.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Data In, Decision Out

This recent blog post by Seth Godin made me think about the data quality adage garbage in, garbage out (aka GIGO).

Since we live in the era of data deluge and information overload, Godin’s question about how much time and effort should be spent on absorbing data and how much time and effort should be invested in producing output is an important one, especially for enterprise data management, where it boils down to how much data should be taken in before a business decision can come out.

In other words, it’s about how much time and effort is invested in the organization’s data in, decision out (i.e., DIDO) process.

And, of course, quality is an important aspect of the DIDO process—both data quality and decision quality.  But, oftentimes, it is an organization’s overwhelming concerns about its GIGO that lead to inefficiencies and ineffectiveness around its DIDO.

How much data is necessary to make an effective business decision?  Having complete (i.e., all available) data seems obviously preferable to incomplete data.  However, with data volumes always burgeoning, the unavoidable fact is that sometimes having more data only adds confusion instead of clarity, thereby becoming a distraction instead of helping you make a better decision.

Although accurate data is obviously preferable to inaccurate data, less than perfect data quality can not be used as an excuse to delay making a business decision.  Even large amounts of high quality data will not guarantee high quality business decisions, just as high quality business decisions will not guarantee high quality business results.

In other words, overcoming GIGO will not guarantee DIDO success.

When it comes to the amount and quality of the data used to make business decisions, you can’t always get the data you want, and while you should always be data-driven, never only intuition-driven, eventually it has to become: Time to start deciding.

 

Related Posts

The Data-Decision Symphony

The Real Data Value is Business Insight

Is your data complete and accurate, but useless to your business?

DQ-View: From Data to Decision

TDWI World Conference Orlando 2010

The Real Data Value is Business Insight

Data Values for COUNTRY Understanding your data usage is essential to improving its quality, and therefore, you must perform data analysis on a regular basis.

A data profiling tool can help you by automating some of the grunt work needed to begin your data analysis, such as generating levels of statistical summaries supported by drill-down details, including data value frequency distributions (like the ones shown to the left).

However, a common mistake is to hyper-focus on the data values.

Narrowing your focus to the values of individual fields is a mistake when it causes you to lose sight of the wider context of the data, which can cause other errors like mistaking validity for accuracy.

Understanding data usage is about analyzing its most important context—how your data is being used to make business decisions.

 

“Begin with the decision in mind”

In his excellent recent blog post It’s time to industrialize analytics, James Taylor wrote that “organizations need to be much more focused on directing analysts towards business problems.”  Although Taylor was writing about how, in advanced analytics (e.g., data mining, predictive analytics), “there is a tendency to let analysts explore the data, see what can be discovered,” I think this tendency is applicable to all data analysis, including less advanced analytics like data profiling and data quality assessments.

Please don’t misunderstand—Taylor and I are not saying that there is no value in data exploration, because, without question, it can definitely lead to meaningful discoveries.  And I continue to advocate that the goal of data profiling is not to find answers, but instead, to discover the right questions.

However, as Taylor explained, it is because “the only results that matter are business results” that data analysis should always “begin with the decision in mind.  Find the decisions that are going to make a difference to business results—to the metrics that drive the organization.  Then ask the analysts to look into those decisions and see what they might be able to predict that would help make better decisions.”

Once again, although Taylor is discussing predictive analytics, this cogent advice should guide all of your data analysis.

 

The Real Data Value is Business Insight

The Real Data Value is Business Insight

Returning to data quality assessments, which create and monitor metrics based on summary statistics provided by data profiling tools (like the ones shown in the mockup to the left), elevating what are low-level technical metrics up to the level of business relevance will often establish their correlation with business performance, but will not establish metrics that drive—or should drive—the organization.

Although built from the bottom-up by using, for the most part, the data value frequency distributions, these metrics lose sight of the top-down fact that business insight is where the real data value lies.

However, data quality metrics such as completeness, validity, accuracy, and uniqueness, which are just a few common examples, should definitely be created and monitored—unfortunately, a single straightforward metric called Business Insight doesn’t exist.

But let’s pretend that my other mockup metrics were real—50% of the data is inaccurate and there is an 11% duplicate rate.

Oh, no!  The organization must be teetering on the edge of oblivion, right?  Well, 50% accuracy does sound really bad, basically like your data’s accuracy is no better than flipping a coin.  However, which data is inaccurate, and far more important, is the inaccurate data actually being used to make a business decision?

As for the duplicate rate, I am often surprised by the visceral reaction it can trigger, such as: “how can we possibly claim to truly understand who our most valuable customers are if we have an 11% duplicate rate?”

So, would reducing your duplicate rate to only 1% automatically result in better customer insight?  Or would it simply mean that the data matching criteria was too conservative (e.g., requiring an exact match on all “critical” data fields), preventing you from discovering how many duplicate customers you have?  (Or maybe the 11% indicates the matching criteria was too aggressive).

My point is that accuracy and duplicate rates are just numbers—what determines if they are a good number or a bad number?

The fundamental question that every data quality metric you create must answer is: How does this provide business insight?

If a data quality (or any other data) metric can not answer this question, then it is meaningless.  Meaningful metrics always represent business insight because they were created by beginning with the business decisions in mind.  Otherwise, your metrics could provide the comforting, but false, impression that all is well, or you could raise red flags that are really red herrings.

Instead of beginning data analysis with the business decisions in mind, many organizations begin with only the data in mind, which results in creating and monitoring data quality metrics that provide little, if any, business insight and decision support.

Although analyzing your data values is important, you must always remember that the real data value is business insight.

 

Related Posts

The First Law of Data Quality

Adventures in Data Profiling

Data Quality and the Cupertino Effect

Is your data complete and accurate, but useless to your business?

The Idea of Order in Data

You Can’t Always Get the Data You Want

Red Flag or Red Herring? 

DQ-Tip: “There is no point in monitoring data quality…”

Which came first, the Data Quality Tool or the Business Need?

Selling the Business Benefits of Data Quality

Is your data complete and accurate, but useless to your business?

Ensuring that complete and accurate data is being used to make critical daily business decisions is perhaps the primary reason why data quality is so vitally important to the success of your organization. 

However, this effort can sometimes take on a life of its own, where achieving complete and accurate data is allowed to become the raison d'être of your data management strategy—in other words, you start managing data for the sake of managing data.

When this phantom menace clouds your judgment, your data might be complete and accurate—but useless to your business.

Completeness and Accuracy

How much data is necessary to make an effective business decision?  Having complete (i.e., all available) data seems obviously preferable to incomplete data.  However, with data volumes always burgeoning, the unavoidable fact is that sometimes having more data only adds confusion instead of clarity, thereby becoming a distraction instead of helping you make a better decision.

Returning to my original question, how much data is really necessary to make an effective business decision? 

Accuracy, which, thanks to substantial assistance from my readers, was defined in a previous post as both the correctness of a data value within a limited context such as verification by an authoritative reference (i.e., validity) combined with the correctness of a valid data value within an extensive context including other data as well as business processes (i.e., accuracy). 

Although accurate data is obviously preferable to inaccurate data, less than perfect data quality can not be used as an excuse to delay making a critical business decision.  When it comes to the quality of the data being used to make these business decisions, you can’t always get the data you want, but if you try sometimes, you just might find, you get the business insight you need.

Data-driven Solutions for Business Problems

Obviously, there are even more dimensions of data quality beyond completeness and accuracy. 

However, although it’s about more than just improving your data, data quality can be misperceived to be an activity performed just for the sake of the data.  When, in fact, data quality is an enterprise-wide initiative performed for the sake of implementing data-driven solutions for business problems, enabling better business decisions, and delivering optimal business performance.

In order to accomplish these objectives, data has to be not only complete and accurate, as well as whatever other dimensions you wish to add to your complete and accurate definition of data quality, but most important, data has to be useful to the business.

Perhaps the most common definition for data quality is “fitness for the purpose of use.” 

The missing word, which makes this definition both incomplete and inaccurate, puns intended, is “business.”  In other words, data quality is “fitness for the purpose of business use.”  How complete and how accurate (and however else) the data needs to be is determined by its business use—or uses since, in the vast majority of cases, data has multiple business uses.

Data, data everywhere

With silos replicating data as well as new data being created daily, managing all of the data is not only becoming impractical, but because we are too busy with the activity of trying to manage all of it, no one is stopping to evaluate usage or business relevance.

The fifth of the Five New Ideas From 2010 MIT Information Quality Industry Symposium, which is a recent blog post written by Mark Goloboy, was that “60-90% of operational data is valueless.”

“I won’t say worthless,” Goloboy clarified, “since there is some operational necessity to the transactional systems that created it, but valueless from an analytic perspective.  Data only has value, and is only worth passing through to the Data Warehouse if it can be directly used for analysis and reporting.  No news on that front, but it’s been more of the focus since the proliferation of data has started an increasing trend in storage spend.”

In his recent blog post Are You Afraid to Say Goodbye to Your Data?, Dylan Jones discussed the critical importance of designing an archive strategy for data, as opposed to the default position many organizations take, where burgeoning data volumes are allowed to proliferate because, in large part, no one wants to delete (or, at the very least, archive) any of the existing data. 

This often results in the data that the organization truly needs for continued success getting stuck in the long line of data waiting to be managed, and in many cases, behind data for which the organization no longer has any business use (and perhaps never even had the chance to use when the data was actually needed to make critical business decisions).

“When identifying data in scope for a migration,” Dylan advised, “I typically start from the premise that ALL data is out of scope unless someone can justify its existence.  This forces the emphasis back on the business to justify their use of the data.”

Data Memorioso

Funes el memorioso is a short story by Jorge Luis Borges, which describes a young man named Ireneo Funes who, as a result of a horseback riding accident, has lost his ability to forget.  Although Funes has a tremendous memory, he is so lost in the details of everything he knows that he is unable to convert the information into knowledge and unable, as a result, to grow in wisdom.

In Spanish, the word memorioso means “having a vast memory.”  When Data Memorioso is your data management strategy, your organization becomes so lost in all of the data it manages that it is unable to convert data into business insight and unable, as a result, to survive and thrive in today’s highly competitive and rapidly evolving marketplace.

In their great book Made to Stick: Why Some Ideas Survive and Others Die, Chip Heath and Dan Heath explained that “an accurate but useless idea is still useless.  If a message can’t be used to make predictions or decisions, it is without value, no matter how accurate or comprehensive it is.”  I believe that this is also true for your data and your organization’s business uses for it.

Is your data complete and accurate, but useless to your business?