A roundup of the Best OCDQ Blog posts published during 2014.Read More
An example of the challenge of data accuracy and the possible misinformation provided by key performance metrics inspired by the investigative reporting of the HBO satirical news show Last Week Tonight with John Oliver.Read More
While the use of a postal validation service is a highly recommended best practice for ensuring valid addresses are entered when data is created, just because you have valid data doesn’t guarantee that you have accurate data.Read More
Listen to Laura Sebastian-Coleman, author of the book Measuring Data Quality for Ongoing Improvement: A Data Quality Assessment Framework, and I discuss bringing together a better understanding of what is represented in data, and how it is represented, with the expectations for use in order to improve the overall quality of data. Our discussion also includes avoiding two common mistakes made when starting a data quality project, and defining five dimensions of data quality.
Laura Sebastian-Coleman has worked on data quality in large health care data warehouses since 2003. She has implemented data quality metrics and reporting, launched and facilitated a data quality community, contributed to data consumer training programs, and has led efforts to establish data standards and to manage metadata. In 2009, she led a group of analysts in developing the original Data Quality Assessment Framework (DQAF), which is the basis for her book.
Laura Sebastian-Coleman has delivered papers at MIT’s Information Quality Conferences and at conferences sponsored by the International Association for Information and Data Quality (IAIDQ) and the Data Governance Organization (DGO). She holds IQCP (Information Quality Certified Professional) designation from IAIDQ, a Certificate in Information Quality from MIT, a B.A. in English and History from Franklin & Marshall College, and a Ph.D. in English Literature from the University of Rochester.
Additional Listening Options:
Popular OCDQ Radio Episodes
Clicking on the link will take you to the episode’s blog post:
- Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
- Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
- Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.
- Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
- The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”
- Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
- Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.
- The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
- The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.
- Data Profiling Early and Often — Guest James Standen discusses data profiling concepts and practices, and how bad data is often misunderstood and can be coaxed away from the dark side if you know how to approach it.
Last week, when I published my blog post Lightning Strikes the Cloud, I unintentionally demonstrated three important things about data quality.
The first thing I demonstrated was even an obsessive-compulsive data quality geek is capable of data defects, since I initially published the post with the title Lightening Strikes the Cloud, which is an excellent example of the difference between validity and accuracy caused by the Cupertino Effect, since although lightening is valid (i.e., a correctly spelled word), it isn’t contextually accurate.
The second thing I demonstrated was the value of shining a social light on data quality — the value of using collaborative tools like social media to crowd-source data quality improvements. Thankfully, Julian Schwarzenbach quickly noticed my error on Twitter. “Did you mean lightning? The concept of lightening clouds could be worth exploring further,” Julian humorously tweeted. “Might be interesting to consider what happens if the cloud gets so light that it floats away.” To which I replied that if the cloud gets so light that it floats away, it could become Interstellar Computing or, as Julian suggested, the start of the Intergalactic Net, which I suppose is where we will eventually have to store all of that big data we keep hearing so much about these days.
The third thing I demonstrated was the potential dark side of data cleansing, since the only remaining trace of my data defect is a broken URL. This is an example of not providing a well-documented audit trail, which is necessary within an organization to communicate data quality issues and resolutions.
Communication and collaboration are essential to finding our way with data quality. And social media can help us by providing more immediate and expanded access to our collective knowledge, experience, and wisdom, and by shining a social light that illuminates the shadows cast upon data quality issues when a perception filter or bystander effect gets the better of our individual attention or undermines our collective best intentions — which, as I recently demonstrated, occasionally happens to all of us.
In psychology, the Stroop Effect is a demonstration of the reaction time of a task. The most commonly used example is what is known as the Stroop Test, which compares the time needed to name colors when they are printed in an ink color that matches their name (e.g., green, yellow, red, blue, brown, purple) with the time needed to name the same colors when they are printed in an ink color that does not match their name (e.g., blue, red, purple, green, brown, yellow). Naming the color of the word takes longer, and is more prone to errors, when the ink color does not match the name of the color.
The Stroop Test, where colors do not match their names, reminds me of the relationship between metadata and data quality if I view the ink color as the metadata and the name of the color as the data, given that understanding data takes longer, and is more prone to errors, when the metadata does not match the data, or when the metadata is ambiguous.
Unlike the Stroop Test, where poor metadata (ink color) obfuscates good data (name of the color), data quality issues can also be caused when good metadata is undermined by poor data (e.g., data entry errors like an email address being entered into a postal address field). And, of course, even when the entered data matches the metadata (or automatic data-to-metadata matching is enabled by drop-down boxes), more insidious data quality issues can be caused by the complex challenge of data accuracy.
Additionally, the point of view paradox can turn data quality debates about fitness for the purpose of use even more colorful than the Stroop Test, such as when data that one user sees as red and green, another user sees as crimson and chartreuse.
But hopefully we can all agree that good data quality begins with good metadata, because better metadata makes data better.
Welcome to the highly anticipated debut episode of the Obsessive-Compulsive Data Quality (OCDQ) podcast—OCDQ Radio!
In this episode, I discuss how data, data quality, data-driven decision making, and metadata quality no longer reside exclusively within the esoteric realm of data management. Data has now so thoroughly pervaded mainstream culture that we hardly seem to notice that we are quite literally swimming in data on a daily basis.
The growing challenge is can we extract meaningful insights from these vast and veritable oceans of unrelenting data volumes, and use those insights to make better decisions in near real-time in order to positively impact the various aspects of our lives.
We are now living in a brave new data world where everyone is a data geek—and data quality affects us all.
Or to paraphrase William Shakespeare:
How many goodly data are there here! How beauteous data geeks are!
O brave new world!
That is so dependent on the quality of the data in it!”
A Brave New Data World
Additional listening options:
“The behavioral revolution in economics began in 1981 when Richard Thaler published a seven-page letter in a somewhat obscure economics journal, which posed a pretty simple choice about apples.
Which would you prefer:
(A) One apple in one year, or
(B) Two apples in one year plus one day?
This is a strange hypothetical—why would you have to wait a year to receive an apple? But choosing is not very difficult; most people would choose to wait an extra day to double the size of their gift.
Thaler went on, however, to pose a second apple choice.
Which would you prefer:
(C) One apple today, or
(D) Two apples tomorrow?
What’s interesting is that many people give a different, seemingly inconsistent answer to this second question. Many of the same people who are patient when asked to consider this choice a year in advance turn around and become impatient when the choice has immediate consequences—they prefer C over D.
What was revolutionary about his apple example is that it illustrated the plausibility of what behavioral economists call ‘time-inconsistent’ preferences. Richard was centrally interested in the people who chose both B and C. These people, who preferred two apples in the future but one apple today, flipped their preferences as the delivery date got closer.”
What does this have to do with data quality? Give me a moment to finish eating my second apple, and then I will explain . . .
Data Quality Oranges
Let’s imagine that an orange represents a unit of measurement for data quality, somewhat analogous to data accuracy, such that the more data quality oranges you have, the better the quality of data is for your needs—let’s say for making a business decision.
Which would you prefer:
(A) One data quality orange in one month, or
(B) Two data quality oranges in one month plus one day?
(Please Note: Due to the strange uncertainties of fruit-based mathematics, two data quality oranges do not necessarily equate to a doubling of data accuracy, but two data quality oranges are certainly an improvement over one data quality orange).
Now, of course, on those rare occasions when you can afford to wait a month or so before making a critical business decision, most people would choose to wait an extra day in order to improve their data quality before making their data-driven decision.
However, let’s imagine you are feeling squeezed by a more pressing business decision—now which would you prefer:
(C) One data quality orange today, or
(D) Two data quality oranges tomorrow?
In my experience with data quality and business intelligence, most people prefer B over A—and C over D.
This “time-inconsistent” data quality preference within business intelligence reflects the reality that with the speed at which things change these days, more real-time business decisions are required—perhaps making speed more important than quality.
In a recent Data Knights Tweet Jam, Mark Lorion pondered speed versus quality within business intelligence, asking: “Is it better to be perfect in 30 days or 70% today? Good enough may often be good enough.”
To which Henrik Liliendahl Sørensen responded with the perfectly pithy wisdom: “Good, Fast, Decision—Pick any two.”
However, Steve Dine cautioned that speed versus quality is decision dependent: “70% is good when deciding how many pencils to order, but maybe not for a one billion dollar acquisition.”
Mark’s follow-up captured the speed versus quality tradeoff succinctly with “Good Now versus Great Later.” And Henrik added the excellent cautionary note: “Good decision now, great decision too late—especially if data quality is not a mature discipline.”
What Say You?
How many data quality oranges do you think it takes? Or for those who prefer a less fruitful phrasing, where do you stand on the speed versus quality debate? How good does data quality have to be in order to make a good data-driven business decision?
Since we live in the era of data deluge and information overload, Godin’s question about how much time and effort should be spent on absorbing data and how much time and effort should be invested in producing output is an important one, especially for enterprise data management, where it boils down to how much data should be taken in before a business decision can come out.
In other words, it’s about how much time and effort is invested in the organization’s data in, decision out (i.e., DIDO) process.
And, of course, quality is an important aspect of the DIDO process—both data quality and decision quality. But, oftentimes, it is an organization’s overwhelming concerns about its GIGO that lead to inefficiencies and ineffectiveness around its DIDO.
How much data is necessary to make an effective business decision? Having complete (i.e., all available) data seems obviously preferable to incomplete data. However, with data volumes always burgeoning, the unavoidable fact is that sometimes having more data only adds confusion instead of clarity, thereby becoming a distraction instead of helping you make a better decision.
Although accurate data is obviously preferable to inaccurate data, less than perfect data quality can not be used as an excuse to delay making a business decision. Even large amounts of high quality data will not guarantee high quality business decisions, just as high quality business decisions will not guarantee high quality business results.
In other words, overcoming GIGO will not guarantee DIDO success.
When it comes to the amount and quality of the data used to make business decisions, you can’t always get the data you want, and while you should always be data-driven, never only intuition-driven, eventually it has to become: Time to start deciding.
“Good morning sir!” said the smiling gentleman behind the counter—and a little too cheerily for 5 o’clock in the morning. “Welcome to the check-in counter for Data Quality Airlines. My name is Edward. How may I help you today?”
“Good morning Edward,” I replied. “My name is John Smith. I am traveling to Boston today on flight number 221.”
“Thank you for choosing Data Quality Airlines!” responded Edward. “May I please see your driver’s license, passport, or other government issued photo identification so that I can verify your data accuracy.”
As I handed Edward my driver’s license, I explained “it’s an old photograph in which I was clean-shaven, wearing contact lenses, and ten pounds lighter” since I now had a full beard, was wearing glasses, and, to be honest, was actually thirty pounds heavier.
“Oh,” said Edward, his plastic smile morphing into a more believable and stern frown. “I am afraid you are on the No Fly List.”
“Oh, that’s right—because of my name being so common!” I replied while fumbling through my backpack, frantically searching for the piece of paper, which I then handed to Edward. “I’m supposed to give you my Redress Control Number.”
“Actually, you’re supposed to use your Redress Control Number when making your reservation,” Edward retorted.
“In other words,” I replied, while sporting my best plastic smile, “although you couldn’t verify the accuracy of my customer data when I made my reservation on-line last month, you were able to verify the authorization to immediately charge my credit card for the full price of purchasing a non-refundable plane ticket to fly on Data Quality Airlines.”
“I don’t appreciate your sense of humor,” replied Edward. “Everyone at Data Quality Airlines takes accuracy very seriously.”
Edward printed my boarding pass, wrote BCS on it in big letters, handed it to me, and with an even more plastic smile cheerily returning to his face, said: “Please proceed to the security checkpoint. Thank you again for choosing Data Quality Airlines!”
“Boarding pass?” asked the not-at-all smiling woman at the security checkpoint. After I handed her my boarding pass, she said, “And your driver’s license, passport, or other government issued photo identification so that I can verify your data accuracy.”
“I guess my verified data accuracy at the Data Quality Airlines check-in counter must have already expired,” I joked as I handed her my driver’s license. “It’s an old photograph in which I was clean-shaven, wearing contact lenses, and ten pounds lighter.”
The woman silently examined my boarding pass and driver’s license, circled BCS with a magic marker, and then shouted over her shoulder to a group of not-at-all smiling security personnel standing behind her: “Randomly selected security screening!”
One of them, a very large man, stepped toward me as the sound from the snap of the fresh latex glove he had just placed on his very large hand echoed down the long hallway that he was now pointing me toward. “Right this way sir,” he said with a smile.
Ten minutes later, as I slowly walked to the gate for Data Quality Airlines Flight Number 221 to Boston, the thought echoing through my mind was that there is no such thing as data accuracy—there are only verifiable assertions of data accuracy . . .
Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.
“There is no such thing as data accuracy — There are only assertions of data accuracy.”
You can download (.pdf file) quotes from this webinar by clicking on this link: Data Quality Pro Webinar Quotes - Peter Benson
ISO 8000 is the international standards for data quality. You can get more information by clicking on this link: ISO 8000
Accuracy, which, thanks to substantial assistance from my readers, was defined in a previous post as both the correctness of a data value within a limited context such as verification by an authoritative reference (i.e., validity) combined with the correctness of a valid data value within an extensive context including other data as well as business processes (i.e., accuracy).
“The definition of data quality,” according to Peter and the ISO 8000 standards, “is the ability of the data to meet requirements.”
Although accuracy is only one of many dimensions of data quality, whenever we refer to data as accurate, we are referring to the ability of the data to meet specific requirements, and quite often it’s the ability to support making a critical business decision.
I agree with Peter and the ISO 8000 standards because we can’t simply take an accuracy metric on a data quality dashboard (or however else the assertion is presented to us) at face value without understanding how the metric is both defined and measured.
However, even when well defined and properly measured, data accuracy is still only an assertion. Oftentimes, the only way to verify the assertion is by putting the data to its intended use.
If by using it you discover that the data is inaccurate, then by having established what the assertion of accuracy was based on, you have a head start on performing root cause analysis, enabling faster resolution of the issues—not only with the data, but also with the business and technical processes used to define and measure data accuracy.
In psychology, the term negativity bias is used to explain how bad evokes a stronger reaction than good in the human mind. Don’t believe that theory? Compare receiving an insult with receiving a compliment—which one do you remember more often?
Now, this doesn’t mean the dark side of the Force is stronger, it simply means that we all have a natural tendency to focus more on the negative aspects, rather than on the positive aspects, of most situations, including data quality.
In the aftermath of poor data quality negatively impacting decision-critical enterprise information, the natural tendency is for a data quality initiative to begin by focusing on the now painfully obvious need for improvement, essentially asking the question:
Why isn’t our data quality better?
Although this type of question is a common reaction to failure, it is also indicative of the problem-seeking mindset caused by our negativity bias. However, Chip and Dan Heath, authors of the great book Switch, explain that even in failure, there are flashes of success, and following these “bright spots” can illuminate a road map for action, encouraging a solution-seeking mindset.
“To pursue bright spots is to ask the question:
What’s working, and how can we do more of it?
Sounds simple, doesn’t it?
Yet, in the real-world, this obvious question is almost never asked.
Instead, the question we ask is more problem focused:
What’s broken, and how do we fix it?”
Why isn’t our data quality worse?
For example, let’s pretend that a data quality assessment is performed on a data source used to make critical business decisions. With the help of business analysts and subject matter experts, it’s verified that this critical source has an 80% data accuracy rate.
The common approach is to ask the following questions (using a problem-seeking mindset):
- Why isn’t our data quality better?
- What is the root cause of the 20% inaccurate data?
- What process (business or technical, or both) is broken, and how do we fix it?
- What people are responsible, and how do we correct their bad behavior?
But why don’t we ask the following questions (using a solution-seeking mindset):
- Why isn’t our data quality worse?
- What is the root cause of the 80% accurate data?
- What process (business or technical, or both) is working, and how do we re-use it?
- What people are responsible, and how do we encourage their good behavior?
I am not suggesting that we abandon the first set of questions, especially since there are times when a problem-seeking mindset might be a better approach (after all, it does also incorporate a solution-seeking mindset—albeit after a problem is identified).
I am simply wondering why we often never even consider asking the second set of questions?
Most data quality initiatives focus on developing new solutions—and not re-using existing solutions.
Most data quality initiatives focus on creating new best practices—and not leveraging existing best practices.
Perhaps you can be the chosen one who will bring balance to the data quality initiative by asking both questions:
Why isn’t our data quality better? Why isn’t our data quality worse?
Understanding your data usage is essential to improving its quality, and therefore, you must perform data analysis on a regular basis.
A data profiling tool can help you by automating some of the grunt work needed to begin your data analysis, such as generating levels of statistical summaries supported by drill-down details, including data value frequency distributions (like the ones shown to the left).
However, a common mistake is to hyper-focus on the data values.
Narrowing your focus to the values of individual fields is a mistake when it causes you to lose sight of the wider context of the data, which can cause other errors like mistaking validity for accuracy.
Understanding data usage is about analyzing its most important context—how your data is being used to make business decisions.
“Begin with the decision in mind”
In his excellent recent blog post It’s time to industrialize analytics, James Taylor wrote that “organizations need to be much more focused on directing analysts towards business problems.” Although Taylor was writing about how, in advanced analytics (e.g., data mining, predictive analytics), “there is a tendency to let analysts explore the data, see what can be discovered,” I think this tendency is applicable to all data analysis, including less advanced analytics like data profiling and data quality assessments.
Please don’t misunderstand—Taylor and I are not saying that there is no value in data exploration, because, without question, it can definitely lead to meaningful discoveries. And I continue to advocate that the goal of data profiling is not to find answers, but instead, to discover the right questions.
However, as Taylor explained, it is because “the only results that matter are business results” that data analysis should always “begin with the decision in mind. Find the decisions that are going to make a difference to business results—to the metrics that drive the organization. Then ask the analysts to look into those decisions and see what they might be able to predict that would help make better decisions.”
Once again, although Taylor is discussing predictive analytics, this cogent advice should guide all of your data analysis.
The Real Data Value is Business Insight
Returning to data quality assessments, which create and monitor metrics based on summary statistics provided by data profiling tools (like the ones shown in the mockup to the left), elevating what are low-level technical metrics up to the level of business relevance will often establish their correlation with business performance, but will not establish metrics that drive—or should drive—the organization.
Although built from the bottom-up by using, for the most part, the data value frequency distributions, these metrics lose sight of the top-down fact that business insight is where the real data value lies.
However, data quality metrics such as completeness, validity, accuracy, and uniqueness, which are just a few common examples, should definitely be created and monitored—unfortunately, a single straightforward metric called Business Insight doesn’t exist.
But let’s pretend that my other mockup metrics were real—50% of the data is inaccurate and there is an 11% duplicate rate.
Oh, no! The organization must be teetering on the edge of oblivion, right? Well, 50% accuracy does sound really bad, basically like your data’s accuracy is no better than flipping a coin. However, which data is inaccurate, and far more important, is the inaccurate data actually being used to make a business decision?
As for the duplicate rate, I am often surprised by the visceral reaction it can trigger, such as: “how can we possibly claim to truly understand who our most valuable customers are if we have an 11% duplicate rate?”
So, would reducing your duplicate rate to only 1% automatically result in better customer insight? Or would it simply mean that the data matching criteria was too conservative (e.g., requiring an exact match on all “critical” data fields), preventing you from discovering how many duplicate customers you have? (Or maybe the 11% indicates the matching criteria was too aggressive).
My point is that accuracy and duplicate rates are just numbers—what determines if they are a good number or a bad number?
The fundamental question that every data quality metric you create must answer is: How does this provide business insight?
If a data quality (or any other data) metric can not answer this question, then it is meaningless. Meaningful metrics always represent business insight because they were created by beginning with the business decisions in mind. Otherwise, your metrics could provide the comforting, but false, impression that all is well, or you could raise red flags that are really red herrings.
Instead of beginning data analysis with the business decisions in mind, many organizations begin with only the data in mind, which results in creating and monitoring data quality metrics that provide little, if any, business insight and decision support.
Although analyzing your data values is important, you must always remember that the real data value is business insight.