Data Quality and the Bystander Effect

In his recent Harvard Business Review blog post Break the Bad Data Habit, Tom Redman cautioned against correcting data quality issues without providing feedback to where the data originated.  “At a minimum,” Redman explained, “others using the erred data may not spot the error.  There is no telling where it might turn up or who might be victimized.”  And correcting bad data without providing feedback to its source also denies the organization an opportunity to get to the bottom of the problem.

“And failure to provide feedback,” Redman continued, “is but the proximate cause.  The deeper root issue is misplaced accountability — or failure to recognize that accountability for data is needed at all.  People and departments must continue to seek out and correct errors.  They must also provide feedback and communicate requirements to their data sources.”

In his blog post The Secret to an Effective Data Quality Feedback Loop, Dylan Jones responded to Redman’s blog post with some excellent insights regarding data quality feedback loops and how they can help improve your data quality initiatives.

I definitely agree with Redman and Jones about the need for feedback loops, but I have found, more often than not, that no feedback at all is provided on data quality issues because of the assumption that data quality is someone else’s responsibility.

This general lack of accountability for data quality issues is similar to what is known in psychology as the Bystander Effect, which refers to people often not offering assistance to the victim in an emergency situation when other people are present.  Apparently, the mere presence of other bystanders greatly decreases intervention, and the greater the number of bystanders, the less likely it is that any one of them will help.  Psychologists believe that the reason this happens is that as the number of bystanders increases, any given bystander is less likely to interpret the incident as a problem, and less likely to assume responsibility for taking action.

In my experience, the most common reason that data quality issues are often neither reported nor corrected is that most people throughout the enterprise act like data quality bystanders, making them less likely to interpret bad data as a problem or, at the very least, not their responsibility.  But the enterprise’s data quality is perhaps most negatively affected by this bystander effect, which may make it the worst bad data habit that the enterprise needs to break.

Data Quality and Big Data

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

This is Part 2 of 2 from my recent discussion with Tom Redman.  In this episode, Tom and I discuss data quality and big data, including if data quality matters less in larger data sets, if statistical outliers represent business insights or data quality issues, statistical sampling errors versus measurement calibration errors, mistaking signal for noise (i.e., good data for bad data), and whether or not the principles and practices of true “data scientists” will truly be embraced by an organization’s business leaders.

Dr. Thomas C. Redman (the “Data Doc”) is an innovator, advisor, and teacher.  He was first to extend quality principles to data and information in the late 80s.  Since then he has crystallized a body of tools, techniques, roadmaps and organizational insights that help organizations make order-of-magnitude improvements.

More recently Tom has developed keen insights into the nature of data and formulated the first comprehensive approach to “putting data to work.”  Taken together, these enable organizations to treat data as assets of virtually unlimited potential.

Tom has personally helped dozens of leaders and organizations better understand data and data quality and start their data programs.  He is a sought-after lecturer and the author of dozens of papers and four books.  The most recent, Data Driven: Profiting from Your Most Important Business Asset (Harvard Business Press, 2008) was a Library Journal best buy of 2008.

Prior to forming Navesink Consulting Group in 1996, Tom conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995.  Tom holds a Ph.D. in statistics from Florida State University. He holds two patents.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Data Driven

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

This is Part 1 of 2 from my recent discussion with Tom Redman.  In this episode, Tom and I discuss concepts from one of my favorite data quality books, which is his most recent book: Data Driven: Profiting from Your Most Important Business Asset.

Our discussion includes viewing data as an asset, an organization’s hierarchy of data needs, a simple model for culture change, and attempting to achieve the “single version of the truth” being marketed as a goal of master data management (MDM).

Dr. Thomas C. Redman (the “Data Doc”) is an innovator, advisor, and teacher.  He was first to extend quality principles to data and information in the late 80s.  Since then he has crystallized a body of tools, techniques, roadmaps and organizational insights that help organizations make order-of-magnitude improvements.

More recently Tom has developed keen insights into the nature of data and formulated the first comprehensive approach to “putting data to work.”  Taken together, these enable organizations to treat data as assets of virtually unlimited potential.

Tom has personally helped dozens of leaders and organizations better understand data and data quality and start their data programs.  He is a sought-after lecturer and the author of dozens of papers and four books.

Prior to forming Navesink Consulting Group in 1996, Tom conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995. Tom holds a Ph.D. in statistics from Florida State University.  He holds two patents.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Bayesian Data-Driven Decision Making

In his book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman recounts the story of economist John Maynard Keynes, who, when asked what he does when new data is presented that does not support his earlier decision, responded: “I change my opinion.  What do you do?”

“This is the way good decision makers behave,” Redman explained.  “They know that a newly made decision is but the first step in its execution.  They regularly and systematically evaluate how well a decision is proving itself in practice by acquiring new data.  They are not afraid to modify their decisions, even admitting they are wrong and reversing course if the facts demand it.”

Since he has a PhD in statistics, it’s not surprising that Redman explained effective data-driven decision making using Bayesian statistics, which is “an important branch of statistics that differs from classic statistics in the way it makes inferences based on data.  One of its advantages is that it provides an explicit means to quantify uncertainty, both a priori, that is, in advance of the data, and a posteriori, in light of the data.”

Good decision makers, Redman explained, follow at least three Bayesian principles:

  1. They bring as much of their prior experience as possible to bear in formulating their initial decision spaces and determining the sorts of data they will consider in making the decision.
  2. For big, important decisions, they adopt decision criteria that minimize the maximum risk.
  3. They constantly evaluate new data to determine how well a decision is working out, and they do not hesitate to modify the decision as needed.

A key concept of statistical process control and continuous improvement is the importance of closing the feedback loop that allows a process to monitor itself, learn from its mistakes, and adjust when necessary.

The importance of building feedback loops into data-driven decision making is too often ignored.

I discuss this, and other aspects of data-driven decision making, in my DataFlux white paper, which is available for download (registration required) using the following link: Decision-Driven Data Management

 

Related Posts

Decision-Driven Data Management

The Speed of Decision

The Big Data Collider

A Decision Needle in a Data Haystack

The Data-Decision Symphony

Thaler’s Apples and Data Quality Oranges

Satisficing Data Quality

Data Confabulation in Business Intelligence

The Data that Supported the Decision

Data Psychedelicatessen

OCDQ Radio - Big Data and Big Analytics

OCDQ Radio - Good-Enough Data for Fast-Enough Decisions

The Circle of Quality

A Farscape Analogy for Data Quality

OCDQ Radio - Organizing for Data Quality

A Farscape Analogy for Data Quality

Farscape was one of my all-time favorite science fiction television shows.  In the weird way my mind works, the recent blog post (which has received great comments) Four Steps to Fixing Your Bad Data by Tom Redman, triggered a Farscape analogy.

“The notion that data are assets sounds simple and is anything but,” Redman wrote.  “Everyone touches data in one way or another, so the tendrils of a data program will affect everyone — the things they do, the way they think, their relationships with one another, your relationships with customers.”

The key word for me was tendrils — like I said, my mind works in a weird way.

 

Moya and Pilot

On Farscape, the central characters of the show travel through space aboard Moya, a Leviathan, which is a species of living, sentient spaceships.  Pilot is a sentient creature (of a species also known as Pilots) with the vast capacity for multitasking that is necessary for the simultaneous handling of the many systems aboard a Leviathan.  The tendrils of a Pilot’s lower body are biologically bonded with the living systems of a Leviathan, creating a permanent symbiotic connection, meaning that, once bonded, a Pilot and a Leviathan can no longer exist independently for more than an hour or so, or both of them will die.

Leviathans were one of the many laudably original concepts of Farscape.  The role of the spaceship in most science fiction is analogous to the role of a boat.  In other words, traveling through space is most often imagined like traveling on water.  However, seafaring vessels and spaceships are usually seen as a technological object providing transportation and life support, but not actually alive in its own right (despite the fact that both types of ship are usually anthropomorphized, and usually as a female).

Because Moya was alive, when she was damaged, she felt pain and needed time to heal.  And because she was sentient, highly intelligent, and capable of communicating with the crew through Pilot (who was the only one who could understand the complexity of the Leviathan language, which was beyond the capability of a universal translator), Moya was much more than just a means of transportation.  In other words, there truly was a symbiotic relationship between, not only Moya and Pilot, but also between Moya and Pilot, and their crew and passengers.

 

Enterprise and Data

(Sorry, my fellow science fiction geeks, but it’s not that Enterprise and that Data.  Perfectly understandable mistake, though.)

Although technically not alive in the biological sense, in many respects, an organization is like a living, sentient organism, and like space and seafaring ships, often anthropomorphized.  An enterprise is much more than just a large organization providing a means of employment and offering products and/or services (and, in a sense, life support to its employees and customers).

As Redman explains in his book Data Driven: Profiting from Your Most Important Business Asset, data is not just the lifeblood of the Information Age, data is essential to everything the enterprise does, from helping it better understand its customers, to guiding its development of better products and/or services, to setting a strategic direction toward achieving its business goals.

So the symbiotic relationship between Enterprise and Data is analogous to the symbiotic relationship between Moya and Pilot.

Data is the Pilot of the Enterprise Leviathan.  The enterprise can not survive without its data.  A healthy enterprise requires healthy data — data of sufficient quality capable of supporting the operational, tactical, and strategic functions of the enterprise.

Returning to Redman’s words, “Everyone touches data in one way or another, so the tendrils of a data program will affect everyone — the things they do, the way they think, their relationships with one another, your relationships with customers.”

So the relationship between an enterprise and its data, and its people, business processes, and technology, is analogous to the relationship between Moya and Pilot, and their crew and passengers.  It is the enterprise’s people, its crew (i.e., employees), who, empowered by high quality data and enabled by technology, optimize business processes for superior corporate performance, thereby delivering superior products and/or services to the enterprise’s passengers (i.e., customers).

 

So why isn’t data viewed as an asset?

So if this deep symbiosis exists, if these intertwined and symbiotic relationships exist, if the tendrils of data are biologically bonded with the complex enterprise ecosystem — then why isn’t data viewed as an asset?

In Data Driven, Redman references the book The Social Life of Information by John Seely Brown and Paul Duguid, who explained that “a technology is never fully accepted until it becomes invisible to those who use it.”  The term informationalization describes the process of building data and information into a product or service.  “When products and services are fully informationalized,” Redman noted, then data, “blends into the background and people do not even think about it anymore.”

Perhaps that is why data isn’t viewed as an asset.  Perhaps data has so thoroughly pervaded the enterprise that it has become invisible to those who use it.  Perhaps it is not an asset because data is invisible to those who are so dependent upon its quality.

 

Perhaps we only see Moya, but not her Pilot.

 

Related Posts

Organizing For Data Quality

Data, data everywhere, but where is data quality?

Finding Data Quality

The Data Quality Wager

Beyond a “Single Version of the Truth”

Poor Data Quality is a Virus

DQ-Tip: “Don't pass bad data on to the next person...”

Retroactive Data Quality

Hyperactive Data Quality (Second Edition)

A Brave New Data World

Organizing for Data Quality

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

Dr. Thomas C. Redman (the “Data Doc”) is an innovator, advisor and teacher.  He was first to extend quality principles to data and information in the late 80s.  Since then he has crystallized a body of tools, techniques, roadmaps and organizational insights that help organizations make order-of-magnitude improvements.

More recently Tom has developed keen insights into the nature of data and formulated the first comprehensive approach to “putting data to work.”  Taken together, these enable organizations to treat data as assets of virtually unlimited potential.

Tom has personally helped dozens of leaders and organizations better understand data and data quality and start their data programs.  He is a sought-after lecturer and the author of dozens of papers and four books.  The most recent, Data Driven: Profiting from Your Most Important Business Asset (Harvard Business Press, 2008) was a Library Journal best buy of 2008.

Prior to forming Navesink Consulting Group in 1996, Tom conceived the Data Quality Lab at AT&T Bell Laboratories in 1987 and led it until 1995.  Tom holds a Ph.D. in statistics from Florida State University.  He holds two patents.

On this episode of OCDQ Radio, Tom Redman and I discuss concepts from his Data Governance and Information Quality 2011 post-conference tutorial about organizing for data quality, which includes his call to action for your role in the data revolution.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.
  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Beyond a “Single Version of the Truth”

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Henrik Liliendahl Sørensen and Charles Blyth.  Our contest is a Blogging Olympics of sorts, with the United States, Denmark, and England competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.” 

Please take the time to read all three posts and then vote for who you think has won the debate (see poll below).  Thanks!

 

The “Point of View” Paradox

In the early 20th century, within his Special Theory of Relativity, Albert Einstein introduced the concept that space and time are interrelated entities forming a single continuum, and therefore the passage of time can be a variable that could change for each individual observer.

One of the many brilliant insights of special relativity was that it could explain why different observers can make validly different observations – it was a scientifically justifiable matter of perspective. 

It was Einstein's apprentice, Obi-Wan Kenobi (to whom Albert explained “Gravity will be with you, always”), who stated:

“You're going to find that many of the truths we cling to depend greatly on our own point of view.”

The Data-Information Continuum

In the early 21st century, within his popular blog post The Data-Information Continuum, Jim Harris introduced the concept that data and information are interrelated entities forming a single continuum, and that speaking of oneself in the third person is the path to the dark side.

I use the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g., customers, vendors, suppliers).

Although a common definition for data quality is fitness for the purpose of use, the common challenge is that data has multiple uses – each with its own fitness requirements.  Viewing each intended use as the information that is derived from data, I define information as data in use or data in action.

Quality within the Data-Information Continuum has both objective and subjective dimensions.  Data's quality is objectively measured separate from its many uses, while information's quality is subjectively measured according to its specific use.

 

Objective Data Quality

Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives. 

In order to lay this foundation, raw data is extracted directly from its sources, profiled, analyzed, transformed, cleansed, documented and monitored by data quality processes designed to provide and maintain universal data sources for the enterprise's information needs. 

At this phase of the architecture, the manipulations of raw data must be limited to objective standards and not be customized for any subjective use.  From this perspective, data is now fit to serve (as at least the basis for) each and every purpose.

 

Subjective Information Quality

Information quality standards (starting from the objective data foundation) are customized to meet the subjective needs of each business unit and initiative.  This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.

But please understand: customization should not be performed simply for the sake of it.  You must always define your information quality standards by using the enterprise-wide data quality standards as your initial framework. 

Whenever possible, enterprise-wide standards should be enforced without customization.  The key word within the phrase “subjective information quality standards” is standards — as opposed to subjective, which can quite often be misinterpreted as “you can do whatever you want.”  Yes you can – just as long as you have justifiable business reasons for doing so.

This approach to implementing information quality standards has three primary advantages.  First, it reinforces a consistent understanding and usage of data throughout the enterprise.  Second, it requires each business unit and initiative to clearly explain exactly how they are using data differently from the rest of your organization, and more important, justify why.  Finally, all deviations from enterprise-wide data quality standards will be fully documented. 

 

The “One Lie Strategy”

A common objection to separating quality standards into objective data quality and subjective information quality is the enterprise's significant interest in creating what is commonly referred to as a “Single Version of the Truth.”

However, in his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman explains:

“A fiendishly attractive concept is...'a single version of the truth'...the logic is compelling...unfortunately, there is no single version of the truth. 

For all important data, there are...too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. 

This does not imply malfeasance on anyone's part; it is simply a fact of life. 

Getting everyone to work from a single version of the truth may be a noble goal, but it is better to call this the 'one lie strategy' than anything resembling truth.”

Beyond a “Single Version of the Truth”

In the classic 1985 film Mad Max Beyond Thunderdome, the title character arrives in Bartertown, ruled by the evil Auntie Entity, where people living in the post-apocalyptic Australian outback go to trade for food, water, weapons, and supplies.  Auntie Entity forces Mad Max to fight her rival Master Blaster to the death within a gladiator-like arena known as Thunderdome, which is governed by one simple rule:

“Two men enter, one man leaves.”

I have always struggled with the concept of creating a “Single Version of the Truth.”  I imagine all of the key stakeholders from throughout the enterprise arriving in Corporatetown, ruled by the Machiavellian CEO known only as Veritas, where all business units and initiatives must go to request funding, staffing, and continued employment.  Veritas forces all of them to fight their Master Data Management rivals within a gladiator-like arena known as Meetingdome, which is governed by one simple rule:

“Many versions of the truth enter, a Single Version of the Truth leaves.”

For any attempted “version of the truth” to truly be successfully implemented within your organization, it must take into account both the objective and subjective dimensions of quality within the Data-Information Continuum. 

Both aspects of this shared perspective of quality must be incorporated into a “Shared Version of the Truth” that enforces a consistent enterprise understanding of data, but that also provides the information necessary to support day-to-day operations.

The Data-Information Continuum is governed by one simple rule:

“All validly different points of view must be allowed to enter,

In order for an all encompassing Shared Version of the Truth to be achieved.”

 

You are the Judge

This post is involved in a good-natured contest (i.e., a blog-bout) with two additional bloggers: Henrik Liliendahl Sørensen and Charles Blyth.  Our contest is a Blogging Olympics of sorts, with the United States, Denmark, and England competing for the Gold, Silver, and Bronze medals in an event we are calling “Three Single Versions of a Shared Version of the Truth.” 

Please take the time to read all three posts and then vote for who you think has won the debate.  A link to the same poll is provided on all three blogs.  Therefore, wherever you choose to cast your vote, you will be able to view an accurate tally of the current totals. 

The poll will remain open for one week, closing at midnight on November 19 so that the “medal ceremony” can be conducted via Twitter on Friday, November 20.  Additionally, please share your thoughts and perspectives on this debate by posting a comment below.  Your comment may be copied (with full attribution) into the comments section of all of the blogs involved in this debate.

 

Related Posts

Poor Data Quality is a Virus

The General Theory of Data Quality

The Data-Information Continuum

Poor Data Quality is a Virus

“A storm is brewing—a perfect storm of viral data, disinformation, and misinformation.” 

These cautionary words (written by Timothy G. Davis, an Executive Director within the IBM Software Group) are from the foreword of the remarkable new book Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman.

“Viral data,” explains Fishman, “is a metaphor used to indicate that business-oriented data can exhibit qualities of a specific type of human pathogen: the virus.  Like a virus, data by itself is inert.  Data requires software (or people) for the data to appear alive (or actionable) and cause a positive, neutral, or negative effect.”

“Viral data is a perfect storm,” because as Fishman explains, it is “a perfect opportunity to miscommunicate with ubiquity and simultaneity—a service-oriented pandemic reaching all corners of the enterprise.”

“The antonym of viral data is trusted information.”

Data Quality

“Quality is a subjective term,” explains Fishman, “for which each person has his or her own definition.”  Fishman goes on to quote from many of the published definitions of data quality, including a few of my personal favorites:

  • David Loshin: “Fitness for use—the level of data quality determined by data consumers in terms of meeting or beating expectations.”
  • Danette McGilvray: “The degree to which information and data can be a trusted source for any and/or all required uses.  It is having the right set of correct information, at the right time, in the right place, for the right people to use to make decisions, to run the business, to serve customers, and to achieve company goals.”
  • Thomas Redman: “Data are of high quality if those who use them say so.  Usually, high-quality data must be both free of defects and possess features that customers desire.”

Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives.  Starting from this foundation, information quality standards are customized to meet the subjective needs of each business unit and initiative.  This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.

However, the enterprise-wide data quality standards must be understood as dynamic.  Therefore, enforcing strict conformance to data quality standards can be self-defeating.  On this point, Fishman quotes Joseph Juran: “conformance by its nature relates to static standards and specification, whereas quality is a moving target.”

Defining data quality is both an essential and challenging exercise for every enterprise.  “While a succinct and holistic single-sentence definition of data quality may be difficult to craft,” explains Fishman, “an axiom that appears to be generally forgotten when establishing a definition is that in business, data is about things that transpire during the course of conducting business.  Business data is data about the business, and any data about the business is metadata.  First and foremost, the definition as to the quality of data must reflect the real-world object, concept, or event to which the data is supposed to be directly associated.”

 

Data Governance

“Data governance can be used as an overloaded term,” explains Fishman, and he quotes Jill Dyché and Evan Levy to explain that “many people confuse data quality, data governance, and master data management.” 

“The function of data governance,” explains Fishman, “should be distinct and distinguishable from normal work activities.” 

For example, although knowledge workers and subject matter experts are necessary to define the business rules for preventing viral data, according to Fishman, these are data quality tasks and not acts of data governance. 

However,  these data quality tasks must “subsequently be governed to make sure that all the requisite outcomes comply with the appropriate controls.”

Therefore, according to Fishman, “data governance is a function that can act as an oversight mechanism and can be used to enforce controls over data quality and master data management, but also over data privacy, data security, identity management, risk management, or be accepted in the interpretation and adoption of regulatory requirements.”

 

Conclusion

“There is a line between trustworthy information and viral data,” explains Fishman, “and that line is very fine.”

Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace. 

Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions.  As the pathogen replicates, more and more decision-critical enterprise information will be compromised.

According to Fishman, enterprise data quality requires a multidisciplinary effort and a lifetime commitment to:

“Prevent viral data and preserve trusted information.”

Books Referenced in this Post

Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman

Enterprise Knowledge Management: The Data Quality Approach by David Loshin

Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information by Danette McGilvray

Data Quality: The Field Guide by Thomas Redman

Juran on Quality by Design: The New Steps for Planning Quality into Goods and Services by Joseph Juran

Customer Data Integration: Reaching a Single Version of the Truth by Jill Dyché and Evan Levy

 

Related Posts

DQ-Tip: “Don't pass bad data on to the next person...”

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

Data Governance and Data Quality

DQ-Tip: “Don't pass bad data on to the next person...”

Data Quality (DQ) Tips is a new regular segment.  Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Don't pass bad data on to the next person.  And don't accept bad data from the previous person.”

This DQ-Tip is from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset.

In the book, Redman explains that this advice is a rewording of his favorite data quality policy of all time.

Assuming that it is someone else's responsibility is a fundamental root case for enterprise data quality problems.  One of the primary goals of a data quality initiative must be to define the roles and responsibilities for data ownership and data quality.

In sports, it is common for inspirational phrases to be posted above every locker room exit door.  Players acknowledge and internalize the inspirational phrase by reaching up and touching it as they head out onto the playing field.

Perhaps you should post this DQ-Tip above every break room exit door throughout your organization?

 

Related Posts

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

Data Governance and Data Quality

 

Additional Resources

Who is responsible for data quality?

DQ Problems? Start a Data Quality Recognition Program!

Starting Your Own Personal Data Quality Crusade

Hyperactive Data Quality (Second Edition)

In the first edition of Hyperactive Data Quality, I discussed reactive and proactive approaches using the data quality lake analogy from Thomas Redman's excellent book Data Driven: Profiting from Your Most Important Business Asset:

“...a lake represents a database and the water therein the data.  The stream, which adds new water, is akin to a business process that creates new data and adds them to the database.  The lake...is polluted, just as the data are dirty.  Two factories pollute the lake.  Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH.

The alternative is to reduce the pollutant at the point source – the factories.

The contrast between the two approaches is stark.  In the first, the focus is on the lake; in the second, it is on the stream.  So too with data.  Finding and fixing errors focuses on the database and data that have already been created.  Preventing errors focuses on the business processes and future data.”

Reactive Data Quality

Reactive Data Quality (i.e. “cleaning the lake” in Redman's analogy) focuses entirely on finding and fixing the problems with existing data after it has been extracted from its sources. 

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert “lake cleaners”).  Data quality problems can be very insidious and even the best “lake cleaning” process will still produce exceptions.  Your process should be designed to identify and report exceptions when they occur.  In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.

 

Proactive Data Quality

Proactive Data Quality focuses on preventing errors at the sources where data is entered or received, and before it is extracted for use by downstream applications (i.e. “enters the lake” in Redman's analogy). 

Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

“It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect.”

Proactive data quality advocates reevaluating business processes that create data, implementing improved controls on data entry screens and web forms, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the information needs of your consumers before delivering enterprise data for their use.

 

Proactive Data Quality > Reactive Data Quality

Proactive data quality is clearly the superior approach.  Although it is impossible to truly prevent every problem before it happens, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information. 

Reactive data quality essentially treats the symptoms without curing the disease.  As Redman explains: “...the problem with being a good lake cleaner is that life never gets better...it gets worse as more data...conspire to mean there is more work every day.”

So why do the vast majority of data quality initiatives use a reactive approach?

 

An Arrow Thickly Smeared With Poison

In Buddhism, there is a famous parable:

A man was shot with an arrow thickly smeared with poison.  His friends wanted to get a doctor to heal him, but the man objected by saying:

“I will neither allow this arrow to be pulled out nor accept any medical treatment until I know the name of the man who wounded me, whether he was a nobleman or a soldier or a merchant or a farmer or a lowly peasant, whether he was tall or short or of average height, whether he used a long bow or a crossbow, and whether the arrow that wounded me was hoof-tipped or curved or barbed.” 

While his friends went off in a frantic search for these answers, the man slowly, and painfully, dies.

 

“Flight to Data Quality”

In economics, the term “flight to quality” describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments.

A similar “flight to data quality” can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information.  Some examples include a customer service nightmare, a regulatory compliance failure, or a financial reporting scandal. 

Driven by a business triage for critical data problems, reactive data cleansing is purposefully chosen over proactive defect prevention.  The priority is finding and fixing the near-term problems rather than worrying about the long-term consequences of not identifying the root cause and implementing process improvements that would prevent it from happening again.

The enterprise has been shot with an arrow thickly smeared with poison – poor data quality.  Now is not the time to point out that the enterprise has actually shot itself by failing to have proactive measures in place. 

Reactive data quality only treats the symptoms.  However, during triage, the priority is to stabilize the patient.  A cure for the underlying condition is worthless if the patient dies before it can be administered.

 

Hyperactive Data Quality

Proactive data quality is the best practice.  Root cause analysis, business process improvement, and defect prevention will always be more effective than the endlessly vicious cycle of reactive data cleansing. 

A data governance framework is necessary for proactive data quality to be successful.  Patience and understanding are also necessary.  Proactive data quality requires a strategic organizational transformation that will not happen easily or quickly. 

Even when not facing an immediate crisis, the reality is that reactive data quality will occasionally be a necessary evil that is used to correct today's problems while proactive data quality is busy trying to prevent tomorrow's problems.

Just like any complex problem, data quality has no fast and easy solution.  Fundamentally, a hybrid discipline is required that combines proactive and reactive aspects into an approach that I refer to as Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.

 

Please share your thoughts and experiences.

 

Related Posts

Hyperactive Data Quality (First Edition)

The General Theory of Data Quality

The Data-Information Continuum

Data is one of the enterprise's most important assets.  Data quality is a fundamental success factor for the decision-critical information that drives the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

When the results of these initiatives don't meet expectations, analysis often reveals poor data quality is a root cause.   Projects are launched to understand and remediate this problem by establishing enterprise-wide data quality standards.

However, a common issue is a lack of understanding about what I refer to as the Data-Information Continuum.

 

The Data-Information Continuum

In physics, the Space-Time Continuum explains that space and time are interrelated entities forming a single continuum.  In classical mechanics, the passage of time can be considered a constant for all observers of spatial objects in motion.  In relativistic contexts, the passage of time is a variable changing for each specific observer of spatial objects in motion.

Data and information are also interrelated entities forming a single continuum.  It is crucial to understand how they are different and how they relate.  I like using the Dragnet definition for data – it is “just the facts” collected as an abstract description of the real-world entities that the enterprise does business with (e.g. customers, vendors, suppliers). 

A common data quality definition is fitness for the purpose of use.  A common challenge is data has multiple uses, each with its own fitness requirements.  I like to view each intended use as the information that is derived from data, defining information as data in use or data in action.

Data could be considered a constant while information is a variable that redefines data for each specific use.  Data is not truly a constant since it is constantly changing.  However, information is still derived from data and many different derivations can be performed while data is in the same state (i.e. before it changes again). 

Quality within the Data-Information Continuum has both objective and subjective dimensions.

 

Objective Data Quality

Data's quality must be objectively measured separate from its many uses.  Enterprise-wide data quality standards must provide a highest common denominator for all business units to use as an objective data foundation for their specific tactical and strategic initiatives.  Raw data extracted directly from its sources must be profiled, analyzed, transformed, cleansed, documented and monitored by data quality processes designed to provide and maintain universal data sources for the enterprise's information needs.  At this phase, the manipulations of raw data by these processes must be limited to objective standards and not be customized for any subjective use.

 

Subjective Information Quality

Information's quality can only be subjectively measured according to its specific use.  Information quality standards are not enterprise-wide, they are customized to a specific business unit or initiative.  However, all business units and initiatives must begin defining their information quality standards by using the enterprise-wide data quality standards as a foundation.  This approach allows leveraging a consistent enterprise understanding of data while also deriving the information necessary for the day-to-day operation of each business unit and initiative.

 

A “Single Version of the Truth” or the “One Lie Strategy”

A common objection to separating quality standards into objective data quality and subjective information quality is the enterprise's significant interest in creating what is commonly referred to as a single version of the truth.

However, in his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman explains:

“A fiendishly attractive concept is...'a single version of the truth'...the logic is compelling...unfortunately, there is no single version of the truth. 

For all important data, there are...too many uses, too many viewpoints, and too much nuance for a single version to have any hope of success. 

This does not imply malfeasance on anyone's part; it is simply a fact of life. 

Getting everyone to work from a single version of the truth may be a noble goal, but it is better to call this the 'one lie strategy' than anything resembling truth.”

Conclusion

There is a significant difference between data and information and therefore a significant difference between data quality and information quality.  Many data quality projects are in fact implementations of information quality customized to the specific business unit or initiative that is funding the project.  Although these projects can achieve some initial success, they encounter failures in later iterations and phases when information quality standards try to act as enterprise-wide data quality standards. 

Significant time and money can be wasted by not understanding the Data-Information Continuum.

Hyperactive Data Quality

In economics, the term "flight to quality" describes the aftermath of a financial crisis (e.g. a stock market crash) when people become highly risk-averse and move their money into safer, more reliable investments. 

A similar "flight to data quality" can occur in the aftermath of an event when poor data quality negatively impacted decision-critical enterprise information.  Some examples include a customer service nightmare, a regulatory compliance failure or a financial reporting scandal.  Whatever the triggering event, a common response is data quality suddenly becomes prioritized as a critical issue and an enterprise information initiative is launched.

Congratulations!  You've realized (albeit the hard way) that this "data quality thing" is really important.

Now what are you going to do about it?  How are you going to attempt to actually solve the problem?

In his excellent book Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman uses an excellent analogy called the data quality lake:

"...a lake represents a database and the water therein the data.  The stream, which adds new water, is akin to a business process that creates new data and adds them to the database.  The lake...is polluted, just as the data are dirty.  Two factories pollute the lake.  Likewise, flaws in the business process are creating errors...

One way to address the dirty lake water is to clean it up...by running the water through filters, passing it through specially designed settling tanks, and using chemicals to kill bacteria and adjust pH. 

The alternative is to reduce the pollutant at the point source - the factories. 

The contrast between the two approaches is stark.  In the first, the focus is on the lake; in the second, it is on the stream.  So too with data.  Finding and fixing errors focuses on the database and data that have already been created.  Preventing errors focuses on the business processes and future data."

 

Reactive Data Quality

A "flight to data quality" usually prompts an approach commonly referred to as Reactive Data Quality (i.e. "cleaning the lake" to use Redman's excellent analogy).  The  majority of enterprise information initiatives are reactive.  The focus is typically on finding and fixing the problems with existing data in an operational data store (ODS), enterprise data warehouse (EDW) or other enterprise information repository.  In other words, the focus is on fixing data after it has been extracted from its sources.

An obsessive-compulsive quest to find and fix every data quality problem is a laudable but ultimately unachievable pursuit (even for expert "lake cleaners").  Data quality problems can be very insidious and even the best "lake cleaning" process will still produce exceptions.  Your process should be designed to identify and report exceptions when they occur.  In fact, as a best practice, you should also include the ability to suspend incoming data that contain exceptions for manual review and correction.

However, as Redman cautions: "...the problem with being a good lake cleaner is that life never gets better.  Indeed, it gets worse as more data...conspire to mean there is more work every day."  I tell my clients the only way to guarantee that reactive data quality will be successful is to unplug all the computers so that no one can add new data or modify existing data.

 

Proactive Data Quality

Attempting to prevent data quality problems before they happen is commonly referred to as Proactive Data Quality.  The focus is on preventing errors at the sources where data is entered or received and before it is extracted for use by downstream applications (i.e. "enters the lake").  Redman describes the benefits of proactive data quality with what he calls the Rule of Ten:

"It costs ten times as much to complete a unit of work when the input data are defective (i.e. late, incorrect, missing, etc.) as it does when the input data are perfect."

Proactive data quality advocates implementing improved edit controls on data entry screens, enforcing the data quality clause (you have one, right?) of your service level agreements with external data providers, and understanding the business needs of your enterprise information consumers before you deliver data to them.

Obviously, it is impossible to truly prevent every problem before it happens.  However, the more control that can be enforced where data originates, the better the overall quality will be for enterprise information.

 

Hyperactive Data Quality

Too many enterprise information initiatives fail because they are launched based on a "flight to data quality" response and have the unrealistic perspective that data quality problems can be quickly and easily resolved.  However, just like any complex problem, there is no fast and easy solution for data quality.

In order to be successful, you must combine aspects of both reactive and proactive data quality in order to create an enterprise-wide best practice that I call Hyperactive Data Quality, which will make the responsibility for managing data quality a daily activity for everyone in your organization.

 

Please share your thoughts and experiences.  Is your data quality Reactive, Proactive or Hyperactive?