July 01, 2016

Data Governance Frameworks are like Jigsaw Puzzles

July 01, 2016/ Jim Harris

In a recent interview, Jill Dyché explained a common misconception, namely that a data governance framework is not a strategy. “Unlike other strategic initiatives that involve IT,” Jill explained, “data governance needs to be designed. The cultural factors, the workflow factors, the organizational structure, the ownership, the political factors, all need to be accounted for when you are designing a data governance roadmap.”

“People need a mental model, that is why everybody loves frameworks,” Jill continued. “But they are not enough and I think the mistake that people make is that once they see a framework, rather than understanding its relevance to their organization, they will just adapt it and plaster it up on the whiteboard and show executives without any kind of context. So they are already defeating the purpose of data governance, which is to make it work within the context of your business problems, not just have some kind of mental model that everybody can agree on, but is not really the basis for execution.”

“So it’s a really, really dangerous trend,” Jill cautioned, “that we see where people equate strategy with framework because strategy is really a series of collected actions that result in some execution — and that is exactly what data governance is.”

And in her excellent article Data Governance Next Practices: The 5 + 2 Model, Jill explained that data governance requires a deliberate design so that the entire organization can buy into a realistic execution plan, not just a sound bite. As usual, I agree with Jill, since, in my experience, many people expect a data governance framework to provide eureka-like moments of insight.

In The Myths of Innovation, Scott Berkun debunked the myth of the eureka moment using the metaphor of a jigsaw puzzle.

“When you put the last piece into place, is there anything special about that last piece or what you were wearing when you put it in?” Berkun asked. “The only reason that last piece is significant is because of the other pieces you’d already put into place. If you jumbled up the pieces a second time, any one of them could turn out to be the last, magical piece.”

“The magic feeling at the moment of insight, when the last piece falls into place,” Berkun explained, “is the reward for many hours (or years) of investment coming together. In comparison to the simple action of fitting the puzzle piece into place, we feel the larger collective payoff of hundreds of pieces’ worth of work.”

Perhaps the myth of the data governance framework could also be debunked using the metaphor of a jigsaw puzzle.

Data governance requires the coordination of a complex combination of a myriad of factors, including executive sponsorship, funding, decision rights, arbitration of conflicting priorities, policy definition, policy implementation, data quality remediation, data stewardship, business process optimization, technology enablement, change management — and many other puzzle pieces.

How could a data governance framework possibly predict how you will assemble the puzzle pieces? Or how the puzzle pieces will fit together within your unique corporate culture? Or which of the many aspects of data governance will turn out to be the last (or even the first) piece of the puzzle to fall into place in your organization? And, of course, there is truly no last piece of the puzzle, since data governance is an ongoing program because the business world constantly gets jumbled up by change.

So, data governance frameworks are useful, but only if you realize that data governance frameworks are like jigsaw puzzles.

June 05, 2012

Data Quality and the Bystander Effect

June 05, 2012/ Jim Harris

In his recent Harvard Business Review blog post Break the Bad Data Habit, Tom Redman cautioned against correcting data quality issues without providing feedback to where the data originated. “At a minimum,” Redman explained, “others using the erred data may not spot the error. There is no telling where it might turn up or who might be victimized.” And correcting bad data without providing feedback to its source also denies the organization an opportunity to get to the bottom of the problem.

“And failure to provide feedback,” Redman continued, “is but the proximate cause. The deeper root issue is misplaced accountability — or failure to recognize that accountability for data is needed at all. People and departments must continue to seek out and correct errors. They must also provide feedback and communicate requirements to their data sources.”

In his blog post The Secret to an Effective Data Quality Feedback Loop, Dylan Jones responded to Redman’s blog post with some excellent insights regarding data quality feedback loops and how they can help improve your data quality initiatives.

I definitely agree with Redman and Jones about the need for feedback loops, but I have found, more often than not, that no feedback at all is provided on data quality issues because of the assumption that data quality is someone else’s responsibility.

This general lack of accountability for data quality issues is similar to what is known in psychology as the Bystander Effect, which refers to people often not offering assistance to the victim in an emergency situation when other people are present. Apparently, the mere presence of other bystanders greatly decreases intervention, and the greater the number of bystanders, the less likely it is that any one of them will help. Psychologists believe that the reason this happens is that as the number of bystanders increases, any given bystander is less likely to interpret the incident as a problem, and less likely to assume responsibility for taking action.

In my experience, the most common reason that data quality issues are often neither reported nor corrected is that most people throughout the enterprise act like data quality bystanders, making them less likely to interpret bad data as a problem or, at the very least, not their responsibility. But the enterprise’s data quality is perhaps most negatively affected by this bystander effect, which may make it the worst bad data habit that the enterprise needs to break.

June 08, 2011

Data Quality Pro

June 08, 2011/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

On this episode, I am joined by special guest Dylan Jones, the community leader of Data Quality Pro, the largest membership resource dedicated entirely to the data quality profession.

Dylan is currently overseeing the re-build and re-launch of Data Quality Pro into a next generation membership platform, and during our podcast discussion, Dylan describes some of the great new features that will be coming soon to Data Quality Pro.

Links for Data Quality Pro and Dylan Jones:

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

January 07, 2011

#FollowFriday Spotlight: @DataQualityPro

January 07, 2011/ Jim Harris

FollowFriday Spotlight is an OCDQ regular segment highlighting someone you should follow—and not just Fridays on Twitter.

Links for Data Quality Pro and Dylan Jones:

Data Quality Pro, founded and maintained by Dylan Jones, is a free and independent community resource dedicated to helping data quality professionals take their career or business to the next level. Data Quality Pro is your free expert resource providing data quality articles, webinars, forums and tutorials from the world’s leading experts, every day.

With the mission to create the most beneficial data quality resource that is freely available to members around the world, the goal of Data Quality Pro is “winning-by-sharing” and they believe that by contributing a small amount of their experience, skill or time to support other members then truly great things can be achieved.

Membership is 100% free and provides a broad range of additional content for professionals of all backgrounds and skill levels.

Check out the Best of Data Quality Pro, which includes the following great blog posts written by Dylan Jones in 2010:

#FollowFriday and Re-Tweet-Worthiness

#FollowFriday and The Three Tweets

Dilbert, Data Quality, Rabbits, and #FollowFriday

Twitter, Meaningful Conversations, and #FollowFriday

The Fellowship of #FollowFriday

Social Karma (Part 7) – Twitter

September 20, 2010

DQ-Tip: “There is no such thing as data accuracy...”

September 20, 2010/ Jim Harris

Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“There is no such thing as data accuracy — There are only assertions of data accuracy.”

This DQ-Tip came from the Data Quality Pro webinar ISO 8000 Master Data Quality featuring Peter Benson of ECCMA.

You can download (.pdf file) quotes from this webinar by clicking on this link: Data Quality Pro Webinar Quotes - Peter Benson

ISO 8000 is the international standards for data quality. You can get more information by clicking on this link: ISO 8000

Data Accuracy

Accuracy, which, thanks to substantial assistance from my readers, was defined in a previous post as both the correctness of a data value within a limited context such as verification by an authoritative reference (i.e., validity) combined with the correctness of a valid data value within an extensive context including other data as well as business processes (i.e., accuracy).

“The definition of data quality,” according to Peter and the ISO 8000 standards, “is the ability of the data to meet requirements.”

Although accuracy is only one of many dimensions of data quality, whenever we refer to data as accurate, we are referring to the ability of the data to meet specific requirements, and quite often it’s the ability to support making a critical business decision.

I agree with Peter and the ISO 8000 standards because we can’t simply take an accuracy metric on a data quality dashboard (or however else the assertion is presented to us) at face value without understanding how the metric is both defined and measured.

However, even when well defined and properly measured, data accuracy is still only an assertion. Oftentimes, the only way to verify the assertion is by putting the data to its intended use.

If by using it you discover that the data is inaccurate, then by having established what the assertion of accuracy was based on, you have a head start on performing root cause analysis, enabling faster resolution of the issues—not only with the data, but also with the business and technical processes used to define and measure data accuracy.

July 28, 2010

DQ-View: Is Data Quality the Sun?

July 28, 2010/ Jim Harris

Data Quality (DQ) View is an OCDQ regular segment. Each DQ-View is a brief video discussion of a data quality key concept.

DataQualityPro

This recent tweet by Dylan Jones of Data Quality Pro succinctly expresses a vitally important truth about the data quality profession.

Although few would debate the necessary requirement of skill, some might doubt the need for passion. Therefore, in this new DQ-View segment, I want to discuss why data quality initiatives require passionate data professionals.

DQ-View: Is Data Quality the Sun?

If you are having trouble viewing this video, then you can watch it on Vimeo by clicking on this link: DQ-View on Vimeo

Data Gazers

Finding Data Quality

Oh, the Data You’ll Show!

Data Rock Stars: The Rolling Forecasts

The Second Law of Data Quality

The General Theory of Data Quality

DQ-Tip: “Start where you are...”

Sneezing Data Quality

July 13, 2010

The 2010 Data Quality Blogging All-Stars

July 13, 2010/ Jim Harris

The 2010 Major League Baseball (MLB) All-Star Game is being held tonight (July 13) at Angel Stadium in Anaheim, California.

For those readers who are not baseball fans, the All-Star Game is an annual exhibition held in mid-July that showcases the players with (for the most part) the best statistical performances during the first half of the MLB season.

Last summer, I began my own annual exhibition of showcasing the bloggers whose posts I have personally most enjoyed reading during the first half of the data quality blogging season.

Therefore, this post provides links to stellar data quality blog posts that were published between January 1 and June 30 of 2010. My definition of a “data quality blog post” also includes Data Governance, Master Data Management, and Business Intelligence.

Please Note: There is no implied ranking in the order that bloggers or blogs are listed, other than that Individual Blog All-Stars are listed first, followed by Vendor Blog All-Stars, and the blog posts are listed in reverse chronological order by publication date.

Henrik Liliendahl Sørensen

From Liliendahl on Data Quality:

Dylan Jones

From Data Quality Pro:

Julian Schwarzenbach

From Data and Process Advantage Blog:

Radioactive gold data
Does data make you lonely ?!?
The Data Accident Investigation Board
The Data Zoo (Five Part Series and White Paper): Part 1, Part 2, Part 3, Part 4, Part 5
How tasty is your data quality cheese?

Rich Murnane

From Rich Murnane's Blog:

Phil Wright

From Data Factotum:

How are you Executing your Data Quality Strategy?
How do you identify your strategic data?
The First Step on your Data Quality Roadmap
Can motivations impact the state of data quality?
A balanced approach to scoring data quality (Six Part Series): Part 1, Part 2, Part 3, Part 4, Part 5, Part 6
The Great Expectations of BI
Questions to measure BI DQ/DM success

Initiate – an IBM Company

From Mastering Data Management:

Baseline Consulting

From their three blogs: Inside the Biz with Jill Dyché, Inside IT with Evan Levy, and In the Field with our Experts:

DataFlux – a SAS Company

From Community of Experts:

Recently Read: May 15, 2010

Recently Read: March 22, 2010

Recently Read: March 6, 2010

Recently Read: January 23, 2010

The 2009 Data Quality Blogging All-Stars

Additional Resources

From the IAIDQ, read the 2010 issues of the Blog Carnival for Information/Data Quality:

April 12, 2010

Microwavable Data Quality

April 12, 2010/ Jim Harris

Data quality is definitely not a one-time project, but instead requires a sustained program of enterprise-wide best practices that are best implemented within a data governance framework that “bakes in” defect prevention, data quality monitoring, and near real-time standardization and matching services—all ensuring high quality data is available to support daily business decisions.

However, implementing a data governance program is an evolutionary process requiring time and patience.

Baking and cooking also require time and patience. Microwavable meals can be an occasional welcome convenience, and if you are anything like me (my condolences) and you can’t bake or cook, then microwavable meals can be an absolute necessity.

Data cleansing can also be an occasional (not necessarily welcome) convenience, or a relative necessity (i.e., a “necessary evil”).

Last year on Data Quality Pro, Dylan Jones hosted a great debate on the necessity of data cleansing, which is well worth reading, especially since the over 25 (and continuing) comments it received proves it is a polarizing topic for the data quality profession.

I reheated this debate (using the Data Quality Microwave, of course) earlier this year with my A Tale of Two Q’s blog post, which also received many commendable comments (but far less than Dylan’s blog post—not that I am counting or anything).

Similarly, a heated debate can be had over the health implications of the microwave. Eating too many microwavable meals is certainly not healthy, but I have many friends and family who would argue quite strongly for either side of this “food fight.”

Both of these great debates can be as deeply polarizing as Pepsi vs. Coke and Soccer vs. Football. Just for the official record, I am firmly for both Pepsi and Football—and by Football, I mean NFL Football—and firmly against both Coke and Soccer.

Just as I advocate that everyone (myself included) should learn how to cook, but still accept the eternal reality of the microwave, I definitely advocate the implementation of a data governance program, but I also accept the eternal reality of data cleansing.

However, my lawyers have advised me to report that beta testing for an actual Data Quality Microwave has not been promising.

A Tale of Two Q’s

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

Follow OCDQ

If you enjoyed this blog post, then please subscribe to OCDQ via my RSS feed, my E-mail updates, or Google Reader.

You can also follow OCDQ on Twitter, fan the Facebook page for OCDQ, and connect with me on LinkedIn.

November 03, 2009

Customer Incognita

November 03, 2009/ Jim Harris

Many enterprise information initiatives are launched in order to unravel that riddle, wrapped in a mystery, inside an enigma, that great unknown, also known as...Customer.

Centuries ago, cartographers used the Latin phrase terra incognita (meaning “unknown land”) to mark regions on a map not yet fully explored. In this century, companies simply can not afford to use the phrase customer incognita to indicate what information about their existing (and prospective) customers they don't currently have or don't properly understand.

What is a Customer?

First things first, what exactly is a customer? Those happy people who give you money? Those angry people who yell at you on the phone or say really mean things about your company on Twitter and Facebook? Why do they have to be so mean?

Mean people suck. However, companies who don't understand their customers also suck. And surely you don't want to be one of those companies, do you? I didn't think so.

Getting back to the question, here are some insights from the Data Quality Pro discussion forum topic What is a customer?:

Someone who purchases products or services from you. The word “someone” is key because it’s not the role of a “customer” that forms the real problem, but the precision of the term “someone” that causes challenges when we try to link other and more specific roles to that “someone.” These other roles could be contract partner, payer, receiver, user, owner, etc.
Customer is a role assigned to a legal entity in a complete and precise picture of the real world. The role is established when the first purchase is accepted from this real-world entity. Of course, the main challenge is whether or not the company can establish and maintain a complete and precise picture of the real world.

These working definitions were provided by fellow blogger and data quality expert Henrik Liliendahl Sørensen, who recently posted 360° Business Partner View, which further examines the many different ways a real-world entity can be represented, including when, instead of a customer, the real-world entity represents a citizen, patient, member, etc.

A critical first step for your company is to develop your definition of a customer. Don't underestimate either the importance or the difficulty of this process. And don't assume it is simply a matter of semantics.

Some of my consulting clients have indignantly told me: “We don't need to define it, everyone in our company knows exactly what a customer is.” I usually respond: “I have no doubt that everyone in your company uses the word customer, however I will work for free if everyone defines the word customer in exactly the same way.” So far, I haven't had to work for free.

How Many Customers Do You Have?

You have done the due diligence and developed your definition of a customer. Excellent! Nice work. Your next challenge is determining how many customers you have. Hopefully, you are not going to try using any of these techniques:

SELECT COUNT(*) AS "We have this many customers" FROM Customers
SELECT COUNT(DISTINCT Name) AS "No wait, we really have this many customers" FROM Customers
Middle-Square or Blum Blum Shub methods (i.e. random number generation)
Magic 8-Ball says: “Ask again later”

One of the most common and challenging data quality problems is the identification of duplicate records, especially redundant representations of the same customer information within and across systems throughout the enterprise. The need for a solution to this specific problem is one of the primary reasons that companies invest in data quality software and services.

Earlier this year on Data Quality Pro, I published a five part series of articles on identifying duplicate customers, which focused on the methodology for defining your business rules and illustrated some of the common data matching challenges.

Topics covered in the series:

Why a symbiosis of technology and methodology is necessary when approaching this challenge
How performing a preliminary analysis on a representative sample of real data prepares effective examples for discussion
Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
How both false negatives and false positives illustrate the highly subjective nature of this problem
How to document your business rules for identifying duplicate customers
How to set realistic expectations about application development
How to foster a collaboration of the business and technical teams throughout the entire project
How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

To download the associated presentation (no registration required), please follow this link: OCDQ Downloads

Conclusion

“Knowing the characteristics of your customers,” stated Jill Dyché and Evan Levy in the opening chapter of their excellent book, Customer Data Integration: Reaching a Single Version of the Truth, “who they are, where they are, how they interact with your company, and how to support them, can shape every aspect of your company's strategy and operations. In the information age, there are fewer excuses for ignorance.”

For companies of every size and within every industry, customer incognita is a crippling condition that must be replaced with customer cognizance in order for the company to continue to remain competitive in a rapidly changing marketplace.

Do you know your customers? If not, then they likely aren't your customers anymore.

August 26, 2009

The Only Thing Necessary for Poor Data Quality

August 26, 2009/ Jim Harris

“Demonstrate projected defects and business impacts if the business fails to act,” explains Dylan Jones of Data Quality Pro in his recent and remarkable post How To Deliver A Compelling Data Quality Business Case:

“Presenting a future without data quality management...leaves a simple take-away message – do nothing and the situation will deteriorate.”

I can not help but be reminded of the famous quote often attributed to the 18th century philosopher Edmund Burke:

“The only thing necessary for the triumph of evil, is for good men to do nothing.”

Or the even more famous quote often attributed to the long time ago Jedi Master Yoda:

“Poor data quality is the path to the dark side. Poor data quality leads to bad business decisions.

Bad business decisions leads to lost revenue. Lost revenue leads to suffering.”

When you present the business case for your data quality initiative to executive management and other corporate stakeholders, demonstrate that poor data quality is not a theoretical problem – it is a real business problem that negatively impacts the quality of decision-critical enterprise information.

Preventing poor data quality is mission-critical. Poor data quality will undermine the tactical and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

“The only thing necessary for Poor Data Quality – is for good businesses to Do Nothing.”

Hyperactive Data Quality (Second Edition)

Data Quality: The Reality Show?

Data Governance and Data Quality

July 14, 2009

Data Quality Blogging All-Stars

July 14, 2009/ Jim Harris

The 2009 Major League Baseball (MLB) All-Star Game is being held tonight at Busch Stadium in St. Louis, Missouri.

For those readers who are not baseball fans, the All-Star Game is an annual exhibition held in mid-July that showcases the players with the best statistical performances from the first half of the MLB season.

As I watch the 80th Midsummer Classic, I offer this exhibition that showcases the bloggers with the posts I have most enjoyed reading from the first half of the 2009 data quality blogging season.

Dylan Jones

From Data Quality Pro:

How to transform your ETL tool into a data quality toolkit
DEBATE: How should data governance and data quality work together?
Selecting Data Quality Software (Two Part Series): Part 1, Part 2
Creating An Internal Data Quality Community (Four Part Series): Part 1, Part 2, Part 3, Part 4
15 Tips for transforming knowledge-workers into a data quality task force
10 Tips to help data quality professionals boost their career prospects in the downturn

Daragh O Brien

From The DOBlog:

Steve Sarsfield

From Data Governance and Data Quality Insider:

Daniel Gent

From Data Quality Edge:

Sun Tzu and the Art of Data Quality (Two Part Series): Part 1, Part 2
DQ is 1/3 Process Knowledge + 1/3 Business Knowledge + 1/3 Intuition
When Bad Data Becomes Acceptable Data
DQ Problems? Start a Data Quality Recognition Program!
Five Attributes for the Data Quality Analyst

Henrik Liliendahl Sørensen

From Liliendahl on Data Quality:

Stefanos Damianakis

From Netrics HD:

TSA False Negatives and the URoSD
TSA “Secure Flight” will require more demographic information
Narrative Fallacy and Data Matching
What’s in a Name? (Three Part Series): Part 1, Part 2, Part 3

Vish Agashe

From Business Intelligence: Process, People and Products:

Mark Goloboy

From Boston Data, Technology & Analytics:

Additional Resources

Over on Data Quality Pro, read the data quality blog roundups from the first half of 2009:

From the IAIDQ, read the 2009 issues of the IAIDQ Blog Carnival:

June 03, 2009

The Two Headed Monster of Data Matching

June 03, 2009/ Jim Harris

Data matching is commonly defined as the comparison of two or more records in order to evaluate if they correspond to the same real world entity (i.e. are duplicates) or represent some other data relationship (e.g. a family household).

Data matching is commonly plagued by what I refer to as The Two Headed Monster:

False Negatives - records that did not match, but should have been matched
False Positives - records that matched, but should not have been matched

I Fought The Two Headed Monster...

On a recent (mostly) business trip to Las Vegas, I scheduled a face-to-face meeting with a potential business partner that I had previously communicated with via phone and email only. We agreed to a dinner meeting at a restaurant in the hotel/casino where I was staying.

I would be meeting with the President/CEO and the Vice President of Business Development, a man and a woman respectively.

I was facing a real world data matching problem.

I knew their names, but I had no idea what they looked like. Checking their company website and LinkedIn profiles didn't help - no photos. I neglected to get their mobile phone numbers, however they had mine.

The restaurant was inside the casino and the only entrance was adjacent to a Starbucks that had tables and chairs facing the casino floor. I decided to arrive at the restaurant 15 minutes early and camp out at Starbucks since anyone going near the restaurant would have to walk right past me.

I was more concerned about avoiding false positives. I didn't want to walk up to every potential match and introduce myself since casino security would soon intervene (and I have seen enough movies to know that scene always ends badly).

I decided to apply some probabilistic data matching principles to evaluate the mass of humanity flowing past me.

If some of my matching criteria seems odd, please remember I was in a Las Vegas casino.

I excluded from consideration all:

Individuals wearing a uniform or a costume
Groups consisting of more than two people
Groups consisting of two men or two women
Couples carrying shopping bags or souvenirs
Couples demonstrating a public display of affection
Couples where one or both were noticeably intoxicated
Couples where one or both were scantily clad
Couples where one or both seemed too young or too old

I carefully considered any:

Couples dressed in business attire or business casual attire
Couples pausing to wait at the restaurant entrance
Couples arriving close to the scheduled meeting time

I was quite pleased with myself for applying probabilistic data matching principles to a real world situation.

However, the scheduled meeting time passed. At first, I simply assumed they might be running a little late or were delayed by traffic. As the minutes continued to pass, I started questioning my matching criteria.

...And The Two Headed Monster Won

When the clock reached 30 minutes past the scheduled meeting time, my mobile phone rang. My dinner companions were calling to ask if I was running late. They had arrived on time, were inside the restaurant, and had already ordered.

Confused, I entered the restaurant. Sure enough, there sat a man and a woman that had walked right past me. I excluded them from consideration because of how they were dressed. The Vice President of Business Development was dressed in jeans, sneakers and a casual shirt. The President/CEO was wearing shorts, sneakers and a casual shirt.

I had dismissed them as a vacationing couple.

I had been defeated by a false negative.

The Harsh Reality is that Monsters are Real

My data quality expertise could not guarantee victory in this particular battle with The Two Headed Monster.

Monsters are real and the hero of the story doesn't always win.

And it doesn’t matter if the match algorithms I use are deterministic, probabilistic, or even supercalifragilistic.

The harsh reality is that false negatives and false positives can be reduced, but never eliminated.

Are You Fighting The Two Headed Monster?

Are you more concerned about false negatives or false positives? Please share your battles with The Two Headed Monster.

Back in February and March, I published a five part series of articles on data matching methodology on Data Quality Pro.

Parts 2 and 3 of the series provided data examples to illustrate the challenge of false negatives and false positives within the context of identifying duplicate customers:

March 26, 2009

Identifying Duplicate Customers

March 26, 2009/ Jim Harris

I just finished publishing a five part series of articles on data matching methodology for dealing with the common data quality problem of identifying duplicate customers.

The article series was published on Data Quality Pro, which is the leading data quality online magazine and free independent community resource dedicated to helping data quality professionals take their career or business to the next level.

Topics covered in the series:

Why a symbiosis of technology and methodology is necessary when approaching the common data quality problem of identifying duplicate customers
How performing a preliminary analysis on a representative sample of real project data prepares effective examples for discussion
Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
How both false negatives and false positives illustrate the highly subjective nature of this problem
How to document your business rules for identifying duplicate customers
How to set realistic expectations about application development
How to foster a collaboration of the business and technical teams throughout the entire project
How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

March 13, 2009

Do you have obsessive-compulsive data quality (OCDQ)?

March 13, 2009/ Jim Harris

Obsessive-compulsive data quality (OCDQ) affects millions of people worldwide.

The most common symptoms of OCDQ are:

Obsessively verifying data used in critical business decisions
Compulsively seeking an understanding of data in business terms
Repeatedly checking that data is complete and accurate before sharing it
Habitually attempting to calculate the cost of poor data quality
Constantly muttering a mantra that data quality must be taken seriously

While the good folks at Prescott Pharmaceuticals are busy working on a treatment, I am dedicating this independent blog as group therapy to all those who (like me) have dealt with OCDQ their entire professional lives.

Over the years, the work of many individuals and organizations has been immensely helpful to those of us with OCDQ.

Some of these heroes deserve special recognition:

Data Quality Pro – Founded and maintained by Dylan Jones, Data Quality Pro is a free independent community resource dedicated to helping data quality professionals take their career or business to the next level. With the mission to create the most beneficial data quality resource that is freely available to members around the world, Data Quality Pro provides free software, job listings, advice, tutorials, news, views and forums. Their goal is "winning-by-sharing” and they believe that by contributing a small amount of their experience, skill or time to support other members then truly great things can be achieved. With the new Member Service Register, consultants, service providers and technology vendors can promote their services and include links to their websites and blogs.

International Association for Information and Data Quality (IAIDQ) – Chartered in January 2004, IAIDQ is a not-for-profit, vendor-neutral professional association whose purpose is to create a world-wide community of people who desire to reduce the high costs of low quality information and data by applying sound quality management principles to the processes that create, maintain and deliver data and information. IAIDQ was co-founded by Larry English and Tom Redman, who are two of the most respected and well-known thought and practice leaders in the field of information and data quality.IAIDQ also provides two excellent blogs: IQ Trainwrecks and Certified Information Quality Professional (CIQP).

Beth Breidenbach – her blog Confessions of a database geek is fantastic in and of itself, but she has also compiled an excellent list of data quality blogs and provides them via aggregated feeds in both Feedburner and Google Reader formats.

Vincent McBurney – his blog Tooling Around in the IBM InfoSphere is an entertaining and informative look at data integration in the IBM InfoSphere covering many IBM Information Server products such as DataStage, QualityStage and Information Analyzer.

Daragh O Brien – is a leading writer, presenter and researcher in the field of information quality management, with a particular interest in legal aspects of information quality. His blog The DOBlog is a popular and entertaining source of great material.

Steve Sarsfield – his blog Data Governance and Data Quality Insider covers the world of data integration, data governance, and data quality from the perspective of an industry insider. Also, check out his new book: The Data Governance Imperative.

OCDQ Blog

Popular OCDQ Radio Episodes

Related Posts

Data Accuracy

DQ-View: Is Data Quality the Sun?

Related Posts

Henrik Liliendahl Sørensen

Dylan Jones

Julian Schwarzenbach

Rich Murnane

Phil Wright

Initiate – an IBM Company

Baseline Consulting

DataFlux – a SAS Company

Related Posts

Additional Resources

Related Posts

Follow OCDQ

What is a Customer?

How Many Customers Do You Have?

Conclusion

Related Posts

Dylan Jones

Daragh O Brien

Steve Sarsfield

Daniel Gent

Henrik Liliendahl Sørensen

Stefanos Damianakis

Vish Agashe

Mark Goloboy

Additional Resources

I Fought The Two Headed Monster...

...And The Two Headed Monster Won

The Harsh Reality is that Monsters are Real

Are You Fighting The Two Headed Monster?

Related Articles

OCDQ Blog