July 15, 2010

Data Quality and the Cupertino Effect

July 15, 2010/ Jim Harris

The Cupertino Effect can occur when you accept the suggestion of a spellchecker program, which was attempting to assist you with a misspelled word (or what it “thinks” is a misspelling because it cannot find an exact match for the word in its dictionary).

Although the suggestion (or in most cases, a list of possible words is suggested) is indeed spelled correctly, it might not be the word you were trying to spell, and in some cases, by accepting the suggestion, you create a contextually inappropriate result.

It’s called the “Cupertino” effect because with older programs the word “cooperation” was only listed in the spellchecking dictionary in hyphenated form (i.e., “co-operation”), making the spellchecker suggest “Cupertino” (i.e., the California city and home of the worldwide headquarters of Apple, Inc., thereby essentially guaranteeing it to be in all spellchecking dictionaries).

By accepting the suggestion of a spellchecker program (and if there’s only one suggested word listed, don’t we always accept it?), a sentence where we intended to write something like:

“Cooperation is vital to our mutual success.”

Becomes instead:

“Cupertino is vital to our mutual success.”

And then confusion ensues (or hilarity—or both).

Beyond being a data quality issue for unstructured data (e.g., documents, e-mail messages, blog posts, etc.), the Cupertino Effect reminded me of the accuracy versus context debate.

“Data quality is primarily about context not accuracy...”

This Data Quality (DQ) Tip from last September sparked a nice little debate in the comments section. The complete DQ-Tip was:

“Data quality is primarily about context not accuracy.

Accuracy is part of the equation, but only a very small portion.”

Therefore, the key point wasn’t that accuracy isn’t important, but simply to emphasize that context is more important.

In her fantastic book Executing Data Quality Projects, Danette McGilvray defines accuracy as “a measure of the correctness of the content of the data (which requires an authoritative source of reference to be identified and accessible).”

Returning to the Cupertino Effect for a moment, the spellchecking dictionary provides an identified, accessible, and somewhat authoritative source of reference—and “Cupertino” is correct data content for representing the name of a city in California.

However, absent a context within which to evaluate accuracy, how can we determine the correctness of the content of the data?

The Free-Form Effect

Let’s use a different example. A common root cause of poor quality for structured data is: free-form text fields.

Regardless of how good the metadata description is written or how well the user interface is designed, if a free-form text field is provided, then you will essentially be allowed to enter whatever you want for the content of the data (i.e., the data value).

For example, a free-form text field is provided for entering the Country associated with your postal address.

Therefore, you could enter data values such as:

Brazil
United States of America
Portugal
United States
República Federativa do Brasil
USA
Canada
Federative Republic of Brazil
Mexico
República Portuguesa
U.S.A.
Portuguese Republic

However, you could also enter data values such as:

Gondor
Gnarnia
Rohan
Citizen of the World
The Land of Oz
The Island of Sodor
Berzerkistan
Lilliput
Brobdingnag
Teletubbyland
Poketopia
Florin

The first list contains real countries, but a lack of standard values introduces needless variations. The second list contains fictional countries, which people like me enter into free-form fields to either prove a point or simply to amuse myself (well okay—both).

The most common solution is to provide a drop-down box of standard values, such as those provided by an identified, accessible, and authoritative source of reference—the ISO 3166 standard country codes.

Problem solved—right? Maybe—but maybe not.

Yes, I could now choose BR, US, PT, CA, MX (the ISO 3166 alpha-2 codes for Brazil, United States, Portugal, Canada, Mexico), which are the valid and standardized country code values for the countries from my first list above—and I would not be able to find any of my fictional countries listed in the new drop-down box.

However, I could also choose DO, RE, ME, FI, SO, LA, TT, DE (Dominican Republic, Réunion, Montenegro, Finland, Somalia, Lao People’s Democratic Republic, Trinidad and Tobago, Germany), all of which are valid and standardized country code values, however all of them are also contextually invalid for my postal address.

Accuracy: With or Without Context?

Accuracy is only one of the many dimensions of data quality—and you may have a completely different definition for it.

Paraphrasing Danette McGilvray, accuracy is a measure of the validity of data values, as verified by an authoritative reference.

My question is what about context? Or more specifically, should accuracy be defined as a measure of the validity of data values, as verified by an authoritative reference, and within a specific context?

Please note that I am only trying to define the accuracy dimension of data quality, and not data quality.

Therefore, please resist the urge to respond with “fitness for the purpose of use” since even if you want to argue that “context” is just another word meaning “use” then next we will have to argue over the meaning of the word “fitness” and before you know it, we will be arguing over the meaning of the word “meaning.”

Please accurately share your thoughts (with or without context) about accuracy and context—by posting a comment below.

January 16, 2010

DQ-Tip: “Start where you are...”

January 16, 2010/ Jim Harris

Data Quality (DQ) Tips is an OCDQ regular segment. Each DQ-Tip is a clear and concise data quality pearl of wisdom.

“Start where you are

Use what you have

Do what you can.”

This DQ-Tip is actually a wonderful quote from Arthur Ashe, which serves as the opening of the final chapter of the fantastic data quality book: Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information by Danette McGilvray.

“I truly believe,” explains McGilvray, “that no matter where you are, there is something you can do to help your organization. I also recognize the fact that true sustainability of any data quality effort requires management support. But don't be discouraged if you don't have the ear of the CEO (of course that would be nice, but don't let it stop you if you don't).”

McGilvray then suggests the following excellent list of dos and don'ts:

You DON'T have to have the CEO's support to begin, but . . .
You DO have to have the appropriate level of management support to get started while continuing to obtain additional support from as high up the chain as possible.
You DON'T have to have all the answers, but . . .
You DO need to do your homework and be willing to ask questions.
You DON'T need to do everything all at once, but . . .
You DO need to have a plan of action and get started!

“So what are you waiting for?” asks McGilvray.

“Get going: build on your experience, continue to learn, bring value to your organization, have fun, and enjoy the journey!”

DQ-Tip: “Data quality is about more than just improving your data...”

DQ-Tip: “...Go talk with the people using the data”

DQ-Tip: “Data quality is primarily about context not accuracy...”

DQ-Tip: “Don't pass bad data on to the next person...”

Follow OCDQ

If you enjoyed this blog post, then please subscribe to OCDQ via my RSS feed or my E-mail updates.

You can also follow OCDQ on Twitter, fan the Facebook page for OCDQ, and connect with me on LinkedIn.

October 01, 2009

Poor Data Quality is a Virus

October 01, 2009/ Jim Harris

“A storm is brewing—a perfect storm of viral data, disinformation, and misinformation.”

These cautionary words (written by Timothy G. Davis, an Executive Director within the IBM Software Group) are from the foreword of the remarkable new book Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman.

“Viral data,” explains Fishman, “is a metaphor used to indicate that business-oriented data can exhibit qualities of a specific type of human pathogen: the virus. Like a virus, data by itself is inert. Data requires software (or people) for the data to appear alive (or actionable) and cause a positive, neutral, or negative effect.”

“Viral data is a perfect storm,” because as Fishman explains, it is “a perfect opportunity to miscommunicate with ubiquity and simultaneity—a service-oriented pandemic reaching all corners of the enterprise.”

“The antonym of viral data is trusted information.”

Data Quality

“Quality is a subjective term,” explains Fishman, “for which each person has his or her own definition.” Fishman goes on to quote from many of the published definitions of data quality, including a few of my personal favorites:

David Loshin: “Fitness for use—the level of data quality determined by data consumers in terms of meeting or beating expectations.”
Danette McGilvray: “The degree to which information and data can be a trusted source for any and/or all required uses. It is having the right set of correct information, at the right time, in the right place, for the right people to use to make decisions, to run the business, to serve customers, and to achieve company goals.”
Thomas Redman: “Data are of high quality if those who use them say so. Usually, high-quality data must be both free of defects and possess features that customers desire.”

Data quality standards provide a highest common denominator to be used by all business units throughout the enterprise as an objective data foundation for their operational, tactical, and strategic initiatives. Starting from this foundation, information quality standards are customized to meet the subjective needs of each business unit and initiative. This approach leverages a consistent enterprise understanding of data while also providing the information necessary for day-to-day operations.

However, the enterprise-wide data quality standards must be understood as dynamic. Therefore, enforcing strict conformance to data quality standards can be self-defeating. On this point, Fishman quotes Joseph Juran: “conformance by its nature relates to static standards and specification, whereas quality is a moving target.”

Defining data quality is both an essential and challenging exercise for every enterprise. “While a succinct and holistic single-sentence definition of data quality may be difficult to craft,” explains Fishman, “an axiom that appears to be generally forgotten when establishing a definition is that in business, data is about things that transpire during the course of conducting business. Business data is data about the business, and any data about the business is metadata. First and foremost, the definition as to the quality of data must reflect the real-world object, concept, or event to which the data is supposed to be directly associated.”

Data Governance

“Data governance can be used as an overloaded term,” explains Fishman, and he quotes Jill Dyché and Evan Levy to explain that “many people confuse data quality, data governance, and master data management.”

“The function of data governance,” explains Fishman, “should be distinct and distinguishable from normal work activities.”

For example, although knowledge workers and subject matter experts are necessary to define the business rules for preventing viral data, according to Fishman, these are data quality tasks and not acts of data governance.

However, these data quality tasks must “subsequently be governed to make sure that all the requisite outcomes comply with the appropriate controls.”

Therefore, according to Fishman, “data governance is a function that can act as an oversight mechanism and can be used to enforce controls over data quality and master data management, but also over data privacy, data security, identity management, risk management, or be accepted in the interpretation and adoption of regulatory requirements.”

Conclusion

“There is a line between trustworthy information and viral data,” explains Fishman, “and that line is very fine.”

Poor data quality is a viral contaminant that will undermine the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace.

Left untreated or unchecked, this infectious agent will negatively impact the quality of business decisions. As the pathogen replicates, more and more decision-critical enterprise information will be compromised.

According to Fishman, enterprise data quality requires a multidisciplinary effort and a lifetime commitment to:

“Prevent viral data and preserve trusted information.”

Books Referenced in this Post

Viral Data in SOA: An Enterprise Pandemic by Neal A. Fishman

Enterprise Knowledge Management: The Data Quality Approach by David Loshin

Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information by Danette McGilvray

Data Quality: The Field Guide by Thomas Redman

Juran on Quality by Design: The New Steps for Planning Quality into Goods and Services by Joseph Juran

Customer Data Integration: Reaching a Single Version of the Truth by Jill Dyché and Evan Levy

DQ-Tip: “Don't pass bad data on to the next person...”

The Only Thing Necessary for Poor Data Quality

Hyperactive Data Quality (Second Edition)

The General Theory of Data Quality

Data Governance and Data Quality

September 08, 2009

Fantasy League Data Quality

September 08, 2009/ Jim Harris

For over 25 years, I have been playing fantasy league baseball and football. For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team. Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?

Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

Master Data – data describing the real-world players on the roster of each fantasy team.
Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number. Team attributes include name, owner, home city, and the name and seating capacity of their stadium.
Definitions – Player and Team have both Professional and Fantasy definitions. Professional teams and players are real-world objects managed independent of fantasy sports. Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League). Therefore, Fantasy Team and Fantasy Player are the true master data objects. The distinction between professional and fantasy teams is simpler than between professional and fantasy players. Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).
Roles – In baseball, the player roles are Batter, Pitcher, and Fielder. In football, the player roles are Offense, Defense and Special Teams. In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).
Connections – Fantasy Players are connected to Fantasy Teams via a roster. On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball). These connections change throughout the season. Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).
Taxonomies – Positions played are defined individually and organized into taxonomies. In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield. Second base and short stop are also infield positions, and more specifically middle infield. And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).

Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse. I downloaded master, reference, and transaction data from my fantasy league's website. I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team. All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables. For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions.

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics. These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball.

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics. This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues. Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams. Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only. Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables. In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century).

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century. However, there are still statistics with multiple meanings. For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas. Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.

Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team. Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website. This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.” However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites.

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online. Of course, this wasn't always the case.

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers. We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs.

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves. I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.

Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.
Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.
Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.
Presentation Quality – A measure of how information is presented to and collected from those who utilize it. Format and appearance support appropriate use of the information.
Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.

Conclusion

I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you. It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby. They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”

OCDQ Blog

OCDQ Blog

OCDQ Blog

OCDQ Blog

Data Quality and the Cupertino Effect

“Data quality is primarily about context not accuracy...”

The Free-Form Effect

Accuracy: With or Without Context?

DQ-Tip: “Start where you are...”

Related Posts

Follow OCDQ

Poor Data Quality is a Virus

Data Quality

Data Governance

Conclusion

Books Referenced in this Post

Related Posts

Fantasy League Data Quality

Master Data Management

Data Warehousing

Business Intelligence

Data Quality

Conclusion

OCDQ Blog