August 09, 2009

Adventures in Data Profiling (Part 3)

August 09, 2009/ Jim Harris

In Part 2 of this series: The adventures continued with a detailed analysis of the Customer ID field and the preliminary analysis of the Gender Code and Customer Name fields. This provided you with an opportunity to become familiar with the features of the fictional data profiling tool that you are using throughout this series to assist with performing your analysis.

Additionally, some of your fellow Data Gazers have provided excellent insights and suggestions via the comments they have left, including my time traveling alter ego who has left you some clues from what the future might hold when you reach the end of these adventures in data profiling.

In Part 3, you will continue your adventures by using a combination of field values and field formats to begin your analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

Birth Date

The field summary for Birth Date includes input metadata along with the summary and additional statistics provided by the data profiling tool. Let's assume that drill-downs revealed the single profiled field data type was DATE and the single profiled field format was MM-DD-CCYY (i.e. Month-Day-Year).

Combined with the profiled minimum/maximum field lengths and minimum/maximum field values, the good news appears to be that when Birth Date is populated it does contain a date value.

However, the not so good news is that the profiled maximum field value (December 21, 2012) appears to indicate that some of the customers are either time travelers or the marketing department has a divinely inspired prospect list.

This is a good example of a common data quality challenge – a field value can have a valid data type and a valid format – but an invalid context. Although 12-21-2012 is a valid date in a valid format, in the context of a birth date, it can't be valid.

We can use drill-downs on the field summary “screen” to get more details about Birth Date provided by the data profiling tool.

The cardinality of Birth Date is not only relatively high, but it also has a very low Distinctness (i.e. the same field value frequently occurs on more than one record). Therefore, we will limit the review to only the top ten most frequently occurring values.

Additional analysis can be performed by extracting the birth year and reviewing only its top ten most frequently occurring values. One aspect of this analysis is that it can be used as an easier method for examining the customer age range.

Here we also see two contextually invalid birth years: 2011 and 2012. Any thoughts on a possible explanation for this data anomaly?

Telephone Number

The field summary for Telephone Number includes input metadata along with the summary and additional statistics provided by the data profiling tool.

The presence of both multiple profiled field data types and multiple profiled field formats would appear to indicate inconsistencies in the way that telephone numbers are represented.

The profiled minimum/maximum field lengths show additional inconsistencies, but perhaps more concerning is the profiled minimum/maximum field values, which show obviously invalid telephone numbers.

Telephone Number is a good example of how you should not mistake Completeness (which as a data profiling statistic indicates the field is populated with an Actual value) for an indication that the field is complete in the sense that its value contains all of the sub-values required to be considered valid.

This summary information points to the need to use drill-downs in order to review more detailed information.

The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different data types based on content and numeric precision.

With only ten profiled field formats, we can easily review them all. Most formats appear to be representative of potentially valid telephone numbers. However, there are two formats for 7 digit numbers appearing to indicate local dialing syntax (i.e. missing the area code in the United States). Additionally, there are two formats that appear invalid based on North American standards.

However, a common data quality challenge is that valid field formats can conceal invalid field values.

Since the cardinality of Telephone Number is very high, we will limit the review to only the top ten most frequently occurring values. In this case, more obviously invalid telephone numbers are discovered.

E-mail Address

The field summary for E-mail Address includes input metadata along with the summary statistics provided by the data profiling tool. In order to save some space, I have intentionally omitted the additional profiling statistics for this field.

E-mail Address represents a greater challenge that really requires more than just summary statistics in order to perform effective analysis.

Most data profiling tools will provide the capability to analyze fields using formats that are constructed by parsing and classifying the individual values within the field.

In the case of the E-mail Address field, potentially valid field values should be comprised of the sub-values User, Domain and Top Level Domain (TLD). These sub-values also have expected delimiters such as User and Domain being separated by an at symbol (@) and Domain and TLD being separated by a dot symbol(.).

Reviewing the top ten most frequently occurring field formats shows several common potentially valid structures. However, some formats are missing one of the three required sub-values. The formats missing User could be an indication that the field sometimes contains a Website Address.

Extracting the top five most frequently occurring Domain and TLD sub-values provides additional alternative analysis for a high cardinality field.

What other questions can you think of for these fields? Additional analysis could be done using drill-downs to perform a more detailed review of records of interest. What other analysis do you think should be performed for these fields?

In Part 4 of this series: We will continue the adventures by shifting our focus to postal address by first analyzing the following fields: City Name, State Abbreviation, Zip Code and Country Code.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

August 05, 2009

Adventures in Data Profiling (Part 2)

August 05, 2009/ Jim Harris

In Part 1 of this series: The adventures began with the following scenario – You are an external consultant on a new data quality initiative. You have got 3,338,190 customer records to analyze, a robust data profiling tool, half a case of Mountain Dew, it's dark, and you're wearing sunglasses...ok, maybe not those last two or three things – but the rest is true.

You have no prior knowledge of the data or its expected characteristics. You are performing this analysis without the aid of either business requirements or subject matter experts. Your goal is to learn us much as you can about the data and then prepare meaningful questions and reports to share with the rest of your team.

The customer data source was processed by the data profiling tool, which provided the following statistical summaries:

The Adventures Continue...

In Part 1, we asked if Customer ID was the primary key for this data source. In an attempt to answer this question, let's “click” on it and drill-down to a field summary provided by the data profiling tool:

Please remember that my data profiling tool is fictional (i.e. not modeled after any real product) and therefore all of my “screen shots” are customized to illustrate series concepts. This “screen” would not only look differently in a real data profiling tool, but it would also contain additional information.

This field summary for Customer ID includes some input metadata, identifying the expected data type and field length. Verifying data matches the metadata that describes it is one essential analytical task that data profiling can help us with, providing a much needed reality check for the perceptions and assumptions that we may have about our data.

The data profiling summary statistics for Customer ID are listed, followed by some useful additional statistics: the count of the number of distinct data types (based on analyzing the values, not the metadata), minimum/maximum field lengths, minimum/maximum field values, and the count of the number of distinct field formats.

We can use drill-downs on the field summary “screen” to get more details about Customer ID provided by the data profiling tool.

The count of the number of distinct data types is explained by the data profiling tool observing field values that could be represented by three different integer data types based on precision (which can vary by RDBMS). Different tools would represent this in different ways (including the option to automatically collapse the list into the data type of the highest precision that could store all of the values).

Drilling down on the field data types shows the field values (in this example, limited to the 5 most frequently occurring values). Please note, I have intentionally customized these lists to reveal hints about the precision breakdown used by my fictional RDBMS.

The count of the number of distinct field formats shows the frequency distribution of the seven numeric patterns observed by the data profiling tool for Customer ID: 7 digits, 6 digits, 5 digits, 4 digits, 3 digits, 2 digits, and 1 digit. We could also continue drilling down to see the actual field values behind the field formats.

Based on analyzing all of the information provided to you by the data profiling tool, can you safely assume that Customer ID is an integer surrogate key that can be used as the primary key for this data source?

In Part 1, we asked why the Gender Code field has 8 distinct values. Cardinality can play a major role in deciding whether or not you want to drill-down to field values or field formats since it is much easier to review all of the field values when there are not very many of them. Alternatively, the review of high cardinality fields can also be limited to the most frequently occurring values (we will see several examples of this alternative later in the series when analyzing some of the other fields).

We will drill-down to this “screen” to view the frequency distribution of the field values for Gender Code provided by the data profiling tool.

It is probably not much of a stretch to assume that F is an abbreviation for Female and M is an abbreviation for Male. Also, you may ask if Unknown is any better of a value than NULL or Missing (which are not listed because the list was intentionally filtered to include only Actual values).

However, it is dangerous to assume anything and what about those numeric values? Additionally, you may wonder if Gender Code can tell us anything about the characteristics of the Customer Name fields. For example, do the records with a NULL or Missing value in Gender Code indicate the presence of an organization name and do the records with an Actual Gender Code value indicate the presence of a personal name?

To attempt to answer these questions, it may be helpful to review records with each of these field values. Therefore, let's assume that we have performed drill-down analysis using the data profiling tool and have selected the following records of interest:

As is so often the case, data rarely conforms to our assumptions about it. Although we will perform more detailed analysis later in the series, what are your thoughts at this point regarding the Gender Code and Customer Name fields?

In Part 3 of this series: We will continue the adventures by using a combination of field values and field formats to begin our analysis of the following fields: Birth Date, Telephone Number and E-mail Address.

Adventures in Data Profiling (Part 1)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

August 03, 2009

Adventures in Data Profiling (Part 1)

August 03, 2009/ Jim Harris

In my popular post Getting Your Data Freq On, I explained that understanding your data is essential to using it effectively and improving its quality – and to achieve these goals, there is simply no substitute for data analysis.

I explained the benefits of using a data profiling tool to help automate some of the grunt work, but that you need to perform the actual analysis and then prepare meaningful questions and reports to share with the rest of your team.

Series Overview

This post is the beginning of a vendor-neutral series on the methodology of data profiling.

In order to narrow the scope of the series, the scenario used will be that a customer data source for a new data quality initiative has been made available to an external consultant who has no prior knowledge of the data or its expected characteristics. Also, the business requirements have not yet been documented, and the subject matter experts are not currently available.

The series will not attempt to cover every possible feature of a data profiling tool or even every possible use of the features that are covered. Both the data profiling tool and the data used throughout the series will be fictional. The “screen shots” have been customized to illustrate concepts and are not modeled after any particular data profiling tool.

The Adventures Begin...

The customer data source has been processed by a data profiling tool, which has provided the above counts and percentages that summarize the following field content characteristics:

NULL – count of the number of records with a NULL value
Missing – count of the number of records with a missing value (i.e. non-NULL absence of data e.g. character spaces)
Actual – count of the number of records with an actual value (i.e. non-NULL and non-missing)
Completeness – percentage calculated as Actual divided by the total number of records
Cardinality – count of the number of distinct actual values
Uniqueness – percentage calculated as Cardinality divided by the total number of records
Distinctness – percentage calculated as Cardinality divided by Actual

Some initial questions based on your analysis of these statistical summaries might include the following:

Is Customer ID the primary key for this data source?
Is Customer Name 1 the primary name on the account? If so, why isn't it always populated?
Do the statistics for Account Number and/or Tax ID indicate the presence of potential duplicate records?
Why does the Gender Code field have 8 distinct values?
Do the 5 distinct values in Country Code indicate international postal addresses?

Please remember the series scenario – You are an external consultant with no prior knowledge of the data or its expected characteristics, who is performing this analysis without the aid of either business requirements or subject matter experts.

What other questions can you think of based on analyzing the statistical summaries provided by the data profiling tool?

In Part 2 of this series: We will continue the adventures by attempting to answer these questions (and more) by beginning our analysis of the frequency distributions of the unique values and formats found within the fields. Additionally, we will begin using drill-down analysis in order to perform a more detailed review of records of interest.

Adventures in Data Profiling (Part 2)

Adventures in Data Profiling (Part 3)

Adventures in Data Profiling (Part 4)

Adventures in Data Profiling (Part 5)

Adventures in Data Profiling (Part 6)

Adventures in Data Profiling (Part 7)

Getting Your Data Freq On

March 26, 2009

Identifying Duplicate Customers

March 26, 2009/ Jim Harris

I just finished publishing a five part series of articles on data matching methodology for dealing with the common data quality problem of identifying duplicate customers.

The article series was published on Data Quality Pro, which is the leading data quality online magazine and free independent community resource dedicated to helping data quality professionals take their career or business to the next level.

Topics covered in the series:

Why a symbiosis of technology and methodology is necessary when approaching the common data quality problem of identifying duplicate customers
How performing a preliminary analysis on a representative sample of real project data prepares effective examples for discussion
Why using a detailed, interrogative analysis of those examples is imperative for defining your business rules
How both false negatives and false positives illustrate the highly subjective nature of this problem
How to document your business rules for identifying duplicate customers
How to set realistic expectations about application development
How to foster a collaboration of the business and technical teams throughout the entire project
How to consolidate identified duplicates by creating a “best of breed” representative record

To read the series, please follow these links:

OCDQ Blog

OCDQ Blog

OCDQ Blog

OCDQ Blog

Adventures in Data Profiling (Part 3)

Birth Date

Telephone Number

E-mail Address

Related Posts

Adventures in Data Profiling (Part 2)

The Adventures Continue...

Related Posts

Adventures in Data Profiling (Part 1)

Series Overview

The Adventures Begin...

Related Posts

Identifying Duplicate Customers

OCDQ Blog