March 14, 2013

On Philosophy, Science, and Data

March 14, 2013/ Jim Harris

Ever since Melinda Thielbar helped me demystify data science on OCDQ Radio, I have been pondering my paraphrasing of an old idea: Science without philosophy is blind; Philosophy without science is empty; Data needs both science and philosophy.

“A philosopher’s job is to find out things about the world by thinking rather than observing,” the philosopher Bertrand Russell once said. One could say a scientist’s job is to find out things about the world by observing and experimenting. In fact, Russell observed that “the most essential characteristic of scientific technique is that it proceeds from experiment, not from tradition.”

Russell also said that “science is what we know, and philosophy is what we don’t know.” However, Stuart Firestein, in his book Ignorance: How It Drives Science, explained “there is no surer way to screw up an experiment than to be certain of its outcome.”

Although it seems it would make more sense for science to be driven by what we know, by facts, “working scientists,” according to Firestein, “don’t get bogged down in the factual swamp because they don’t care that much for facts. It’s not that they discount or ignore them, but rather that they don’t see them as an end in themselves. They don’t stop at the facts; they begin there, right beyond the facts, where the facts run out. Facts are selected for the questions they create, for the ignorance they point to.”

In this sense, philosophy and science work together to help us think about and experiment with what we do and don’t know.

Some might argue that while anyone can be a philosopher, being a scientist requires more rigorous training. A commonly stated requirement in the era of big data is to hire data scientists, but this begs the question: Is data science only for data scientists?

“Clearly what we need,” Firestein explained, “is a crash course in citizen science—a way to humanize science so that it can be both appreciated and judged by an informed citizenry. Aggregating facts is useless if you don’t have a context to interpret them.”

I would argue that clearly what organizations need is a crash course in data science—a way to humanize data science so that it can be both appreciated and judged by an informed business community. Big data is useless if you don’t have a business context to interpret it. Firestein also made great points about science not being exclusionary (i.e., not just for scientists). Just as you can enjoy watching sports without being a professional athlete and you can appreciate music without being a professional musician, you can—and should—learn the basics of data science (especially statistics) without being a professional data scientist.

In order to truly deliver business value to organizations, data science can not be exclusionary. This doesn’t mean you shouldn’t hire data scientists. In many cases, you will need the expertise of professional data scientists. However, you will not be able to direct them or interpret their findings without understanding the basics, what could be called the philosophy of data science.

Some might argue that philosophy only reigns in the absence of data, while science reigns in the analysis of data. Although in the era of big data there seems to be fewer areas truly absent of data, a conceptual bridge still remains between analysis and insight, the crossing of which is itself a philosophical exercise. So, an endless oscillation persists between science and philosophy, which is why science without philosophy is blind, and philosophy without science is empty. Data needs both science and philosophy.

March 07, 2013

Doing Data Governance

March 07, 2013/ Jim Harris

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, I discuss the practical aspects of doing data governance with John Ladley, the author of the excellent book Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program. Our discussion includes understanding the difference and relationship between data governance and information management, the importance of establishing principles before creating policies, data stewardship, and three critical success factors for data governance.

John Ladley is a business technology thought leader with 30 years of experience in improving organizations through the successful implementation of information systems. He is a recognized authority in the use and implementation of business intelligence and enterprise information management (EIM).

John Ladley is the author of Making EIM Work for Business, and frequently writes and speaks on a variety of technology and enterprise information management topics. His information management experience is balanced between strategic technology planning, project management, and, most important, the practical application of technology to business problems.

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Demystifying Data Science — Guest Melinda Thielbar, a Ph.D. Statistician, discusses what a data scientist does and provides a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, and correlation.

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Data Profiling Early and Often — Guest James Standen discusses data profiling concepts and practices, and how bad data is often misunderstood and can be coaxed away from the dark side if you know how to approach it.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

March 05, 2013

Data Governance needs Searchers, not Planners

March 05, 2013/ Jim Harris

In his book Everything Is Obvious: How Common Sense Fails Us, Duncan Watts explained that “plans fail, not because planners ignore common sense, but rather because they rely on their own common sense to reason about the behavior of people who are different from them.”

As development economist William Easterly explained, “A Planner thinks he already knows the answer; A Searcher admits he doesn’t know the answers in advance. A Planner believes outsiders know enough to impose solutions; A Searcher believes only insiders have enough knowledge to find solutions, and that most solutions must be homegrown.”

I made a similar point in my post Data Governance and the Adjacent Possible. Change management efforts are resisted when they impose new methods by emphasizing bad business and technical processes, as well as bad data-related employee behaviors, while ignoring unheralded processes and employees whose existing methods are preventing other problems from happening.

Demonstrating that some data governance policies reflect existing best practices reduces resistance to change by showing that the search for improvement was not limited to only searching for what is currently going wrong.

This is why data governance needs Searchers, not Planners. A Planner thinks a framework provides all the answers; A Searcher knows a data governance framework is like a jigsaw puzzle. A Planner believes outsiders (authorized by executive management) know enough to impose data governance solutions; A Searcher believes only insiders (united by collaboration) have enough knowledge to find the ingredients for data governance solutions, and a true commitment to change always comes from within.

The Hawthorne Effect, Helter Skelter, and Data Governance

Cooks, Chefs, and Data Governance

Data Governance Frameworks are like Jigsaw Puzzles

Data Governance and the Buttered Cat Paradox

Data Governance Star Wars: Bureaucracy versus Agility

Beware the Data Governance Ides of March

Aristotle, Data Governance, and Lead Rulers

Data Governance and the Adjacent Possible

The Three Most Important Letters in Data Governance

The Data Governance Oratorio

An Unsettling Truth about Data Governance

The Godfather of Data Governance

Over the Data Governance Rainbow

Getting Your Data Governance Stuff Together

Datenvergnügen

Council Data Governance

A Tale of Two G’s

Declaration of Data Governance

The Role Of Data Quality Monitoring In Data Governance

The Collaborative Culture of Data Governance

February 28, 2013

Open MIKE Podcast — Episode 12

February 28, 2013/ Jim Harris

Method for an Integrated Knowledge Environment (MIKE2.0) is an open source delivery framework for Enterprise Information Management, which provides a comprehensive methodology that can be applied across a number of different projects within the Information Management space. For more information, click on this link: openmethodology.org/wiki/What_is_MIKE2.0

The Open MIKE Podcast is a video podcast show, hosted by Jim Harris, which discusses aspects of the MIKE2.0 framework, and features content contributed to MIKE 2.0 Wiki Articles, Blog Posts, and Discussion Forums.

Episode 12: Information Development Book

If you’re having trouble viewing this video, you can watch it on Vimeo by clicking on this link: Open MIKE Podcast on Vimeo

MIKE2.0 Content Featured in or Related to this Podcast

Information Development Book: openmethodology.org/wiki/Information_Development_Book

Information Development: openmethodology.org/wiki/Information_Development

Previous Episodes of the Open MIKE Podcast

Clicking on the link will take you to the episode’s blog post:

Episode 01: Information Management Principles

Episode 02: Information Governance and Distributing Power

Episode 03: Data Quality Improvement and Data Investigation

Episode 04: Metadata Management

Episode 05: Defining Big Data

Episode 06: Getting to Know NoSQL

Episode 07: Guiding Principles for Open Semantic Enterprise

Episode 08: Information Lifecycle Management

Episode 09: Enterprise Data Management Strategy

Episode 10: Information Maturity QuickScan

Episode 11: Information Maturity Model

You can also find the videos and blog post summaries for every episode of the Open MIKE Podcast at: ocdqblog.com/MIKE

February 19, 2013

Demystifying Data Science

February 19, 2013/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, special guest, and actual data scientist, Dr. Melinda Thielbar, a Ph.D. Statistician, and I attempt to demystify data science by explaining what a data scientist does, including the requisite skills involved, bridging the communication gap between data scientists and business leaders, delivering data products business users can use on their own, and providing a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, experimentation, and correlation.

Melinda Thielbar is the Senior Mathematician for IAVO Research and Scientific. Her work there focuses on power system optimization using real-time prediction models. She has worked as a software developer, an analytic lead for big data implementations, and a statistics and programming teacher.

Melinda Thielbar is a co-founder of Research Triangle Analysts, a professional group for analysts and data scientists located in the Research Triangle of North Carolina.

While Melinda Thielbar doesn’t specialize in a single field, she is particularly interested in power systems because, as she puts it, “A power systems optimizer has to work every time.”

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Gaining a Competitive Advantage with Data — Guest William McKnight discusses some of the practical, hands-on guidance provided by his book Information Management: Strategies for Gaining a Competitive Advantage with Data.

Doing Data Governance — Guest John Ladley discusses his book How to Design, Deploy and Sustain Data Governance and how to understand the difference and relationship between data governance and enterprise information management.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Measuring Data Quality for Ongoing Improvement — Guest Laura Sebastian-Coleman discusses bringing together a better understanding of what is represented in data with the expectations for use in order to improve the overall quality of data.

The Blue Box of Information Quality — Guest Daragh O Brien on why Information Quality is bigger on the inside, using stories as an analytical tool and change management technique, and why we must never forget that “people are cool.”

Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses Data Quality and Business Intelligence, including the speed versus quality debate of near-real-time decision making, and the future of predictive analytics.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Art of Data Matching — Guest Henrik Liliendahl Sørensen discusses data matching concepts and practices, including different match techniques, candidate selection, presentation of match results, and business applications of data matching.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

February 12, 2013

The Hawthorne Effect, Helter Skelter, and Data Governance

February 12, 2013/ Jim Harris

In his book The Half-life of Facts: Why Everything We Know Has an Expiration Date, Samuel Arbesman introduced me to the Hawthorne Effect, which is “when subjects behave differently if they know they are being studied. The effect was named after what happened in a factory called Hawthorne Works outside Chicago in the 1920s and 1930s.”

“Scientists wished to measure,” Arbesman explained, “the effects of environmental changes on the productivity of workers. They discovered whatever they did to change the workers’ behaviors — whether they increased the lighting or altered any other aspect of the environment — resulted in increased productivity. However, as soon as the study was completed, productivity dropped. The researchers concluded that the observations themselves were affecting productivity and not the experimental changes.”

I couldn’t help but wonder how the Hawthorne Effect could affect a data governance program. When data governance policies are first defined, and their associated procedures and processes are initially implemented, after a little while, and usually after a little resistance, productivity often increases and the organization begins to advance its data governance maturity level.

Perhaps during these early stages employees are well-aware that they’re being observed to make sure they’re complying with the new data governance policies, and this observation itself accounts for advancing to the next maturity level. Especially since after progress stops being studied so closely, it’s not uncommon for an organization to backslide to an earlier maturity level.

You might be tempted to conclude that continuous monitoring, especially of the Orwellian Big Brother variety, might be able to prevent this from happening, but I doubt it. Data governance maturity is often misperceived in the same way that expertise is misperceived — as a static state that once achieved signifies a comforting conclusion to all the grueling effort that was required, either to become an expert, or reach a particular data governance maturity level.

However, just like the five stages of data quality, oscillating between different levels of data governance maturity, and perhaps even occasionally coming full circle, may be an inevitable part of the ongoing evolution of a data governance program, which can often feel like a top-down/bottom-up amusement park ride of the Beatles “Helter Skelter” variety:

When you get to the bottom, you go back to the top, where you stop and you turn, and you go for a ride until you get to the bottom — and then you do it again.

Come On Tell Me Your Answers

Do you, don’t you . . . think the Hawthorne Effect affects data governance?

Do you, don’t you . . . think data governance is Helter Skelter?

Tell me, tell me, come on tell me your answers — by posting a comment below.

February 05, 2013

Big Data and the Infinite Inbox

February 05, 2013/ Jim Harris

Occasionally it’s necessary to temper the unchecked enthusiasm accompanying the peak of inflated expectations associated with any hype cycle. This may be especially true for big data, and especially now since, as Svetlana Sicular of Gartner recently blogged, big data is falling into the trough of disillusionment and “to minimize the depth of the fall, companies must be at a high enough level of analytical and enterprise information management maturity combined with organizational support of innovation.”

I fear the fall may feel bottomless for those who fell hard for the hype and believe the Big Data Psychic capable of making better, if not clairvoyant, predictions. When, in fact, “our predictions may be more prone to failure in the era of big data,” explained Nate Silver in his book The Signal and the Noise: Why Most Predictions Fail but Some Don't. “There isn’t any more truth in the world than there was before the Internet. Most of the data is just noise, as most of the universe is filled with empty space.”

Proposing the 3Ss (Small, Slow, Sure) as a counterpoint to the 3Vs (Volume, Velocity, Variety), Stephen Few recently blogged about the slow data movement. “Data is growing in volume, as it always has, but only a small amount of it is useful. Data is being generated and transmitted at an increasing velocity, but the race is not necessarily for the swift; slow and steady will win the information race. Data is branching out in ever-greater variety, but only a few of these new choices are sure.”

Big data requires us to revisit information overload, a term that was originally about, not the increasing amount of information, but instead the increasing access to information. As Clay Shirky stated, “It’s not information overload, it’s filter failure.”

As Silver noted, the Internet (like the printing press before it) was a watershed moment in our increased access to information, but its data deluge didn’t increase the amount of truth in the world. And in today’s world, where many of us strive on a daily basis to prevent email filter failure and achieve what Merlin Mann called Inbox Zero, I find unfiltered enthusiasm about big data to be rather ironic, since big data is essentially enabling the data-driven decision making equivalent of the Infinite Inbox.

Imagine logging into your email every morning and discovering: You currently have (∞) Unread Messages.

However, I’m sure most of it probably would be spam, which you obviously wouldn’t have any trouble quickly filtering (after all, infinity minus spam must be a back of the napkin calculation), allowing you to only read the truly useful messages. Right?

HoardaBytes and the Big Data Lebowski

OCDQ Radio - Data Quality and Big Data

Open MIKE Podcast — Episode 05: Defining Big Data

Will Big Data be Blinded by Data Science?

Data Silence

Magic Elephants, Data Psychics, and Invisible Gorillas

The Graystone Effects of Big Data

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

A Statistically Significant Resolution for 2013

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

The Big Data Theory

Swimming in Big Data

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

January 31, 2013

Open MIKE Podcast — Episode 11

January 31, 2013/ Jim Harris

Episode 11: Information Maturity Model

If you’re having trouble viewing this video, you can watch it on Vimeo by clicking on this link: Open MIKE Podcast on Vimeo

MIKE2.0 Content Featured in or Related to this Podcast

Information Maturity Model: openmethodology.org/wiki/Information_Maturity_Model

Reactive Data Governance: openmethodology.org/wiki/Reactive_Data_Governance_Organisation

Proactive Data Governance: openmethodology.org/wiki/Proactive_Data_Governance_Organisation

Managed Data Governance: openmethodology.org/wiki/Managed_Data_Governance_Organisation

Optimal Data Governance: openmethodology.org/wiki/Optimal_Data_Governance_Organisation

Previous Episodes of the Open MIKE Podcast

Clicking on the link will take you to the episode’s blog post:

Episode 01: Information Management Principles

Episode 02: Information Governance and Distributing Power

Episode 03: Data Quality Improvement and Data Investigation

Episode 04: Metadata Management

Episode 05: Defining Big Data

Episode 06: Getting to Know NoSQL

Episode 07: Guiding Principles for Open Semantic Enterprise

Episode 08: Information Lifecycle Management

Episode 09: Enterprise Data Management Strategy

Episode 10: Information Maturity QuickScan

You can also find the videos and blog post summaries for every episode of the Open MIKE Podcast at: ocdqblog.com/MIKE

January 26, 2013

MDM, Assets, Locations, and the TARDIS

January 26, 2013/ Jim Harris

Henrik Liliendahl Sørensen, as usual, is facilitating excellent discussion around master data management (MDM) concepts via his blog. Two of his recent posts, Multi-Entity MDM vs. Multi-Domain MDM and The Real Estate Domain, have both received great commentary. So, in case you missed them, be sure to read those posts, and join in their comment discussions/debates.

A few of the concepts discussed and debated reminded me of the OCDQ Radio episode Demystifying Master Data Management, during which guest John Owens explained the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), as well as, and perhaps the most important concept of all, the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Henrik’s second post touched on Location and Asset, which come up far less often in MDM discussions than Party and Product do, and arguably with understandably good reason. This reminded me of the science fiction metaphor I used during my podcast with John, a metaphor I made in an attempt to help explain the difference and relationship between an Asset and a Location.

Location is often over-identified with postal address, which is actually just one means of referring to a location. A location can also be referred to by its geographic coordinates, either absolute (e.g., latitude and longitude) or relative (e.g., 7 miles northeast of the intersection of Route 66 and Route 54).

Asset refers to a resource owned or controlled by an enterprise and capable of producing business value. Assets are often over-identified with their location, especially real estate assets such as a manufacturing plant or an office building, since they are essentially immovable assets always at a particular location.

However, many assets are movable, such as the equipment used to manufacture products, or the technology used to support employee activities. These assets are not always at a particular location (e.g., laptops and smartphones used by employees) and can also be dependent on other, non-co-located, sub-assets (e.g., replacement parts needed to repair broken equipment).

In Doctor Who, a brilliant British science fiction television program celebrating its 50th anniversary this year, the TARDIS, which stands for Time and Relative Dimension in Space, is the time machine and spaceship the Doctor and his companions travel in.

The TARDIS is arguably the Doctor’s most important asset, but its location changes frequently, both during and across episodes.

So, in MDM, we could say that Location is a time and relative dimension in space where we would currently find an Asset.

OCDQ Radio - Demystifying Master Data Management

OCDQ Radio - Master Data Management in Practice

OCDQ Radio - The Art of Data Matching

Plato’s Data

Once Upon a Time in the Data

The Data Cold War

DQ-BE: Single Version of the Time

The Data Outhouse

Fantasy League Data Quality

OCDQ Radio - The Blue Box of Information Quality

Choosing Your First Master Data Domain

Lycanthropy, Silver Bullets, and Master Data Management

Voyage of the Golden Records

The Quest for the Golden Copy

How Social can MDM get?

Will Social MDM be the New Spam?

More Thoughts about Social MDM

Is Social MDM going the Wrong Way?

The Semantic Future of MDM

Small Data and VRM

January 22, 2013

Popeye, Spinach, and Data Quality

January 22, 2013/ Jim Harris

As a kid, one of my favorite cartoons was Popeye the Sailor, who was empowered by eating spinach to take on many daunting challenges, such as battling his brawny nemesis Bluto for the affections of his love interest Olive Oyl, often kidnapped by Bluto.

I am reading the book The Half-life of Facts: Why Everything We Know Has an Expiration Date by Samuel Arbesman, who explained, while examining how a novel fact, even a wrong one, spreads and persists, that one of the strangest examples of the spread of an error is related to Popeye the Sailor. “Popeye, with his odd accent and improbable forearms, used spinach to great effect, a sort of anti-Kryptonite. It gave him his strength, and perhaps his distinctive speaking style. But why did Popeye eat so much spinach? What was the reason for his obsession with such a strange food?”

The truth begins over fifty years before the comic strip made its debut. “Back in 1870,” Arbesman explained, “Erich von Wolf, a German chemist, examined the amount of iron within spinach, among many other green vegetables. In recording his findings, von Wolf accidentally misplaced a decimal point when transcribing data from his notebook, changing the iron content in spinach by an order of magnitude. While there are actually only 3.5 milligrams of iron in a 100-gram serving of spinach, the accepted fact became 35 milligrams. Once this incorrect number was printed, spinach’s nutritional value became legendary. So when Popeye was created, studio executives recommended he eat spinach for his strength, due to its vaunted health properties, and apparently Popeye helped increase American consumption of spinach by a third!”

“This error was eventually corrected in 1937,” Arbesman continued, “when someone rechecked the numbers. But the damage had been done. It spread and spread, and only recently has gone by the wayside, no doubt helped by Popeye’s relative obscurity today. But the error was so widespread, that the British Medical Journal published an article discussing this spinach incident in 1981, trying its best to finally debunk the issue.”

“Ultimately, the reason these errors spread,” Arbesman concluded, “is because it’s a lot easier to spread the first thing you find, or the fact that sounds correct, than to delve deeply into the literature in search of the correct fact.”

What “spinach” has your organization been falsely consuming because of a data quality issue that was not immediately obvious, and which may have led to a long, and perhaps ongoing, history of data-driven decisions based on poor quality data?

Popeye said “I yam what I yam!” Your organization yams what your data yams, so you had better make damn sure it’s correct.

The Family Circus and Data Quality

Can Data Quality avoid the Dustbin of History?

Retroactive Data Quality

Spartan Data Quality

Pirates of the Computer: The Curse of the Poor Data Quality

The Tooth Fairy of Data Quality

The Dumb and Dumber Guide to Data Quality

Darth Data

Occurred, a data defect has . . .

The Data Quality Placebo

Data Quality is People!

DQ-View: The Five Stages of Data Quality

DQ-BE: Data Quality Airlines

Wednesday Word: Quality-ish

The Five Worst Elevator Pitches for Data Quality

Shining a Social Light on Data Quality

The Poor Data Quality Jar

Data Quality and #FollowFriday the 13th

Dilbert, Data Quality, Rabbits, and #FollowFriday

Data Love Song Mashup

January 10, 2013

Open Source Business Intelligence

January 10, 2013/ Jim Harris

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, I discuss open source business intelligence (OSBI) with Lyndsay Wise, author of the insightful new book Using Open Source Platforms for Business Intelligence: Avoid Pitfalls and Maximize ROI.

Lyndsay Wise is the President and Founder of WiseAnalytics, an independent analyst firm and consultancy specializing in business intelligence for small and mid-sized organizations. For more than ten years, Lyndsay Wise has assisted clients in business systems analysis, software selection, and implementation of enterprise applications.

Lyndsay Wise conducts regular research studies, consults, writes articles, and speaks about how to implement a successful business intelligence approach and improving the value of business intelligence within organizations.

Related OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

Decision Management Systems — Guest James Taylor discusses data-driven decision making and analytical concepts from his book Decision Management Systems: A Practical Guide to Using Business Rules and Predictive Analytics.

Data Driven — Guest Tom Redman (aka the “Data Doc”) discusses concepts from one of my favorite data quality books, which is his most recent book Data Driven: Profiting from Your Most Important Business Asset.

Making EIM Work for Business — Guest John Ladley discusses his book Making EIM Work for Business, exploring what makes information management, not just useful, but valuable to the enterprise.

The Data Governance Imperative — Guest Steve Sarsfield discusses his book The Data Governance Imperative, exploring the business value of data quality, the characteristics of a data champion, and creating effective data quality scorecards.

Master Data Management in Practice — Guests Dalton Cervo and Mark Allen discuss their book MDM in Practice, demystifying the theories surrounding MDM, and recommending how to properly prepare for a new MDM program.

Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.

Good-Enough Data for Fast-Enough Decisions — Guest Julie Hunt discusses the intersection of data quality and business intelligence, especially good-enough data for fast-enough decisions, a necessity in the constantly changing business world.

Big Data and Big Analytics — Special Guests Jill Dyché and Dan Soceanu, following the 2011 Pacific Northwest BI Summit, discuss big trends in business intelligence, including cloud computing, collaboration, and big data analytics.

January 08, 2013

Data Quality and Anton’s Syndrome

January 08, 2013/ Jim Harris

In his book Incognito: The Secret Lives of the Brain, David Eagleman discussed aspects of a bizarre, and rare, brain disorder called Anton’s Syndrome in which a stroke renders a person blind — but the person denies their blindness.

“Those with Anton’s Syndrome truly believe they are not blind,” Eagleman explained. “It is only after bumping into enough furniture and walls that they begin to feel that something is amiss. They are experiencing what they take to be vision, but it is all internally generated. The external data is not getting to the right places because of the stroke, and so their reality is simply that which is generated by the brain, with little attachment to the real world. In this sense, what they experience is no different from dreaming, drug trips, or hallucinations.”

Data quality practitioners often complain that business leaders are blind to the importance of data quality to business success, or that they deny data quality issues exist in their organization. As much as we wish it wasn’t so, often it isn’t until business leaders bump into enough of the negative effects of poor data quality that they begin to feel that something is amiss. However, as much as we would like to, we can’t really attribute their denial to drug-induced hallucinations.

Sometimes an illusion-of-quality effect is caused when data is excessively filtered and cleansed before it reaches business leaders, perhaps as the result of a perception filter for data quality issues created as a natural self-defense mechanism by the people responsible for the business processes and technology surrounding data, since no one wants to be blamed for causing, or failing to fix, data quality issues. Unfortunately, this might really leave the organization’s data with little attachment to the real world.

In fairness, sometimes it’s also the blind leading the blind because data quality practitioners often suffer from business blindness by presenting data quality issues without providing business context, without relating data quality metrics in a tangible manner to how the business uses data to support a business process, accomplish a business objective, or make a business decision.

A lot of the disconnect between business leaders, who believe they are not blind to data quality, and data quality practitioners, who believe they are not blind to business context, comes from a crisis of perception. Each side in this debate believes they have a complete vision, but it’s only after bumping into each other enough times that they begin to envision the organizational blindness caused when data quality is not properly measured within a business context and continually monitored.

Data Quality and Chicken Little Syndrome

Data Quality and Miracle Exceptions

Data Quality: Quo Vadimus?

Availability Bias and Data Quality Improvement

Finding Data Quality

“Some is not a number and soon is not a time”

The Data Quality of Dorian Gray

The Data Quality Wager

DQ-View: The Five Stages of Data Quality

Data Quality and the Bystander Effect

Data Quality and the Q Test

Why isn’t our data quality worse?

The Illusion-of-Quality Effect

Perception Filters and Data Quality

WYSIWYG and WYSIATI

Predictably Poor Data Quality

Data Psychedelicatessen

Data Geeks and Business Blindness

The Real Data Value is Business Insight

Is your data accurate, but useless to your business?

Data Quality Measurement Matters

Data Myopia and Business Relativity

Data and its Relationships with Quality

Plato’s Data

January 03, 2013

Best OCDQ Blog Posts of 2012

January 03, 2013/ Jim Harris

Welcome to my roundup of the best blog posts published on the Obsessive-Compulsive Data Quality (OCDQ) blog during 2012.

My selections were based on a pseudo-scientific, quasi-statistical combination of page views, comments, and re-tweets, as well as choosing a few of my personal favorites, and which I have organized into four sections of ten best posts by topic or type.

Ten Best Posts on Big Data

Dot Collectors and Dot Connectors — The multifaceted challenges of big data require the dot collectors of data management and the dot connectors of business intelligence to overcome their attention blindness and work together more collaboratively.

HoardaBytes and the Big Data Lebowski — Don’t hoard Data, dude. The Data must abide. The Data must abide both the Business, by proving useful to our business activities, and the Individual, by protecting the privacy of our personal activities.

Magic Elephants, Data Psychics, and Invisible Gorillas — As technological advancements improve our data analytical tools, we must not lose sight of the fact that tools and data remain only as effective and beneficent as the humans who wield them.

Our Increasingly Data-Constructed World — What we now call Big Data is in fact a long-running macro trend underlying the many recent trends and innovations making our world, not just more data-driven, but increasingly data-constructed.

Will Big Data be Blinded by Data Science? — With apologies to Thomas Dolby, will the business leaders being told to hire data scientists to derive business value from big data analytics be blind to what data science tries to show them?

The Graystone Effects of Big Data — Using a metaphor based on the science fiction television show Caprica, I refer to the positive aspects of Big Data as the Zoe Graystone Effect, and the negative aspects of Big Data as the Daniel Graystone Effect.

Exercise Better Data Management — Big Data may be followed by MOData (i.e., MOre Data or Morbidly Obese Data), but that doesn’t necessarily mean we require more data management, instead we just need to exercise better data management.

A Tale of Two Datas — Inspired by Malcolm Chisholm and Charles Dickens, there are two types of data (i.e., representation and observation, not big and not-so-big) with different data uses that will require different data management approaches.

Data Silence — Not only do we need to adopt a mindset that embraces the principles of data science, but we also have to acknowledge that the biases and preconceptions in our minds could silence the signal and amplify the noise in big data.

The Wisdom of Crowds, Friends, and Experts — The future of wisdom will increasingly become an amalgamation of experts, friends, and crowds, with the data and techniques from all three sources often contributing to data-driven decision making.

Ten Best Posts on Data Governance and Data Quality

Data Governance Frameworks are like Jigsaw Puzzles — Inspired by Jill Dyché and Scott Berkun, this post explains how the usefulness of data governance frameworks comes from realizing data governance frameworks are like jigsaw puzzles.

Data Quality: Quo Vadimus? — With lots of help from Henrik Liliendahl Sørensen, Garry Ure, Bryan Larkin, and many others via the comments, I ponder where data quality is going, and whether data quality is a journey or a destination.

Data Quality and Miracle Exceptions — Battling the dark forces of poor data quality doesn’t require any superpowers, and data quality doesn’t have any miracle exceptions, so for the love of high-quality data everywhere, stop trying to sell us one.

Data Myopia and Business Relativity — Examines the two most prevalent definitions for data quality, real-world alignment and fitness for the purpose of use, otherwise known as the danger of data myopia and the challenge of business relativity.

How Data Cleansing Saves Lives — Although proactive defect prevention is far superior to reactive data cleansing, the history of the Hubble Space Telescope proves that data cleansing can be not just a necessary evil, but also a necessary good.

Data Quality and the Bystander Effect — The most common reason data quality issues are neither reported nor corrected is the Bystander Effect making people less likely to interpret bad data as a problem or, at the very least, not their responsibility.

Data Quality and Chicken Little Syndrome — A chicken-metaphor-based post about the far-too-common and fowl folly of, instead of trying to sell the business benefits of data quality, emphasizing the negative aspects of not investing in data quality.

Data and its Relationships with Quality — The metadata linking the data management industry to what it manages suffers from the one-to-many relationships created by never agreeing on how data, information, and quality should be defined.

Cooks, Chefs, and Data Governance — Implementing policies requires cooks who are adept at carrying out a recipe, as well as chefs who are trusted to figure out how to best combine policies with the organizational ingredients available to them.

Availability Bias and Data Quality Improvement — The availability heuristic explains why a reactive data cleansing project is easily approved, and availability bias explains why initiating a proactive data quality program is usually resisted.

Ten Best Podcasts

Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.

Saving Private Data — Recorded in December 2011, guest Daragh O Brien discusses the data privacy and data protection implications of social media, cloud computing, and big data.

Decision Management Systems — Guest James Taylor discusses data-driven decision making and analytical concepts from his book: Decision Management Systems: A Practical Guide to Using Business Rules and Predictive Analytics.

Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).

Social Media for Midsize Businesses — Sponsored by IBM Midsize Business Solutions, guest Paul Gillin, author of four books, the latest, co-authored with Greg Gianforte, is Attack of the Customers, discusses social media marketing concepts.

Data Driven — Guest Tom Redman (aka the “Data Doc”) discusses concepts from one of my favorite data quality books, which is his most recent book: Data Driven: Profiting from Your Most Important Business Asset.

The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.

The Evolution of Enterprise Security — Sponsored by the Enterprise CIO Forum, guest Bill Laberis discusses striking a balance between convenience and security, which is necessary in the era of cloud computing and mobile devices.

Defining Big Data — This episode of the Open MIKE Podcast, with assistance from Robert Hillard, discusses how big data refers to big complexity, not big volume, even though complex datasets tend to grow rapidly, thus making them voluminous.

Getting to Know NoSQL — This episode of the Open MIKE Podcast discusses how NoSQL does not mean AntiSQL (i.e., NoSQL is not a Relational replacement), and that business-driven big data needs will often require “Not Only SQL.”

Ten Best of the Rest

DQ-View: Data Is as Data Does — In this short video, I explain that data’s value comes from data’s usefulness, exemplifying the potential value of unstructured data based on whether or not you put what you read in data management books to use.

DQ-View: The Five Stages of Data Quality — In this short video, using my superb acting skills, I demonstrate how coming to terms with the daunting challenge of data quality is somewhat similar to experiencing the Five Stages of Grief.

DQ-View: MetaData makes BettahMusic — In this short video, I demonstrate how better metadata makes data better using the metadata automatically and manually created after importing my CD collection into my iTunes library.

Metadata, Data Quality, and the Stroop Test — In this colorful (and perhaps too colorful) post, I use the Stroop Test, where colors do not match their names, to discuss the relationship between metadata and data quality.

Quality is the Higgs Field of Data — Using one of the biggest science stories of 2012, the potential discovery of the elusive Higgs Boson (which I also attempt to explain), I attempt an analogy for data quality based on the Higgs Field.

The Family Circus and Data Quality — Thanks to The Family Circus comic strip created by cartoonist Bil Keane, I explain how Ida Know owns the data, Not Me is accountable for data governance, and Nobody takes responsibility for data quality.

Data Love Song Mashup — Since your data needs love too, on Valentine’s Day I wrote this post providing a mashup of love songs for your data (and Rob DuMoulin added a few more in the comments) — Happy Data Quality to you and your data!

The Algebra of Collaboration — The trick of algebra equates collaboration with data quality and data governance success when collaboration is viewed not just as a guiding principle, but also as a call to action in your daily practices.

The Return of the Dumb Terminal — With help from author Kevin Kelly and my old green machine, I ponder how the mobile-app-portal-to-the-cloud computing model means mobile devices are bringing about the return of the dumb terminal.

An Enterprise Carol — Jacob Marley raises the ghosts of a few ideas to consider about how to keep the Enterprise well in the new year via the Ghosts of Enterprise Past (Legacy Applications), Present (IT Consumerization), and Future (Big Data).

Thank You for Reading OCDQ Blog in 2012

In 2012, the Obsessive-Compulsive Data Quality (OCDQ) blog published 92 posts, which received 160,000 total page views, while averaging over 400 page views and 200 unique visitors a day.

Thank you for reading OCDQ Blog in 2012. Your readership was deeply appreciated.

Best OCDQ Blog Posts of 2011

So Long 2011, and Thanks for All the . . . – The OCDQ Radio 2011 Year in Review

2012 Quarterly Review of the Data Roundtable (Part 4)

2012 Quarterly Review of the Data Roundtable (Part 3)

2012 Quarterly Review of the Data Roundtable (Part 2)

2012 Quarterly Review of the Data Roundtable (Part 1)

2011 Quarterly Review of the Data Roundtable (Part 4)

2011 Quarterly Review of the Data Roundtable (Part 3)

2011 Quarterly Review of the Data Roundtable (Part 2)

2011 Quarterly Review of the Data Roundtable (Part 1)

December 27, 2012

Open MIKE Podcast — Episode 10

December 27, 2012/ Jim Harris

Episode 10: Information Maturity QuickScan

If you’re having trouble viewing this video, you can watch it on Vimeo by clicking on this link: Open MIKE Podcast on Vimeo

MIKE2.0 Content Featured in or Related to this Podcast

Information Maturity (IM) QuickScan: openmethodology.org/wiki/Information_Maturity_QuickScan

IM QuickScan Template Documents: openmethodology.org/wiki/QuickScan_MS_Office_survey

Information Maturity Model: openmethodology.org/wiki/Information_Maturity_Model

Previous Episodes of the Open MIKE Podcast

Clicking on the link will take you to the episode’s blog post:

Episode 01: Information Management Principles

Episode 02: Information Governance and Distributing Power

Episode 03: Data Quality Improvement and Data Investigation

Episode 04: Metadata Management

Episode 05: Defining Big Data

Episode 06: Getting to Know NoSQL

Episode 07: Guiding Principles for Open Semantic Enterprise

Episode 08: Information Lifecycle Management

Episode 09: Enterprise Data Management Strategy

You can also find the videos and blog post summaries for every episode of the Open MIKE Podcast at: ocdqblog.com/MIKE

December 18, 2012

An Enterprise Carol

December 18, 2012/ Jim Harris

This blog post is sponsored by the Enterprise CIO Forum and HP.

Since ‘tis the season for reflecting on the past year and predicting the year ahead, while pondering this post my mind wandered to the reflections and predictions provided by the ghosts of A Christmas Carol by Charles Dickens. So, I decided to let the spirit of Jacob Marley revisit my previous Enterprise CIO Forum posts to bring you the Ghosts of Enterprise Past, Present, and Future.

The Ghost of Enterprise Past

Legacy applications have a way of haunting the enterprise long after they should have been sunset. The reason that most of them do not go gentle into that good night, but instead rage against the dying of their light, is some users continue using some of the functionality they provide, as well as the data trapped in those applications, to support the enterprise’s daily business activities.

This freaky feature fracture (i.e., technology supporting business needs being splintered across new and legacy applications) leaves many IT departments overburdened with maintaining a lot of technology and data that’s not being used all that much.

The Ghost of Enterprise Past warns us that IT can’t enable the enterprise’s future if it’s stuck still supporting its past.

The Ghost of Enterprise Present

While IT was busy battling the Ghost of Enterprise Past, a familiar, but fainter, specter suddenly became empowered by the diffusion of the consumerization of IT. The rapid ascent of the cloud and mobility, spirited by service-oriented solutions that were more focused on the user experience, promised to quickly deliver only the functionality required right now to support the speed and agility requirements driving the enterprise’s business needs in the present moment.

Gifted by this New Prometheus, Shadow IT emerged from the shadows as the Ghost of Enterprise Present, with business-driven and decentralized IT solutions becoming more commonplace, as well as begrudgingly accepted by IT leaders.

All of which creates quite the IT Conundrum, forming yet another front in the war against Business-IT collaboration. Although, in the short-term, the consumerization of IT usually better services the technology needs of the enterprise, in the long-term, if it’s not integrated into a cohesive strategy, it creates a complex web of IT that entangles the enterprise much more than it enables it.

And with the enterprise becoming much more of a conceptual, rather than a physical, entity due to the cloud and mobile devices enabling us to take the enterprise with us wherever we go, the evolution of enterprise security is now facing far more daunting challenges than the external security threats we focused on in the past. This more open business environment is here to stay, and it requires a modern data security model, despite the fact that such a model could become the weakest link in enterprise security.

The Ghost of Enterprise Present asks many questions, but none more frightening than: Can the enterprise really be secured?

The Ghost of Enterprise Future

Of course, the T in IT wasn’t the only apparition previously invisible outside of the IT department to recently break through the veil in a big way. The I in IT had its own coming-out party this year also since, as many predicted, 2012 was the year of Big Data.

Although neither the I nor the T is magic, instead of sugar plums, Data Psychics and Magic Elephants appear to be dancing in everyone’s heads this holiday season. In other words, the predictive power of big data and the technological wizardry of Hadoop (as well as other NoSQL techniques) seem to be on the wish list of every enterprise for the foreseeable future.

However, despite its unquestionable potential, as its hype starts to settle down, the sobering realities of big data analytics will begin to sink in. Data’s value comes from data’s usefulness. If all we do is hoard data, then we’ll become so lost in the details that we’ll be unable to connect enough of the dots to discover meaningful patterns and convert big data into useful information that enables the enterprise to take action, make better decisions, or otherwise support its business activities.

Big data will force us to revisit information overload as we are occasionally confronted with the limitations of historical analysis, and blindsided by how our biases and preconceptions could silence the signal and amplify the noise, which will also force us to realize that data quality still matters in big data and that bigger data needs better data management.

As the Ghost of Enterprise Future, big data may haunt us with more questions than the many answers it will no doubt provide.

“Bah, Humbug!”

I realize that this post lacks the happy ending of A Christmas Carol. To paraphrase Dickens, I endeavored in this ghostly little post to raise the ghosts of a few ideas, not to put my readers out of humor with themselves, with each other, or with the season, but simply to give them thoughts to consider about how to keep the Enterprise well in the new year. Happy Holidays Everyone!

This blog post is sponsored by the Enterprise CIO Forum and HP.

Why does the sun never set on legacy applications?

Are Applications the La Brea Tar Pits for Data?

The Diffusion of the Consumerization of IT

The Cloud is shifting our Center of Gravity

More Tethered by the Untethered Enterprise?

A Swift Kick in the AAS

The UX Factor

Sometimes all you Need is a Hammer

Shadow IT and the New Prometheus

The IT Consumerization Conundrum

OCDQ Radio - The Evolution of Enterprise Security