Baseball Data Analysis Challenge

Calling all data analysts, machine learning engineers, and data scientists!

I am working on building some demos and tutorials for machine learning. Of course, I will be sharing everything I do on GitHub. I thought it would be fun to share my input data with all of you before I start and make a little challenge out of this. While not as exciting or lucrative as a Kaggle competition, please feel free to have at it and use whatever techniques and tools you would like to discover any insights and/or make any predictions (even if you do not know anything about baseball).

The input data for this challenge represents 6 years (2016-2021) of Boston Red Sox Major League Baseball (MLB) regular season baseball game results, including a Game_Result column, labeled either 0 or 1, where 0 = Loss and 1 = Win.

The input data for this challenge is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_input.csv

The data profiling results for the input data is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_profile.csv

The raw data used in this challenge was collected via a paid subscription to: https://stathead.com/baseball/

Update for 2022 MLB Opening Day

I completed my initial work in time for the opening day of the 2022 MLB season, the results of which you can find in this Microsoft Excel file: Baseball Data Analysis Challenge 2022-04-05.xlsx. My baseball data analysis was performed using my employer’s (Vertica) in-database machine learning capabilities, and you can find my SQL scripts on GitHub.

I used logistic regression classification models to calculate win probabilities for the Red Sox across nine (9) game metrics: opponent, opponent’s division, month of year, day of week, runs scored, hits, extra base hits, home runs, and walks versus strikeouts. I also used the input data to train a Naïve Bayes classification model to predict wins and losses with an associated probability based on the runs scored, hits, extra base hits, home runs, and walks versus strikeouts game metrics (all of which are binned ranges of input data values). Its initial accuracy is only 77%, but I plan on making some adjustments. I also plan on using the 2022 baseball season as my test data. So not only will I be watching how many games the Red Sox win or lose this season, but I will also be watching how many games my machine learning model predicts correctly.

Think you can best my model? Game on! The baseball data analysis challenge continues. Play ball!

Pitching Perfect Data Quality

In my previous post, I used a baseball metaphor to explain why we should strive for a quality start to our business activities by starting them off with good data quality, thereby giving our organization a better chance to succeed.

Since it’s a beautiful week for baseball metaphors, let’s post two!  (My apologies to Ernie Banks.)

If good data quality gives our organization a better chance to succeed, then it seems logical to assume that perfect data quality would give our organization the best chance to succeed.  However, as Yogi Berra said: “If the world were perfect, it wouldn’t be.”

My previous baseball metaphor was based on a statistic that measured how well a starting pitcher performs during a game.  The best possible performance of a starting pitcher is called a perfect game, when nine innings are perfectly completed by retiring the minimum of 27 opposing batters without allowing any hits, walks, hit batsmen, or batters reaching base due to a fielding error.

Although a lot of buzz is generated when a pitcher gets close to pitching a perfect game (e.g., usually after five perfect innings, it’s all the game’s announcers will talk about), during the 143 years of Major League Baseball history, during which approximately 200,000 games have been played, there have been only 20 perfect games, making it one of the rarest statistical events in baseball.

When a pitcher loses the chance of pitching a perfect game, does his team forfeit the game?  No, of course not.  Because the pitcher’s goal is not pitching perfectly.  The pitcher’s (and every other player’s) goal is helping the team win the game.

This is why I have never been a fan of anyone who is pitching perfect data quality, i.e., anyone advocating data perfection as the organization’s goal.  The organization’s goal is business success.  Data quality has a role to play, but claiming business success is impossible without having perfect data quality is like claiming winning in baseball is impossible without pitching a perfect game.

 

Related Posts

DQ-View: Baseball and Data Quality

The Dichotomy Paradox, Data Quality and Zero Defects

The Asymptote of Data Quality

To Our Data Perfectionists

Data Quality and The Middle Way

There is No Such Thing as a Root Cause

OCDQ Radio - The Johari Window of Data Quality

Data Quality and Miracle Exceptions

Data Quality: Quo Vadimus?

Quality Starts and Data Quality

This past week was the beginning of the 2012 Major League Baseball (MLB) season.  Since its data is mostly transaction data describing the statistical events of games played, baseball has long been a sport obsessed with statistics.  Baseball statisticians slice and dice every aspect of past games attempting to discover trends that could predict what is likely to happen in future games.

There are too many variables involved in determining which team will win a particular game to be able to choose a single variable that predicts game results.  But a few key statistics are cited by baseball analysts as general guidelines of a team’s potential to win.

One such statistic is a quality start, which is defined as a game in which a team’s starting pitcher completes at least six innings and permits no more than three earned runs.  Of course, a so-called quality start is no guarantee that the starting pitcher’s team will win the game.  But the relative reliability of the statistic to predict a game’s result causes some baseball analysts to refer to a loss suffered by a pitcher in a quality start as a tough loss and a win earned by a pitcher in a non-quality start as a cheap win.

There are too many variables involved in determining if a particular business activity will succeed to be able to choose a single variable that predicts business results.  But data quality is one of the general guidelines of an organization’s potential to succeed.

As Henrik Liliendahl Sørensen blogged, organizations are capable of achieving success with their business activities despite bad data quality, which we could call the business equivalent of cheap wins.  And organizations are also capable of suffering failure with their business activities despite good data quality, which we could call the business equivalent of tough losses.

So just like a quality start is no guarantee of a win in baseball, good data quality is no guarantee of a success in business.

But perhaps the relative reliability of data quality to predict business results should influence us to at least strive for a quality start to our business activities by starting them off with good data quality, thereby giving our organization a better chance to succeed.

 

Related Posts

DQ-View: Baseball and Data Quality

Poor Quality Data Sucks

Fantasy League Data Quality

There is No Such Thing as a Root Cause

Data Quality: Quo Vadimus?

OCDQ Radio - The Johari Window of Data Quality

OCDQ Radio - Redefining Data Quality

OCDQ Radio - The Blue Box of Information Quality

OCDQ Radio - Studying Data Quality

OCDQ Radio - Organizing for Data Quality

Poor Quality Data Sucks

Fenway Park 2008 Home Opener

Over the last few months on his Information Management blog, Steve Miller has been writing posts inspired by a great 2008 book that we both highly recommend: The Drunkard's Walk: How Randomness Rules Our Lives by Leonard Mlodinow.

In his most recent post The Demise of the 2009 Boston Red Sox: Super-Crunching Takes a Drunkard's Walk, Miller takes on my beloved Boston Red Sox and the less than glorious conclusion to their 2009 season. 

For those readers who are not baseball fans, the Los Angeles Angels of Anaheim swept the Red Sox out of the playoffs.  I will let Miller's words describe their demise: “Down two to none in the best of five series, the Red Sox took a 6-4 lead into the ninth inning, turning control over to impenetrable closer Jonathan Papelbon, who hadn't allowed a run in 26 postseason innings.  The Angels, within one strike of defeat on three occasions, somehow managed a miracle rally, scoring 3 runs to take the lead 7-6, then holding off the Red Sox in the bottom of the ninth for the victory to complete the shocking sweep.”

 

Baseball and Data Quality

What, you may be asking, does baseball have to do with data quality?  Beyond simply being two of my all-time favorite topics, quite a lot actually.  Baseball data is mostly transaction data describing the statistical events of games played.

Statistical analysis has been a beloved pastime even longer than baseball has been America's Pastime.  Number-crunching is far more than just a quantitative exercise in counting.  The qualitative component of statistics – discerning what the numbers mean, analyzing them to discover predictive patterns and trends – is the very basis of data-driven decision making.

“The Red Sox,” as Miller explained, “are certainly exemplars of the data and analytic team-building methodology” chronicled in Moneyball: The Art of Winning an Unfair Game, the 2003 book by Michael Lewis.  Red Sox General Manager Theo Epstein has always been an advocate of the so-called evidenced-based baseball, or baseball analytics, pioneered by Bill James, the baseball writer, historian, statistician, current Red Sox consultant, and founder of Sabermetrics.

In another book that Miller and I both highly recommend, Super Crunchers, author Ian Ayres explained that “Bill James challenged the notion that baseball experts could judge talent simply by watching a player.  James's simple but powerful thesis was that data-based analysis in baseball was superior to observational expertise.  James's number-crunching approach was particular anathema to scouts.” 

“James was baseball's herald,” continues Ayres, “of data-driven decision making.”

 

The Drunkard's Walk

As Mlodinow explains in the prologue: “The title The Drunkard's Walk comes from a mathematical term describing random motion, such as the paths molecules follow as they fly through space, incessantly bumping, and being bumped by, their sister molecules.  The surprise is that the tools used to understand the drunkard's walk can also be employed to help understand the events of everyday life.”

Later in the book, Mlodinow describes the hidden effects of randomness by discussing how to build a mathematical model for the probability that a baseball player will hit a home run: “The result of any particular at bat depends on the player's ability, of course.  But it also depends on the interplay of many other factors: his health, the wind, the sun or the stadium lights, the quality of the pitches he receives, the game situation, whether he correctly guesses how the pitcher will throw, whether his hand-eye coordination works just perfectly as he takes his swing, whether that brunette he met at the bar kept him up too late, or the chili-cheese dog with garlic fries he had for breakfast soured his stomach.”

“If not for all the unpredictable factors,” continues Mlodinow, “a player would either hit a home run on every at bat or fail to do so.  Instead, for each at bat all you can say is that he has a certain probability of hitting a home run and a certain probability of failing to hit one.  Over the hundreds of at bats he has each year, those random factors usually average out and result in some typical home run production that increases as the player becomes more skillful and then eventually decreases owing to the same process that etches wrinkles in his handsome face.  But sometimes the random factors don't average out.  How often does that happen, and how large is the aberration?”

 

Conclusion

I have heard some (not Mlodinow or anyone else mentioned in this post) argue that data quality is an irrelevant issue.  The basis of their argument is that poor quality data are simply random factors that, in any data set of statistically significant size, will usually average out and therefore have a negligible effect on any data-based decisions. 

However, the random factors don't always average out.  It is important to not only measure exactly how often poor quality data occur, but acknowledge the large aberration poor quality data are, especially in data-driven decision making.

As every citizen of Red Sox Nation is taught from birth, the only acceptable opinion of our American League East Division rivals, the New York Yankees, is encapsulated in the chant heard throughout the baseball season (and not just at Fenway Park):

“Yankees Suck!”

From their inception, the day-to-day business decisions of every organization are based on its data.  This decision-critical information drives the operational, tactical, and strategic initiatives essential to the enterprise's mission to survive and thrive in today's highly competitive and rapidly evolving marketplace. 

It doesn't quite roll off the tongue as easily, but a chant heard throughout these enterprise information initiatives is:

“Poor Quality Data Sucks!”

Books Recommended by Red Sox Nation

Mind Game: How the Boston Red Sox Got Smart, Won a World Series, and Created a New Blueprint for Winning

Feeding the Monster: How Money, Smarts, and Nerve Took a Team to the Top

Theology: How a Boy Wonder Led the Red Sox to the Promised Land

Now I Can Die in Peace: How The Sports Guy Found Salvation Thanks to the World Champion (Twice!) Red Sox

Fantasy League Data Quality

For over 25 years, I have been playing fantasy league baseball and football.  For those readers who are not familiar with fantasy sports, they simulate ownership of a professional sports team.  Participants “draft” individual real-world professional athletes to “play” for their fantasy team, which competes with other teams using a scoring system based on real-world game statistics.

What does any of this have to do with data quality?

 

Master Data Management

In Worthy Data Quality Whitepapers (Part 1), Peter Benson of the ECCMA explained that “data is intrinsically simple and can be divided into data that identifies and describes things, master data, and data that describes events, transaction data.”

In fantasy sports, this distinction is very easy to make:

  • Master Data – data describing the real-world players on the roster of each fantasy team.

  • Transaction Data – data describing the statistical events of the real-world games played.

In his magnificent book Master Data Management, David Loshin explained that “master data objects are those core business objects used in the different applications across the organization, along with their associated metadata, attributes, definitions, roles, connections and taxonomies.”

In fantasy sports, Players and Teams are the master data objects with many characteristics including the following:

  • Attributes – Player attributes include first name, last name, birth date, professional experience in years, and their uniform number.  Team attributes include name, owner, home city, and the name and seating capacity of their stadium.

  • Definitions – Player and Team have both Professional and Fantasy definitions.  Professional teams and players are real-world objects managed independent of fantasy sports.  Fundamentally, Professional Team and Professional Player are reference data objects from external content providers (Major League Baseball and the National Football League).  Therefore, Fantasy Team and Fantasy Player are the true master data objects.  The distinction between professional and fantasy teams is simpler than between professional and fantasy players.  Not every professional player will be used in fantasy sports (e.g. offensive linemen in football) and the same professional player can simultaneously play for multiple fantasy teams in different fantasy leagues (or sometimes even within the same league – e.g. fantasy tournament formats).

  • Roles – In baseball, the player roles are Batter, Pitcher, and Fielder.  In football, the player roles are Offense, Defense and Special Teams.  In both sports, the same player can have multiple or changing roles (e.g. in National League baseball, a pitcher is also a batter as well as a fielder).

  • Connections – Fantasy Players are connected to Fantasy Teams via a roster.  On the fantasy team roster, fantasy players are connected to real-world statistical events via a lineup, which indicates the players active for a given scoring period (typically a week in fantasy football and either a week or a day in fantasy baseball).  These connections change throughout the season.  Lineups change as players can go from active to inactive (i.e. on the bench) and rosters change as players can be traded, released, and signed (i.e. free agents added to the roster after the draft).

  • Taxonomies – Positions played are defined individually and organized into taxonomies.  In baseball, first base and third base are individual positions, but both are infield positions and more specifically corner infield.  Second base and short stop are also infield positions, and more specifically middle infield.  And not all baseball positions are associated with fielding (e.g. a pinch runner can accrue statistics such as stolen bases and runs scored without either fielding or batting).

 

Data Warehousing

Combining a personal hobby with professional development, I built a fantasy baseball data warehouse.  I downloaded master, reference, and transaction data from my fantasy league's website.  I prepared these sources in a flat file staging area, from which I applied inserts and updates to the relational database tables in my data warehouse, where I used dimensional modeling.

My dimension tables were Date, Professional Team, Player, Position, Fantasy League, and Fantasy Team.  All of these tables (except for Date) were Type 2 slowly changing dimensions to support full history and rollbacks.

For simplicity, the Date dimension was calendar days with supporting attributes for all aggregate levels (e.g. monthly aggregate fact tables used the last day of the month as opposed to a separate Month dimension).

Professional and fantasy team rosters, as well as fantasy team lineups and fantasy league team membership, were all tracked using factless fact tables.  For example, the Professional Team Roster factless fact table used the Date, Professional Team, and Player dimensions, and the Fantasy Team Lineup factless fact table used the Date, Fantasy League, Fantasy Team, Player, and Position dimensions. 

The factless fact tables also allowed Player to be used as a conformed dimension for both professional and fantasy players since a Fantasy Player dimension would redundantly store multiple instances of the same professional player for each fantasy team he played for, as well as using Fantasy League and Fantasy Team as snowflaked dimensions.

My base fact tables were daily transactions for Batting Statistics and Pitching Statistics.  These base fact tables used only the Date, Professional Team, Player, and Position dimensions to provide the lowest level of granularity for daily real-world statistical performances independent of fantasy baseball. 

The Fantasy League and Fantasy Team dimensions replaced the Professional Team dimension in a separate family of base fact tables for daily fantasy transactions for Batting Statistics and Pitching Statistics.  This was necessary to accommodate for the same professional player simultaneously playing for multiple fantasy teams in different fantasy leagues.  Alternatively, I could have stored each fantasy league in a separate data mart.

Aggregate fact tables accumulated month-to-date and year-to-date batting and pitching statistical totals for fantasy players and teams.  Additional aggregate fact tables incremented current rolling snapshots of batting and pitching statistical totals for the previous 7, 14 and 21 days for players only.  Since the aggregate fact tables were created to optimize fantasy league query performance, only the base tables with daily fantasy transactions were aggregated.

Conformed facts were used in both the base and aggregate fact tables.  In baseball, this is relatively easy to achieve since most statistics have been consistently defined and used for decades (and some for more than a century). 

For example, batting average is defined as the ratio of hits to at bats and has been used consistently since the late 19th century.  However, there are still statistics with multiple meanings.  For example, walks and strikeouts are recorded for both batters and pitchers, with very different connotations for each.

Additionally, in the late 20th century, new baseball statistics such as secondary average and runs created have been defined with widely varying formulas.  Metadata tables with definitions (including formulas where applicable) were included in the baseball data warehouse to avoid confusion.

For remarkable reference material containing clear-cut guidelines and real-world case studies for both dimensional modeling and data warehousing, I highly recommend all three books in the collection: Ralph Kimball's Data Warehouse Toolkit Classics.

 

Business Intelligence

In his Information Management special report BI: Only as Good as its Data Quality, William Giovinazzo explained that “the chief promise of business intelligence is the delivery to decision-makers the information necessary to make informed choices.”

As a reminder for the uninitiated, fantasy sports simulate the ownership of a professional sports team.  Business intelligence techniques are used for pre-draft preparation and for tracking your fantasy team's statistical performance during the season in order to make management decisions regarding your roster and lineup.

The aggregate fact tables that I created in my baseball data warehouse delivered the same information available as standard reports from my fantasy league's website.  This allowed me to use the website as an external data source to validate my results, which is commonly referred to as using a “surrogate source of the truth.”  However, since I also used the website as the original source of my master, reference, and transaction data, I double-checked my results using other websites. 

This is a significant advantage for fantasy sports – there are numerous external data sources that can be used for validation freely available online.  Of course, this wasn't always the case. 

Over 25 years ago when I first started playing fantasy sports, my friends and I had to manually tabulate statistics from newspapers.  We migrated to customized computer spreadsheet programs (this was in the days before everyone had PCs with Microsoft Excel – which we eventually used) before the Internet revolution and cloud computing brought the wonderful world of fantasy sports websites that we enjoy today.

Now with just a few mouse clicks, I can run regression analysis to determine whether my next draft pick should be a first baseman predicted to hit 30 home runs or a second baseman predicted to have a .300 batting average and score 100 runs. 

I can check my roster for weaknesses in statistics difficult to predict, such as stolen bases and saves.  I can track the performances of players I didn't draft to decide if I want to make a trade, as well as accurately evaluate a potential trade from another owner who claims to be offering players who are having a great year and could help my team be competitive.

 

Data Quality

In her fantastic book Executing Data Quality Projects: Ten Steps to Quality Data and Trusted Information, Danette McGilvray comprehensively defines all of the data quality dimensions, which include the following most applicable to fantasy sports:

  • Accuracy – A measure of the correctness of the content of the data, which requires an authoritative source of reference to be identified and accessible.

  • Timeliness and Availability – A measure of the degree to which data are current and available for use as specified and in the time frame in which they are expected.

  • Data Coverage – A measure of the availability and comprehensiveness of data compared to the total data universe or population of interest.

  • Presentation Quality – A measure of how information is presented to and collected from those who utilize it.  Format and appearance support appropriate use of the information.

  • Perception, Relevance, and Trust – A measure of the perception of and confidence in the data quality; the importance, value, and relevance of the data to business needs.

 

Conclusion

I highly doubt that you will see Fantasy League Data Quality coming soon to a fantasy sports website near you.  It is just as unlikely that my future blog posts will conclude with “The Mountain Dew Post Game Show” or that I will rename my blog to “OCDQ – The Worldwide Leader in Data Quality” (duh-nuh-nuh, duh-nuh-nuh).

However, fantasy sports are more than just a hobby.  They're a thriving real-world business providing many excellent examples of best practices in action for master data management, data warehousing, and business intelligence – all implemented upon a solid data quality foundation.

So who knows, maybe some Monday night this winter we'll hear Hank Williams Jr. sing:

“Are you ready for some data quality?”