Baseball Data Analysis Challenge

Calling all data analysts, machine learning engineers, and data scientists!

I am working on building some demos and tutorials for machine learning. Of course, I will be sharing everything I do on GitHub. I thought it would be fun to share my input data with all of you before I start and make a little challenge out of this. While not as exciting or lucrative as a Kaggle competition, please feel free to have at it and use whatever techniques and tools you would like to discover any insights and/or make any predictions (even if you do not know anything about baseball).

The input data for this challenge represents 6 years (2016-2021) of Boston Red Sox Major League Baseball (MLB) regular season baseball game results, including a Game_Result column, labeled either 0 or 1, where 0 = Loss and 1 = Win.

The input data for this challenge is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_input.csv

The data profiling results for the input data is available as a CSV file here: https://github.com/ocdqblog/Vertica/blob/main/csv/BRS_2016_2021_Batting_profile.csv

The raw data used in this challenge was collected via a paid subscription to: https://stathead.com/baseball/

Update for 2022 MLB Opening Day

I completed my initial work in time for the opening day of the 2022 MLB season, the results of which you can find in this Microsoft Excel file: Baseball Data Analysis Challenge 2022-04-05.xlsx. My baseball data analysis was performed using my employer’s (Vertica) in-database machine learning capabilities, and you can find my SQL scripts on GitHub.

I used logistic regression classification models to calculate win probabilities for the Red Sox across nine (9) game metrics: opponent, opponent’s division, month of year, day of week, runs scored, hits, extra base hits, home runs, and walks versus strikeouts. I also used the input data to train a Naïve Bayes classification model to predict wins and losses with an associated probability based on the runs scored, hits, extra base hits, home runs, and walks versus strikeouts game metrics (all of which are binned ranges of input data values). Its initial accuracy is only 77%, but I plan on making some adjustments. I also plan on using the 2022 baseball season as my test data. So not only will I be watching how many games the Red Sox win or lose this season, but I will also be watching how many games my machine learning model predicts correctly.

Think you can best my model? Game on! The baseball data analysis challenge continues. Play ball!

OCDQ Radio on Big Data and Data Science

OCDQ Radio is an audio podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

This podcast is no longer an active project, meaning not only do I rarely publish a new episode, but its episodes are only available to listen to on this website and no longer distributed on platforms such as Apple Podcasts and Google Podcasts.

I have been enjoying listening to many of the old episodes since I was happy to hear how evergreen they are, meaning their content is still applicable today. This post is part of my Best of OCDQ Radio series, organizing groups of episodes by topic(s).

Podcast Episodes on Big Data and Data Science

i blog of Data glad and big

I recently blogged about the need to balance the hype of big data with some anti-hype.  My hope was, like a collision of matter and anti-matter, the hype and anti-hype would cancel each other out, transitioning our energy into a more productive discussion about big data.  But, of course, few things in human discourse ever reach such an equilibrium, or can maintain it for very long.

For example, Quentin Hardy recently blogged about six big data myths based on a conference presentation by Kate Crawford, who herself also recently blogged about the hidden biases in big data.  “I call B.S. on all of it,” Derrick Harris blogged in his response to the backlash against big data.  “It might be provocative to call into question one of the hottest tech movements in generations, but it’s not really fair.  That’s because how companies and people benefit from big data, data science or whatever else they choose to call the movement toward a data-centric world is directly related to what they expect going in.  Arguing that big data isn’t all it’s cracked up to be is a strawman, pure and simple — because no one should think it’s magic to begin with.”

In their new book Big Data: A Revolution That Will Transform How We Live, Work, and Think, Viktor Mayer-Schonberger and Kenneth Cukier explained that “like so many new technologies, big data will surely become a victim of Silicon Valley’s notorious hype cycle: after being feted on the cover of magazines and at industry conferences, the trend will be dismissed and many of the data-smitten startups will flounder.  But both the infatuation and the damnation profoundly misunderstand the importance of what is taking place.  Just as the telescope enabled us to comprehend the universe and the microscope allowed us to understand germs, the new techniques for collecting and analyzing huge bodies of data will help us make sense of our world in ways we are just starting to appreciate.  The real revolution is not in the machines that calculate data, but in data itself and how we use it.”

Although there have been numerous critical technology factors making the era of big data possible, such as increases in the amount of computing power, decreases in the cost of data storage, increased network bandwidth, parallel processing frameworks (e.g., Hadoop), scalable and distributed models (e.g., cloud computing), and other techniques (e.g., in-memory computing), Mayer-Schonberger and Cukier argued that “something more important changed too, something subtle.  There was a shift in mindset about how data could be used.  Data was no longer regarded as static and stale, whose usefulness was finished once the purpose for which it was collected was achieved.  Rather, data became a raw material of business, a vital economic input, used to create a new form of economic value.”

“In fact, with the right mindset, data can be cleverly used to become a fountain of innovation and new services.  The data can reveal secrets to those with the humility, the willingness, and the tools to listen.”

Pondering this big data war of words reminded me of the E. E. Cummings poem i sing of Olaf glad and big, which sings of Olaf, a conscientious objector forced into military service, who passively endures brutal torture inflicted upon him by training officers, while calmly responding (pardon the profanity): “I will not kiss your fucking flag” and “there is some shit I will not eat.”

Without question, big data has both positive and negative aspects, but the seeming unwillingness of either side in the big data war of words to “kiss each other’s flag,” so to speak, is not as concerning to me as is the conscientious objection to big data and data science expanding into realms where people and businesses were not used to enduring its influence.  For example, some will feel that data-driven audits of their decision-making is like brutal torture inflicted upon their less-than data-driven intuition.

E.E. Cummings sang the praises of Olaf “because unless statistics lie, he was more brave than me.”  i blog of Data glad and big, but I fear that, regardless of how big it is, “there is some data I will not believe” will be a common refrain by people who will lack the humility and willingness to listen to data, and who will not be brave enough to admit that statistics don’t always lie.

 

Related Posts

The Need for Data Philosophers

On Philosophy, Science, and Data

OCDQ Radio - Demystifying Data Science

OCDQ Radio - Data Quality and Big Data

Big Data and the Infinite Inbox

The Laugh-In Effect of Big Data

HoardaBytes and the Big Data Lebowski

Magic Elephants, Data Psychics, and Invisible Gorillas

Will Big Data be Blinded by Data Science?

The Graystone Effects of Big Data

Information Overload Revisited

Exercise Better Data Management

A Tale of Two Datas

Our Increasingly Data-Constructed World

The Wisdom of Crowds, Friends, and Experts

Data Separates Science from Superstition

Headaches, Data Analysis, and Negativity Bias

Why Data Science Storytelling Needs a Good Editor

Predictive Analytics, the Data Effect, and Jed Clampett

Rage against the Machines Learning

The Flying Monkeys of Big Data

Cargo Cult Data Science

Speed Up Your Data to Slow Down Your Decisions

It’s Not about being Data-Driven

Big Data, Sporks, and Decision Frames

Big Data: Structure and Quality

Darth Vader, Big Data, and Predictive Analytics

Big Data, Predictive Analytics, and the Ideal Chronicler

What Magic Tricks teach us about Data Science

What Mozart for Babies teaches us about Data Science

The Need for Data Philosophers

In my post On Philosophy, Science, and Data, I explained that although some argue philosophy only reigns in the absence of data while science reigns in the analysis of data, a conceptual bridge still remains between analysis and insight, the crossing of which is itself a philosophical exercise.  Therefore, I argued that an endless oscillation persists between science and philosophy, which is why, despite the fact that all we hear about is the need for data scientists, there’s also a need for data philosophers.

Of course, the debate between science and philosophy is a very old one, as is the argument we need both.  In my previous post, I slightly paraphrased Immanuel Kant (“perception without conception is blind and conception without perception is empty”) by saying that science without philosophy is blind and philosophy without science is empty.

In his book Cosmic Apprentice: Dispatches from the Edges of Science, Dorion Sagan explained that science and philosophy hang “in a kind of odd balance, watching each other, holding hands.  Science’s eye for detail, buttressed by philosophy’s broad view, makes for a kind of alembic, an antidote to both.  Although philosophy isn’t fiction, it can be more personal, creative and open, a kind of counterbalance for science even as it argues that science, with its emphasis on a kind of impersonal materialism, provides a crucial reality check for philosophy and a tendency to over-theorize that’s inimical to the scientific spirit.  Ideally, in the search for truth, science and philosophy, the impersonal and autobiographical, can keep each other honest in a kind of open circuit.”

“Science’s spirit is philosophical,” Sagan concluded.  “It is the spirit of questioning, of curiosity, of critical inquiry combined with fact-checking.  It is the spirit of being able to admit you’re wrong, of appealing to data, not authority.”

“Science,” as his father Carl Sagan said, “is a way of thinking much more than it is a body of knowledge.”  By extension, we could say that data science is about a way of thinking much more than it is about big data or about being data-driven.

I have previously blogged that science has always been about bigger questions, not bigger data.  As Claude Lévi-Strauss said, “the scientist is not a person who gives the right answers, but one who asks the right questions.”  As far as data science goes, what are the right questions?  Data scientist Melinda Thielbar proposes three key questions (Actionable? Verifiable? Repeatable?).

Here again we see the interdependence of science and philosophy.  “Philosophy,” Marilyn McCord Adams said, “is thinking really hard about the most important questions and trying to bring analytic clarity both to the questions and the answers.”

“Philosophy is critical thinking,” Don Cupitt said. “Trying to become aware of how one’s own thinking works, of all the things one takes for granted, of the way in which one’s own thinking shapes the things one’s thinking about.”  Yes, even a data scientist’s own thinking could shape the things they are thinking scientifically about.  James Kobielus has blogged about five biases that may crop up in a data scientist’s work (Cognitive, Selection, Sampling, Modeling, Funding).

“Data science has a bright future ahead,” explained Hilary Mason in a recent interview.  “There will only be more data, and more of a need for people who can find meaning and value in that data.  We’re also starting to see a greater need for data engineers, people to build infrastructure around data and algorithms, and data artists, people who can visualize the data.”

I agree with Mason, and I would add that we are also starting to see a greater need for data philosophers, people who can, borrowing the words that Anthony Kenny used to define philosophy, “think as clearly as possible about the most fundamental concepts that reach through all the disciplines.”

On Philosophy, Science, and Data

Ever since Melinda Thielbar helped me demystify data science on OCDQ Radio, I have been pondering my paraphrasing of an old idea: Science without philosophy is blind; Philosophy without science is empty; Data needs both science and philosophy.

“A philosopher’s job is to find out things about the world by thinking rather than observing,” the philosopher Bertrand Russell once said.  One could say a scientist’s job is to find out things about the world by observing and experimenting.  In fact, Russell observed that “the most essential characteristic of scientific technique is that it proceeds from experiment, not from tradition.”

Russell also said that “science is what we know, and philosophy is what we don’t know.”  However, Stuart Firestein, in his book Ignorance: How It Drives Science, explained “there is no surer way to screw up an experiment than to be certain of its outcome.”

Although it seems it would make more sense for science to be driven by what we know, by facts, “working scientists,” according to Firestein, “don’t get bogged down in the factual swamp because they don’t care that much for facts.  It’s not that they discount or ignore them, but rather that they don’t see them as an end in themselves.  They don’t stop at the facts; they begin there, right beyond the facts, where the facts run out.  Facts are selected for the questions they create, for the ignorance they point to.”

In this sense, philosophy and science work together to help us think about and experiment with what we do and don’t know.

Some might argue that while anyone can be a philosopher, being a scientist requires more rigorous training.  A commonly stated requirement in the era of big data is to hire data scientists, but this begs the question: Is data science only for data scientists?

“Clearly what we need,” Firestein explained, “is a crash course in citizen science—a way to humanize science so that it can be both appreciated and judged by an informed citizenry.  Aggregating facts is useless if you don’t have a context to interpret them.”

I would argue that clearly what organizations need is a crash course in data science—a way to humanize data science so that it can be both appreciated and judged by an informed business community.  Big data is useless if you don’t have a business context to interpret it.  Firestein also made great points about science not being exclusionary (i.e., not just for scientists).  Just as you can enjoy watching sports without being a professional athlete and you can appreciate music without being a professional musician, you can—and should—learn the basics of data science (especially statistics) without being a professional data scientist.

In order to truly deliver business value to organizations, data science can not be exclusionary.  This doesn’t mean you shouldn’t hire data scientists.  In many cases, you will need the expertise of professional data scientists.  However, you will not be able to direct them or interpret their findings without understanding the basics, what could be called the philosophy of data science.

Some might argue that philosophy only reigns in the absence of data, while science reigns in the analysis of data.  Although in the era of big data there seems to be fewer areas truly absent of data, a conceptual bridge still remains between analysis and insight, the crossing of which is itself a philosophical exercise.  So, an endless oscillation persists between science and philosophy, which is why science without philosophy is blind, and philosophy without science is empty.  Data needs both science and philosophy.

Demystifying Data Science

OCDQ Radio is a vendor-neutral podcast about data quality and its related disciplines, produced and hosted by Jim Harris.

During this episode, special guest, and actual data scientist, Dr. Melinda Thielbar, a Ph.D. Statistician, and I attempt to demystify data science by explaining what a data scientist does, including the requisite skills involved, bridging the communication gap between data scientists and business leaders, delivering data products business users can use on their own, and providing a straightforward explanation of key concepts such as signal-to-noise ratio, uncertainty, experimentation, and correlation.

Melinda Thielbar is the Senior Mathematician for IAVO Research and Scientific.  Her work there focuses on power system optimization using real-time prediction models.  She has worked as a software developer, an analytic lead for big data implementations, and a statistics and programming teacher.

Melinda Thielbar is a co-founder of Research Triangle Analysts, a professional group for analysts and data scientists located in the Research Triangle of North Carolina.

While Melinda Thielbar doesn’t specialize in a single field, she is particularly interested in power systems because, as she puts it, “A power systems optimizer has to work every time.”

Popular OCDQ Radio Episodes

Clicking on the link will take you to the episode’s blog post:

  • Data Quality and Big Data — Guest Tom Redman (aka the “Data Doc”) discusses Data Quality and Big Data, including if data quality matters less in larger data sets, and if statistical outliers represent business insights or data quality issues.
  • Demystifying Master Data Management — Guest John Owens explains the three types of data (Transaction, Domain, Master), the four master data entities (Party, Product, Location, Asset), and the Party-Role Relationship, which is where we find many of the terms commonly used to describe the Party master data entity (e.g., Customer, Supplier, Employee).
  • Data Governance Star Wars — Special Guests Rob Karel and Gwen Thomas joined this extended, and Star Wars themed, discussion about how to balance bureaucracy and business agility during the execution of data governance programs.
  • The Johari Window of Data Quality — Guest Martin Doyle discusses helping people better understand their data and assess its business impacts, not just the negative impacts of bad data quality, but also the positive impacts of good data quality.
  • Studying Data Quality — Guest Gordon Hamilton discusses the key concepts from recommended data quality books, including those which he has implemented in his career as a data quality practitioner.