PreprintPDF Available

Predicting NBA Playoffs Using Machine Learning

February 2021

February 2021

DOI:10.47611/harp.26

License
CC BY-NC-ND 4.0

Authors:

Preprints and early-stage research may not have been peer reviewed yet.

2018 NBA Playoff (Prediction)

…

2018 NBA Playoff (Original)

…

Playoff Prediction Accuracy of Different Machine Learning Models with all data selected and average 30 trial runs

…

2018 LR Prediction Results

…

2016 Team Performance Prediction with All Data Selected

…

Figures - available via license: Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Content may be subject to copyright.

Available via license: CC BY-NC-ND 4.0

Content may be subject to copyright.

Predicting NBA Playoﬀs using Machine Learning

Sean Liu ∗

February 18, 2021

Abstract

This project attempts to predict the NBA playoﬀ bracket using ma-

chine learning methods. It will consider one self-constructed model and

one machine learning model built from various machine learning algo-

rithms. The project will also determine the most eﬃcient model for pre-

dicting NBA results and which way to select data gives an accurate and

consistent prediction. Finally, the project will investigate the eﬀect of

home and away variables on the teams’ performance and the model’s ac-

curacy.

1 Introduction

The National Basketball Association (NBA) is considered as the premier bas-

ketball league for professional male basketball players in USA. It is made up of

30 teams, split into the Eastern and Western conferences [Aut01].

During the playoﬀ, the top 8 teams from each conference (Eastern and West-

ern) are chosen to compete for the championship. The rankings are decided

based on the teams’ performances during the regular season. Then, the teams

play against each other with the 1st place playing against the 8th place, the

2nd place playing against the 7th place, etc. Each game will be a best-of-seven

match, and teams will rotate between home and away.

Like AlphaGo in the Go Contest [Sil02], machine learning is a well-known

prediction tool for complex process. The question of this research is, can ma-

chine learning be used to predict the NBA playoﬀ bracket? And what is accu-

racy of such prediction compared with the real results? What machine learning

method is the best solution for NBA playoﬀ bracket prediction?

To better analyzing and comparing the performance of the machine learning

in predicting the NBA playoﬀ bracket, ﬁrstly we create a self-made prediction

model based on several key game variables that will impact the game result

mostly by our best knowledge about the NBA games. These variables include 1)

eﬀective ﬁeld goal percentage, 2) free throw percentage, 3) turn over percentage,

4) Oﬀensive rebound percentage, 5) Defensive rebound percentage. And We

∗Advised by: Derek Sorensen

focused on working out the probability of Team A winning against Team B,

then applying this to every game in the playoﬀ. Base on the historic game data

of 2014 to 2018, the playoﬀ prediction of 2018,2017 and 2016 are carried out

and comparison with the real playoﬀ brackets are also presented.

As for the machine learning model for the NBA playoﬀ bracket prediction,

here, we are focusing on 5 diﬀerent machine learning models that have already

been implemented in the Python Machine Learning Library (Scikit-learn), i.e.,

Logistic Regression (LR), Linear Discriminate Analysis (LDA), Support Vector

Machine (SVM), K-Nearest Neighbors (KNN) and Classiﬁcation and Regression

Tree (CART) [Dhi03] - [TA11]. For comparison with the self-made model, the

same playoﬀ prediction of 2018, 2017 and 2016 are carried out and comparison

among diﬀerent machine learning models are given accordingly.

2 Result

2.1 Exposition of self-made prediction model

Based on the testing results for our self-made prediction model, we have the

following prediction results (Table 1). And the predicted playoﬀ bracket with

the original ones are shown in Figure 1 and 2, with the prediction diﬀerence

highlighted in red color.

Figure 1: 2018 NBA Playoﬀ (Prediction)

In summary, the self-built prediction model performed best when predicting

the 2018 playoﬀ, getting an accuracy of 80%. The second-best prediction was

the 2017 prediction, obtaining an accuracy of 66.7%. The worst prediction was

for the 2016 bracket, only getting an accuracy of 53.3%. For the 2018 playoﬀ

Figure 2: 2018 NBA Playoﬀ (Original)

Figure 3: Playoﬀ Prediction Accuracy of Self-made Prediction Model

prediction, we used teams’ statistics over 4 years from 2014-2018. For the 2017

playoﬀ prediction, we used statistics of teams over a time of 3 years from 2014

- 2017. Finally, for the 2016 playoﬀ prediction, we only used statistics of teams

over two years from 2014-2016.

2.2 Exposition of the Machine Learning Models

In order to evaluate the performance of machine learning in NBA playoﬀ bracket

prediction, 5 diﬀerent machine learning models are employed, which have been

implemented by Python Machine Learning Library (Scikit-learn), i.e., 1) Logis-

tic Regression (LR), 2) Linear Discriminate Analysis (LDA), 3) Support Vector

Machine (SVM), 4) K-Nearest Neighbors (KNN) and 5) Classiﬁcation and Re-

gression Tree (CART) [Dhi03] - [TA11]. With two diﬀerent training data selec-

tion methods and home/away investigation, the most accurate machine learning

model for NBA playoﬀ bracket prediction is ﬁnally presented.

There are two ways in which we chose to select the data to train the al-

gorithm. 1. The ﬁrst method was to consider each Team’s performance when

playing against all other teams 2. The second method was to evaluate a team’s

performance with one other speciﬁc Team to determine its win rate against that

speciﬁc Team

2.2.1 Method 1 – Selecting all data

In this model, we trained the diﬀerent machine learning algorithms with all the

statistics of every Team. Like the self-constructed model, the 2018 prediction

used data over four years, the 2017 prediction used data over three years, and

the 2016 prediction used data only over two years.

From the above data, algorithms generally performed relatively well and

consistent in 2018 and 2017, except for the SVM model and the LR model

(Figure 3). The SVM model generally had a low and consistent prediction

accuracy in the three years, and the LR did signiﬁcantly better in 2017 than in

2018.

Overall, LDA had the highest mean accuracy of 71.2%, followed by the

CART model, with a mean accuracy of 69.2%. The worst performing model is

the SVM algorithm with an accuracy of 0.436 only.

Figure 4: Playoﬀ Prediction Accuracy of Diﬀerent Machine Learning Models

with all data selected and average 30 trial runs

2.2.2 Method 2 – Selecting Partial Data

In this model, we trained the machine learning models with a partial amount

of data, which is only based on one Team’s performance against a speciﬁc op-

ponent. In other words, it is the data between two teams that we are trying to

predict.

Based on the data collected (Figure 4), there are no clear trends or patterns

available. All the model predictions have signiﬁcantly large variations in each

year. It also doesn’t have a clear correlation to data size. Although 2016 was

again the worst year of prediction, the 2017 prediction did signiﬁcantly better

than the 2018 prediction, proving that there are no trends.

In summary, selecting only the data where two teams played against each

other resulted in inaccurate and inconsistent predictions. It also means that

the models’ accuracy in this data selection method will not be considered when

calculating the best performing model due to the inconsistency and inaccuracy.

Therefore, it can be concluded that selecting all data is a better data selection

method.

Figure 5: Playoﬀ Prediction Accuracy of Diﬀerent Machine Learning Models

with partial data selected and average 30 trial runs

2.2.3 Home and Away Investigation

The home and away variables are widely considered an essential variable on a

team’s performance and players. The home-court will have more fans, and the

positive atmosphere will give the home team a spiritual boost, which may result

in a better performance.

An experiment is conducted by training and running the program three times

with diﬀerent data. One will have all data from home and away games, another

will have only the data from home games, and the ﬁnal one will have only the

data from away games. Diﬀerences between the predicted results are analyzed

and evaluated.

Based on the above data (Table 2), generally, teams had much better per-

formance and a higher win percentage when playing as the home team.

To summarize, the home and away variable greatly inﬂuenced teams’ per-

formance level and win percentage in general. In theory, the variable should not

aﬀect model accuracy to a signiﬁcant extent. But in this case, the model did

impact the model accuracy, which can be caused by other factors in real life.

Figure 6: 2018 LR Prediction Results

2.2.4 Most Accurate Machine Learning Model for NBA Playoﬀ Pre-

diction

To conclude, the most accurate machine learning model at predicting the NBA

playoﬀs is LDA, which reached an accuracy of 71.2%. The performance of

the models at predicting with a partial amount of data is neglected since it is

considered that the data selection did not give useful information.

3 Discussion

3.1 Self-made prediction model

Based on the model’s accuracy and the size of the data, we see a trend between

the two variables, with 2018 having the largest dataset and the highest model

accuracy and 2016 having the smallest dataset and the lowest model accuracy

(Table 1). One possible explanation for the model’s changing performance is

that it works well with larger datasets while having lower performances when

working with smaller datasets.

Another possible reason is that the model is simply not consistent in predic-

tion. It might be a coincidence that there is a correlation between data size and

model accuracy since we only have data for three years of prediction. Further

investigation can be carried out to conﬁrm the eﬀect of the data size on the

model’s accuracy. This can be done by running the prediction model for more

years with diﬀerent data sizes to understand the correlation between the two

variables better.

3.2 Machine learning prediction model

Similar as the results for the self-made prediction model, the prediction accuracy

of 2018 is the highest while the one of 2016 is lowest when all data are selected

to train the machine learning models (Figure 3). All algorithms except the SVM

model performed signiﬁcantly worse in 2016. One hypothesis is that the models

reached a tipping point in 2016 when the data size is not big enough to support

accurate predictions.

Figure 7: 2016 Team Performance Prediction with All Data Selected

Another hypothesis is that there is an error in the program itself that is

causing the 2016 prediction to deviate. This can be seen through the models’

accuracy in 2016 (Table 3), except for SVM, all had an accuracy of 53.3%.

Here, it demonstrated that each model except for SVM had the same playoﬀ

prediction for every round. Although teams have slightly shifting percentages

in diﬀerent models, which may symbolize that there isn’t an error and that all

models are independent of each other, it is still doubtful that each model had

the same win percentage and same prediction. This will be a research question

for future investigations to conﬁrm if there is an error in the program causing

the deviation in 2016, or the model reached a tipping point in data size that is

causing the variation to occur.

For model training with partial data selection (Figure 4), it can be concluded

that partial data selection method gives inaccurate and inconsistent predictions.

This is likely because there is very little data for a given pair of teams. To be

speciﬁc, two teams only play against each other ten times a year, with Team1

playing as the home team for ﬁve matches and Team2 playing as the home team

for another ﬁve games. Additionally, only 80% of the data are used to train,

meaning that only eight sets of data are provided for training each year. This

resulted in inaccurate predictions with inadequate dataset. It also means that

the prediction models will be more likely to give the two teams a 50 percent

win rate each due to the small amount of data for testing and training. This

will result in the program randomly selecting a winner between the two teams,

making the prediction model inconsistent.

Additionally, the two sets of predictions namely home and away should have

similar accuracy theoretically. This is because teams typically have a higher

win percentage when playing as the home team and a lower win percentage

when playing as the away team. If all teams perform better when playing as the

home team, they should get roughly the same increase in performance level, so

it should not aﬀect the accuracy to a signiﬁcant extent. This is the same when

teams are playing as the away team. They should all perform relatively worse,

so the models’ accuracy should not shift by a signiﬁcant amount.

However, in this case (Table 2), the model accuracy did shift signiﬁcantly, at

13%. This is due to outliers like the team MIN, which had a better performance

when playing as the away Team than as the home team. It is also because

diﬀerent teams had diﬀerent performance levels when playing as the home team.

For example, GSW had an increase in a win percentage of 30% when playing as

the home team. On the other hand, team PHI only had a 9% increase in win

percentage when playing as the home team. One hypothesis is that GSW has

more fans than other teams, so they have a better atmosphere when playing as

the home team. However, many other factors can decide a team’s performance

when playing as the home team and the away team. These factors can be further

investigated in the future.

4 Methods

To test the eﬀectiveness of our self-made NBA playoﬀ prediction model and all

the related machine learning algorithms, certain NBA historic statistics data

from 2014 - 2018 are needed, which can be access from many open source NBA

statistics. And these historic NBA statistics are usually saved as .csv ﬁle format,

which we can use the Python pandas library read-csv module to load the dataset

from the corresponding csv URL link, the format and header of the dataset is

of the following form (in Figure 5.)

Figure 8: NBA historic statistics dataset format and headers

4.1 Self-made prediction model

To start, we ﬁrst created our own prediction model to predict the NBA bracket.

We focused on working out the probability of Team A winning against Team B,

then applying this to every game in the playoﬀ.

We have to narrow our focus on speciﬁc game variables, which signiﬁcantly

impact the game result. After some research, we decided to use the following

variables:

1)EFG% eﬀective ﬁeld goal percentage [Aut12], considers both 2pts ﬁeld

goals and 3pts ﬁeld goals in one variable and considered their weight with three-

pointers worth 1.5 times of a two-pointer.

2)FT% free throw percentage [Aut12], calculates the percentage of free-throw

makes for a speciﬁc team.

3)TOV% turn over percentage [Aut12], is an estimate of turnovers by a team

per 100 possessions.

4)ORB% Oﬀensive rebound percentage [Aut12], is an estimate of the per-

centage of oﬀensive rebound that a team gets.

5)DRB% Defensive rebound percentage [Aut12], is an estimate of the per-

centage of defensive rebounds taken by a team.

EF G% = (2 point field goals made + 1.5∗3point field goals made)∗100

T otal f ield goals made

F T % = f ree throws made ∗100

free throws attempted

T OV % = number of turnovers ∗100

field goal attempted + 0.44 ∗free throws attempted +number of turnovers

ORB% = off ensive rebounds ∗100

off ensive rebounds +opponent defensive rebounds

DRB% = def ensive rebounds ∗100

defensive rebounds +opponent offensive rebounds

4.1.1 Algorithms

The ﬁve variables that were chosen are considered the most impactful factors

in the game. The second step of our model is to decide on the algorithm we are

going to use to calculate the probability of Team A beating Team B; the chosen

algorithm was:

Pwi n =c1P1+c2P2+c3P3+c4P4+c5P5

Here, ciis the proportional correlation of the variable viwith winning. In

other words, the larger the value of ci, the more variable viwill contribute to

the winning of a game. Pi is the probability that Team A will have a higher

score than Team B on variable vi. By multiplying the probability of the two

factors together and adding all the numbers up for all ﬁve diﬀerent variables,

we predict Team A beating Team B in a match.

4.1.2 cicalculation

The formula for ciis:

ci=ri

r1+r2+r3+r4+r5

Here, rirepresents the Pearson correlation coeﬃcient of the variable viwith

winning. However, winning is a categorical value that cannot be used in the

Pearson correlation. Therefore we decided to represent winning with the point

diﬀerence between the two teams.

4.1.3 Picalculation

To calculate the value for Pi, we used the principle of conﬁdence intervals, which

is deﬁned to be the probability that a parameter will fall between two sets of

values with a speciﬁc conﬁdence level. [Wil13]

1 - Calculate a 95% conﬁdence interval of for both teams

2 - We deﬁned the conﬁdence intervals for Team A as [xA,yA] and Team B

as [xB,yB]

3 – The ﬁrst case is when the intervals don’t overlap. In this situation, the

Team with the higher interval has a 95% chance of scoring higher. (Note: The

percentage might be slightly higher than 95%, but in this case, we consider it

as 95%.)

4 - The second case is when the two intervals overlap, and Team A has a

higher upper limit (yA¿yB). Here, the formula to calculate Pi is:

Pi= 0.95 yA−yB

yA−xB

5 - The third case is when Team B has a higher upper limit (yB¿yA). Here,

the formula to calculate Piis:

Pi= 1 −0.95 yB−yA

yA−xB

6 – The last case is when the two sets of data have the same upper limit

(yB=yA). Then Pi = 0.5 in this case.

4.2 Predicting the playoﬀ bracket

In order to predict the playoﬀ bracket, we created Python’s function to calculate

the probability of Team A defeating Team B, and we applied it to predict the

playoﬀ bracket for 2018.

Figure 9: Python Code Snippet for Self-made Prediction Model

In the python code snippet (Figure 6), the function select team(), which

predicts the winner between Team A and Team B, is called many times. This

calculates the winners for the quarter-ﬁnal, the semi-ﬁnal, the ﬁnals, and in

the end, it calculates the winner of the year. This predicted playoﬀ is then

appended to a list and compared to the actual result of the 2018 playoﬀ to cal-

culate a prediction accuracy. The original playoﬀ is pre-loaded into the program

beforehand.

4.3 Machine learning prediction models

In order to test if machine learning algorithm can be used to predict the NBA

playoﬀ bracket and evaluate which machine learning model has the best predic-

tion accuracy, 5 diﬀerent machine learning models that have been implemented

in Python Scikit-learn machine learning library are employed with two linear

(LR and LDA) and three nonlinear (KNN, CART and SVM) ones (Figure 6).

After loading the dataset from the historic NBA statistics CSV ﬁle, depend-

ing on the two diﬀerent data selection methods, all data or partial data, together

with home or away analysis, data related to year of 2016, 2017 and 2018 for NBA

playoﬀ prediction can be split into diﬀerent data arrays, so that the prediction

accuracy of diﬀerent machine learning models can be analyzed accordingly.

To test the prediction accuracy for each diﬀerent machine learning model,

the dataset needs to be split into two sets, one for the model training and one

for the model prediction on 8:2 randomly selection basis, which means 80% of

the data will be used as training data and 20% will be used to evaluate the

prediction accuracy and the data is randomly selected.

After the dataset is split for training and validation, the ﬁt function for each

machine learning model will be called to train each individual models. After

model training, the predict function for each machine learning model will be

called to make the ﬁnal prediction based on the validation dataset generated

before and the prediction accuracy for each models will also be calculated by

the accuracyscoref unction(F ig ure7).

Note that in Python, loc function is a frequently used function to retrieve

partial data in the dataset related to certain variable value, like certain year,

certain team, etc.

Figure 10: Python code snippet for importing modules, functions and models

5 Acknowledgement

I would like to thank Mr. Derek Sorensen of Horizon Academic Research Pro-

gram (HARP), together with my parents, my sister for the support.

Figure 11: Python code snippet for dataset split and model cross-evaluation

References

[Anu06] Mehta Anukrati. A beginner’s guide to classiﬁcation and regression

trees. https://www.digitalvidya.com/blog/classiﬁcation-and-regression-

trees/, 0006.

[Aut01] No Author. National basketball association.

https://en.wikipedia.org/wiki/NationalBasketballAssociation, 0001.

[Aut09] No Author. Machine learning - logistic regression.

https://www.tutorialspoint.com/machinelearningwithpython/machine-

learningwithpythonclassiﬁcationalgorithmslogisticregression.htm, 0009.

[Aut10] No Author. Advantages and disadvantages of logistic re-

gression. https://iq.opengenus.org/advantages-and-disadvantages-of-

logistic-regression/, 0010.

[Aut12] No Author. Glossary. Basketball Reference, No Date,

https://www.basketballreference.com/about/glossary.htmltovpct, 0012.

[Cor05] Maklin Cory. Linear discriminant analysis in python.

https://towardsdatascience.com/linear-discriminant-analysis-in-

python-76b8b17817c2, 0005.

[Dhi03] K Dhiraj. Top 5 advantages and disadvantages of decision tree

algorithm. https://medium.com/@dhiraj8899/top-5-advantages-and-

disadvantages-of-decision-tree-algorithm-428ebd199d9a, 0003.

[Jos08] Starmer Josh. Linear discriminant analysis (lda) clearly explained.

https://www.youtube.com/watch?v=azXCzI57Yfct=516s, 0008.

[Nar04] Kumar Naresh. Advantages and disadvan-

tages of knn algorithm in machine learning.

http://theprofessionalspoint.blogspot.com/2019/02/advantages-and-

disadvantages-of-knn.html, 0004.

[Rus07] Pupale Rushikesh. Support vector machine (svm) – an overview.

https://towardsdatascience.com/https-medium-com-pupalerushikesh-

svmf4b42800e989, 0007.

[Sil02] David Silver. Mastering the game of go without human knowledge.

ARTICLE, doi:10.1038/nature24270, 0002.

[TA11] Ibrahem Abdlehameed Hassanien Aboul Ella Tharwatt Alaa,

Gaber Tarek. Linear discriminant analysis: A detailed tu-

torial. https://www.researchgate.net/publication/316994943Linear-

discriminantanalysisAdetailedtutorial, 0011.

[Wil13] Kenton Wil. Conﬁdence interval.

https://www.investopedia.com/terms/c/conﬁdenceinterval.asp, 0013.

ResearchGate has not been able to resolve any citations for this publication.

Mastering the game of Go without human knowledge

Article

Full-text available

Oct 2017
NATURE

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo's own move selections and also the winner of AlphaGo's games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100-0 against the previously published, champion-defeating AlphaGo. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.

A beginner's guide to classification and regression trees

Mehta Anukrati

Mehta Anukrati. A beginner's guide to classification and regression trees. https://www.digitalvidya.com/blog/classification-and-regressiontrees/, 0006.

Linear discriminant analysis in python

Maklin Cory

Maklin Cory. Linear discriminant analysis in python.

Top 5 advantages and disadvantages of decision tree algorithm

K Dhiraj

K Dhiraj. Top 5 advantages and disadvantages of decision tree algorithm. https://medium.com/@dhiraj8899/top-5-advantages-anddisadvantages-of-decision-tree-algorithm-428ebd199d9a, 0003.

Linear discriminant analysis (lda) clearly explained

Josh Starmer

Starmer Josh. Linear discriminant analysis (lda) clearly explained. https://www.youtube.com/watch?v=azXCzI57Yfct=516s, 0008.

Support vector machine (svm) -an overview

Pupale Rushikesh

Pupale Rushikesh. Support vector machine (svm) -an overview. https://towardsdatascience.com/https-medium-com-pupalerushikesh-svmf4b42800e989, 0007.

Predicting NBA Playoffs Using Machine Learning

Figures

Recommended publications

Weather Prediction Using Clustering Strategies in Machine Learning

Short-term traffic predictions on large urban traffic networks: Applications of network-based machin...

Short-term 4D Trajectory Prediction Using Machine Learning Methods: a case study on Beijing BCIA air...

Construction Model Using Machine Learning Techniques for the Prediction of Rice Produce for Farmers