Our dataset features two distinct components that we merged together: game-specific features and team statistics. Each game features two opponents, both of which can be joined to the team statistics described below, and one of which is designated the “primary” team in each game (i.e., we predict a win or loss for that team). Naturally, we note game outcomes, as well as other ancillary features of games: the home team, whether or not the teams are conference opponents, March Madness games, etc.
Pomeroy’s derived team statistics are the main draw of this dataset; he describes his methodology in more depth in his ratings glossary and four factors discussion. Pomeroy produces “tempo-free” statistics about college basketball teams, adjusting their reported statistics by the “pace” (roughly the number of possessions) of their games. This methodology recognizes that teams’ playing styles often affect the speed of each game, and that raw statistics might favor quick (and potentially inefficient) playing styles while failing to accurately describe slower, efficient ones. The calculations that ultimately proved most consequential to our predictions are Pomeroy’s measurements of offensive and defensive efficiency, derived from the ratio of their points scored (or allowed in the case of defense) per possession.
If the reader is interested, Pomeroy’s team statistics dataset for the latest season is available free of charge at KenPom.com.
Season-by-season team statistics (in CSV format) are available back to 2002 with the purchase of a one-year subscription to Ken Pomeroy’s website. Those were trivial to retrieve for our time range, as they are aggregated at the year level and can be easily downloaded by hand. Game records proved more complex to retrieve, as they are displayed on the website but not made available for easy download. To retrieve game outcomes, we used Python’s “requests” library to request raw HTML pages that contain team information. As this data is only available to premium subscribers, we used “spoofed” browser cookies (from a logged-in session) to make our requests, ultimately scraping more than 2,000 team information pages for analysis.
With raw HTML data in-hand, we opted to use Beautiful Soup to turn HTML tables with game outcomes into a CSV dataset. This process created one obvious complication: every game appeared twice in our dataset, with teams swapping roles as the “primary” team and the opponent. For example, since Duke and Wisconsin played one another, each of their team pages shows a game against the other. To avoid potentially compromising our analysis with these duplicate observations, we identified all pairs of games and randomly sampled one game from each pair (thus designating one team the “primary” team for which we predict a win or loss).
Finally, we merged game data with team statistics, with each game being augmented with statistics for both the primary and secondary teams. This created one additional challenge: every game featuring a given team would feature the same team statistics. Though a typical season has more than 5,000 games, the actual set of team statistics would be much smaller. In response, we also generated pairwise differences and ratios of team statistics between the primary and secondary team, thus ensuring diversity in our dataset and creating a features that compare both teams, rather than simply describing each team individually. A sample of the resultant dataset (the 2015 Final Four games) appears below; scroll to the right to see all fields:
game_id | game_group | year | date | team | opponent | conference | conference_tournament | ncaa_tournament | other_tournament | location_Away | location_Home | location_Neutral | location_SemiAway | location_SemiHome | team_Tempo | team_RankTempo | team_AdjTempo | team_RankAdjTempo | team_OE | team_RankOE | team_AdjOE | team_RankAdjOE | team_DE | team_RankDE | team_AdjDE | team_RankAdjDE | team_Pythag | team_RankPythag | opponent_Tempo | opponent_RankTempo | opponent_AdjTempo | opponent_RankAdjTempo | opponent_OE | opponent_RankOE | opponent_AdjOE | opponent_RankAdjOE | opponent_DE | opponent_RankDE | opponent_AdjDE | opponent_RankAdjDE | opponent_Pythag | opponent_RankPythag | diff_Tempo | diff_RankTempo | diff_AdjTempo | diff_RankAdjTempo | diff_OE | diff_RankOE | diff_AdjOE | diff_RankAdjOE | diff_DE | diff_RankDE | diff_AdjDE | diff_RankAdjDE | diff_Pythag | diff_RankPythag | ratio_Tempo | ratio_RankTempo | ratio_AdjTempo | ratio_RankAdjTempo | ratio_OE | ratio_RankOE | ratio_AdjOE | ratio_RankAdjOE | ratio_DE | ratio_RankDE | ratio_AdjDE | ratio_RankAdjDE | ratio_Pythag | ratio_RankPythag | points_for | points_against | win |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20150404-duke-michiganst | 1 | 2015 | 2015-04-04 | Duke | Michigan St. | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 65.9330 | 120 | 65.9619 | 114 | 119.6530 | 3 | 121.5639 | 3 | 96.6318 | 51 | 92.3456 | 12 | 0.959355 | 4 | 63.0403 | 271 | 63.5905 | 245 | 109.8856 | 33 | 114.5700 | 15 | 98.4175 | 86 | 95.5169 | 47 | 0.890088 | 15 | 2.8927 | -151 | 2.3714 | -131 | 9.7674 | -30 | 6.9939 | -12 | -1.7857 | -35 | -3.1713 | -35 | 0.069267 | -11 | 1.045887 | 0.442804 | 1.037292 | 0.465306 | 1.088887 | 0.090909 | 1.061045 | 0.2 | 0.981856 | 0.593023 | 0.966799 | 0.255319 | 1.077820 | 0.266667 | 81 | 61 | 1 |
20150404-kentucky-wisconsin | 1 | 2015 | 2015-04-04 | Wisconsin | Kentucky | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 59.4998 | 344 | 59.0178 | 346 | 121.1329 | 1 | 127.8751 | 1 | 97.5034 | 72 | 96.3347 | 54 | 0.962927 | 3 | 63.7584 | 241 | 63.4728 | 251 | 115.4396 | 9 | 119.2883 | 5 | 84.6510 | 1 | 86.5378 | 2 | 0.975662 | 1 | -4.2586 | 103 | -4.4550 | 95 | 5.6933 | -8 | 8.5868 | -4 | 12.8524 | 71 | 9.7969 | 52 | -0.012735 | 2 | 0.933207 | 1.427386 | 0.929812 | 1.378486 | 1.049318 | 0.111111 | 1.071984 | 0.2 | 1.151828 | 72.000000 | 1.113209 | 27.000000 | 0.986947 | 3.000000 | 71 | 64 | 1 |
20150406-duke-wisconsin | 1 | 2015 | 2015-04-06 | Duke | Wisconsin | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 65.9330 | 120 | 65.9619 | 114 | 119.6530 | 3 | 121.5639 | 3 | 96.6318 | 51 | 92.3456 | 12 | 0.959355 | 4 | 59.4998 | 344 | 59.0178 | 346 | 121.1329 | 1 | 127.8751 | 1 | 97.5034 | 72 | 96.3347 | 54 | 0.962927 | 3 | 6.4332 | -224 | 6.9441 | -232 | -1.4799 | 2 | -6.3112 | 2 | -0.8716 | -21 | -3.9891 | -42 | -0.003572 | 1 | 1.108121 | 0.348837 | 1.117661 | 0.329480 | 0.987783 | 3.000000 | 0.950646 | 3.0 | 0.991061 | 0.708333 | 0.958591 | 0.222222 | 0.996290 | 1.333333 | 68 | 63 | 1 |
Columns starting with team_, opponent_, diff_, and tempo_ respresent KenPom-calculated and -derived features. Game outcome information appears in the final three columns. Columns ending with“OE” and“DE” represent the efficienty metrics described above.
Code for data retrieval and processing is spread across three files in our repository: