Data

Our dataset features two distinct components that we merged together: game-specific features and team statistics. Each game features two opponents, both of which can be joined to the team statistics described below, and one of which is designated the “primary” team in each game (i.e., we predict a win or loss for that team). Naturally, we note game outcomes, as well as other ancillary features of games: the home team, whether or not the teams are conference opponents, March Madness games, etc.

Pomeroy’s derived team statistics are the main draw of this dataset; he describes his methodology in more depth in his ratings glossary and four factors discussion. Pomeroy produces “tempo-free” statistics about college basketball teams, adjusting their reported statistics by the “pace” (roughly the number of possessions) of their games. This methodology recognizes that teams’ playing styles often affect the speed of each game, and that raw statistics might favor quick (and potentially inefficient) playing styles while failing to accurately describe slower, efficient ones. The calculations that ultimately proved most consequential to our predictions are Pomeroy’s measurements of offensive and defensive efficiency, derived from the ratio of their points scored (or allowed in the case of defense) per possession.

If the reader is interested, Pomeroy’s team statistics dataset for the latest season is available free of charge at KenPom.com.

Retrieval & Processing

Season-by-season team statistics (in CSV format) are available back to 2002 with the purchase of a one-year subscription to Ken Pomeroy’s website. Those were trivial to retrieve for our time range, as they are aggregated at the year level and can be easily downloaded by hand. Game records proved more complex to retrieve, as they are displayed on the website but not made available for easy download. To retrieve game outcomes, we used Python’s “requests” library to request raw HTML pages that contain team information. As this data is only available to premium subscribers, we used “spoofed” browser cookies (from a logged-in session) to make our requests, ultimately scraping more than 2,000 team information pages for analysis.

With raw HTML data in-hand, we opted to use Beautiful Soup to turn HTML tables with game outcomes into a CSV dataset. This process created one obvious complication: every game appeared twice in our dataset, with teams swapping roles as the “primary” team and the opponent. For example, since Duke and Wisconsin played one another, each of their team pages shows a game against the other. To avoid potentially compromising our analysis with these duplicate observations, we identified all pairs of games and randomly sampled one game from each pair (thus designating one team the “primary” team for which we predict a win or loss).

Finally, we merged game data with team statistics, with each game being augmented with statistics for both the primary and secondary teams. This created one additional challenge: every game featuring a given team would feature the same team statistics. Though a typical season has more than 5,000 games, the actual set of team statistics would be much smaller. In response, we also generated pairwise differences and ratios of team statistics between the primary and secondary team, thus ensuring diversity in our dataset and creating a features that compare both teams, rather than simply describing each team individually. A sample of the resultant dataset (the 2015 Final Four games) appears below; scroll to the right to see all fields:

game_id                                                      game_group year date                    team opponent             conference conference_tournament ncaa_tournament other_tournament location_Away location_Home location_Neutral location_SemiAway location_SemiHome team_Tempo team_RankTempo team_AdjTempo team_RankAdjTempo team_OE team_RankOE team_AdjOE team_RankAdjOE team_DE team_RankDE team_AdjDE team_RankAdjDE team_Pythag team_RankPythag opponent_Tempo opponent_RankTempo opponent_AdjTempo opponent_RankAdjTempo opponent_OE opponent_RankOE opponent_AdjOE opponent_RankAdjOE opponent_DE opponent_RankDE opponent_AdjDE opponent_RankAdjDE opponent_Pythag opponent_RankPythag diff_Tempo diff_RankTempo diff_AdjTempo diff_RankAdjTempo diff_OE diff_RankOE diff_AdjOE diff_RankAdjOE diff_DE diff_RankDE diff_AdjDE diff_RankAdjDE diff_Pythag diff_RankPythag ratio_Tempo ratio_RankTempo ratio_AdjTempo ratio_RankAdjTempo ratio_OE ratio_RankOE ratio_AdjOE ratio_RankAdjOE ratio_DE ratio_RankDE ratio_AdjDE ratio_RankAdjDE ratio_Pythag ratio_RankPythag points_for points_against win
20150404-duke-michiganst 1 2015 2015-04-04 Duke Michigan St. 0 0 1 0 0 0 1 0 0 65.9330 120 65.9619 114 119.6530 3 121.5639 3 96.6318 51 92.3456 12 0.959355 4 63.0403 271 63.5905 245 109.8856 33 114.5700 15 98.4175 86 95.5169 47 0.890088 15 2.8927 -151 2.3714 -131 9.7674 -30 6.9939 -12 -1.7857 -35 -3.1713 -35 0.069267 -11 1.045887 0.442804 1.037292 0.465306 1.088887 0.090909 1.061045 0.2 0.981856 0.593023 0.966799 0.255319 1.077820 0.266667 81 61 1
20150404-kentucky-wisconsin 1 2015 2015-04-04 Wisconsin Kentucky 0 0 1 0 0 0 1 0 0 59.4998 344 59.0178 346 121.1329 1 127.8751 1 97.5034 72 96.3347 54 0.962927 3 63.7584 241 63.4728 251 115.4396 9 119.2883 5 84.6510 1 86.5378 2 0.975662 1 -4.2586 103 -4.4550 95 5.6933 -8 8.5868 -4 12.8524 71 9.7969 52 -0.012735 2 0.933207 1.427386 0.929812 1.378486 1.049318 0.111111 1.071984 0.2 1.151828 72.000000 1.113209 27.000000 0.986947 3.000000 71 64 1
20150406-duke-wisconsin 1 2015 2015-04-06 Duke Wisconsin 0 0 1 0 0 0 1 0 0 65.9330 120 65.9619 114 119.6530 3 121.5639 3 96.6318 51 92.3456 12 0.959355 4 59.4998 344 59.0178 346 121.1329 1 127.8751 1 97.5034 72 96.3347 54 0.962927 3 6.4332 -224 6.9441 -232 -1.4799 2 -6.3112 2 -0.8716 -21 -3.9891 -42 -0.003572 1 1.108121 0.348837 1.117661 0.329480 0.987783 3.000000 0.950646 3.0 0.991061 0.708333 0.958591 0.222222 0.996290 1.333333 68 63 1

Columns starting with team_, opponent_, diff_, and tempo_ respresent KenPom-calculated and -derived features. Game outcome information appears in the final three columns. Columns ending with“OE” and“DE” represent the efficienty metrics described above.

Code

Code for data retrieval and processing is spread across three files in our repository: