AM 207 Final Project Website

Data

Our dataset features two distinct components that we merged together: game-specific features and team statistics. Each game features two opponents, both of which can be joined to the team statistics described below, and one of which is designated the “primary” team in each game (i.e., we predict a win or loss for that team). Naturally, we note game outcomes, as well as other ancillary features of games: the home team, whether or not the teams are conference opponents, March Madness games, etc.

Pomeroy’s derived team statistics are the main draw of this dataset; he describes his methodology in more depth in his ratings glossary and four factors discussion. Pomeroy produces “tempo-free” statistics about college basketball teams, adjusting their reported statistics by the “pace” (roughly the number of possessions) of their games. This methodology recognizes that teams’ playing styles often affect the speed of each game, and that raw statistics might favor quick (and potentially inefficient) playing styles while failing to accurately describe slower, efficient ones. The calculations that ultimately proved most consequential to our predictions are Pomeroy’s measurements of offensive and defensive efficiency, derived from the ratio of their points scored (or allowed in the case of defense) per possession.

If the reader is interested, Pomeroy’s team statistics dataset for the latest season is available free of charge at KenPom.com.

Retrieval & Processing

Season-by-season team statistics (in CSV format) are available back to 2002 with the purchase of a one-year subscription to Ken Pomeroy’s website. Those were trivial to retrieve for our time range, as they are aggregated at the year level and can be easily downloaded by hand. Game records proved more complex to retrieve, as they are displayed on the website but not made available for easy download. To retrieve game outcomes, we used Python’s “requests” library to request raw HTML pages that contain team information. As this data is only available to premium subscribers, we used “spoofed” browser cookies (from a logged-in session) to make our requests, ultimately scraping more than 2,000 team information pages for analysis.

With raw HTML data in-hand, we opted to use Beautiful Soup to turn HTML tables with game outcomes into a CSV dataset. This process created one obvious complication: every game appeared twice in our dataset, with teams swapping roles as the “primary” team and the opponent. For example, since Duke and Wisconsin played one another, each of their team pages shows a game against the other. To avoid potentially compromising our analysis with these duplicate observations, we identified all pairs of games and randomly sampled one game from each pair (thus designating one team the “primary” team for which we predict a win or loss).

Finally, we merged game data with team statistics, with each game being augmented with statistics for both the primary and secondary teams. This created one additional challenge: every game featuring a given team would feature the same team statistics. Though a typical season has more than 5,000 games, the actual set of team statistics would be much smaller. In response, we also generated pairwise differences and ratios of team statistics between the primary and secondary team, thus ensuring diversity in our dataset and creating a features that compare both teams, rather than simply describing each team individually. A sample of the resultant dataset (the 2015 Final Four games) appears below; scroll to the right to see all fields:

game_id	game_group	year	date	team	opponent	ncaa_tournament	location_Neutral	team_Tempo	team_RankTempo	team_AdjTempo	team_RankAdjTempo	team_OE	team_RankOE	team_AdjOE	team_RankAdjOE	team_DE	team_RankDE	team_AdjDE	team_RankAdjDE	team_Pythag	team_RankPythag	opponent_Tempo	opponent_RankTempo	opponent_AdjTempo	opponent_RankAdjTempo	opponent_OE	opponent_RankOE	opponent_AdjOE	opponent_RankAdjOE	opponent_DE	opponent_RankDE	opponent_AdjDE	opponent_RankAdjDE	opponent_Pythag	opponent_RankPythag	diff_Tempo	diff_RankTempo	diff_AdjTempo	diff_RankAdjTempo	diff_OE	diff_RankOE	diff_AdjOE	diff_RankAdjOE	diff_DE	diff_RankDE	diff_AdjDE	diff_RankAdjDE	diff_Pythag	diff_RankPythag	ratio_Tempo	ratio_RankTempo	ratio_AdjTempo	ratio_RankAdjTempo	ratio_OE	ratio_RankOE	ratio_AdjOE	ratio_RankAdjOE	ratio_DE	ratio_RankDE	ratio_AdjDE	ratio_RankAdjDE	ratio_Pythag	ratio_RankPythag	points_for	points_against	win
20150404-duke-michiganst	1	2015	2015-04-04	Duke	Michigan St.	1	1	65.9330	120	65.9619	114	119.6530	3	121.5639	3	96.6318	51	92.3456	12	0.959355	4	63.0403	271	63.5905	245	109.8856	33	114.5700	15	98.4175	86	95.5169	47	0.890088	15	2.8927	-151	2.3714	-131	9.7674	-30	6.9939	-12	-1.7857	-35	-3.1713	-35	0.069267	-11	1.045887	0.442804	1.037292	0.465306	1.088887	0.090909	1.061045	0.2	0.981856	0.593023	0.966799	0.255319	1.077820	0.266667	81	61	1
20150404-kentucky-wisconsin	1	2015	2015-04-04	Wisconsin	Kentucky	1	1	59.4998	344	59.0178	346	121.1329	1	127.8751	1	97.5034	72	96.3347	54	0.962927	3	63.7584	241	63.4728	251	115.4396	9	119.2883	5	84.6510	1	86.5378	2	0.975662	1	-4.2586	103	-4.4550	95	5.6933	-8	8.5868	-4	12.8524	71	9.7969	52	-0.012735	2	0.933207	1.427386	0.929812	1.378486	1.049318	0.111111	1.071984	0.2	1.151828	72.000000	1.113209	27.000000	0.986947	3.000000	71	64	1
20150406-duke-wisconsin	1	2015	2015-04-06	Duke	Wisconsin	1	1	65.9330	120	65.9619	114	119.6530	3	121.5639	3	96.6318	51	92.3456	12	0.959355	4	59.4998	344	59.0178	346	121.1329	1	127.8751	1	97.5034	72	96.3347	54	0.962927	3	6.4332	-224	6.9441	-232	-1.4799	2	-6.3112	2	-0.8716	-21	-3.9891	-42	-0.003572	1	1.108121	0.348837	1.117661	0.329480	0.987783	3.000000	0.950646	3.0	0.991061	0.708333	0.958591	0.222222	0.996290	1.333333	68	63	1

Columns starting with team_, opponent_, diff_, and tempo_ respresent KenPom-calculated and -derived features. Game outcome information appears in the final three columns. Columns ending with“OE” and“DE” represent the efficienty metrics described above.

Code

Code for data retrieval and processing is spread across three files in our repository:

/src/data-processing/scrape_team_snapshots.py: Reads Ken Pomeroy's statistical summary files and uses team entries therein to scrape all game records from kenpom.com. Raw HTML is written to disk.
/src/data-processing/process_snapshots.py: Reads game records from above and summarizes them as well-formed CSV files.
/src/data-processing/generate_game_data.py: Merges CSV game records and team statistics, and derives several fields for later modeling and analysis.