Figure. Minute-by-minute scores for a single NBA basketball game between the Boston Celtics and the Houston Rockets.
I am very lucky to have met the student members of the Harvard College Sports Analysis Collective (HSAC) and their excellent faculty advisor, Professor Carl Morris. This unique organization impressed upon me that there are many people out there who are passionate about data and statistics. They pursue really cool quantitative questions related to sports, and share many ideas on their awesome blog. The world of sports data is fascinating and very fun because lots of people really care about this data, and a ton of high-quality sports data is freely available online.
Play-by-play (PBP) data is particularly interesting because it describes the trajectory of a game at a fairly fine level of granularity. Basketball PBP data is especially exciting because the game is so fast-paced and many more points are scored compared to, say, a soccer game. In addition to describing the trajectory of the two teams’ scores, basketball PBP data also describes who has possession of the ball, shot attempts, possession turnover, etc. In other words, basketball PBP data contains a lot of information and is complex. A lot of very interesting analysis can be done with this data, but I think that many of the people who would like to do cool stuff with PBP and other sports data don’t know how to deal with the available data. Manipulating this data and extracting specific data of interest can be very challenging in an environment like Excel. Basic programming skills in a language like Python can be the difference between a data enthusiast who has cool data analysis ideas and a data enthusiast who produces cool data analysis results.
A friend of mine in HSAC showed me some NBA play-by-play data that he wanted to analyze for a project. I won’t describe everything that we did, but our first-order goal was to extract the scores at whole-minute intervals. Aggregating the data in this fashion gives you a clean way to align any number of games, enabling a number of statistical analyses.
You can download lots of basketball data at www.basketballvalue.com. For our example, we are going to use the 2009-2010 season NBA play-by-plays, which you can download as a zip file.
Download basketball.py and the play-by-plays. Put them in the same directory.
From the command line, produce a minute-by-minute of the play-by-play data:
$ python basketball.py
From the Python shell, produce the minute-by-minute data, generate the above figure, and start exploring the generated data using tabular:
>>> from basketball import *
>>> aggregate_on_minutes(fin='playbyplay20092010reg20100418.txt', fout='pbp20092010.tsv')
>>> plot_scores(fin='pbp20092010.tsv', game_id='20100319BOSHOU', fout='scores.pdf')
>>> import tabular as tb
>>> x = tb.tabarray(SVfile='pbp20092010.tsv')
>>> x
tabarray([('20091027BOSCLE', 2009, 10, 27, 'BOS', 'CLE', 48, 0, 0),
('20091027BOSCLE', 2009, 10, 27, 'BOS', 'CLE', 47, 0, 4),
('20091027BOSCLE', 2009, 10, 27, 'BOS', 'CLE', 46, 2, 8), ...,
('20100414SASDAL', 2010, 4, 14, 'SAS', 'DAL', 3, 81, 87),
('20100414SASDAL', 2010, 4, 14, 'SAS', 'DAL', 2, 81, 87),
('20100414SASDAL', 2010, 4, 14, 'SAS', 'DAL', 1, 83, 92)],
dtype=[('GameID', '|S14'), ('Year', '<i8'), ('Month', '<i8'), ('Day', '<i8'), ('Team1', '|S3'), ('Team2', '|S3'), ('MinutesRemaining', '<i8'), ('Score1', '<i8'), ('Score2', '<i8')])
>>> boston_games = tb.utils.uniqify(x[(x['Team1']=='BOS') | (x['Team2']=='BOS')]['GameID'])
NumPy and tabular, plus IPython and matplotlib if you want to make figures.
Input file: 2009-2010 season NBA play-by-plays (.zip from www.basketballvalue.com)
Python code: basketball.py
Output file: Minute-by-minute scores extracted from the play-by-plays (tab-delimited text file)
Script to extract per-minute data from http://basketballvalue.com play-by-plays.
Usage: python basketball.py (input_file output_file)
Author: Elaine Angelino <elaine at eecs dot harvard dot edu>
Copyright 2011
Extract per-minute data from http://basketballvalue.com play-by-plays.
Parameters
fin : str
Name of input file, i.e. an unzipped play-by-play file downloaded from http://basketballvalue.com. These are tab-delimited and contain four columns:
GameID : str
Contains the game date and identifiers for the two teams, e.g., 20091027BOSCLE.LineNumber : int
Numbers the play-by-play records for each game.TimeRemaining : str
Hours, minutes and seconds remaining, e.g., 00:45:34.Entry : str
Description in the format, '[ABC X-Y] Play-by-play description', where 'ABC' is the identifier of the scoring team and 'X' is their current score, and 'Y' is the score of the other team, e.g., '[BOS 89-83] Pierce Jump Shot: Made (17 PTS)'.fout : str
Name of output file. Use ‘.tsv’ (‘.csv’) to create a tab- (comma-) delimited text file.
GameID : str
Contains the game date and identifiers for the two teams, e.g., 20091027BOSCLE.Year : int
Year extracted from the date in the GameID.Month : int
Month extracted from the date in the GameID.Day : int
Day extracted from the date in the GameID.Team1 : str
Team identifier, e.g. BOS for the Boston Celtics.Team2 : str
Team identifier, e.g. HOU for the Houston Rockets.MinutesRemaining : int
Number of minutes remaining (48, 47, ..., 1, 0).Score1 : int
Score of Team1.Score2 : int
Score of Team2.
Plot scores as a function of time for a particular game.
Parameters
fin : str
Name of output file. Use ‘.tsv’ (‘.csv’) to create a tab- (comma-) delimited text file.game_id : str
GameID as in the original play-by-play file.fout : str
Name of output image file. Use an extension like ‘.pdf’ or ‘.png’.