#### Previous topic

Molecular dynamics in technicolor, on your laptop (work in progress)

#### Next topic

Wrangling data for investigative journalism

Figure. Minute-by-minute scores for a single NBA basketball game between the Boston Celtics and the Houston Rockets.

I am very lucky to have met the student members of the Harvard College Sports Analysis Collective (HSAC) and their excellent faculty advisor, Professor Carl Morris. This unique organization impressed upon me that there are many people out there who are passionate about data and statistics. They pursue really cool quantitative questions related to sports, and share many ideas on their awesome blog. The world of sports data is fascinating and very fun because lots of people really care about this data, and a ton of high-quality sports data is freely available online.

Play-by-play (PBP) data is particularly interesting because it describes the trajectory of a game at a fairly fine level of granularity. Basketball PBP data is especially exciting because the game is so fast-paced and many more points are scored compared to, say, a soccer game. In addition to describing the trajectory of the two teams’ scores, basketball PBP data also describes who has possession of the ball, shot attempts, possession turnover, etc. In other words, basketball PBP data contains a lot of information and is complex. A lot of very interesting analysis can be done with this data, but I think that many of the people who would like to do cool stuff with PBP and other sports data don’t know how to deal with the available data. Manipulating this data and extracting specific data of interest can be very challenging in an environment like Excel. Basic programming skills in a language like Python can be the difference between a data enthusiast who has cool data analysis ideas and a data enthusiast who produces cool data analysis results.

A friend of mine in HSAC showed me some NBA play-by-play data that he wanted to analyze for a project. I won’t describe everything that we did, but our first-order goal was to extract the scores at whole-minute intervals. Aggregating the data in this fashion gives you a clean way to align any number of games, enabling a number of statistical analyses.

## Example usage¶

From the command line, produce a minute-by-minute of the play-by-play data:

From the Python shell, produce the minute-by-minute data, generate the above figure, and start exploring the generated data using tabular:

>>> aggregate_on_minutes(fin='playbyplay20092010reg20100418.txt', fout='pbp20092010.tsv')
>>> plot_scores(fin='pbp20092010.tsv', game_id='20100319BOSHOU', fout='scores.pdf')
>>> import tabular as tb
>>> x = tb.tabarray(SVfile='pbp20092010.tsv')
>>> x
tabarray([('20091027BOSCLE', 2009, 10, 27, 'BOS', 'CLE', 48, 0, 0),
('20091027BOSCLE', 2009, 10, 27, 'BOS', 'CLE', 47, 0, 4),
('20091027BOSCLE', 2009, 10, 27, 'BOS', 'CLE', 46, 2, 8), ...,
('20100414SASDAL', 2010, 4, 14, 'SAS', 'DAL', 3, 81, 87),
('20100414SASDAL', 2010, 4, 14, 'SAS', 'DAL', 2, 81, 87),
('20100414SASDAL', 2010, 4, 14, 'SAS', 'DAL', 1, 83, 92)],
dtype=[('GameID', '|S14'), ('Year', '<i8'), ('Month', '<i8'), ('Day', '<i8'), ('Team1', '|S3'), ('Team2', '|S3'), ('MinutesRemaining', '<i8'), ('Score1', '<i8'), ('Score2', '<i8')])
>>> boston_games = tb.utils.uniqify(x[(x['Team1']=='BOS') | (x['Team2']=='BOS')]['GameID'])

## Python package dependencies¶

NumPy and tabular, plus IPython and matplotlib if you want to make figures.

Input file: 2009-2010 season NBA play-by-plays (.zip from www.basketballvalue.com)

Output file: Minute-by-minute scores extracted from the play-by-plays (tab-delimited text file)

Script to extract per-minute data from http://basketballvalue.com play-by-plays.

Author: Elaine Angelino <elaine at eecs dot harvard dot edu>

Extract per-minute data from http://basketballvalue.com play-by-plays.

Parameters

fin : str

Name of input file, i.e. an unzipped play-by-play file downloaded from http://basketballvalue.com. These are tab-delimited and contain four columns:

GameID : str

Contains the game date and identifiers for the two teams, e.g., 20091027BOSCLE.

LineNumber : int

Numbers the play-by-play records for each game.

TimeRemaining : str

Hours, minutes and seconds remaining, e.g., 00:45:34.

Entry : str

Description in the format, '[ABC X-Y] Play-by-play description', where 'ABC' is the identifier of the scoring team and 'X' is their current score, and 'Y' is the score of the other team, e.g., '[BOS 89-83] Pierce Jump Shot: Made (17 PTS)'.

fout : str

Name of output file. Use ‘.tsv’ (‘.csv’) to create a tab- (comma-) delimited text file.

GameID : str

Contains the game date and identifiers for the two teams, e.g., 20091027BOSCLE.

Year : int

Year extracted from the date in the GameID.

Month : int

Month extracted from the date in the GameID.

Day : int

Day extracted from the date in the GameID.

Team1 : str

Team identifier, e.g. BOS for the Boston Celtics.

Team2 : str

Team identifier, e.g. HOU for the Houston Rockets.

MinutesRemaining : int

Number of minutes remaining (48, 47, ..., 1, 0).

Score1 : int

Score of Team1.

Score2 : int

Score of Team2.

Plot scores as a function of time for a particular game.

Parameters

fin : str

Name of output file. Use ‘.tsv’ (‘.csv’) to create a tab- (comma-) delimited text file.

game_id : str

GameID as in the original play-by-play file.

fout : str

Name of output image file. Use an extension like ‘.pdf’ or ‘.png’.