Baseball - Predicting players' salary¶
In this notebook, we will benchmark several of getML's feature learning algorithms against featuretools using a dataset related to baseball players' salary.
Summary:
- Prediction type: Regression model
- Domain: Sports
- Prediction target: Salaries
- Population size: 23111
Background¶
In the late 1990s, the Oakland Athletics began focusing on the idea of sabermetrics, using statistical methods to identify undervalued baseball players. This was done to compensate for the fact that the team had a significantly smaller budget than most other teams in its league. Under its general manager Billy Beane, the Oakland Athletics became the first team in over 100 years to win 20 consecutive games in a row, despite still being significantly disadvantaged in terms of its budget. After this remarkable success, the use of sabermetrics quickly became the norm in baseball. These events have been documented in a bestselling book and a movie, both called Moneyball.
In this notebook, we will demonstrate that relational learning can be used for sabermetrics. Specifically, we will develop a model to predict players' salary using getML's statistical relational learning algorithms. Such predictions can then be used to identify undervalued players.
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015) (Now residing at relational-data.org.).
We will benchmark getML 's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.
Analysis¶
Let's get started with the analysis and set up your session:
import copy
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path
from urllib import request
import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
%matplotlib inline
import featuretools
import woodwork as ww
import getml
getml.engine.launch(home_directory=Path.home(), allow_remote_ips=True, token='token')
getml.engine.set_project('baseball')
getML engine is already running. Connected to project 'baseball'
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="lahman_2014",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='lahman_2014', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
allstarfull = load_if_needed("allstarfull")
awardsplayers = load_if_needed("awardsplayers")
awardsshareplayers = load_if_needed("awardsshareplayers")
batting = load_if_needed("batting")
battingpost = load_if_needed("battingpost")
fielding = load_if_needed("fielding")
fieldingpost = load_if_needed("fieldingpost")
pitching = load_if_needed("pitching")
pitchingpost = load_if_needed("pitchingpost")
salaries = load_if_needed("salaries")
allstarfull
name | yearID | gameNum | GP | startingPos | playerID | gameID | teamID | lgID |
---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 1955 | 0 | 1 | nan | aaronha01 | NLS195507120 | ML1 | NL |
1 | 1956 | 0 | 1 | nan | aaronha01 | ALS195607100 | ML1 | NL |
2 | 1957 | 0 | 1 | 9 | aaronha01 | NLS195707090 | ML1 | NL |
3 | 1958 | 0 | 1 | 9 | aaronha01 | ALS195807080 | ML1 | NL |
4 | 1959 | 1 | 1 | 9 | aaronha01 | NLS195907070 | ML1 | NL |
... | ... | ... | ... | ... | ... | ... | ... | |
4826 | 1978 | 0 | 1 | 9 | ziskri01 | NLS197807110 | TEX | AL |
4827 | 2002 | 0 | 1 | nan | zitoba01 | NLS200207090 | OAK | AL |
4828 | 2003 | 0 | 0 | nan | zitoba01 | ALS200307150 | OAK | AL |
4829 | 2006 | 0 | 1 | nan | zitoba01 | NLS200607110 | OAK | AL |
4830 | 2009 | 0 | 1 | nan | zobribe01 | NLS200907140 | TBA | AL |
4831 rows x 8 columns
memory usage: 0.45 MB
name: allstarfull
type: getml.DataFrame
awardsplayers
name | yearID | playerID | awardID | lgID | tie | notes |
---|---|---|---|---|---|---|
role | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1877 | bondto01 | Pitching Triple Crown | NL | NULL | NULL |
1 | 1878 | hinespa01 | Triple Crown | NL | NULL | NULL |
2 | 1884 | heckegu01 | Pitching Triple Crown | AA | NULL | NULL |
3 | 1884 | radboch01 | Pitching Triple Crown | NL | NULL | NULL |
4 | 1887 | oneilti01 | Triple Crown | AA | NULL | NULL |
... | ... | ... | ... | ... | ... | |
5790 | 2012 | larocad01 | Silver Slugger | NL | NULL | 1B |
5791 | 2012 | cabremi01 | Triple Crown | NL | NULL | NULL |
5792 | 2012 | cabremi01 | TSN Major League Player of the Y... | ML | NULL | NULL |
5793 | 2012 | verlaju01 | TSN Pitcher of the Year | AL | Y | NULL |
5794 | 2012 | dickera01 | TSN Pitcher of the Year | NL | NULL | NULL |
5795 rows x 6 columns
memory usage: 0.48 MB
name: awardsplayers
type: getml.DataFrame
awardsshareplayers
name | yearID | pointsWon | pointsMax | votesFirst | awardID | lgID | playerID |
---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 1956 | 1 | 16 | 1 | Cy Young | ML | fordwh01 |
1 | 1956 | 4 | 16 | 4 | Cy Young | ML | maglisa01 |
2 | 1956 | 10 | 16 | 10 | Cy Young | ML | newcodo01 |
3 | 1956 | 1 | 16 | 1 | Cy Young | ML | spahnwa01 |
4 | 1957 | 1 | 16 | 1 | Cy Young | ML | donovdi01 |
... | ... | ... | ... | ... | ... | ... | |
6284 | 2006 | 1 | 160 | 0 | Rookie of the Year | NL | willijo03 |
6285 | 2006 | 101 | 160 | 10 | Rookie of the Year | NL | zimmery01 |
6286 | 2008 | 3 | 140 | 0 | Rookie of the Year | AL | devinjo01 |
6287 | 2008 | 158 | 160 | 31 | Rookie of the Year | NL | sotoge01 |
6288 | 2008 | 9 | 160 | 0 | Rookie of the Year | NL | volqued01 |
6289 rows x 7 columns
memory usage: 0.47 MB
name: awardsshareplayers
type: getml.DataFrame
batting
name | yearID | stint | G | G_batting | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | G_old | playerID | teamID | lgID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 2004 | 1 | 11 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | aardsda01 | SFN | NL |
1 | 2006 | 1 | 45 | 43 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 45 | aardsda01 | CHN | NL |
2 | 2007 | 1 | 25 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | aardsda01 | CHA | AL |
3 | 2008 | 1 | 47 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 5 | aardsda01 | BOS | AL |
4 | 2009 | 1 | 73 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | aardsda01 | SEA | AL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
92348 | 1959 | 1 | 6 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | zuverge01 | BAL | AL |
92349 | 1910 | 1 | 27 | 27 | 87 | 7 | 16 | 5 | 0 | 0 | 5 | 1 | nan | 11 | nan | nan | 1 | 1 | nan | nan | 27 | zwilldu01 | CHA | AL |
92350 | 1914 | 1 | 154 | 154 | 592 | 91 | 185 | 38 | 8 | 16 | 95 | 21 | nan | 46 | 68 | nan | 1 | 10 | nan | nan | 154 | zwilldu01 | CHF | FL |
92351 | 1915 | 1 | 150 | 150 | 548 | 65 | 157 | 32 | 7 | 13 | 94 | 24 | nan | 67 | 65 | nan | 2 | 18 | nan | nan | 150 | zwilldu01 | CHF | FL |
92352 | 1916 | 1 | 35 | 35 | 53 | 4 | 6 | 1 | 0 | 1 | 8 | 0 | nan | 4 | 6 | nan | 0 | 2 | nan | nan | 35 | zwilldu01 | CHN | NL |
92353 rows x 24 columns
memory usage: 19.29 MB
name: batting
type: getml.DataFrame
battingpost
name | yearID | G | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | round | playerID | teamID | lgID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 1884 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | nan | 0 | 0 | 0 | nan | nan | nan | nan | WS | becanbu01 | NY4 | AA |
1 | 1884 | 3 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | nan | 0 | 1 | 0 | nan | nan | nan | nan | WS | bradyst01 | NY4 | AA |
2 | 1884 | 3 | 10 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | nan | 1 | 1 | 0 | nan | nan | nan | nan | WS | carrocl01 | PRO | NL |
3 | 1884 | 3 | 9 | 3 | 4 | 0 | 1 | 1 | 2 | 0 | nan | 0 | 3 | 0 | nan | nan | nan | nan | WS | dennyje01 | PRO | NL |
4 | 1884 | 3 | 10 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | nan | 0 | 3 | 0 | nan | nan | nan | nan | WS | esterdu01 | NY4 | AA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
9793 | 2012 | 2 | 5 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | nan | nan | nan | nan | nan | WS | theriry01 | SFN | NL |
9794 | 2012 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | WS | valvejo01 | DET | AL |
9795 | 2012 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | WS | verlaju01 | DET | AL |
9796 | 2012 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | WS | vogelry01 | SFN | NL |
9797 | 2012 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | nan | nan | nan | nan | nan | WS | zitoba01 | SFN | NL |
9798 rows x 22 columns
memory usage: 1.93 MB
name: battingpost
type: getml.DataFrame
fielding
name | yearID | stint | G | GS | InnOuts | PO | A | E | DP | PB | WP | SB | CS | ZR | playerID | teamID | lgID | POS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 2004 | 1 | 11 | 0 | 32 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | aardsda01 | SFN | NL | P |
1 | 2006 | 1 | 45 | 0 | 159 | 1 | 5 | 0 | 1 | nan | nan | nan | nan | nan | aardsda01 | CHN | NL | P |
2 | 2007 | 1 | 25 | 0 | 97 | 2 | 4 | 1 | 0 | nan | nan | nan | nan | nan | aardsda01 | CHA | AL | P |
3 | 2008 | 1 | 47 | 0 | 146 | 3 | 6 | 0 | 0 | nan | nan | nan | nan | nan | aardsda01 | BOS | AL | P |
4 | 2009 | 1 | 73 | 0 | 214 | 2 | 5 | 0 | 1 | 0 | nan | 0 | 0 | nan | aardsda01 | SEA | AL | P |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
137970 | 1910 | 1 | 27 | nan | nan | 45 | 2 | 3 | 1 | nan | nan | nan | nan | nan | zwilldu01 | CHA | AL | OF |
137971 | 1914 | 1 | 154 | nan | nan | 340 | 15 | 14 | 3 | nan | nan | nan | nan | nan | zwilldu01 | CHF | FL | OF |
137972 | 1915 | 1 | 3 | nan | nan | 3 | 0 | 0 | 0 | nan | nan | nan | nan | nan | zwilldu01 | CHF | FL | 1B |
137973 | 1915 | 1 | 148 | nan | nan | 356 | 20 | 8 | 6 | nan | nan | nan | nan | nan | zwilldu01 | CHF | FL | OF |
137974 | 1916 | 1 | 10 | nan | nan | 11 | 0 | 0 | 0 | nan | nan | nan | nan | nan | zwilldu01 | CHN | NL | OF |
137975 rows x 18 columns
memory usage: 22.57 MB
name: fielding
type: getml.DataFrame
fieldingpost
name | yearID | G | GS | InnOuts | PO | A | E | DP | TP | PB | SB | CS | playerID | teamID | lgID | round | POS |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1957 | 7 | 7 | 186 | 11 | 0 | 0 | 0 | 0 | nan | nan | nan | aaronha01 | ML1 | NL | WS | CF |
1 | 1958 | 1 | 1 | 21 | 2 | 0 | 0 | 0 | 0 | nan | nan | nan | aaronha01 | ML1 | NL | WS | CF |
2 | 1958 | 7 | 6 | 168 | 13 | 0 | 0 | 0 | 0 | nan | nan | nan | aaronha01 | ML1 | NL | WS | RF |
3 | 1969 | 3 | 3 | 78 | 5 | 1 | 0 | 0 | 0 | nan | nan | nan | aaronha01 | ATL | NL | NLCS | RF |
4 | 1979 | 2 | 0 | 15 | 0 | 1 | 0 | 0 | 0 | nan | 0 | 0 | aasedo01 | CAL | AL | ALCS | P |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
10341 | 2006 | 1 | 1 | 24 | 0 | 1 | 0 | 0 | 0 | nan | 0 | 1 | zitoba01 | OAK | AL | ALDS2 | P |
10342 | 2012 | 1 | 1 | 23 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | zitoba01 | SFN | NL | NLCS | P |
10343 | 2012 | 1 | 1 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | zitoba01 | SFN | NL | NLDS2 | P |
10344 | 2012 | 1 | 1 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | zitoba01 | SFN | NL | WS | P |
10345 | 1946 | 1 | 0 | 6 | 0 | 0 | 0 | 0 | 0 | nan | 0 | 0 | zuberbi01 | BOS | AL | WS | P |
10346 rows x 17 columns
memory usage: 1.65 MB
name: fieldingpost
type: getml.DataFrame
pitching
name | yearID | stint | W | L | G | GS | CG | SHO | SV | IPouts | H | ER | HR | BB | SO | BAOpp | ERA | IBB | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | playerID | teamID | lgID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 2004 | 1 | 1 | 0 | 11 | 0 | 0 | 0 | 0 | 32 | 20 | 8 | 1 | 10 | 5 | 0 | 6 | 0 | 0 | 2 | 0 | 61 | 5 | 8 | nan | nan | nan | aardsda01 | SFN | NL |
1 | 2006 | 1 | 3 | 0 | 45 | 0 | 0 | 0 | 0 | 159 | 41 | 24 | 9 | 28 | 49 | nan | 4 | 0 | 1 | 1 | 0 | 225 | 9 | 25 | nan | nan | nan | aardsda01 | CHN | NL |
2 | 2007 | 1 | 2 | 1 | 25 | 0 | 0 | 0 | 0 | 97 | 39 | 23 | 4 | 17 | 36 | nan | 6 | 3 | 2 | 1 | 0 | 151 | 7 | 24 | nan | nan | nan | aardsda01 | CHA | AL |
3 | 2008 | 1 | 4 | 2 | 47 | 0 | 0 | 0 | 0 | 146 | 49 | 30 | 4 | 35 | 49 | nan | 5 | 2 | 3 | 5 | 0 | 228 | 7 | 32 | nan | nan | nan | aardsda01 | BOS | AL |
4 | 2009 | 1 | 3 | 6 | 73 | 0 | 0 | 0 | 38 | 214 | 49 | 20 | 4 | 34 | 80 | nan | 2 | 3 | 2 | 0 | 0 | 296 | 53 | 23 | nan | nan | nan | aardsda01 | SEA | AL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
39356 | 1955 | 2 | 4 | 3 | 28 | 5 | 0 | 0 | 4 | 259 | 80 | 21 | 5 | 17 | 31 | 0 | 2 | 1 | 2 | 4 | 0 | 333 | 16 | 28 | nan | nan | nan | zuverge01 | BAL | AL |
39357 | 1956 | 1 | 7 | 6 | 62 | 0 | 0 | 0 | 16 | 292 | 112 | 45 | 6 | 34 | 33 | 0 | 4 | 9 | 1 | 3 | 1 | 432 | 40 | 52 | nan | nan | nan | zuverge01 | BAL | AL |
39358 | 1957 | 1 | 10 | 6 | 56 | 0 | 0 | 0 | 9 | 338 | 105 | 31 | 9 | 39 | 36 | 0 | 2 | 13 | 1 | 4 | 0 | 475 | 37 | 37 | nan | nan | nan | zuverge01 | BAL | AL |
39359 | 1958 | 1 | 2 | 2 | 45 | 0 | 0 | 0 | 7 | 207 | 74 | 26 | 4 | 17 | 22 | 0 | 3 | 3 | 2 | 6 | 0 | 294 | 23 | 29 | nan | nan | nan | zuverge01 | BAL | AL |
39360 | 1959 | 1 | 0 | 1 | 6 | 0 | 0 | 0 | 0 | 39 | 15 | 6 | 1 | 6 | 1 | 0 | 4 | 0 | 1 | 0 | 0 | 55 | 1 | 7 | nan | nan | nan | zuverge01 | BAL | AL |
39361 rows x 30 columns
memory usage: 10.11 MB
name: pitching
type: getml.DataFrame
pitchingpost
name | yearID | W | L | G | GS | CG | SHO | SV | IPouts | H | ER | HR | BB | SO | BAOpp | ERA | IBB | WP | HBP | BK | BFP | GF | R | SH | SF | GIDP | playerID | round | teamID | lgID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 1979 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 15 | 4 | 1 | 0 | 2 | 6 | 0 | 1 | 1 | 0 | 0 | 0 | 20 | 2 | 1 | 0 | 1 | 0 | aasedo01 | ALCS | CAL | AL |
1 | 1975 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 0 | 0 | 0 | abbotgl01 | ALCS | OAK | AL |
2 | 2000 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 15 | 3 | 3 | 1 | 3 | 3 | 0 | 5 | 0 | 0 | 0 | 0 | 21 | 0 | 3 | 0 | 0 | 0 | abbotpa01 | ALCS | SEA | AL |
3 | 2000 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 17 | 5 | 1 | 0 | 3 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 25 | 0 | 2 | 0 | 1 | 1 | abbotpa01 | ALDS2 | SEA | AL |
4 | 2001 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 15 | 0 | 0 | 0 | 8 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 21 | 0 | 0 | 1 | 0 | 0 | abbotpa01 | ALCS | SEA | AL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
4192 | 2006 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 24 | 4 | 1 | 1 | 3 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 30 | 0 | 1 | 1 | 0 | 0 | zitoba01 | ALDS2 | OAK | AL |
4193 | 2012 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 23 | 6 | 0 | 0 | 1 | 6 | 0 | 0 | 1 | 0 | 0 | nan | 29 | 0 | 0 | nan | nan | 1 | zitoba01 | NLCS | SFN | NL |
4194 | 2012 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 7 | 4 | 2 | 1 | 4 | 4 | 0 | 6 | 0 | 0 | 0 | nan | 16 | 0 | 2 | nan | nan | 0 | zitoba01 | NLDS2 | SFN | NL |
4195 | 2012 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 17 | 6 | 1 | 0 | 1 | 3 | 0 | 1 | 0 | 0 | 0 | nan | 23 | 0 | 1 | nan | nan | 1 | zitoba01 | WS | SFN | NL |
4196 | 1946 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6 | 3 | 1 | 0 | 1 | 1 | 0 | 4 | 1 | 0 | 0 | 0 | 9 | 0 | 1 | 0 | 0 | 1 | zuberbi01 | WS | BOS | AL |
4197 rows x 30 columns
memory usage: 1.10 MB
name: pitchingpost
type: getml.DataFrame
salaries
name | yearID | salary | teamID | lgID | playerID |
---|---|---|---|---|---|
role | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 1985 | 870000 | ATL | NL | barkele01 |
1 | 1985 | 550000 | ATL | NL | bedrost01 |
2 | 1985 | 545000 | ATL | NL | benedbr01 |
3 | 1985 | 633333 | ATL | NL | campri01 |
4 | 1985 | 625000 | ATL | NL | ceronri01 |
... | ... | ... | ... | ... | |
23106 | 2012 | 750000 | WAS | NL | tracych01 |
23107 | 2012 | 4000000 | WAS | NL | wangch01 |
23108 | 2012 | 13571428 | WAS | NL | werthja01 |
23109 | 2012 | 2300000 | WAS | NL | zimmejo02 |
23110 | 2012 | 12000000 | WAS | NL | zimmery01 |
23111 rows x 5 columns
memory usage: 1.31 MB
name: salaries
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
allstarfull["year"] = allstarfull["yearID"].as_str().as_ts(["%Y"])
allstarfull.set_role(["playerID"], getml.data.roles.join_key)
allstarfull.set_role(allstarfull.roles.unused_string, getml.data.roles.categorical)
allstarfull.set_role(allstarfull.roles.unused_float, getml.data.roles.numerical)
allstarfull.set_role("year", getml.data.roles.time_stamp)
allstarfull.set_role("yearID", getml.data.roles.unused_float)
allstarfull
name | year | playerID | gameID | teamID | lgID | gameNum | GP | startingPos | yearID |
---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | categorical | categorical | categorical | numerical | numerical | numerical | unused_float |
unit | time stamp, comparison only | ||||||||
0 | 1955-01-01 | aaronha01 | NLS195507120 | ML1 | NL | 0 | 1 | nan | 1955 |
1 | 1956-01-01 | aaronha01 | ALS195607100 | ML1 | NL | 0 | 1 | nan | 1956 |
2 | 1957-01-01 | aaronha01 | NLS195707090 | ML1 | NL | 0 | 1 | 9 | 1957 |
3 | 1958-01-01 | aaronha01 | ALS195807080 | ML1 | NL | 0 | 1 | 9 | 1958 |
4 | 1959-01-01 | aaronha01 | NLS195907070 | ML1 | NL | 1 | 1 | 9 | 1959 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
4826 | 1978-01-01 | ziskri01 | NLS197807110 | TEX | AL | 0 | 1 | 9 | 1978 |
4827 | 2002-01-01 | zitoba01 | NLS200207090 | OAK | AL | 0 | 1 | nan | 2002 |
4828 | 2003-01-01 | zitoba01 | ALS200307150 | OAK | AL | 0 | 0 | nan | 2003 |
4829 | 2006-01-01 | zitoba01 | NLS200607110 | OAK | AL | 0 | 1 | nan | 2006 |
4830 | 2009-01-01 | zobribe01 | NLS200907140 | TBA | AL | 0 | 1 | nan | 2009 |
4831 rows x 9 columns
memory usage: 0.27 MB
name: allstarfull
type: getml.DataFrame
awardsplayers["year"] = awardsplayers["yearID"].as_str().as_ts(["%Y"])
awardsplayers.set_role(["playerID"], getml.data.roles.join_key)
awardsplayers.set_role(["awardID", "lgID", "notes"], getml.data.roles.categorical)
awardsplayers.set_role("year", getml.data.roles.time_stamp)
awardsplayers
name | year | playerID | awardID | lgID | notes | yearID | tie |
---|---|---|---|---|---|---|---|
role | time_stamp | join_key | categorical | categorical | categorical | unused_float | unused_string |
unit | time stamp, comparison only | ||||||
0 | 1877-01-01 | bondto01 | Pitching Triple Crown | NL | NULL | 1877 | NULL |
1 | 1878-01-01 | hinespa01 | Triple Crown | NL | NULL | 1878 | NULL |
2 | 1884-01-01 | heckegu01 | Pitching Triple Crown | AA | NULL | 1884 | NULL |
3 | 1884-01-01 | radboch01 | Pitching Triple Crown | NL | NULL | 1884 | NULL |
4 | 1887-01-01 | oneilti01 | Triple Crown | AA | NULL | 1887 | NULL |
... | ... | ... | ... | ... | ... | ... | |
5790 | 2012-01-01 | larocad01 | Silver Slugger | NL | 1B | 2012 | NULL |
5791 | 2012-01-01 | cabremi01 | Triple Crown | NL | NULL | 2012 | NULL |
5792 | 2012-01-01 | cabremi01 | TSN Major League Player of the Y... | ML | NULL | 2012 | NULL |
5793 | 2012-01-01 | verlaju01 | TSN Pitcher of the Year | AL | NULL | 2012 | Y |
5794 | 2012-01-01 | dickera01 | TSN Pitcher of the Year | NL | NULL | 2012 | NULL |
5795 rows x 7 columns
memory usage: 0.24 MB
name: awardsplayers
type: getml.DataFrame
awardsshareplayers["year"] = awardsshareplayers["yearID"].as_str().as_ts(["%Y"])
awardsshareplayers.set_role(["playerID"], getml.data.roles.join_key)
awardsshareplayers.set_role(awardsshareplayers.roles.unused_float, getml.data.roles.numerical)
awardsshareplayers.set_role(awardsshareplayers.roles.unused_string, getml.data.roles.categorical)
awardsshareplayers.set_role("yearID", getml.data.roles.unused_float)
awardsshareplayers.set_role("year", getml.data.roles.time_stamp)
awardsshareplayers
name | year | playerID | awardID | lgID | pointsWon | pointsMax | votesFirst | yearID |
---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | categorical | categorical | numerical | numerical | numerical | unused_float |
unit | time stamp, comparison only | |||||||
0 | 1956-01-01 | fordwh01 | Cy Young | ML | 1 | 16 | 1 | 1956 |
1 | 1956-01-01 | maglisa01 | Cy Young | ML | 4 | 16 | 4 | 1956 |
2 | 1956-01-01 | newcodo01 | Cy Young | ML | 10 | 16 | 10 | 1956 |
3 | 1956-01-01 | spahnwa01 | Cy Young | ML | 1 | 16 | 1 | 1956 |
4 | 1957-01-01 | donovdi01 | Cy Young | ML | 1 | 16 | 1 | 1957 |
... | ... | ... | ... | ... | ... | ... | ... | |
6284 | 2006-01-01 | willijo03 | Rookie of the Year | NL | 1 | 160 | 0 | 2006 |
6285 | 2006-01-01 | zimmery01 | Rookie of the Year | NL | 101 | 160 | 10 | 2006 |
6286 | 2008-01-01 | devinjo01 | Rookie of the Year | AL | 3 | 140 | 0 | 2008 |
6287 | 2008-01-01 | sotoge01 | Rookie of the Year | NL | 158 | 160 | 31 | 2008 |
6288 | 2008-01-01 | volqued01 | Rookie of the Year | NL | 9 | 160 | 0 | 2008 |
6289 rows x 8 columns
memory usage: 0.33 MB
name: awardsshareplayers
type: getml.DataFrame
batting["year"] = batting["yearID"].as_str().as_ts(["%Y"])
batting.set_role(["playerID", "teamID"], getml.data.roles.join_key)
batting.set_role(batting.roles.unused_float, getml.data.roles.numerical)
batting.set_role(batting.roles.unused_string, getml.data.roles.categorical)
batting.set_role("yearID", getml.data.roles.unused_float)
batting.set_role("year", getml.data.roles.time_stamp)
batting
name | year | playerID | teamID | lgID | stint | G | G_batting | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | G_old | yearID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float |
unit | time stamp, comparison only | ||||||||||||||||||||||||
0 | 2004-01-01 | aardsda01 | SFN | NL | 1 | 11 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | 2004 |
1 | 2006-01-01 | aardsda01 | CHN | NL | 1 | 45 | 43 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 45 | 2006 |
2 | 2007-01-01 | aardsda01 | CHA | AL | 1 | 25 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 2007 |
3 | 2008-01-01 | aardsda01 | BOS | AL | 1 | 47 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 5 | 2008 |
4 | 2009-01-01 | aardsda01 | SEA | AL | 1 | 73 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | 2009 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
92348 | 1959-01-01 | zuverge01 | BAL | AL | 1 | 6 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1959 |
92349 | 1910-01-01 | zwilldu01 | CHA | AL | 1 | 27 | 27 | 87 | 7 | 16 | 5 | 0 | 0 | 5 | 1 | nan | 11 | nan | nan | 1 | 1 | nan | nan | 27 | 1910 |
92350 | 1914-01-01 | zwilldu01 | CHF | FL | 1 | 154 | 154 | 592 | 91 | 185 | 38 | 8 | 16 | 95 | 21 | nan | 46 | 68 | nan | 1 | 10 | nan | nan | 154 | 1914 |
92351 | 1915-01-01 | zwilldu01 | CHF | FL | 1 | 150 | 150 | 548 | 65 | 157 | 32 | 7 | 13 | 94 | 24 | nan | 67 | 65 | nan | 2 | 18 | nan | nan | 150 | 1915 |
92352 | 1916-01-01 | zwilldu01 | CHN | NL | 1 | 35 | 35 | 53 | 4 | 6 | 1 | 0 | 1 | 8 | 0 | nan | 4 | 6 | nan | 0 | 2 | nan | nan | 35 | 1916 |
92353 rows x 25 columns
memory usage: 17.36 MB
name: batting
type: getml.DataFrame
battingpost["year"] = battingpost["yearID"].as_str().as_ts(["%Y"])
battingpost.set_role(["playerID", "teamID"], getml.data.roles.join_key)
battingpost.set_role(battingpost.roles.unused_float, getml.data.roles.numerical)
battingpost.set_role(battingpost.roles.unused_string, getml.data.roles.categorical)
battingpost.set_role("yearID", getml.data.roles.unused_float)
battingpost.set_role("year", getml.data.roles.time_stamp)
battingpost
name | year | playerID | teamID | round | lgID | G | AB | R | H | 2B | 3B | HR | RBI | SB | CS | BB | SO | IBB | HBP | SH | SF | GIDP | yearID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float |
unit | time stamp, comparison only | ||||||||||||||||||||||
0 | 1884-01-01 | becanbu01 | NY4 | WS | AA | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | nan | 0 | 0 | 0 | nan | nan | nan | nan | 1884 |
1 | 1884-01-01 | bradyst01 | NY4 | WS | AA | 3 | 10 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | nan | 0 | 1 | 0 | nan | nan | nan | nan | 1884 |
2 | 1884-01-01 | carrocl01 | PRO | WS | NL | 3 | 10 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | nan | 1 | 1 | 0 | nan | nan | nan | nan | 1884 |
3 | 1884-01-01 | dennyje01 | PRO | WS | NL | 3 | 9 | 3 | 4 | 0 | 1 | 1 | 2 | 0 | nan | 0 | 3 | 0 | nan | nan | nan | nan | 1884 |
4 | 1884-01-01 | esterdu01 | NY4 | WS | AA | 3 | 10 | 0 | 3 | 1 | 0 | 0 | 0 | 1 | nan | 0 | 3 | 0 | nan | nan | nan | nan | 1884 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
9793 | 2012-01-01 | theriry01 | SFN | WS | NL | 2 | 5 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | nan | nan | nan | nan | nan | 2012 |
9794 | 2012-01-01 | valvejo01 | DET | WS | AL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 2012 |
9795 | 2012-01-01 | verlaju01 | DET | WS | AL | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 2012 |
9796 | 2012-01-01 | vogelry01 | SFN | WS | NL | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 2012 |
9797 | 2012-01-01 | zitoba01 | SFN | WS | NL | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | nan | nan | nan | nan | nan | 2012 |
9798 rows x 23 columns
memory usage: 1.65 MB
name: battingpost
type: getml.DataFrame
fielding["year"] = fielding["yearID"].as_str().as_ts(["%Y"])
fielding.set_role(["playerID", "teamID"], getml.data.roles.join_key)
fielding.set_role(["stint", "G","GS","InnOuts", "PO", "A", "E", "DP"], getml.data.roles.numerical)
fielding.set_role(fielding.roles.unused_string, getml.data.roles.categorical)
fielding.set_role("year", getml.data.roles.time_stamp)
fielding
name | year | playerID | teamID | lgID | POS | stint | G | GS | InnOuts | PO | A | E | DP | yearID | PB | WP | SB | CS | ZR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float |
unit | time stamp, comparison only | ||||||||||||||||||
0 | 2004-01-01 | aardsda01 | SFN | NL | P | 1 | 11 | 0 | 32 | 0 | 0 | 0 | 0 | 2004 | nan | nan | nan | nan | nan |
1 | 2006-01-01 | aardsda01 | CHN | NL | P | 1 | 45 | 0 | 159 | 1 | 5 | 0 | 1 |