SFScores - Predicting health inspection scores of restaurants¶
In this notebook, we will benchmark several of getML's feature learning algorithms against featuretools using a dataset of eateries in San Francisco.
Summary:
- Prediction type: Regression model
- Domain: Health
- Prediction target: Sales
- Population size: 12887
Background¶
This notebook is based on the San Francisco Dept. of Public Health's database of eateries in San Francisco. These eateries are regularly inspected. The inspections often result in a score.
The challenge is to predict the score resulting from an inspection.
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015)(Now residing at relational-data.org.).
We will benchmark getML's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "ipywidgets==8.1.5"
import os
import warnings
import pandas as pd
import featuretools
import woodwork as ww
import getml
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
warnings.simplefilter(action='ignore', category=FutureWarning)
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('sfscores')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-community-edition-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912213838.log.
Connected to project 'sfscores'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="SFScores",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='SFScores', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
businesses = load_if_needed("businesses")
inspections = load_if_needed("inspections")
violations = load_if_needed("violations")
businesses
name | business_id | latitude | longitude | phone_number | business_certificate | name | address | city | postal_code | tax_code | application_date | owner_name | owner_address | owner_city | owner_state | owner_zip |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 10 | 37.7911 | -122.404 | nan | 779059 | Tiramisu Kitchen | 033 Belden Pl | San Francisco | 94104 | H24 | NULL | Tiramisu LLC | 33 Belden St | San Francisco | CA | 94104 |
1 | 24 | 37.7929 | -122.403 | nan | 352312 | OMNI S.F. Hotel - 2nd Floor Pant... | 500 California St, 2nd Floor | San Francisco | 94104 | H24 | NULL | OMNI San Francisco Hotel Corp | 500 California St, 2nd Floor | San Francisco | CA | 94104 |
2 | 31 | 37.8072 | -122.419 | nan | 346882 | Norman's Ice Cream and Freezes | 2801 Leavenworth St | San Francisco | 94133 | H24 | NULL | Norman Antiforda | 2801 Leavenworth St | San Francisco | CA | 94133 |
3 | 45 | 37.7471 | -122.414 | nan | 340024 | CHARLIE'S DELI CAFE | 3202 FOLSOM St | S.F. | 94110 | H24 | 2001-10-10 | HARB, CHARLES AND KRISTIN | 1150 SANCHEZ | S.F. | CA | 94114 |
4 | 48 | 37.764 | -122.466 | nan | 318022 | ART'S CAFE | 747 IRVING St | SAN FRANCISCO | 94122 | H24 | NULL | YOON HAE RYONG | 1567 FUNSTON AVE | SAN FRANCISCO | CA | 94122 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
6353 | 89335 | nan | nan | nan | 1057025 | Breaking Bad Sandwiches | 154 McAllister St | NULL | 94102 | H25 | 2016-09-23 | JPMD, LLC | 662 Bellhurst Lane | Castro Valley | CA | 94102 |
6354 | 89336 | nan | nan | nan | 1057746 | Miller's Rest | 1085 Sutter St | NULL | 94109 | H26 | 2016-09-23 | Miller's Rest, LLC | 2906 Bush Street | San Francisco | CA | 94109 |
6355 | 89393 | nan | nan | nan | 1042408 | Panuchos | 620 Broadway St | NULL | 94133 | H24 | 2016-09-28 | Los Aluxes, LLC | 1032 Irving Street, #421 | San Francisco | CA | 94122 |
6356 | 89416 | nan | nan | nan | 1051081 | Nobhill Pizza & Shawerma | 1534 California St | NULL | 94109 | H24 | 2016-09-29 | BBA Foods, Inc. | 840 Post Street, #218 | San Francisco | CA | 94109 |
6357 | 89453 | nan | nan | nan | 459309 | Burger King #4668 | 1690 Valencia St | San Francisco | 94110 | H29 | 2016-10-03 | Golden Gate Restaurant Group, In... | P.O Box 21 | Lafeyette | CA | 94549 |
6358 rows x 16 columns
memory usage: 1.57 MB
name: businesses
type: getml.DataFrame
inspections
name | business_id | score | date | type |
---|---|---|---|---|
role | unused_float | unused_float | unused_string | unused_string |
0 | 10 | 92 | 2014-01-14 | Routine - Unscheduled |
1 | 10 | nan | 2014-01-24 | Reinspection/Followup |
2 | 10 | 94 | 2014-07-29 | Routine - Unscheduled |
3 | 10 | nan | 2014-08-07 | Reinspection/Followup |
4 | 10 | 82 | 2016-05-03 | Routine - Unscheduled |
... | ... | ... | ... | |
23759 | 89199 | 100 | 2016-09-12 | Routine - Unscheduled |
23760 | 89200 | 100 | 2016-09-12 | Routine - Unscheduled |
23761 | 89201 | nan | 2016-09-12 | New Ownership |
23762 | 89204 | 100 | 2016-09-12 | Routine - Unscheduled |
23763 | 89296 | nan | 2016-09-30 | New Ownership |
23764 rows x 4 columns
memory usage: 1.51 MB
name: inspections
type: getml.DataFrame
violations
name | business_id | date | violation_type_id | risk_category | description |
---|---|---|---|---|---|
role | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 10 | 2014-07-29 | 103129 | Moderate Risk | Insufficient hot water or runnin... |
1 | 10 | 2014-07-29 | 103144 | Low Risk | Unapproved or unmaintained equip... |
2 | 10 | 2014-01-14 | 103119 | Moderate Risk | Inadequate and inaccessible hand... |
3 | 10 | 2014-01-14 | 103145 | Low Risk | Improper storage of equipment ut... |
4 | 10 | 2014-01-14 | 103154 | Low Risk | Unclean or degraded floors walls... |
... | ... | ... | ... | ... | |
36045 | 88878 | 2016-08-19 | 103144 | Low Risk | Unapproved or unmaintained equip... |
36046 | 88878 | 2016-08-19 | 103124 | Moderate Risk | Inadequately cleaned or sanitize... |
36047 | 89072 | 2016-09-22 | 103120 | Moderate Risk | Moderate risk food holding tempe... |
36048 | 89072 | 2016-09-22 | 103131 | Moderate Risk | Moderate risk vermin infestation |
36049 | 89072 | 2016-09-22 | 103149 | Low Risk | Wiping cloths not clean or prope... |
36050 rows x 5 columns
memory usage: 4.06 MB
name: violations
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
businesses.set_role("business_id", getml.data.roles.join_key)
businesses.set_role("name", getml.data.roles.text)
businesses.set_role(["postal_code", "tax_code", "owner_zip"], getml.data.roles.categorical)
businesses
name | business_id | postal_code | tax_code | owner_zip | name | latitude | longitude | phone_number | business_certificate | address | city | application_date | owner_name | owner_address | owner_city | owner_state |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | join_key | categorical | categorical | categorical | text | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 10 | 94104 | H24 | 94104 | Tiramisu Kitchen | 37.7911 | -122.404 | nan | 779059 | 033 Belden Pl | San Francisco | NULL | Tiramisu LLC | 33 Belden St | San Francisco | CA |
1 | 24 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pant... | 37.7929 | -122.403 | nan | 352312 | 500 California St, 2nd Floor | San Francisco | NULL | OMNI San Francisco Hotel Corp | 500 California St, 2nd Floor | San Francisco | CA |
2 | 31 | 94133 | H24 | 94133 | Norman's Ice Cream and Freezes | 37.8072 | -122.419 | nan | 346882 | 2801 Leavenworth St | San Francisco | NULL | Norman Antiforda | 2801 Leavenworth St | San Francisco | CA |
3 | 45 | 94110 | H24 | 94114 | CHARLIE'S DELI CAFE | 37.7471 | -122.414 | nan | 340024 | 3202 FOLSOM St | S.F. | 2001-10-10 | HARB, CHARLES AND KRISTIN | 1150 SANCHEZ | S.F. | CA |
4 | 48 | 94122 | H24 | 94122 | ART'S CAFE | 37.764 | -122.466 | nan | 318022 | 747 IRVING St | SAN FRANCISCO | NULL | YOON HAE RYONG | 1567 FUNSTON AVE | SAN FRANCISCO | CA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
6353 | 89335 | 94102 | H25 | 94102 | Breaking Bad Sandwiches | nan | nan | nan | 1057025 | 154 McAllister St | NULL | 2016-09-23 | JPMD, LLC | 662 Bellhurst Lane | Castro Valley | CA |
6354 | 89336 | 94109 | H26 | 94109 | Miller's Rest | nan | nan | nan | 1057746 | 1085 Sutter St | NULL | 2016-09-23 | Miller's Rest, LLC | 2906 Bush Street | San Francisco | CA |
6355 | 89393 | 94133 | H24 | 94122 | Panuchos | nan | nan | nan | 1042408 | 620 Broadway St | NULL | 2016-09-28 | Los Aluxes, LLC | 1032 Irving Street, #421 | San Francisco | CA |
6356 | 89416 | 94109 | H24 | 94109 | Nobhill Pizza & Shawerma | nan | nan | nan | 1051081 | 1534 California St | NULL | 2016-09-29 | BBA Foods, Inc. | 840 Post Street, #218 | San Francisco | CA |
6357 | 89453 | 94110 | H29 | 94549 | Burger King #4668 | nan | nan | nan | 459309 | 1690 Valencia St | San Francisco | 2016-10-03 | Golden Gate Restaurant Group, In... | P.O Box 21 | Lafeyette | CA |
6358 rows x 16 columns
memory usage: 1.36 MB
name: businesses
type: getml.DataFrame
inspections = inspections[~inspections.score.is_nan()].to_df("inspections")
inspections.set_role("business_id", getml.data.roles.join_key)
inspections.set_role("score", getml.data.roles.target)
inspections.set_role("date", getml.data.roles.time_stamp)
inspections
name | date | business_id | score | type |
---|---|---|---|---|
role | time_stamp | join_key | target | unused_string |
unit | time stamp, comparison only | |||
0 | 2014-01-14 | 10 | 92 | Routine - Unscheduled |
1 | 2014-07-29 | 10 | 94 | Routine - Unscheduled |
2 | 2016-05-03 | 10 | 82 | Routine - Unscheduled |
3 | 2013-11-18 | 24 | 100 | Routine - Unscheduled |
4 | 2014-06-12 | 24 | 96 | Routine - Unscheduled |
... | ... | ... | ... | |
12882 | 2016-09-22 | 89072 | 90 | Routine - Unscheduled |
12883 | 2016-09-12 | 89198 | 100 | Routine - Unscheduled |
12884 | 2016-09-12 | 89199 | 100 | Routine - Unscheduled |
12885 | 2016-09-12 | 89200 | 100 | Routine - Unscheduled |
12886 | 2016-09-12 | 89204 | 100 | Routine - Unscheduled |
12887 rows x 4 columns
memory usage: 0.64 MB
name: inspections
type: getml.DataFrame
violations.set_role("business_id", getml.data.roles.join_key)
violations.set_role("date", getml.data.roles.time_stamp)
violations.set_role(["violation_type_id", "risk_category"], getml.data.roles.categorical)
violations.set_role("description", getml.data.roles.text)
violations
name | date | business_id | violation_type_id | risk_category | description |
---|---|---|---|---|---|
role | time_stamp | join_key | categorical | categorical | text |
unit | time stamp, comparison only | ||||
0 | 2014-07-29 | 10 | 103129 | Moderate Risk | Insufficient hot water or runnin... |
1 | 2014-07-29 | 10 | 103144 | Low Risk | Unapproved or unmaintained equip... |
2 | 2014-01-14 | 10 | 103119 | Moderate Risk | Inadequate and inaccessible hand... |
3 | 2014-01-14 | 10 | 103145 | Low Risk | Improper storage of equipment ut... |
4 | 2014-01-14 | 10 | 103154 | Low Risk | Unclean or degraded floors walls... |
... | ... | ... | ... | ... | |
36045 | 2016-08-19 | 88878 | 103144 | Low Risk | Unapproved or unmaintained equip... |
36046 | 2016-08-19 | 88878 | 103124 | Moderate Risk | Inadequately cleaned or sanitize... |
36047 | 2016-09-22 | 89072 | 103120 | Moderate Risk | Moderate risk food holding tempe... |
36048 | 2016-09-22 | 89072 | 103131 | Moderate Risk | Moderate risk vermin infestation |
36049 | 2016-09-22 | 89072 | 103149 | Low Risk | Wiping cloths not clean or prope... |
36050 rows x 5 columns
memory usage: 2.59 MB
name: violations
type: getml.DataFrame
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
split = getml.data.split.random(train=0.8, test=0.2)
2.1 Define relational model¶
star_schema = getml.data.StarSchema(population=inspections, alias="population", split=split)
star_schema.join(
businesses,
on="business_id",
relationship=getml.data.relationship.many_to_one,
)
star_schema.join(
violations,
on="business_id",
time_stamps="date",
horizon=getml.data.time.days(1),
)
star_schema.join(
inspections,
on="business_id",
time_stamps="date",
lagged_targets=True,
horizon=getml.data.time.days(1),
)
star_schema
data frames | staging table | |
---|---|---|
0 | population, businesses | POPULATION__STAGING_TABLE_1 |
1 | inspections | INSPECTIONS__STAGING_TABLE_2 |
2 | violations | VIOLATIONS__STAGING_TABLE_3 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | inspections | 2492 | View |
1 | train | inspections | 10395 | View |
name | rows | type | |
---|---|---|---|
0 | businesses | 6358 | DataFrame |
1 | violations | 36050 | DataFrame |
2 | inspections | 12887 | DataFrame |
2.2 getML pipeline¶
Set-up the feature learner & predictor
We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples
).
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
)
predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
Build the pipeline
pipe1 = getml.pipeline.Pipeline(
tags=['fast_prop'],
data_model=star_schema.data_model,
preprocessors=[mapping],
feature_learners=[fast_prop],
predictors=[predictor]
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=['businesses', 'inspections', 'violations'], predictors=['XGBoostRegressor'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop'])
2.3 Model training¶
pipe1.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
--------------------------------------------------------------------------- OSError Traceback (most recent call last) Cell In[17], line 1 ----> 1 pipe1.check(star_schema.train) File ~/Documents/github/getml-demo/.venv/lib/python3.11/site-packages/getml/pipeline/pipeline.py:1106, in Pipeline.check(self, population_table, peripheral_tables) 1104 msg = comm.log(sock) 1105 if msg != "Success!": -> 1106 comm.handle_engine_exception(msg) 1107 issues = Issues(comm.recv_issues(sock)) 1108 if len(issues) == 0: File ~/Documents/github/getml-demo/.venv/lib/python3.11/site-packages/getml/exceptions.py:124, in handle_engine_exception(msg, extra) 121 for handler in EngineExceptionHandlerRegistry.handlers: 122 handler(msg, extra=extra) --> 124 raise OSError(msg) OSError: The Mapping preprocessor is not supported in the community edition. Please upgrade to getML enterprise to use this. An overview of what is supported in the community edition can be found in the official getML documentation.
pipe1.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 ⠼ Indexing text fields... 0% • 00:00
Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 104 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
Trained pipeline.
Time taken: 0:00:03.109211.
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=['businesses', 'inspections', 'violations'], predictors=['XGBoostRegressor'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop', 'container-4IQSR5'])
2.4 Model evaluation¶
fastprop_score = pipe1.score(star_schema.test)
fastprop_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 11:53:31 | train | score | 4.8865 | 6.5247 | 0.3608 |
1 | 2024-09-12 11:53:32 | test | score | 5.3218 | 7.0532 | 0.2889 |
2.5 featuretools¶
population_train_pd = star_schema.train.population.to_pandas()
population_test_pd = star_schema.test.population.to_pandas()
inspections_pd = inspections.drop(inspections.roles.unused).to_pandas()
violations_pd = violations.drop(violations.roles.unused).to_pandas()
businesses_pd = businesses.drop(businesses.roles.unused).to_pandas()
population_train_pd["id"] = population_train_pd.index
population_train_pd = population_train_pd.merge(
businesses_pd,
on="business_id"
)
population_train_pd
business_id | score | date | id | postal_code | tax_code | owner_zip | name | |
---|---|---|---|---|---|---|---|---|
0 | 10 | 92.0 | 2014-01-14 | 0 | 94104 | H24 | 94104 | Tiramisu Kitchen |
1 | 10 | 94.0 | 2014-07-29 | 1 | 94104 | H24 | 94104 | Tiramisu Kitchen |
2 | 10 | 82.0 | 2016-05-03 | 2 | 94104 | H24 | 94104 | Tiramisu Kitchen |
3 | 24 | 96.0 | 2014-06-12 | 3 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pantry |
4 | 24 | 96.0 | 2014-11-24 | 4 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pantry |
... | ... | ... | ... | ... | ... | ... | ... | ... |
10390 | 88878 | 94.0 | 2016-08-19 | 10390 | 94102 | H24 | 94566 | Jamba Juice |
10391 | 89072 | 90.0 | 2016-09-22 | 10391 | 94109 | H91 | 94109 | Epicurean at Sacred Heart Catholic Prep School |
10392 | 89198 | 100.0 | 2016-09-12 | 10392 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/View Level, Sec. 333 |
10393 | 89199 | 100.0 | 2016-09-12 | 10393 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/Lower CF, Sec. 140 |
10394 | 89200 | 100.0 | 2016-09-12 | 10394 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/Lower CF, Sec. 142 |
10395 rows × 8 columns
population_test_pd["id"] = population_test_pd.index
population_test_pd = population_test_pd.merge(
businesses_pd,
on="business_id"
)
population_test_pd
business_id | score | date | id | postal_code | tax_code | owner_zip | name | |
---|---|---|---|---|---|---|---|---|
0 | 24 | 100.0 | 2013-11-18 | 0 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pantry |
1 | 24 | 96.0 | 2016-03-11 | 1 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pantry |
2 | 45 | 94.0 | 2013-12-09 | 2 | 94110 | H24 | 94114 | CHARLIE'S DELI CAFE |
3 | 58 | 78.0 | 2014-07-25 | 3 | 94111 | H24 | 94111 | Oasis Grill |
4 | 66 | 91.0 | 2014-05-19 | 4 | 94122 | H24 | 94122 | STARBUCKS |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2487 | 87802 | 91.0 | 2016-06-07 | 2487 | 94110 | H25 | 94110 | Bernal Heights Pizzeria |
2488 | 88082 | 84.0 | 2016-08-30 | 2488 | 94133 | H24 | 94133 | Chongqing Xiaomian |
2489 | 88447 | 96.0 | 2016-08-17 | 2489 | None | H91 | 94107 | Fare Resources |
2490 | 88702 | 96.0 | 2016-08-15 | 2490 | 94118 | H25 | 94118 | Dancing Bull |
2491 | 89204 | 100.0 | 2016-09-12 | 2491 | 94107 | H36 | 94107 | AT&T - Hol n Jam Cart/Upper CF, Sec. 142 |
2492 rows × 8 columns
def prepare_peripheral(violations_pd, train_or_test):
"""
Helper function that imitates the behavior of
the data model defined above.
"""
violations_new = violations_pd.merge(
train_or_test[["id", "business_id", "date"]],
on="business_id"
)
violations_new = violations_new[
violations_new["date_x"] < violations_new["date_y"]
]
del violations_new["date_y"]
del violations_new["business_id"]
return violations_new.rename(columns={"date_x": "date"})
violations_train_pd = prepare_peripheral(violations_pd, population_train_pd)
violations_test_pd = prepare_peripheral(violations_pd, population_test_pd)
violations_train_pd
violation_type_id | risk_category | description | date | id | |
---|---|---|---|---|---|
2 | 103129 | Moderate Risk | Insufficient hot water or running water | 2014-07-29 | 2 |
5 | 103144 | Low Risk | Unapproved or unmaintained equipment or utensils | 2014-07-29 | 2 |
7 | 103119 | Moderate Risk | Inadequate and inaccessible handwashing facili... | 2014-01-14 | 1 |
8 | 103119 | Moderate Risk | Inadequate and inaccessible handwashing facili... | 2014-01-14 | 2 |
10 | 103145 | Low Risk | Improper storage of equipment utensils or linens | 2014-01-14 | 1 |
... | ... | ... | ... | ... | ... |
89220 | 103119 | Moderate Risk | Inadequate and inaccessible handwashing facili... | 2016-02-16 | 10290 |
89256 | 103131 | Moderate Risk | Moderate risk vermin infestation | 2016-04-04 | 10308 |
89336 | 103154 | Low Risk | Unclean or degraded floors walls or ceilings | 2016-04-11 | 10331 |
89338 | 103148 | Low Risk | No thermometers or uncalibrated thermometers | 2016-04-11 | 10331 |
89340 | 103144 | Low Risk | Unapproved or unmaintained equipment or utensils | 2016-04-11 | 10331 |
29004 rows × 5 columns
inspections_train_pd = prepare_peripheral(inspections_pd, population_train_pd)
inspections_test_pd = prepare_peripheral(inspections_pd, population_test_pd)
inspections_train_pd
score | date | id | |
---|---|---|---|
1 | 92.0 | 2014-01-14 | 1 |
2 | 92.0 | 2014-01-14 | 2 |
5 | 94.0 | 2014-07-29 | 2 |
9 | 100.0 | 2013-11-18 | 3 |
10 | 100.0 | 2013-11-18 | 4 |
... | ... | ... | ... |
32628 | 92.0 | 2016-02-16 | 10290 |
32648 | 96.0 | 2016-04-04 | 10308 |
32673 | 94.0 | 2016-04-11 | 10331 |
32707 | 100.0 | 2016-05-23 | 10360 |
32738 | 100.0 | 2016-08-17 | 10389 |
11190 rows × 3 columns
del population_train_pd["business_id"]
del population_test_pd["business_id"]
population_train_pd
score | date | id | postal_code | tax_code | owner_zip | name | |
---|---|---|---|---|---|---|---|
0 | 92.0 | 2014-01-14 | 0 | 94104 | H24 | 94104 | Tiramisu Kitchen |
1 | 94.0 | 2014-07-29 | 1 | 94104 | H24 | 94104 | Tiramisu Kitchen |
2 | 82.0 | 2016-05-03 | 2 | 94104 | H24 | 94104 | Tiramisu Kitchen |
3 | 96.0 | 2014-06-12 | 3 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pantry |
4 | 96.0 | 2014-11-24 | 4 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pantry |
... | ... | ... | ... | ... | ... | ... | ... |
10390 | 94.0 | 2016-08-19 | 10390 | 94102 | H24 | 94566 | Jamba Juice |
10391 | 90.0 | 2016-09-22 | 10391 | 94109 | H91 | 94109 | Epicurean at Sacred Heart Catholic Prep School |
10392 | 100.0 | 2016-09-12 | 10392 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/View Level, Sec. 333 |
10393 | 100.0 | 2016-09-12 | 10393 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/Lower CF, Sec. 140 |
10394 | 100.0 | 2016-09-12 | 10394 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/Lower CF, Sec. 142 |
10395 rows × 7 columns
def add_index(df):
df.insert(0, "index", range(len(df)))
population_pd_logical_types = {
"id": ww.logical_types.Integer,
"score": ww.logical_types.Integer,
"date": ww.logical_types.Datetime,
"postal_code": ww.logical_types.Categorical,
"tax_code": ww.logical_types.Categorical,
"owner_zip": ww.logical_types.Categorical,
"name": ww.logical_types.Categorical
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")
population_test_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")
add_index(inspections_train_pd)
add_index(inspections_test_pd)
inspections_pd_logical_types = {
"index": ww.logical_types.Integer,
"score": ww.logical_types.Integer,
"date": ww.logical_types.Datetime,
"id": ww.logical_types.Integer
}
inspections_train_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections")
inspections_test_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections")
add_index(violations_train_pd)
add_index(violations_test_pd)
violations_pd_logical_types = {
"index": ww.logical_types.Integer,
"violation_type_id": ww.logical_types.Categorical,
"risk_category": ww.logical_types.Categorical,
"description": ww.logical_types.Categorical,
"date": ww.logical_types.Datetime,
"id": ww.logical_types.Integer
}
violations_train_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations")
violations_test_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations")
dataframes_train = {
"population" : (population_train_pd, ),
"inspections" : (inspections_train_pd, ),
"violations" : (violations_train_pd, )
}
dataframes_test = {
"population" : (population_test_pd, ),
"inspections" : (inspections_test_pd, ),
"violations" : (violations_test_pd, )
}
relationships = [
("population", "id", "inspections", "id"),
("population", "id", "violations", "id")
]
featuretools_train_pd = featuretools.dfs(
dataframes=dataframes_train,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_test_pd = featuretools.dfs(
dataframes=dataframes_test,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
featuretools_train.set_role("score", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)
featuretools_train
name | score | postal_code | tax_code | owner_zip | name | COUNT(inspections) | COUNT(violations) | MODE(violations.description) | MODE(violations.risk_category) | MODE(violations.violation_type_id) | NUM_UNIQUE(violations.description) | NUM_UNIQUE(violations.risk_category) | NUM_UNIQUE(violations.violation_type_id) | DAY(date) | MONTH(date) | WEEKDAY(date) | YEAR(date) | MODE(inspections.DAY(date)) | MODE(inspections.MONTH(date)) | MODE(inspections.WEEKDAY(date)) | MODE(inspections.YEAR(date)) | NUM_UNIQUE(inspections.DAY(date)) | NUM_UNIQUE(inspections.MONTH(date)) | NUM_UNIQUE(inspections.WEEKDAY(date)) | NUM_UNIQUE(inspections.YEAR(date)) | MODE(violations.DAY(date)) | MODE(violations.MONTH(date)) | MODE(violations.WEEKDAY(date)) | MODE(violations.YEAR(date)) | NUM_UNIQUE(violations.DAY(date)) | NUM_UNIQUE(violations.MONTH(date)) | NUM_UNIQUE(violations.WEEKDAY(date)) | NUM_UNIQUE(violations.YEAR(date)) | MAX(inspections.score) | MEAN(inspections.score) | MIN(inspections.score) | SKEW(inspections.score) | STD(inspections.score) | SUM(inspections.score) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 92 | 94104 | H24 | 94104 | Tiramisu Kitchen | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 14 | 1 | 1 | 2014 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
1 | 94 | 94104 | H24 | 94104 | Tiramisu Kitchen | 1 | 3 | Improper storage of equipment ut... | Low Risk | 103119 | 3 | 2 | 3 | 29 | 7 | 1 | 2014 | 14 | 1 | 1 | 2014 | 1 | 1 | 1 | 1 | 14 | 1 | 1 | 2014 | 1 | 1 | 1 | 1 | 92 | 92 | 92 | nan | nan | 92 |
2 | 82 | 94104 | H24 | 94104 | Tiramisu Kitchen | 2 | 5 | Improper storage of equipment ut... | Low Risk | 103119 | 5 | 2 | 5 | 3 | 5 | 1 | 2016 | 14 | 1 | 1 | 2014 | 2 | 2 | 1 | 1 | 14 | 1 | 1 | 2014 | 2 | 2 | 1 | 1 | 94 | 93 | 92 | nan | 1.4142 | 186 |
3 | 96 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pant... | 1 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 12 | 6 | 3 | 2014 | 18 | 11 | 0 | 2013 | 1 | 1 | 1 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 100 | 100 | 100 | nan | nan | 100 |
4 | 96 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pant... | 2 | 2 | Improper storage of equipment ut... | Low Risk | 103145 | 2 | 1 | 2 | 24 | 11 | 0 | 2014 | 12 | 6 | 0 | 2013 | 2 | 2 | 2 | 2 | 12 | 6 | 3 | 2014 | 1 | 1 | 1 | 1 | 100 | 98 | 96 | nan | 2.8284 | 196 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
10390 | 94 | 94102 | H24 | 94566 | Jamba Juice | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 19 | 8 | 4 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
10391 | 90 | 94109 | H91 | 94109 | Epicurean at Sacred Heart Cathol... | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 22 | 9 | 3 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
10392 | 100 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/View Level... | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 12 | 9 | 0 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
10393 | 100 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/Lower CF, ... | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 12 | 9 | 0 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
10394 | 100 | 94107 | H36 | 29615 | AT&T Park - Beer Cart/Lower CF, ... | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 12 | 9 | 0 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
10395 rows x 39 columns
memory usage: 1.91 MB
name: featuretools_train
type: getml.DataFrame
featuretools_test.set_role("score", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)
featuretools_test
name | score | postal_code | tax_code | owner_zip | name | COUNT(inspections) | COUNT(violations) | MODE(violations.description) | MODE(violations.risk_category) | MODE(violations.violation_type_id) | NUM_UNIQUE(violations.description) | NUM_UNIQUE(violations.risk_category) | NUM_UNIQUE(violations.violation_type_id) | DAY(date) | MONTH(date) | WEEKDAY(date) | YEAR(date) | MODE(inspections.DAY(date)) | MODE(inspections.MONTH(date)) | MODE(inspections.WEEKDAY(date)) | MODE(inspections.YEAR(date)) | NUM_UNIQUE(inspections.DAY(date)) | NUM_UNIQUE(inspections.MONTH(date)) | NUM_UNIQUE(inspections.WEEKDAY(date)) | NUM_UNIQUE(inspections.YEAR(date)) | MODE(violations.DAY(date)) | MODE(violations.MONTH(date)) | MODE(violations.WEEKDAY(date)) | MODE(violations.YEAR(date)) | NUM_UNIQUE(violations.DAY(date)) | NUM_UNIQUE(violations.MONTH(date)) | NUM_UNIQUE(violations.WEEKDAY(date)) | NUM_UNIQUE(violations.YEAR(date)) | MAX(inspections.score) | MEAN(inspections.score) | MIN(inspections.score) | SKEW(inspections.score) | STD(inspections.score) | SUM(inspections.score) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 100 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pant... | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 18 | 11 | 0 | 2013 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
1 | 96 | 94104 | H24 | 94104 | OMNI S.F. Hotel - 2nd Floor Pant... | 3 | 3 | Improper storage of equipment ut... | Low Risk | 103119 | 3 | 2 | 3 | 11 | 3 | 4 | 2016 | 12 | 11 | 0 | 2014 | 3 | 2 | 2 | 2 | 12 | 6 | 3 | 2014 | 2 | 2 | 2 | 1 | 100 | 97.3333 | 96 | 1.7321 | 2.3094 | 292 |
2 | 94 | 94110 | H24 | 94114 | CHARLIE'S DELI CAFE | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 9 | 12 | 0 | 2013 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
3 | 78 | 94111 | H24 | 94111 | Oasis Grill | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 25 | 7 | 4 | 2014 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
4 | 91 | 94122 | H24 | 94122 | STARBUCKS | 1 | 1 | Wiping cloths not clean or prope... | Low Risk | 103149 | 1 | 1 | 1 | 19 | 5 | 0 | 2014 | 10 | 2 | 0 | 2014 | 1 | 1 | 1 | 1 | 10 | 2 | 0 | 2014 | 1 | 1 | 1 | 1 | 98 | 98 | 98 | nan | nan | 98 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
2487 | 91 | 94110 | H25 | 94110 | Bernal Heights Pizzeria | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 7 | 6 | 1 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
2488 | 84 | 94133 | H24 | 94133 | Chongqing Xiaomian | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 30 | 8 | 1 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
2489 | 96 | NULL | H91 | 94107 | Fare Resources | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 17 | 8 | 2 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
2490 | 96 | 94118 | H25 | 94118 | Dancing Bull | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 15 | 8 | 0 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
2491 | 100 | 94107 | H36 | 94107 | AT&T - Hol n Jam Cart/Upper CF, ... | 0 | 0 | NULL | NULL | NULL | NULL | NULL | NULL | 12 | 9 | 0 | 2016 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 |
2492 rows x 39 columns
memory usage: 0.46 MB
name: featuretools_test
type: getml.DataFrame
We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.
Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.
imputation = getml.preprocessors.Imputation()
predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
pipe2 = getml.pipeline.Pipeline(
tags=['featuretools'],
preprocessors=[imputation],
predictors=[predictor],
include_categorical=True,
)
pipe2
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
pipe2.fit(featuretools_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 1 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:12
Trained pipeline.
Time taken: 0:00:12.224273.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
featuretools_score = pipe2.score(featuretools_test)
featuretools_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 11:53:55 | featuretools_train | score | 5.1359 | 6.7501 | 0.321 |
1 | 2024-09-12 11:53:55 | featuretools_test | score | 5.4491 | 7.1941 | 0.2626 |
2.6 Features¶
The most important feature looks as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_46";
CREATE TABLE "FEATURE_1_46" AS
SELECT COUNT( t1."date" - t2."date" ) - COUNT( DISTINCT t1."date" - t2."date" ) AS "feature_1_46",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "VIOLATIONS__STAGING_TABLE_3" t2
ON t1."business_id" = t2."business_id"
WHERE t2."date__1_000000_days" <= t1."date"
GROUP BY t1.rowid;
2.7 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named sfscores_pipeline containing
# the SQL code.
pipe1.features.to_sql().save("sfscores_pipeline")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("sfscores_spark")
2.8 Discussion¶
For a more convenient overview, we summarize our results into a table.
scores = [fastprop_score, featuretools_score]
pd.DataFrame(data={
'Name': ['getML: FastProp', 'featuretools'],
'R-squared': [f'{score.rsquared:.1%}' for score in scores],
'RMSE': [f'{score.rmse:,.2f}' for score in scores],
'MAE': [f'{score.mae:,.2f}' for score in scores]
})
Name | R-squared | RMSE | MAE | |
---|---|---|---|---|
0 | getML: FastProp | 28.9% | 7.05 | 5.32 |
1 | featuretools | 26.3% | 7.19 | 5.45 |
getml.engine.shutdown()
As we can see, getML's FastProp outperforms featuretools according to all three measures.
3. Conclusion¶
We have benchmarked getML against featuretools on dataset related to health inspections of eateries in San Francisco. We have found that getML outperforms featuretools.
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).