SFScores - Predicting health inspection scores of restaurants¶

In this notebook, we will benchmark several of getML's feature learning algorithms against featuretools using a dataset of eateries in San Francisco.

Summary:

Prediction type: Regression model
Domain: Health
Prediction target: Sales
Population size: 12887

Background¶

This notebook is based on the San Francisco Dept. of Public Health's database of eateries in San Francisco. These eateries are regularly inspected. The inspections often result in a score.

The challenge is to predict the score resulting from an inspection.

The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015)(Now residing at relational-data.org.).

We will benchmark getML's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.

Analysis¶

Let's get started with the analysis and set up your session:

In [18]:

  Copied!     
 
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "ipywidgets==8.1.5"

In [2]:

  Copied!     
 
import os
import warnings

import pandas as pd

import featuretools
import woodwork as ww
import getml

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
warnings.simplefilter(action='ignore', category=FutureWarning)

print(f"getML API version: {getml.__version__}\n")
import os import warnings import pandas as pd import featuretools import woodwork as ww import getml os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" warnings.simplefilter(action='ignore', category=FutureWarning) print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [3]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('sfscores')
getml.engine.launch(allow_remote_ips=True, token='token') getml.engine.set_project('sfscores')

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-community-edition-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912213838.log.

Connected to project 'sfscores'.

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data:

In [4]:

  Copied!     
 
conn = getml.database.connect_mysql(
    host="relational.fel.cvut.cz",
    dbname="SFScores",
    port=3306,
    user="guest",
    password="ctu-relational"
)

conn
conn = getml.database.connect_mysql( host="relational.fel.cvut.cz", dbname="SFScores", port=3306, user="guest", password="ctu-relational" ) conn

Out[4]:

Connection(dbname='SFScores',
           dialect='mysql',
           host='relational.fel.cvut.cz',
           port=3306)

In [5]:

  Copied!     
 
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
def load_if_needed(name): """ Loads the data from the relational learning repository, if the data frame has not already been loaded. """ if not getml.data.exists(name): data_frame = getml.data.DataFrame.from_db( name=name, table_name=name, conn=conn ) data_frame.save() else: data_frame = getml.data.load_data_frame(name) return data_frame

In [6]:

  Copied!     
 
businesses = load_if_needed("businesses")
inspections = load_if_needed("inspections")
violations = load_if_needed("violations")
businesses = load_if_needed("businesses") inspections = load_if_needed("inspections") violations = load_if_needed("violations")

In [7]:

  Copied!     
 
businesses
businesses

Out[7]:

name	business_id	latitude	longitude	phone_number	business_certificate	name	address	city	postal_code	tax_code	application_date	owner_name	owner_address	owner_city	owner_state	owner_zip
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string
0	10	37.7911	-122.404	nan	779059	Tiramisu Kitchen	033 Belden Pl	San Francisco	94104	H24	NULL	Tiramisu LLC	33 Belden St	San Francisco	CA	94104
1	24	37.7929	-122.403	nan	352312	OMNI S.F. Hotel - 2nd Floor Pant...	500 California St, 2nd Floor	San Francisco	94104	H24	NULL	OMNI San Francisco Hotel Corp	500 California St, 2nd Floor	San Francisco	CA	94104
2	31	37.8072	-122.419	nan	346882	Norman's Ice Cream and Freezes	2801 Leavenworth St	San Francisco	94133	H24	NULL	Norman Antiforda	2801 Leavenworth St	San Francisco	CA	94133
3	45	37.7471	-122.414	nan	340024	CHARLIE'S DELI CAFE	3202 FOLSOM St	S.F.	94110	H24	2001-10-10	HARB, CHARLES AND KRISTIN	1150 SANCHEZ	S.F.	CA	94114
4	48	37.764	-122.466	nan	318022	ART'S CAFE	747 IRVING St	SAN FRANCISCO	94122	H24	NULL	YOON HAE RYONG	1567 FUNSTON AVE	SAN FRANCISCO	CA	94122
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6353	89335	nan	nan	nan	1057025	Breaking Bad Sandwiches	154 McAllister St	NULL	94102	H25	2016-09-23	JPMD, LLC	662 Bellhurst Lane	Castro Valley	CA	94102
6354	89336	nan	nan	nan	1057746	Miller's Rest	1085 Sutter St	NULL	94109	H26	2016-09-23	Miller's Rest, LLC	2906 Bush Street	San Francisco	CA	94109
6355	89393	nan	nan	nan	1042408	Panuchos	620 Broadway St	NULL	94133	H24	2016-09-28	Los Aluxes, LLC	1032 Irving Street, #421	San Francisco	CA	94122
6356	89416	nan	nan	nan	1051081	Nobhill Pizza & Shawerma	1534 California St	NULL	94109	H24	2016-09-29	BBA Foods, Inc.	840 Post Street, #218	San Francisco	CA	94109
6357	89453	nan	nan	nan	459309	Burger King #4668	1690 Valencia St	San Francisco	94110	H29	2016-10-03	Golden Gate Restaurant Group, In...	P.O Box 21	Lafeyette	CA	94549

6358 rows x 16 columns
memory usage: 1.57 MB
name: businesses
type: getml.DataFrame

In [8]:

  Copied!     
 
inspections
inspections

Out[8]:

name	business_id	score	date	type
role	unused_float	unused_float	unused_string	unused_string
0	10	92	2014-01-14	Routine - Unscheduled
1	10	nan	2014-01-24	Reinspection/Followup
2	10	94	2014-07-29	Routine - Unscheduled
3	10	nan	2014-08-07	Reinspection/Followup
4	10	82	2016-05-03	Routine - Unscheduled
	...	...	...	...
23759	89199	100	2016-09-12	Routine - Unscheduled
23760	89200	100	2016-09-12	Routine - Unscheduled
23761	89201	nan	2016-09-12	New Ownership
23762	89204	100	2016-09-12	Routine - Unscheduled
23763	89296	nan	2016-09-30	New Ownership

23764 rows x 4 columns
memory usage: 1.51 MB
name: inspections
type: getml.DataFrame

In [9]:

  Copied!     
 
violations
violations

Out[9]:

name	business_id	date	violation_type_id	risk_category	description
role	unused_float	unused_string	unused_string	unused_string	unused_string
0	10	2014-07-29	103129	Moderate Risk	Insufficient hot water or runnin...
1	10	2014-07-29	103144	Low Risk	Unapproved or unmaintained equip...
2	10	2014-01-14	103119	Moderate Risk	Inadequate and inaccessible hand...
3	10	2014-01-14	103145	Low Risk	Improper storage of equipment ut...
4	10	2014-01-14	103154	Low Risk	Unclean or degraded floors walls...
	...	...	...	...	...
36045	88878	2016-08-19	103144	Low Risk	Unapproved or unmaintained equip...
36046	88878	2016-08-19	103124	Moderate Risk	Inadequately cleaned or sanitize...
36047	89072	2016-09-22	103120	Moderate Risk	Moderate risk food holding tempe...
36048	89072	2016-09-22	103131	Moderate Risk	Moderate risk vermin infestation
36049	89072	2016-09-22	103149	Low Risk	Wiping cloths not clean or prope...

36050 rows x 5 columns
memory usage: 4.06 MB
name: violations
type: getml.DataFrame

1.2 Prepare data for getML¶

getML requires that we define roles for each of the columns.

In [10]:

  Copied!     
 
businesses.set_role("business_id", getml.data.roles.join_key)
businesses.set_role("name", getml.data.roles.text)
businesses.set_role(["postal_code", "tax_code", "owner_zip"], getml.data.roles.categorical)

businesses
businesses.set_role("business_id", getml.data.roles.join_key) businesses.set_role("name", getml.data.roles.text) businesses.set_role(["postal_code", "tax_code", "owner_zip"], getml.data.roles.categorical) businesses

Out[10]:

name	business_id	postal_code	tax_code	owner_zip	name	latitude	longitude	phone_number	business_certificate	address	city	application_date	owner_name	owner_address	owner_city	owner_state
role	join_key	categorical	categorical	categorical	text	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string
0	10	94104	H24	94104	Tiramisu Kitchen	37.7911	-122.404	nan	779059	033 Belden Pl	San Francisco	NULL	Tiramisu LLC	33 Belden St	San Francisco	CA
1	24	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pant...	37.7929	-122.403	nan	352312	500 California St, 2nd Floor	San Francisco	NULL	OMNI San Francisco Hotel Corp	500 California St, 2nd Floor	San Francisco	CA
2	31	94133	H24	94133	Norman's Ice Cream and Freezes	37.8072	-122.419	nan	346882	2801 Leavenworth St	San Francisco	NULL	Norman Antiforda	2801 Leavenworth St	San Francisco	CA
3	45	94110	H24	94114	CHARLIE'S DELI CAFE	37.7471	-122.414	nan	340024	3202 FOLSOM St	S.F.	2001-10-10	HARB, CHARLES AND KRISTIN	1150 SANCHEZ	S.F.	CA
4	48	94122	H24	94122	ART'S CAFE	37.764	-122.466	nan	318022	747 IRVING St	SAN FRANCISCO	NULL	YOON HAE RYONG	1567 FUNSTON AVE	SAN FRANCISCO	CA
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6353	89335	94102	H25	94102	Breaking Bad Sandwiches	nan	nan	nan	1057025	154 McAllister St	NULL	2016-09-23	JPMD, LLC	662 Bellhurst Lane	Castro Valley	CA
6354	89336	94109	H26	94109	Miller's Rest	nan	nan	nan	1057746	1085 Sutter St	NULL	2016-09-23	Miller's Rest, LLC	2906 Bush Street	San Francisco	CA
6355	89393	94133	H24	94122	Panuchos	nan	nan	nan	1042408	620 Broadway St	NULL	2016-09-28	Los Aluxes, LLC	1032 Irving Street, #421	San Francisco	CA
6356	89416	94109	H24	94109	Nobhill Pizza & Shawerma	nan	nan	nan	1051081	1534 California St	NULL	2016-09-29	BBA Foods, Inc.	840 Post Street, #218	San Francisco	CA
6357	89453	94110	H29	94549	Burger King #4668	nan	nan	nan	459309	1690 Valencia St	San Francisco	2016-10-03	Golden Gate Restaurant Group, In...	P.O Box 21	Lafeyette	CA

6358 rows x 16 columns
memory usage: 1.36 MB
name: businesses
type: getml.DataFrame

In [11]:

  Copied!     
 
inspections = inspections[~inspections.score.is_nan()].to_df("inspections")

inspections.set_role("business_id", getml.data.roles.join_key)
inspections.set_role("score", getml.data.roles.target)
inspections.set_role("date", getml.data.roles.time_stamp)

inspections
inspections = inspections[~inspections.score.is_nan()].to_df("inspections") inspections.set_role("business_id", getml.data.roles.join_key) inspections.set_role("score", getml.data.roles.target) inspections.set_role("date", getml.data.roles.time_stamp) inspections

Out[11]:

name	date	business_id	score	type
role	time_stamp	join_key	target	unused_string
unit	time stamp, comparison only
0	2014-01-14	10	92	Routine - Unscheduled
1	2014-07-29	10	94	Routine - Unscheduled
2	2016-05-03	10	82	Routine - Unscheduled
3	2013-11-18	24	100	Routine - Unscheduled
4	2014-06-12	24	96	Routine - Unscheduled
	...	...	...	...
12882	2016-09-22	89072	90	Routine - Unscheduled
12883	2016-09-12	89198	100	Routine - Unscheduled
12884	2016-09-12	89199	100	Routine - Unscheduled
12885	2016-09-12	89200	100	Routine - Unscheduled
12886	2016-09-12	89204	100	Routine - Unscheduled

12887 rows x 4 columns
memory usage: 0.64 MB
name: inspections
type: getml.DataFrame

In [12]:

  Copied!     
 
violations.set_role("business_id", getml.data.roles.join_key)
violations.set_role("date", getml.data.roles.time_stamp)
violations.set_role(["violation_type_id", "risk_category"], getml.data.roles.categorical)
violations.set_role("description", getml.data.roles.text)

violations
violations.set_role("business_id", getml.data.roles.join_key) violations.set_role("date", getml.data.roles.time_stamp) violations.set_role(["violation_type_id", "risk_category"], getml.data.roles.categorical) violations.set_role("description", getml.data.roles.text) violations

Out[12]:

name	date	business_id	violation_type_id	risk_category	description
role	time_stamp	join_key	categorical	categorical	text
unit	time stamp, comparison only
0	2014-07-29	10	103129	Moderate Risk	Insufficient hot water or runnin...
1	2014-07-29	10	103144	Low Risk	Unapproved or unmaintained equip...
2	2014-01-14	10	103119	Moderate Risk	Inadequate and inaccessible hand...
3	2014-01-14	10	103145	Low Risk	Improper storage of equipment ut...
4	2014-01-14	10	103154	Low Risk	Unclean or degraded floors walls...
	...	...	...	...	...
36045	2016-08-19	88878	103144	Low Risk	Unapproved or unmaintained equip...
36046	2016-08-19	88878	103124	Moderate Risk	Inadequately cleaned or sanitize...
36047	2016-09-22	89072	103120	Moderate Risk	Moderate risk food holding tempe...
36048	2016-09-22	89072	103131	Moderate Risk	Moderate risk vermin infestation
36049	2016-09-22	89072	103149	Low Risk	Wiping cloths not clean or prope...

36050 rows x 5 columns
memory usage: 2.59 MB
name: violations
type: getml.DataFrame

2. Predictive modeling¶

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

In [13]:

  Copied!     
 
split = getml.data.split.random(train=0.8, test=0.2)
split = getml.data.split.random(train=0.8, test=0.2)

2.1 Define relational model¶

In [14]:

  Copied!     
 
star_schema = getml.data.StarSchema(population=inspections, alias="population", split=split)

star_schema.join(
    businesses,
    on="business_id",
    relationship=getml.data.relationship.many_to_one,
)

star_schema.join(
    violations,
    on="business_id",
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

star_schema.join(
    inspections,
    on="business_id",
    time_stamps="date",
    lagged_targets=True,
    horizon=getml.data.time.days(1),
)

star_schema
star_schema = getml.data.StarSchema(population=inspections, alias="population", split=split) star_schema.join( businesses, on="business_id", relationship=getml.data.relationship.many_to_one, ) star_schema.join( violations, on="business_id", time_stamps="date", horizon=getml.data.time.days(1), ) star_schema.join( inspections, on="business_id", time_stamps="date", lagged_targets=True, horizon=getml.data.time.days(1), ) star_schema

Out[14]:

data model

diagram

staging

	data frames	staging table
0	population, businesses	POPULATION__STAGING_TABLE_1
1	inspections	INSPECTIONS__STAGING_TABLE_2
2	violations	VIOLATIONS__STAGING_TABLE_3

container

population

	subset	name	rows	type
0	test	inspections	2492	View
1	train	inspections	10395	View

peripheral

	name	rows	type
0	businesses	6358	DataFrame
1	violations	36050	DataFrame
2	inspections	12887	DataFrame

2.2 getML pipeline¶

Set-up the feature learner & predictor

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples).

In [15]:

  Copied!     
 
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
mapping = getml.preprocessors.Mapping() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, ) predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

Build the pipeline

In [16]:

  Copied!     
 
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe1
pipe1 = getml.pipeline.Pipeline( tags=['fast_prop'], data_model=star_schema.data_model, preprocessors=[mapping], feature_learners=[fast_prop], predictors=[predictor] ) pipe1

Out[16]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=['businesses', 'inspections', 'violations'],
         predictors=['XGBoostRegressor'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop'])

2.3 Model training¶

In [17]:

  Copied!     
 
pipe1.check(star_schema.train)
pipe1.check(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[17], line 1
----> 1 pipe1.check(star_schema.train)

File ~/Documents/github/getml-demo/.venv/lib/python3.11/site-packages/getml/pipeline/pipeline.py:1106, in Pipeline.check(self, population_table, peripheral_tables)
   1104 msg = comm.log(sock)
   1105 if msg != "Success!":
-> 1106     comm.handle_engine_exception(msg)
   1107 issues = Issues(comm.recv_issues(sock))
   1108 if len(issues) == 0:

File ~/Documents/github/getml-demo/.venv/lib/python3.11/site-packages/getml/exceptions.py:124, in handle_engine_exception(msg, extra)
    121 for handler in EngineExceptionHandlerRegistry.handlers:
    122     handler(msg, extra=extra)
--> 124 raise OSError(msg)

OSError: The Mapping preprocessor is not supported in the community edition. Please upgrade to getML enterprise to use this. An overview of what is supported in the community edition can be found in the official getML documentation.

In [18]:

  Copied!     
 
pipe1.fit(star_schema.train)
pipe1.fit(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
⠼ Indexing text fields...                                            0% • 00:00

  Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 104 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02

Trained pipeline.

Time taken: 0:00:03.109211.

Out[18]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=['businesses', 'inspections', 'violations'],
         predictors=['XGBoostRegressor'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-4IQSR5'])

2.4 Model evaluation¶

In [19]:

  Copied!     
 
fastprop_score = pipe1.score(star_schema.test)
fastprop_score
fastprop_score = pipe1.score(star_schema.test) fastprop_score

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[19]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 11:53:31	train	score	4.8865	6.5247	0.3608
1	2024-09-12 11:53:32	test	score	5.3218	7.0532	0.2889

2.5 featuretools¶

In [20]:

  Copied!     
 
population_train_pd = star_schema.train.population.to_pandas()
population_test_pd = star_schema.test.population.to_pandas()
population_train_pd = star_schema.train.population.to_pandas() population_test_pd = star_schema.test.population.to_pandas()

In [21]:

  Copied!     
 
inspections_pd = inspections.drop(inspections.roles.unused).to_pandas()
violations_pd = violations.drop(violations.roles.unused).to_pandas()
businesses_pd = businesses.drop(businesses.roles.unused).to_pandas()
inspections_pd = inspections.drop(inspections.roles.unused).to_pandas() violations_pd = violations.drop(violations.roles.unused).to_pandas() businesses_pd = businesses.drop(businesses.roles.unused).to_pandas()

In [22]:

  Copied!     
 
population_train_pd["id"] = population_train_pd.index

population_train_pd = population_train_pd.merge(
    businesses_pd,
    on="business_id"
)

population_train_pd
population_train_pd["id"] = population_train_pd.index population_train_pd = population_train_pd.merge( businesses_pd, on="business_id" ) population_train_pd

Out[22]:

	business_id	score	date	id	postal_code	tax_code	owner_zip	name
0	10	92.0	2014-01-14	0	94104	H24	94104	Tiramisu Kitchen
1	10	94.0	2014-07-29	1	94104	H24	94104	Tiramisu Kitchen
2	10	82.0	2016-05-03	2	94104	H24	94104	Tiramisu Kitchen
3	24	96.0	2014-06-12	3	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pantry
4	24	96.0	2014-11-24	4	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pantry
...	...	...	...	...	...	...	...	...
10390	88878	94.0	2016-08-19	10390	94102	H24	94566	Jamba Juice
10391	89072	90.0	2016-09-22	10391	94109	H91	94109	Epicurean at Sacred Heart Catholic Prep School
10392	89198	100.0	2016-09-12	10392	94107	H36	29615	AT&T Park - Beer Cart/View Level, Sec. 333
10393	89199	100.0	2016-09-12	10393	94107	H36	29615	AT&T Park - Beer Cart/Lower CF, Sec. 140
10394	89200	100.0	2016-09-12	10394	94107	H36	29615	AT&T Park - Beer Cart/Lower CF, Sec. 142

10395 rows × 8 columns

In [23]:

  Copied!     
 
population_test_pd["id"] = population_test_pd.index

population_test_pd = population_test_pd.merge(
    businesses_pd,
    on="business_id"
)

population_test_pd
population_test_pd["id"] = population_test_pd.index population_test_pd = population_test_pd.merge( businesses_pd, on="business_id" ) population_test_pd

Out[23]:

	business_id	score	date	id	postal_code	tax_code	owner_zip	name
0	24	100.0	2013-11-18	0	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pantry
1	24	96.0	2016-03-11	1	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pantry
2	45	94.0	2013-12-09	2	94110	H24	94114	CHARLIE'S DELI CAFE
3	58	78.0	2014-07-25	3	94111	H24	94111	Oasis Grill
4	66	91.0	2014-05-19	4	94122	H24	94122	STARBUCKS
...	...	...	...	...	...	...	...	...
2487	87802	91.0	2016-06-07	2487	94110	H25	94110	Bernal Heights Pizzeria
2488	88082	84.0	2016-08-30	2488	94133	H24	94133	Chongqing Xiaomian
2489	88447	96.0	2016-08-17	2489	None	H91	94107	Fare Resources
2490	88702	96.0	2016-08-15	2490	94118	H25	94118	Dancing Bull
2491	89204	100.0	2016-09-12	2491	94107	H36	94107	AT&T - Hol n Jam Cart/Upper CF, Sec. 142

2492 rows × 8 columns

In [24]:

  Copied!     
 
def prepare_peripheral(violations_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    violations_new = violations_pd.merge(
        train_or_test[["id", "business_id", "date"]],
        on="business_id"
    )

    violations_new = violations_new[
        violations_new["date_x"] < violations_new["date_y"]
    ]

    del violations_new["date_y"]
    del violations_new["business_id"]

    return violations_new.rename(columns={"date_x": "date"})
def prepare_peripheral(violations_pd, train_or_test): """ Helper function that imitates the behavior of the data model defined above. """ violations_new = violations_pd.merge( train_or_test[["id", "business_id", "date"]], on="business_id" ) violations_new = violations_new[ violations_new["date_x"] < violations_new["date_y"] ] del violations_new["date_y"] del violations_new["business_id"] return violations_new.rename(columns={"date_x": "date"})

In [25]:

  Copied!     
 
violations_train_pd = prepare_peripheral(violations_pd, population_train_pd)
violations_test_pd = prepare_peripheral(violations_pd, population_test_pd)
violations_train_pd
violations_train_pd = prepare_peripheral(violations_pd, population_train_pd) violations_test_pd = prepare_peripheral(violations_pd, population_test_pd) violations_train_pd

Out[25]:

	violation_type_id	risk_category	description	date	id
2	103129	Moderate Risk	Insufficient hot water or running water	2014-07-29	2
5	103144	Low Risk	Unapproved or unmaintained equipment or utensils	2014-07-29	2
7	103119	Moderate Risk	Inadequate and inaccessible handwashing facili...	2014-01-14	1
8	103119	Moderate Risk	Inadequate and inaccessible handwashing facili...	2014-01-14	2
10	103145	Low Risk	Improper storage of equipment utensils or linens	2014-01-14	1
...	...	...	...	...	...
89220	103119	Moderate Risk	Inadequate and inaccessible handwashing facili...	2016-02-16	10290
89256	103131	Moderate Risk	Moderate risk vermin infestation	2016-04-04	10308
89336	103154	Low Risk	Unclean or degraded floors walls or ceilings	2016-04-11	10331
89338	103148	Low Risk	No thermometers or uncalibrated thermometers	2016-04-11	10331
89340	103144	Low Risk	Unapproved or unmaintained equipment or utensils	2016-04-11	10331

29004 rows × 5 columns

In [26]:

  Copied!     
 
inspections_train_pd = prepare_peripheral(inspections_pd, population_train_pd)
inspections_test_pd = prepare_peripheral(inspections_pd, population_test_pd)
inspections_train_pd
inspections_train_pd = prepare_peripheral(inspections_pd, population_train_pd) inspections_test_pd = prepare_peripheral(inspections_pd, population_test_pd) inspections_train_pd

Out[26]:

	score	date	id
1	92.0	2014-01-14	1
2	92.0	2014-01-14	2
5	94.0	2014-07-29	2
9	100.0	2013-11-18	3
10	100.0	2013-11-18	4
...	...	...	...
32628	92.0	2016-02-16	10290
32648	96.0	2016-04-04	10308
32673	94.0	2016-04-11	10331
32707	100.0	2016-05-23	10360
32738	100.0	2016-08-17	10389

11190 rows × 3 columns

In [27]:

  Copied!     
 
del population_train_pd["business_id"]
del population_test_pd["business_id"]
del population_train_pd["business_id"] del population_test_pd["business_id"]

In [28]:

  Copied!     
 
population_train_pd
population_train_pd

Out[28]:

	score	date	id	postal_code	tax_code	owner_zip	name
0	92.0	2014-01-14	0	94104	H24	94104	Tiramisu Kitchen
1	94.0	2014-07-29	1	94104	H24	94104	Tiramisu Kitchen
2	82.0	2016-05-03	2	94104	H24	94104	Tiramisu Kitchen
3	96.0	2014-06-12	3	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pantry
4	96.0	2014-11-24	4	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pantry
...	...	...	...	...	...	...	...
10390	94.0	2016-08-19	10390	94102	H24	94566	Jamba Juice
10391	90.0	2016-09-22	10391	94109	H91	94109	Epicurean at Sacred Heart Catholic Prep School
10392	100.0	2016-09-12	10392	94107	H36	29615	AT&T Park - Beer Cart/View Level, Sec. 333
10393	100.0	2016-09-12	10393	94107	H36	29615	AT&T Park - Beer Cart/Lower CF, Sec. 140
10394	100.0	2016-09-12	10394	94107	H36	29615	AT&T Park - Beer Cart/Lower CF, Sec. 142

10395 rows × 7 columns

In [29]:

  Copied!     
 
def add_index(df):
    df.insert(0, "index", range(len(df)))

population_pd_logical_types = {
    "id": ww.logical_types.Integer,
    "score": ww.logical_types.Integer,
    "date": ww.logical_types.Datetime,
    "postal_code": ww.logical_types.Categorical,
    "tax_code": ww.logical_types.Categorical,
    "owner_zip": ww.logical_types.Categorical,
    "name": ww.logical_types.Categorical
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")
population_test_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")

add_index(inspections_train_pd)
add_index(inspections_test_pd)
inspections_pd_logical_types = {
    "index": ww.logical_types.Integer,
    "score": ww.logical_types.Integer,
    "date": ww.logical_types.Datetime,
    "id": ww.logical_types.Integer
}
inspections_train_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections")
inspections_test_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections")

add_index(violations_train_pd)
add_index(violations_test_pd)
violations_pd_logical_types = {
    "index": ww.logical_types.Integer,
    "violation_type_id": ww.logical_types.Categorical,
    "risk_category": ww.logical_types.Categorical,
    "description": ww.logical_types.Categorical,
    "date": ww.logical_types.Datetime,
    "id": ww.logical_types.Integer
}
violations_train_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations")
violations_test_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations")
def add_index(df): df.insert(0, "index", range(len(df))) population_pd_logical_types = { "id": ww.logical_types.Integer, "score": ww.logical_types.Integer, "date": ww.logical_types.Datetime, "postal_code": ww.logical_types.Categorical, "tax_code": ww.logical_types.Categorical, "owner_zip": ww.logical_types.Categorical, "name": ww.logical_types.Categorical } population_train_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population") population_test_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population") add_index(inspections_train_pd) add_index(inspections_test_pd) inspections_pd_logical_types = { "index": ww.logical_types.Integer, "score": ww.logical_types.Integer, "date": ww.logical_types.Datetime, "id": ww.logical_types.Integer } inspections_train_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections") inspections_test_pd.ww.init(logical_types=inspections_pd_logical_types, index="index", name="inspections") add_index(violations_train_pd) add_index(violations_test_pd) violations_pd_logical_types = { "index": ww.logical_types.Integer, "violation_type_id": ww.logical_types.Categorical, "risk_category": ww.logical_types.Categorical, "description": ww.logical_types.Categorical, "date": ww.logical_types.Datetime, "id": ww.logical_types.Integer } violations_train_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations") violations_test_pd.ww.init(logical_types=violations_pd_logical_types, index="index", name="violations")

In [30]:

  Copied!     
 
dataframes_train = {
    "population" : (population_train_pd, ),
    "inspections" : (inspections_train_pd, ),
    "violations" : (violations_train_pd, )
}
dataframes_train = { "population" : (population_train_pd, ), "inspections" : (inspections_train_pd, ), "violations" : (violations_train_pd, ) }

In [31]:

  Copied!     
 
dataframes_test = {
    "population" : (population_test_pd, ),
    "inspections" : (inspections_test_pd, ),
    "violations" : (violations_test_pd, )
}
dataframes_test = { "population" : (population_test_pd, ), "inspections" : (inspections_test_pd, ), "violations" : (violations_test_pd, ) }

In [32]:

  Copied!     
 
relationships = [
    ("population", "id", "inspections", "id"),
    ("population", "id", "violations", "id")
]
relationships = [ ("population", "id", "inspections", "id"), ("population", "id", "violations", "id") ]

In [33]:

  Copied!     
 
featuretools_train_pd = featuretools.dfs(
    dataframes=dataframes_train,
    relationships=relationships,
    target_dataframe_name="population")[0]
featuretools_train_pd = featuretools.dfs( dataframes=dataframes_train, relationships=relationships, target_dataframe_name="population")[0]

In [34]:

  Copied!     
 
featuretools_test_pd = featuretools.dfs(
    dataframes=dataframes_test,
    relationships=relationships,
    target_dataframe_name="population")[0]
featuretools_test_pd = featuretools.dfs( dataframes=dataframes_test, relationships=relationships, target_dataframe_name="population")[0]

In [35]:

  Copied!     
 
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train") featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")

In [36]:

  Copied!     
 
featuretools_train.set_role("score", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)

featuretools_train
featuretools_train.set_role("score", getml.data.roles.target) featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical) featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical) featuretools_train

Out[36]:

name	score	postal_code	tax_code	owner_zip	name	COUNT(inspections)	COUNT(violations)	MODE(violations.description)	MODE(violations.risk_category)	MODE(violations.violation_type_id)	NUM_UNIQUE(violations.description)	NUM_UNIQUE(violations.risk_category)	NUM_UNIQUE(violations.violation_type_id)	DAY(date)	MONTH(date)	WEEKDAY(date)	YEAR(date)	MODE(inspections.DAY(date))	MODE(inspections.MONTH(date))	MODE(inspections.WEEKDAY(date))	MODE(inspections.YEAR(date))	NUM_UNIQUE(inspections.DAY(date))	NUM_UNIQUE(inspections.MONTH(date))	NUM_UNIQUE(inspections.WEEKDAY(date))	NUM_UNIQUE(inspections.YEAR(date))	MODE(violations.DAY(date))	MODE(violations.MONTH(date))	MODE(violations.WEEKDAY(date))	MODE(violations.YEAR(date))	NUM_UNIQUE(violations.DAY(date))	NUM_UNIQUE(violations.MONTH(date))	NUM_UNIQUE(violations.WEEKDAY(date))	NUM_UNIQUE(violations.YEAR(date))	MAX(inspections.score)	MEAN(inspections.score)	MIN(inspections.score)	SKEW(inspections.score)	STD(inspections.score)	SUM(inspections.score)
role	target	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	numerical	numerical	numerical	numerical	numerical	numerical
0	92	94104	H24	94104	Tiramisu Kitchen	0	0	NULL	NULL	NULL	NULL	NULL	NULL	14	1	1	2014	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
1	94	94104	H24	94104	Tiramisu Kitchen	1	3	Improper storage of equipment ut...	Low Risk	103119	3	2	3	29	7	1	2014	14	1	1	2014	1	1	1	1	14	1	1	2014	1	1	1	1	92	92	92	nan	nan	92
2	82	94104	H24	94104	Tiramisu Kitchen	2	5	Improper storage of equipment ut...	Low Risk	103119	5	2	5	3	5	1	2016	14	1	1	2014	2	2	1	1	14	1	1	2014	2	2	1	1	94	93	92	nan	1.4142	186
3	96	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pant...	1	0	NULL	NULL	NULL	NULL	NULL	NULL	12	6	3	2014	18	11	0	2013	1	1	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	100	100	100	nan	nan	100
4	96	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pant...	2	2	Improper storage of equipment ut...	Low Risk	103145	2	1	2	24	11	0	2014	12	6	0	2013	2	2	2	2	12	6	3	2014	1	1	1	1	100	98	96	nan	2.8284	196
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10390	94	94102	H24	94566	Jamba Juice	0	0	NULL	NULL	NULL	NULL	NULL	NULL	19	8	4	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
10391	90	94109	H91	94109	Epicurean at Sacred Heart Cathol...	0	0	NULL	NULL	NULL	NULL	NULL	NULL	22	9	3	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
10392	100	94107	H36	29615	AT&T Park - Beer Cart/View Level...	0	0	NULL	NULL	NULL	NULL	NULL	NULL	12	9	0	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
10393	100	94107	H36	29615	AT&T Park - Beer Cart/Lower CF, ...	0	0	NULL	NULL	NULL	NULL	NULL	NULL	12	9	0	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
10394	100	94107	H36	29615	AT&T Park - Beer Cart/Lower CF, ...	0	0	NULL	NULL	NULL	NULL	NULL	NULL	12	9	0	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0

10395 rows x 39 columns
memory usage: 1.91 MB
name: featuretools_train
type: getml.DataFrame

In [37]:

  Copied!     
 
featuretools_test.set_role("score", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)

featuretools_test
featuretools_test.set_role("score", getml.data.roles.target) featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical) featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical) featuretools_test

Out[37]:

name	score	postal_code	tax_code	owner_zip	name	COUNT(inspections)	COUNT(violations)	MODE(violations.description)	MODE(violations.risk_category)	MODE(violations.violation_type_id)	NUM_UNIQUE(violations.description)	NUM_UNIQUE(violations.risk_category)	NUM_UNIQUE(violations.violation_type_id)	DAY(date)	MONTH(date)	WEEKDAY(date)	YEAR(date)	MODE(inspections.DAY(date))	MODE(inspections.MONTH(date))	MODE(inspections.WEEKDAY(date))	MODE(inspections.YEAR(date))	NUM_UNIQUE(inspections.DAY(date))	NUM_UNIQUE(inspections.MONTH(date))	NUM_UNIQUE(inspections.WEEKDAY(date))	NUM_UNIQUE(inspections.YEAR(date))	MODE(violations.DAY(date))	MODE(violations.MONTH(date))	MODE(violations.WEEKDAY(date))	MODE(violations.YEAR(date))	NUM_UNIQUE(violations.DAY(date))	NUM_UNIQUE(violations.MONTH(date))	NUM_UNIQUE(violations.WEEKDAY(date))	NUM_UNIQUE(violations.YEAR(date))	MAX(inspections.score)	MEAN(inspections.score)	MIN(inspections.score)	SKEW(inspections.score)	STD(inspections.score)	SUM(inspections.score)
role	target	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	numerical	numerical	numerical	numerical	numerical	numerical
0	100	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pant...	0	0	NULL	NULL	NULL	NULL	NULL	NULL	18	11	0	2013	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
1	96	94104	H24	94104	OMNI S.F. Hotel - 2nd Floor Pant...	3	3	Improper storage of equipment ut...	Low Risk	103119	3	2	3	11	3	4	2016	12	11	0	2014	3	2	2	2	12	6	3	2014	2	2	2	1	100	97.3333	96	1.7321	2.3094	292
2	94	94110	H24	94114	CHARLIE'S DELI CAFE	0	0	NULL	NULL	NULL	NULL	NULL	NULL	9	12	0	2013	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
3	78	94111	H24	94111	Oasis Grill	0	0	NULL	NULL	NULL	NULL	NULL	NULL	25	7	4	2014	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
4	91	94122	H24	94122	STARBUCKS	1	1	Wiping cloths not clean or prope...	Low Risk	103149	1	1	1	19	5	0	2014	10	2	0	2014	1	1	1	1	10	2	0	2014	1	1	1	1	98	98	98	nan	nan	98
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2487	91	94110	H25	94110	Bernal Heights Pizzeria	0	0	NULL	NULL	NULL	NULL	NULL	NULL	7	6	1	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
2488	84	94133	H24	94133	Chongqing Xiaomian	0	0	NULL	NULL	NULL	NULL	NULL	NULL	30	8	1	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
2489	96	NULL	H91	94107	Fare Resources	0	0	NULL	NULL	NULL	NULL	NULL	NULL	17	8	2	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
2490	96	94118	H25	94118	Dancing Bull	0	0	NULL	NULL	NULL	NULL	NULL	NULL	15	8	0	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0
2491	100	94107	H36	94107	AT&T - Hol n Jam Cart/Upper CF, ...	0	0	NULL	NULL	NULL	NULL	NULL	NULL	12	9	0	2016	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0

2492 rows x 39 columns
memory usage: 0.46 MB
name: featuretools_test
type: getml.DataFrame

We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.

Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.

In [38]:

  Copied!     
 
imputation = getml.preprocessors.Imputation()

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe2 = getml.pipeline.Pipeline(
    tags=['featuretools'],
    preprocessors=[imputation],
    predictors=[predictor],
    include_categorical=True,
)

pipe2
imputation = getml.preprocessors.Imputation() predictor = getml.predictors.XGBoostRegressor(n_jobs=1) pipe2 = getml.pipeline.Pipeline( tags=['featuretools'], preprocessors=[imputation], predictors=[predictor], include_categorical=True, ) pipe2

Out[38]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=['Imputation'],
         share_selected_features=0.5,
         tags=['featuretools'])

In [39]:

  Copied!     
 
pipe2.fit(featuretools_train)
pipe2.fit(featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 1 issues labeled INFO and 1 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:12

Trained pipeline.

Time taken: 0:00:12.224273.

Out[39]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=['Imputation'],
         share_selected_features=0.5,
         tags=['featuretools'])

In [40]:

  Copied!     
 
featuretools_score = pipe2.score(featuretools_test)
featuretools_score
featuretools_score = pipe2.score(featuretools_test) featuretools_score

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[40]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 11:53:55	featuretools_train	score	5.1359	6.7501	0.321
1	2024-09-12 11:53:55	featuretools_test	score	5.4491	7.1941	0.2626

2.6 Features¶

The most important feature looks as follows:

In [41]:

  Copied!     
 
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]

Out[41]:

DROP TABLE IF EXISTS "FEATURE_1_46";

CREATE TABLE "FEATURE_1_46" AS
SELECT COUNT( t1."date" - t2."date"  ) - COUNT( DISTINCT t1."date" - t2."date" ) AS "feature_1_46",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "VIOLATIONS__STAGING_TABLE_3" t2
ON t1."business_id" = t2."business_id"
WHERE t2."date__1_000000_days" <= t1."date"
GROUP BY t1.rowid;

2.7 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [ ]:

  Copied!     
 
# Creates a folder named sfscores_pipeline containing
# the SQL code.
pipe1.features.to_sql().save("sfscores_pipeline")
# Creates a folder named sfscores_pipeline containing # the SQL code. pipe1.features.to_sql().save("sfscores_pipeline")

In [ ]:

  Copied!     
 
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("sfscores_spark")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("sfscores_spark")

2.8 Discussion¶

For a more convenient overview, we summarize our results into a table.

In [44]:

  Copied!     
 
scores = [fastprop_score, featuretools_score]
pd.DataFrame(data={
    'Name': ['getML: FastProp', 'featuretools'],
    'R-squared': [f'{score.rsquared:.1%}' for score in scores],
    'RMSE': [f'{score.rmse:,.2f}' for score in scores],
    'MAE': [f'{score.mae:,.2f}' for score in scores]
})
scores = [fastprop_score, featuretools_score] pd.DataFrame(data={ 'Name': ['getML: FastProp', 'featuretools'], 'R-squared': [f'{score.rsquared:.1%}' for score in scores], 'RMSE': [f'{score.rmse:,.2f}' for score in scores], 'MAE': [f'{score.mae:,.2f}' for score in scores] })

Out[44]:

	Name	R-squared	RMSE	MAE
0	getML: FastProp	28.9%	7.05	5.32
1	featuretools	26.3%	7.19	5.45

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

As we can see, getML's FastProp outperforms featuretools according to all three measures.

3. Conclusion¶

We have benchmarked getML against featuretools on dataset related to health inspections of eateries in San Francisco. We have found that getML outperforms featuretools.

References¶

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).