StatsExchange - Predicting users' reputations¶
In this notebook, we will use relational learning techniques to predict users' reputation on StackExchange
Summary:
- Prediction type: Regression model
- Domain: Internet
- Prediction target: Reputation
- Population size: 41793
Background¶
StatsExchange is a website similar to StackOverflow, but based on statistics and machine learning. Much like StackOverflow, it has a complicated system of calculating users' reputation. In this notebook, we will benchmark relational learning techniques based on there ability to reverse-engineer said system.
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015)(Now residing at relational-data.org.).
We will benchmark getML 's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "ipywidgets==8.1.5"
import os
import warnings
import pandas as pd
import featuretools
import woodwork as ww
import getml
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
warnings.simplefilter(action='ignore', category=FutureWarning)
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('stats')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912151014.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'stats'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="stats",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='stats', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
badges = load_if_needed("badges")
posts = load_if_needed("posts")
users = load_if_needed("users")
votes = load_if_needed("votes")
badges
name | Id | UserId | Name | Date |
---|---|---|---|---|
role | unused_float | unused_float | unused_string | unused_string |
0 | 1 | 5 | Teacher | 2010-07-19 19:39:07 |
1 | 2 | 6 | Teacher | 2010-07-19 19:39:07 |
2 | 3 | 8 | Teacher | 2010-07-19 19:39:07 |
3 | 4 | 23 | Teacher | 2010-07-19 19:39:07 |
4 | 5 | 36 | Teacher | 2010-07-19 19:39:07 |
... | ... | ... | ... | |
79846 | 92236 | 55744 | Student | 2014-09-13 23:25:21 |
79847 | 92237 | 1118 | Nice Answer | 2014-09-14 00:09:35 |
79848 | 92238 | 1118 | Enlightened | 2014-09-14 01:18:29 |
79849 | 92239 | 55746 | Student | 2014-09-14 01:41:18 |
79850 | 92240 | 12597 | Autobiographer | 2014-09-14 02:31:28 |
79851 rows x 4 columns
memory usage: 4.96 MB
name: badges
type: getml.DataFrame
posts
name | Id | PostTypeId | AcceptedAnswerId | Score | ViewCount | OwnerUserId | AnswerCount | CommentCount | FavoriteCount | LastEditorUserId | ParentId | CreaionDate | Body | LasActivityDate | Title | Tags | LastEditDate | CommunityOwnedDate | ClosedDate | OwnerDisplayName | LastEditorDisplayName |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 1 | 15 | 23 | 1278 | 8 | 5 | 1 | 14 | nan | nan | 2010-07-19 19:12:12 | <p>How should I elicit prior dis... | 2010-09-15 21:08:26 | Eliciting priors from experts | <bayesian><prior><elicitation> | NULL | NULL | NULL | NULL | NULL |
1 | 2 | 1 | 59 | 22 | 8198 | 24 | 7 | 1 | 8 | 88 | nan | 2010-07-19 19:12:57 | <p>In many different statistical... | 2012-11-12 09:21:54 | What is normality? | <distributions><normality> | 2010-08-07 17:56:44 | NULL | NULL | NULL | NULL |
2 | 3 | 1 | 5 | 54 | 3613 | 18 | 19 | 4 | 36 | 183 | nan | 2010-07-19 19:13:28 | <p>What are some valuable Statis... | 2013-05-27 14:48:36 | What are some valuable Statistic... | <software><open-source> | 2011-02-12 05:50:03 | 2010-07-19 19:13:28 | NULL | NULL | NULL |
3 | 4 | 1 | 135 | 13 | 5224 | 23 | 5 | 2 | 2 | nan | nan | 2010-07-19 19:13:31 | <p>I have two groups of data. E... | 2010-09-08 03:00:19 | Assessing the significance of di... | <distributions><statistical-sign... | NULL | NULL | NULL | NULL | NULL |
4 | 5 | 2 | nan | 81 | nan | 23 | nan | 3 | nan | 23 | 3 | 2010-07-19 19:14:43 | <p>The R-project</p> <p><a href... | 2010-07-19 19:21:15 | NULL | NULL | 2010-07-19 19:21:15 | 2010-07-19 19:14:43 | NULL | NULL | NULL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
91971 | 115374 | 2 | nan | 2 | nan | 805 | nan | 2 | nan | 805 | 115367 | 2014-09-13 23:45:39 | <p>This grew too long for a comm... | 2014-09-14 02:05:41 | NULL | NULL | 2014-09-14 02:05:41 | NULL | NULL | NULL | NULL |
91972 | 115375 | 1 | nan | 0 | 9 | 49365 | 1 | 0 | nan | nan | nan | 2014-09-13 23:46:05 | <p>Assume a classification probl... | 2014-09-14 02:09:23 | Detecting a consistent pattern i... | <classification><cross-validatio... | NULL | NULL | NULL | NULL | NULL |
91973 | 115376 | 1 | nan | 1 | 5 | 55746 | 0 | 2 | nan | 7290 | nan | 2014-09-14 01:27:54 | <p>My goal is to create a formul... | 2014-09-14 01:40:55 | How to project video viewcount b... | <summary-statistics><median><evi... | 2014-09-14 01:40:55 | NULL | NULL | NULL | NULL |
91974 | 115377 | 2 | nan | 0 | nan | 805 | nan | 0 | nan | 805 | 115358 | 2014-09-14 02:03:28 | <p>As a practical answer to the ... | 2014-09-14 02:54:13 | NULL | NULL | 2014-09-14 02:54:13 | NULL | NULL | NULL | NULL |
91975 | 115378 | 2 | nan | 0 | nan | 7250 | nan | 0 | nan | nan | 115375 | 2014-09-14 02:09:23 | <p>Decision trees are notoriousl... | 2014-09-14 02:09:23 | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
91976 rows x 21 columns
memory usage: 128.41 MB
name: posts
type: getml.DataFrame
users
name | Id | Reputation | Views | UpVotes | DownVotes | AccountId | Age | CreationDate | DisplayName | LastAccessDate | WebsiteUrl | Location | AboutMe | ProfileImageUrl |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | -1 | 1 | 0 | 5007 | 1920 | -1 | nan | 2010-07-19 06:55:26 | Community | 2010-07-19 06:55:26 | http://meta.stackexchange.com/ | on the server farm | <p>Hi, I'm not really a person.<... | NULL |
1 | 2 | 101 | 25 | 3 | 0 | 2 | 37 | 2010-07-19 14:01:36 | Geoff Dalgas | 2013-11-12 22:07:23 | http://stackoverflow.com | Corvallis, OR | <p>Developer on the StackOverflo... | NULL |
2 | 3 | 101 | 22 | 19 | 0 | 3 | 35 | 2010-07-19 15:34:50 | Jarrod Dixon | 2014-08-08 06:42:58 | http://stackoverflow.com | New York, NY | <p><a href="http://blog.stackove... | NULL |
3 | 4 | 101 | 11 | 0 | 0 | 1998 | 28 | 2010-07-19 19:03:27 | Emmett | 2014-01-02 09:31:02 | http://minesweeperonline.com | San Francisco, CA | <p>currently at a startup in SF<... | http://i.stack.imgur.com/d1oHX.j... |
4 | 5 | 6792 | 1145 | 662 | 5 | 54503 | 35 | 2010-07-19 19:03:57 | Shane | 2014-08-13 00:23:47 | http://www.statalgo.com | New York, NY | <p>Quantitative researcher focus... | NULL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
40320 | 55743 | 1 | 0 | 0 | 0 | 5026902 | nan | 2014-09-13 21:03:50 | AussieMeg | 2014-09-13 21:18:52 | NULL | NULL | NULL | http://graph.facebook.com/665821... |
40321 | 55744 | 6 | 1 | 0 | 0 | 5026998 | nan | 2014-09-13 21:39:30 | Mia Maria | 2014-09-13 21:39:30 | NULL | NULL | NULL | NULL |
40322 | 55745 | 101 | 0 | 0 | 0 | 481766 | nan | 2014-09-13 23:45:27 | tronbabylove | 2014-09-13 23:45:27 | NULL | United States | NULL | https://www.gravatar.com/avatar/... |
40323 | 55746 | 106 | 1 | 0 | 0 | 976289 | nan | 2014-09-14 00:29:41 | GPP | 2014-09-14 02:05:17 | NULL | NULL | <p>Stats noobie, product, market... | https://www.gravatar.com/avatar/... |
40324 | 55747 | 1 | 0 | 0 | 0 | 5027354 | nan | 2014-09-14 01:01:44 | Shivam Agrawal | 2014-09-14 01:19:04 | NULL | India | <p>Maths Enthusiast </p> | https://lh4.googleusercontent.co... |
40325 rows x 14 columns
memory usage: 10.32 MB
name: users
type: getml.DataFrame
votes
name | Id | PostId | VoteTypeId | UserId | BountyAmount | CreationDate |
---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string |
0 | 1 | 3 | 2 | nan | nan | 2010-07-19 |
1 | 2 | 2 | 2 | nan | nan | 2010-07-19 |
2 | 3 | 5 | 2 | nan | nan | 2010-07-19 |
3 | 4 | 5 | 2 | nan | nan | 2010-07-19 |
4 | 5 | 3 | 2 | nan | nan | 2010-07-19 |
... | ... | ... | ... | ... | ... | |
328059 | 386254 | 26088 | 2 | nan | nan | 2014-09-14 |
328060 | 386255 | 26088 | 5 | 31466 | nan | 2014-09-14 |
328061 | 386256 | 115374 | 2 | nan | nan | 2014-09-14 |
328062 | 386257 | 115368 | 2 | nan | nan | 2014-09-14 |
328063 | 386258 | 115369 | 2 | nan | nan | 2014-09-14 |
328064 rows x 6 columns
memory usage: 19.36 MB
name: votes
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
badges.set_role(["Id", "UserId"], getml.data.roles.join_key)
badges.set_role("Date", getml.data.roles.time_stamp)
badges.set_role("Name", getml.data.roles.categorical)
badges
name | Date | Id | UserId | Name |
---|---|---|---|---|
role | time_stamp | join_key | join_key | categorical |
unit | time stamp, comparison only | |||
0 | 2010-07-19 19:39:07 | 1 | 5 | Teacher |
1 | 2010-07-19 19:39:07 | 2 | 6 | Teacher |
2 | 2010-07-19 19:39:07 | 3 | 8 | Teacher |
3 | 2010-07-19 19:39:07 | 4 | 23 | Teacher |
4 | 2010-07-19 19:39:07 | 5 | 36 | Teacher |
... | ... | ... | ... | |
79846 | 2014-09-13 23:25:21 | 92236 | 55744 | Student |
79847 | 2014-09-14 00:09:35 | 92237 | 1118 | Nice Answer |
79848 | 2014-09-14 01:18:29 | 92238 | 1118 | Enlightened |
79849 | 2014-09-14 01:41:18 | 92239 | 55746 | Student |
79850 | 2014-09-14 02:31:28 | 92240 | 12597 | Autobiographer |
79851 rows x 4 columns
memory usage: 1.60 MB
name: badges
type: getml.DataFrame
posts.set_role(["Id", "AcceptedAnswerId", "OwnerUserId"], getml.data.roles.join_key)
posts.set_role(["Score", "ViewCount", "AnswerCount", "CommentCount", "FavoriteCount", "LastEditorUserId"], getml.data.roles.numerical)
posts.set_role(["PostTypeId"], getml.data.roles.categorical)
posts.drop(posts.roles.unused)
name | Id | AcceptedAnswerId | OwnerUserId | PostTypeId | Score | ViewCount | AnswerCount | CommentCount | FavoriteCount | LastEditorUserId |
---|---|---|---|---|---|---|---|---|---|---|
role | join_key | join_key | join_key | categorical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 1 | 15 | 8 | 1 | 23 | 1278 | 5 | 1 | 14 | nan |
1 | 2 | 59 | 24 | 1 | 22 | 8198 | 7 | 1 | 8 | 88 |
2 | 3 | 5 | 18 | 1 | 54 | 3613 | 19 | 4 | 36 | 183 |
3 | 4 | 135 | 23 | 1 | 13 | 5224 | 5 | 2 | 2 | nan |
4 | 5 | nan | 23 | 2 | 81 | nan | nan | 3 | nan | 23 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
91976 rows
type: getml.data.View
users.set_role("Reputation", getml.data.roles.target)
users.set_role(["CreationDate", "LastAccessDate"], getml.data.roles.time_stamp)
users.set_role(["Id"], getml.data.roles.join_key)
users.set_role(["Views", "UpVotes", "DownVotes"], getml.data.roles.numerical)
users.drop(users.roles.unused)
name | CreationDate | LastAccessDate | Id | Reputation | Views | UpVotes | DownVotes |
---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | join_key | target | numerical | numerical | numerical |
unit | time stamp, comparison only | time stamp, comparison only | |||||
0 | 2010-07-19 06:55:26 | 2010-07-19 06:55:26 | -1 | 1 | 0 | 5007 | 1920 |
1 | 2010-07-19 14:01:36 | 2013-11-12 22:07:23 | 2 | 101 | 25 | 3 | 0 |
2 | 2010-07-19 15:34:50 | 2014-08-08 06:42:58 | 3 | 101 | 22 | 19 | 0 |
3 | 2010-07-19 19:03:27 | 2014-01-02 09:31:02 | 4 | 101 | 11 | 0 | 0 |
4 | 2010-07-19 19:03:57 | 2014-08-13 00:23:47 | 5 | 6792 | 1145 | 662 | 5 |
... | ... | ... | ... | ... | ... | ... | ... |
40325 rows
type: getml.data.View
votes.set_role(["CreationDate"], getml.data.roles.time_stamp)
votes.set_role(["Id", "PostId"], getml.data.roles.join_key)
votes.set_role(["VoteTypeId"], getml.data.roles.categorical)
votes.set_role(["BountyAmount"], getml.data.roles.numerical)
votes.drop(users.roles.unused)
name | CreationDate | Id | PostId | VoteTypeId | BountyAmount | UserId |
---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | categorical | numerical | unused_float |
unit | time stamp, comparison only | |||||
0 | 2010-07-19 | 1 | 3 | 2 | nan | nan |
1 | 2010-07-19 | 2 | 2 | 2 | nan | nan |
2 | 2010-07-19 | 3 | 5 | 2 | nan | nan |
3 | 2010-07-19 | 4 | 5 | 2 | nan | nan |
4 | 2010-07-19 | 5 | 3 | 2 | nan | nan |
... | ... | ... | ... | ... | ... | ... |
328064 rows
type: getml.data.View
split = getml.data.split.random(train=0.8, test=0.2)
container = getml.data.Container(population=users, split=split)
container.add(badges=badges, posts=posts, votes=votes)
container
subset | name | rows | type | |
---|---|---|---|---|
0 | test | users | 7992 | View |
1 | train | users | 32333 | View |
name | rows | type | |
---|---|---|---|
0 | badges | 79851 | DataFrame |
1 | posts | 91976 | DataFrame |
2 | votes | 328064 | DataFrame |
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
dm = getml.data.DataModel(population=users.to_placeholder())
dm.add(getml.data.to_placeholder(badges=badges, posts=[posts]*2, votes=votes))
dm.population.join(
dm.badges,
on=("Id", "UserId"),
)
dm.population.join(
dm.posts[0],
on=("Id", "OwnerUserId"),
)
dm.posts[0].join(
dm.posts[1],
on=("Id", "AcceptedAnswerId"),
relationship=getml.data.relationship.one_to_one,
)
dm.posts[0].join(
dm.votes,
on=("Id", "PostId"),
)
dm
data frames | staging table | |
---|---|---|
0 | users | USERS__STAGING_TABLE_1 |
1 | badges | BADGES__STAGING_TABLE_2 |
2 | posts, posts | POSTS__STAGING_TABLE_3 |
3 | votes | VOTES__STAGING_TABLE_4 |
2.2 getML pipeline¶
Set-up the feature learner & predictor
We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples
).
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
aggregation=["AVG", "SUM"],
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
n_most_frequent=5,
)
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
num_subfeatures=10,
)
predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
Build the pipeline
pipe1 = getml.Pipeline(
tags=['fast_prop'],
data_model=dm,
feature_learners=[fast_prop],
predictors=[predictor],
include_categorical=True,
)
pipe1
Pipeline(data_model='users', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=['badges', 'posts', 'votes'], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['fast_prop'])
pipe2 = getml.Pipeline(
tags=['relboost'],
data_model=dm,
preprocessors=[mapping],
feature_learners=[relboost],
predictors=[predictor],
include_categorical=True,
)
pipe2
Pipeline(data_model='users', feature_learners=['Relboost'], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=['badges', 'posts', 'votes'], predictors=['XGBoostRegressor'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['relboost'])
2.3 Model training¶
pipe1.check(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 ⠋ Checking... ━━━━━━━━━━━━━━╸ 37% • 00:01
Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 3 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | WARNING | COLUMN SHOULD BE UNUSED | All non-NULL entries in column 't3__posttypeid' in POSTS__STAGING_TABLE_3 are equal to each other. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). |
1 | WARNING | COLUMN SHOULD BE UNUSED | 93.147125% of all entries in column 't3__favoritecount' in POSTS__STAGING_TABLE_3 are NULL values. You should consider setting its role to unused_float. |
2 | WARNING | COLUMN SHOULD BE UNUSED | 99.468396% of all entries in column 'bountyamount' in VOTES__STAGING_TABLE_4 are NULL values. You should consider setting its role to unused_float. |
3 | INFO | FOREIGN KEYS NOT FOUND | When joining USERS__STAGING_TABLE_1 and BADGES__STAGING_TABLE_2 over 'Id' and 'UserId', there are no corresponding entries for 37.794204% of entries in 'Id' in 'USERS__STAGING_TABLE_1'. You might want to double-check your join keys. |
4 | INFO | FOREIGN KEYS NOT FOUND | When joining USERS__STAGING_TABLE_1 and POSTS__STAGING_TABLE_3 over 'Id' and 'OwnerUserId', there are no corresponding entries for 45.600470% of entries in 'Id' in 'USERS__STAGING_TABLE_1'. You might want to double-check your join keys. |
5 | INFO | FOREIGN KEYS NOT FOUND | When joining POSTS__STAGING_TABLE_3 and VOTES__STAGING_TABLE_4 over 'Id' and 'PostId', there are no corresponding entries for 16.464078% of entries in 'Id' in 'POSTS__STAGING_TABLE_3'. You might want to double-check your join keys. |
pipe1.fit(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 3 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 522 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09
Trained pipeline.
Time taken: 0:00:19.835015.
Pipeline(data_model='users', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=['badges', 'posts', 'votes'], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['fast_prop', 'container-1mZfCR'])
pipe2.check(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 3 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | WARNING | COLUMN SHOULD BE UNUSED | All non-NULL entries in column 't3__posttypeid' in POSTS__STAGING_TABLE_3 are equal to each other. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). |
1 | WARNING | COLUMN SHOULD BE UNUSED | 93.147125% of all entries in column 't3__favoritecount' in POSTS__STAGING_TABLE_3 are NULL values. You should consider setting its role to unused_float. |
2 | WARNING | COLUMN SHOULD BE UNUSED | 99.468396% of all entries in column 'bountyamount' in VOTES__STAGING_TABLE_4 are NULL values. You should consider setting its role to unused_float. |
3 | INFO | FOREIGN KEYS NOT FOUND | When joining USERS__STAGING_TABLE_1 and BADGES__STAGING_TABLE_2 over 'Id' and 'UserId', there are no corresponding entries for 37.794204% of entries in 'Id' in 'USERS__STAGING_TABLE_1'. You might want to double-check your join keys. |
4 | INFO | FOREIGN KEYS NOT FOUND | When joining USERS__STAGING_TABLE_1 and POSTS__STAGING_TABLE_3 over 'Id' and 'OwnerUserId', there are no corresponding entries for 45.600470% of entries in 'Id' in 'USERS__STAGING_TABLE_1'. You might want to double-check your join keys. |
5 | INFO | FOREIGN KEYS NOT FOUND | When joining POSTS__STAGING_TABLE_3 and VOTES__STAGING_TABLE_4 over 'Id' and 'PostId', there are no corresponding entries for 16.464078% of entries in 'Id' in 'POSTS__STAGING_TABLE_3'. You might want to double-check your join keys. |
pipe2.fit(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 3 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:19 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:22 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:10 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04
Trained pipeline.
Time taken: 0:02:05.355604.
Pipeline(data_model='users', feature_learners=['Relboost'], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=['badges', 'posts', 'votes'], predictors=['XGBoostRegressor'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['relboost', 'container-1mZfCR'])
2.4 Model evaluation¶
fastprop_score = pipe1.score(container.test)
fastprop_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 12:57:46 | train | Reputation | 31.8897 | 43.6771 | 0.9974 |
1 | 2024-09-12 12:59:53 | test | Reputation | 33.6064 | 65.3332 | 0.9777 |
relboost_score = pipe2.score(container.test)
relboost_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 12:59:52 | train | Reputation | 29.7673 | 42.9683 | 0.9975 |
1 | 2024-09-12 12:59:55 | test | Reputation | 32.2748 | 75.6576 | 0.9728 |
2.5 featuretools¶
population_train_pd = container.train.population.to_pandas()
population_test_pd = container.test.population.to_pandas()
badges_pd = badges.drop(badges.roles.unused).to_pandas()
posts_pd = posts.drop(posts.roles.unused).to_pandas()
votes_pd = votes.drop(votes.roles.unused).to_pandas()
population_pd_logical_types = {
"Id": ww.logical_types.IntegerNullable,
"Views": ww.logical_types.Integer,
"UpVotes": ww.logical_types.Integer,
"DownVotes": ww.logical_types.Integer,
"Reputation": ww.logical_types.Integer,
"CreationDate": ww.logical_types.Datetime,
"LastAccessDate": ww.logical_types.Datetime
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index="Id", name="population")
population_test_pd.ww.init(logical_types=population_pd_logical_types, index="Id", name="population")
badges_pd_logical_types = {
"Name": ww.logical_types.Categorical,
"Id": ww.logical_types.Integer,
"UserId": ww.logical_types.IntegerNullable,
"Date": ww.logical_types.Datetime
}
badges_pd.ww.init(logical_types=badges_pd_logical_types, index="Id", name="badges")
posts_pd_logical_types = {
"PostTypeId": ww.logical_types.Integer,
"Id": ww.logical_types.Integer,
"AcceptedAnswerId": ww.logical_types.IntegerNullable,
"OwnerUserId": ww.logical_types.IntegerNullable,
"Score": ww.logical_types.Integer,
"ViewCount": ww.logical_types.IntegerNullable,
"AnswerCount": ww.logical_types.IntegerNullable,
"CommentCount": ww.logical_types.Integer,
"FavoriteCount": ww.logical_types.IntegerNullable,
"LastEditorUserId": ww.logical_types.IntegerNullable
}
posts_pd.ww.init(logical_types=posts_pd_logical_types, index="Id", name="posts")
votes_pd_logical_types = {
"VoteTypeId": ww.logical_types.Categorical,
"Id": ww.logical_types.Integer,
"PostId": ww.logical_types.Integer,
"BountyAmount": ww.logical_types.IntegerNullable,
"CreationDate": ww.logical_types.Datetime
}
votes_pd.ww.init(logical_types=votes_pd_logical_types, index="Id", name="votes")
dataframes_train = {
"population" : (population_train_pd, ),
"badges": (badges_pd, ),
"posts": (posts_pd, ),
"votes": (votes_pd, ),
}
dataframes_test = {
"population" : (population_test_pd, ),
"badges": (badges_pd, ),
"posts": (posts_pd, ),
"votes": (votes_pd, ),
}
relationships = [
("population", "Id", "badges", "UserId"),
("population", "Id", "posts", "OwnerUserId"),
("posts", "Id", "votes", "PostId"),
]
featuretools_train_pd = featuretools.dfs(
dataframes=dataframes_train,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_test_pd = featuretools.dfs(
dataframes=dataframes_test,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
featuretools_train.set_role("Reputation", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)
featuretools_train.set_role(["SKEW(posts.STD(votes.BountyAmount))", "STD(posts.SKEW(votes.BountyAmount))"], getml.data.roles.unused_float)
featuretools_train
name | Reputation | COUNT(badges) | MODE(badges.Name) | NUM_UNIQUE(badges.Name) | COUNT(posts) | COUNT(votes) | MODE(votes.VoteTypeId) | NUM_UNIQUE(votes.VoteTypeId) | DAY(CreationDate) | DAY(LastAccessDate) | MONTH(CreationDate) | MONTH(LastAccessDate) | WEEKDAY(CreationDate) | WEEKDAY(LastAccessDate) | YEAR(CreationDate) | YEAR(LastAccessDate) | MODE(badges.DAY(Date)) | MODE(badges.MONTH(Date)) | MODE(badges.WEEKDAY(Date)) | MODE(badges.YEAR(Date)) | NUM_UNIQUE(badges.DAY(Date)) | NUM_UNIQUE(badges.MONTH(Date)) | NUM_UNIQUE(badges.WEEKDAY(Date)) | NUM_UNIQUE(badges.YEAR(Date)) | MODE(posts.MODE(votes.VoteTypeId)) | NUM_UNIQUE(posts.MODE(votes.VoteTypeId)) | Views | UpVotes | DownVotes | MAX(posts.AcceptedAnswerId) | MAX(posts.AnswerCount) | MAX(posts.CommentCount) | MAX(posts.FavoriteCount) | MAX(posts.LastEditorUserId) | MAX(posts.PostTypeId) | MAX(posts.Score) | MAX(posts.ViewCount) | MEAN(posts.AcceptedAnswerId) | MEAN(posts.AnswerCount) | MEAN(posts.CommentCount) | MEAN(posts.FavoriteCount) | MEAN(posts.LastEditorUserId) | MEAN(posts.PostTypeId) | MEAN(posts.Score) | MEAN(posts.ViewCount) | MIN(posts.AcceptedAnswerId) | MIN(posts.AnswerCount) | MIN(posts.CommentCount) | MIN(posts.FavoriteCount) | MIN(posts.LastEditorUserId) | MIN(posts.PostTypeId) | MIN(posts.Score) | MIN(posts.ViewCount) | SKEW(posts.AcceptedAnswerId) | SKEW(posts.AnswerCount) | SKEW(posts.CommentCount) | SKEW(posts.FavoriteCount) | SKEW(posts.LastEditorUserId) | SKEW(posts.PostTypeId) | SKEW(posts.Score) | SKEW(posts.ViewCount) | STD(posts.AcceptedAnswerId) | STD(posts.AnswerCount) | STD(posts.CommentCount) | STD(posts.FavoriteCount) | STD(posts.LastEditorUserId) | STD(posts.PostTypeId) | STD(posts.Score) | STD(posts.ViewCount) | SUM(posts.AcceptedAnswerId) | SUM(posts.AnswerCount) | SUM(posts.CommentCount) | SUM(posts.FavoriteCount) | SUM(posts.LastEditorUserId) | SUM(posts.PostTypeId) | SUM(posts.Score) | SUM(posts.ViewCount) | MAX(votes.BountyAmount) | MEAN(votes.BountyAmount) | MIN(votes.BountyAmount) | SKEW(votes.BountyAmount) | STD(votes.BountyAmount) | SUM(votes.BountyAmount) | MAX(posts.COUNT(votes)) | MAX(posts.MEAN(votes.BountyAmount)) | MAX(posts.MIN(votes.BountyAmount)) | MAX(posts.NUM_UNIQUE(votes.VoteTypeId)) | MAX(posts.SKEW(votes.BountyAmount)) | MAX(posts.STD(votes.BountyAmount)) | MAX(posts.SUM(votes.BountyAmount)) | MEAN(posts.COUNT(votes)) | MEAN(posts.MAX(votes.BountyAmount)) | MEAN(posts.MEAN(votes.BountyAmount)) | MEAN(posts.MIN(votes.BountyAmount)) | MEAN(posts.NUM_UNIQUE(votes.VoteTypeId)) | MEAN(posts.SKEW(votes.BountyAmount)) | MEAN(posts.STD(votes.BountyAmount)) | MEAN(posts.SUM(votes.BountyAmount)) | MIN(posts.COUNT(votes)) | MIN(posts.MAX(votes.BountyAmount)) | MIN(posts.MEAN(votes.BountyAmount)) | MIN(posts.NUM_UNIQUE(votes.VoteTypeId)) | MIN(posts.SKEW(votes.BountyAmount)) | MIN(posts.STD(votes.BountyAmount)) | MIN(posts.SUM(votes.BountyAmount)) | SKEW(posts.COUNT(votes)) | SKEW(posts.MAX(votes.BountyAmount)) | SKEW(posts.MEAN(votes.BountyAmount)) | SKEW(posts.MIN(votes.BountyAmount)) | SKEW(posts.NUM_UNIQUE(votes.VoteTypeId)) | SKEW(posts.SUM(votes.BountyAmount)) | STD(posts.COUNT(votes)) | STD(posts.MAX(votes.BountyAmount)) | STD(posts.MEAN(votes.BountyAmount)) | STD(posts.MIN(votes.BountyAmount)) | STD(posts.NUM_UNIQUE(votes.VoteTypeId)) | STD(posts.SUM(votes.BountyAmount)) | SUM(posts.MAX(votes.BountyAmount)) | SUM(posts.MEAN(votes.BountyAmount)) | SUM(posts.MIN(votes.BountyAmount)) | SUM(posts.NUM_UNIQUE(votes.VoteTypeId)) | SUM(posts.SKEW(votes.BountyAmount)) | SUM(posts.STD(votes.BountyAmount)) | MAX(votes.posts.AcceptedAnswerId) | MAX(votes.posts.AnswerCount) | MAX(votes.posts.CommentCount) | MAX(votes.posts.FavoriteCount) | MAX(votes.posts.LastEditorUserId) | MAX(votes.posts.PostTypeId) | MAX(votes.posts.Score) | MAX(votes.posts.ViewCount) | MEAN(votes.posts.AcceptedAnswerId) | MEAN(votes.posts.AnswerCount) | MEAN(votes.posts.CommentCount) | MEAN(votes.posts.FavoriteCount) | MEAN(votes.posts.LastEditorUserId) | MEAN(votes.posts.PostTypeId) | MEAN(votes.posts.Score) | MEAN(votes.posts.ViewCount) | MIN(votes.posts.AcceptedAnswerId) | MIN(votes.posts.AnswerCount) | MIN(votes.posts.CommentCount) | MIN(votes.posts.FavoriteCount) | MIN(votes.posts.LastEditorUserId) | MIN(votes.posts.PostTypeId) | MIN(votes.posts.Score) | MIN(votes.posts.ViewCount) | SKEW(votes.posts.AcceptedAnswerId) | SKEW(votes.posts.AnswerCount) | SKEW(votes.posts.CommentCount) | SKEW(votes.posts.FavoriteCount) | SKEW(votes.posts.LastEditorUserId) | SKEW(votes.posts.PostTypeId) | SKEW(votes.posts.Score) | SKEW(votes.posts.ViewCount) | STD(votes.posts.AcceptedAnswerId) | STD(votes.posts.AnswerCount) | STD(votes.posts.CommentCount) | STD(votes.posts.FavoriteCount) | STD(votes.posts.LastEditorUserId) | STD(votes.posts.PostTypeId) | STD(votes.posts.Score) | STD(votes.posts.ViewCount) | SUM(votes.posts.AcceptedAnswerId) | SUM(votes.posts.AnswerCount) | SUM(votes.posts.CommentCount) | SUM(votes.posts.FavoriteCount) | SUM(votes.posts.LastEditorUserId) | SUM(votes.posts.PostTypeId) | SUM(votes.posts.Score) | SUM(votes.posts.ViewCount) | SKEW(posts.STD(votes.BountyAmount)) | STD(posts.SKEW(votes.BountyAmount)) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float |
0 | 1 | 0 | NULL | NULL | 211 | 10 | 16 | 1 | 19 | 19 | 7 | 7 | 0 | 0 | 2010 | 2010 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 16 | 1 | 0 | 5007 | 1920 | nan | nan | 0 | nan | 8588 | 7 | 0 | nan | nan | nan | 0 | nan | 306.1634 | 4.9005 | 0 | nan | nan | nan | 0 | nan | -1 | 3 | 0 | nan | nan | nan | 0 | nan | 4.9461 | -0.2648 | 0 | nan | nan | nan | 0 | nan | 1339.1554 | 0.5386 | 0 | nan | 0 | 0 | 0 | 0 | 61845 | 1034 | 0 | 0 | nan | nan | nan | nan | nan | 0 | 1 | nan | nan | 1 | nan | nan | 0 | 0.04739 | nan | nan | nan | 1 | nan | nan | 0 | 0 | nan | nan | 1 | nan | nan | 0 | 4.2908 | nan | nan | nan | 0 | 0 | 0.213 | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 10 | 0 | 0 | nan | nan | 0 | nan | 2116 | 5 | 0 | nan | nan | nan | 0 | nan | 2116 | 4.4 | 0 | nan | nan | nan | 0 | nan | 2116 | 3 | 0 | nan | nan | nan | 0 | nan | nan | -0.7801 | 0 | nan | nan | nan | 0 | nan | nan | 0.6992 | 0 | nan | 0 | 0 | 0 | 0 | 2116 | 44 | 0 | 0 | nan | nan |
1 | 101 | 3 | Autobiographer | 3 | 0 | 0 | NULL | NULL | 19 | 12 | 7 | 11 | 0 | 1 | 2010 | 2013 | 11 | 7 | 5 | 2010 | 3 | 3 | 2 | 1 | NULL | NULL | 25 | 3 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
2 | 101 | 3 | Autobiographer | 3 | 0 | 0 | NULL | NULL | 19 | 8 | 7 | 8 | 0 | 4 | 2010 | 2014 | 2 | 4 | 0 | 2010 | 3 | 3 | 3 | 2 | NULL | NULL | 22 | 19 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
3 | 6792 | 131 | Nice Answer | 45 | 117 | 2016 | 2 | 7 | 19 | 13 | 7 | 8 | 0 | 2 | 2010 | 2014 | 19 | 7 | 3 | 2010 | 30 | 12 | 7 | 5 | 2 | 3 | 1145 | 662 | 5 | 2251 | 56 | 7 | 137 | 22047 | 2 | 156 | 64481 | 1793.6667 | 10.7059 | 1.5812 | 29.8125 | 1004.902 | 1.8547 | 12.5043 | 8898.2941 | 566 | 1 | 0 | 2 | -1 | 1 | 1 | 286 | -1.5227 | 2.6791 | 1.2464 | 1.8953 | 4.868 | -2.0393 | 4.9894 | 2.5979 | 616.7514 | 13.4618 | 1.7628 | 45.5869 | 3514.2488 | 0.3539 | 22.4828 | 17078.287 | 16143 | 182 | 185 | 477 | 51250 | 217 | 1463 | 151271 | 50 | 50 | 50 | nan | nan | 50 | 303 | 50 | 50 | 4 | nan | nan | 50 | 17.2308 | 50 | 50 | 50 | 1.5299 | nan | nan | 0.4274 | 1 | 50 | 50 | 1 | nan | nan | 0 | 5.7832 | nan | nan | nan | 0.9689 | 10.8167 | 41.2949 | nan | nan | nan | 0.6238 | 4.6225 | 50 | 50 | 50 | 179 | 0 | 0 | 2251 | 56 | 7 | 137 | 22047 | 2 | 156 | 64481 | 1862.9954 | 25.5357 | 3.0511 | 90.5613 | 5358.5862 | 1.4648 | 64.5496 | 30912.241 | 566 | 1 | 0 | 2 | -1 | 1 | 1 | 286 | -1.6226 | 0.6212 | 0.3534 | -0.561 | 1.3063 | 0.1413 | 0.5721 | 0.2586 | 554.8324 | 20.2126 | 2.4432 | 54.3408 | 8826.2062 | 0.4989 | 62.094 | 23812.9928 | 402407 | 27553 | 6151 | 97444 | 7276960 | 2953 | 130132 | 33354308 | nan | nan |
4 | 457 | 19 | Announcer | 16 | 12 | 94 | 2 | 3 | 19 | 7 | 7 | 8 | 0 | 3 | 2010 | 2014 | 19 | 7 | 2 | 2010 | 11 | 8 | 5 | 5 | 2 | 2 | 114 | 47 | 0 | 39226 | 2 | 5 | 4 | 930 | 2 | 40 | 981 | 22179 | 1 | 1.8333 | 2.3333 | 187.3333 | 1.5 | 7.0833 | 507.6667 | 3832 | 0 | 0 | 1 | 6 | 1 | 0 | 122 | -0.267 | 0 | 0.613 | 0.9352 | 2.3842 | 0 | 2.814 | 0.2603 | 14488.6332 | 0.6325 | 1.6422 | 1.5275 | 366.0419 | 0.5222 | 11.0327 | 399.5571 | 88716 | 6 | 22 | 7 | 1124 | 18 | 85 | 3046 | nan | nan | nan | nan | nan | 0 | 40 | nan | nan | 2 | nan | nan | 0 | 7.8333 | nan | nan | nan | 1.3333 | nan | nan | 0 | 1 | nan | nan | 1 | nan | nan | 0 | 2.6072 | nan | nan | nan | 0.8124 | 0 | 11.044 | nan | nan | nan | 0.4924 | 0 | 0 | 0 | 0 | 16 | 0 | 0 | 39226 | 2 | 5 | 4 | 930 | 2 | 40 | 981 | 23376.875 | 1.1 | 1.4681 | 2.9545 | 96 | 1.6809 | 21.2553 | 706.7 | 3832 | 0 | 0 | 1 | 6 | 1 | 0 | 122 | -0.2995 | 0.8834 | 0.9674 | -0.5825 | 3.1415 | -0.7886 | 0.1943 | -0.8643 | 13635.7689 | 0.4026 | 1.3496 | 1.3266 | 245.0523 | 0.4686 | 16.5282 | 334.6668 | 374030 | 33 | 138 | 65 | 6240 | 158 | 1998 | 21201 | nan | nan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
32328 | 1 | 0 | NULL | NULL | 0 | 0 | NULL | NULL | 13 | 13 | 9 | 9 | 5 | 5 | 2014 | 2014 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
32329 | 101 | 1 | Supporter | 1 | 0 | 0 | NULL | NULL | 13 | 13 | 9 | 9 | 5 | 5 | 2014 | 2014 | 13 | 9 | 5 | 2014 | 1 | 1 | 1 | 1 | NULL | NULL | 0 | 1 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
32330 | 6 | 1 | Student | 1 | 1 | 2 | 2 | 2 | 13 | 13 | 9 | 9 | 5 | 5 | 2014 | 2014 | 13 | 9 | 5 | 2014 | 1 | 1 | 1 | 1 | 2 | 1 | 0 | 0 | 0 | nan | 1 | 0 | 1 | nan | 1 | 1 | 17 | nan | 1 | 0 | 1 | nan | 1 | 1 | 17 | nan | 1 | 0 | 1 | nan | 1 | 1 | 17 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 17 | nan | nan | nan | nan | nan | 0 | 2 | nan | nan | 2 | nan | nan | 0 | 2 | nan | nan | nan | 2 | nan | nan | 0 | 2 | nan | nan | 2 | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 2 | 0 | 0 | nan | 1 | 0 | 1 | nan | 1 | 1 | 17 | nan | 1 | 0 | 1 | nan | 1 | 1 | 17 | nan | 1 | 0 | 1 | nan | 1 | 1 | 17 | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | nan | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 | 2 | 2 | 34 | nan | nan |
32331 | 101 | 0 | NULL | NULL | 0 | 0 | NULL | NULL | 13 | 13 | 9 | 9 | 5 | 5 | 2014 | 2014 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
32332 | 1 | 0 | NULL | NULL | 0 | 0 | NULL | NULL | 14 | 14 | 9 | 9 | 6 | 6 | 2014 | 2014 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
32333 rows x 173 columns
memory usage: 41.52 MB
name: featuretools_train
type: getml.DataFrame
featuretools_test.set_role("Reputation", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)
featuretools_test.set_role(["SKEW(posts.STD(votes.BountyAmount))", "STD(posts.SKEW(votes.BountyAmount))"], getml.data.roles.unused_float)
featuretools_test
name | Reputation | COUNT(badges) | MODE(badges.Name) | NUM_UNIQUE(badges.Name) | COUNT(posts) | COUNT(votes) | MODE(votes.VoteTypeId) | NUM_UNIQUE(votes.VoteTypeId) | DAY(CreationDate) | DAY(LastAccessDate) | MONTH(CreationDate) | MONTH(LastAccessDate) | WEEKDAY(CreationDate) | WEEKDAY(LastAccessDate) | YEAR(CreationDate) | YEAR(LastAccessDate) | MODE(badges.DAY(Date)) | MODE(badges.MONTH(Date)) | MODE(badges.WEEKDAY(Date)) | MODE(badges.YEAR(Date)) | NUM_UNIQUE(badges.DAY(Date)) | NUM_UNIQUE(badges.MONTH(Date)) | NUM_UNIQUE(badges.WEEKDAY(Date)) | NUM_UNIQUE(badges.YEAR(Date)) | MODE(posts.MODE(votes.VoteTypeId)) | NUM_UNIQUE(posts.MODE(votes.VoteTypeId)) | Views | UpVotes | DownVotes | MAX(posts.AcceptedAnswerId) | MAX(posts.AnswerCount) | MAX(posts.CommentCount) | MAX(posts.FavoriteCount) | MAX(posts.LastEditorUserId) | MAX(posts.PostTypeId) | MAX(posts.Score) | MAX(posts.ViewCount) | MEAN(posts.AcceptedAnswerId) | MEAN(posts.AnswerCount) | MEAN(posts.CommentCount) | MEAN(posts.FavoriteCount) | MEAN(posts.LastEditorUserId) | MEAN(posts.PostTypeId) | MEAN(posts.Score) | MEAN(posts.ViewCount) | MIN(posts.AcceptedAnswerId) | MIN(posts.AnswerCount) | MIN(posts.CommentCount) | MIN(posts.FavoriteCount) | MIN(posts.LastEditorUserId) | MIN(posts.PostTypeId) | MIN(posts.Score) | MIN(posts.ViewCount) | SKEW(posts.AcceptedAnswerId) | SKEW(posts.AnswerCount) | SKEW(posts.CommentCount) | SKEW(posts.FavoriteCount) | SKEW(posts.LastEditorUserId) | SKEW(posts.PostTypeId) | SKEW(posts.Score) | SKEW(posts.ViewCount) | STD(posts.AcceptedAnswerId) | STD(posts.AnswerCount) | STD(posts.CommentCount) | STD(posts.FavoriteCount) | STD(posts.LastEditorUserId) | STD(posts.PostTypeId) | STD(posts.Score) | STD(posts.ViewCount) | SUM(posts.AcceptedAnswerId) | SUM(posts.AnswerCount) | SUM(posts.CommentCount) | SUM(posts.FavoriteCount) | SUM(posts.LastEditorUserId) | SUM(posts.PostTypeId) | SUM(posts.Score) | SUM(posts.ViewCount) | MAX(votes.BountyAmount) | MEAN(votes.BountyAmount) | MIN(votes.BountyAmount) | SKEW(votes.BountyAmount) | STD(votes.BountyAmount) | SUM(votes.BountyAmount) | MAX(posts.COUNT(votes)) | MAX(posts.MEAN(votes.BountyAmount)) | MAX(posts.MIN(votes.BountyAmount)) | MAX(posts.NUM_UNIQUE(votes.VoteTypeId)) | MAX(posts.SKEW(votes.BountyAmount)) | MAX(posts.STD(votes.BountyAmount)) | MAX(posts.SUM(votes.BountyAmount)) | MEAN(posts.COUNT(votes)) | MEAN(posts.MAX(votes.BountyAmount)) | MEAN(posts.MEAN(votes.BountyAmount)) | MEAN(posts.MIN(votes.BountyAmount)) | MEAN(posts.NUM_UNIQUE(votes.VoteTypeId)) | MEAN(posts.SKEW(votes.BountyAmount)) | MEAN(posts.STD(votes.BountyAmount)) | MEAN(posts.SUM(votes.BountyAmount)) | MIN(posts.COUNT(votes)) | MIN(posts.MAX(votes.BountyAmount)) | MIN(posts.MEAN(votes.BountyAmount)) | MIN(posts.NUM_UNIQUE(votes.VoteTypeId)) | MIN(posts.SKEW(votes.BountyAmount)) | MIN(posts.STD(votes.BountyAmount)) | MIN(posts.SUM(votes.BountyAmount)) | SKEW(posts.COUNT(votes)) | SKEW(posts.MAX(votes.BountyAmount)) | SKEW(posts.MEAN(votes.BountyAmount)) | SKEW(posts.MIN(votes.BountyAmount)) | SKEW(posts.NUM_UNIQUE(votes.VoteTypeId)) | SKEW(posts.SUM(votes.BountyAmount)) | STD(posts.COUNT(votes)) | STD(posts.MAX(votes.BountyAmount)) | STD(posts.MEAN(votes.BountyAmount)) | STD(posts.MIN(votes.BountyAmount)) | STD(posts.NUM_UNIQUE(votes.VoteTypeId)) | STD(posts.SUM(votes.BountyAmount)) | SUM(posts.MAX(votes.BountyAmount)) | SUM(posts.MEAN(votes.BountyAmount)) | SUM(posts.MIN(votes.BountyAmount)) | SUM(posts.NUM_UNIQUE(votes.VoteTypeId)) | SUM(posts.SKEW(votes.BountyAmount)) | SUM(posts.STD(votes.BountyAmount)) | MAX(votes.posts.AcceptedAnswerId) | MAX(votes.posts.AnswerCount) | MAX(votes.posts.CommentCount) | MAX(votes.posts.FavoriteCount) | MAX(votes.posts.LastEditorUserId) | MAX(votes.posts.PostTypeId) | MAX(votes.posts.Score) | MAX(votes.posts.ViewCount) | MEAN(votes.posts.AcceptedAnswerId) | MEAN(votes.posts.AnswerCount) | MEAN(votes.posts.CommentCount) | MEAN(votes.posts.FavoriteCount) | MEAN(votes.posts.LastEditorUserId) | MEAN(votes.posts.PostTypeId) | MEAN(votes.posts.Score) | MEAN(votes.posts.ViewCount) | MIN(votes.posts.AcceptedAnswerId) | MIN(votes.posts.AnswerCount) | MIN(votes.posts.CommentCount) | MIN(votes.posts.FavoriteCount) | MIN(votes.posts.LastEditorUserId) | MIN(votes.posts.PostTypeId) | MIN(votes.posts.Score) | MIN(votes.posts.ViewCount) | SKEW(votes.posts.AcceptedAnswerId) | SKEW(votes.posts.AnswerCount) | SKEW(votes.posts.CommentCount) | SKEW(votes.posts.FavoriteCount) | SKEW(votes.posts.LastEditorUserId) | SKEW(votes.posts.PostTypeId) | SKEW(votes.posts.Score) | SKEW(votes.posts.ViewCount) | STD(votes.posts.AcceptedAnswerId) | STD(votes.posts.AnswerCount) | STD(votes.posts.CommentCount) | STD(votes.posts.FavoriteCount) | STD(votes.posts.LastEditorUserId) | STD(votes.posts.PostTypeId) | STD(votes.posts.Score) | STD(votes.posts.ViewCount) | SUM(votes.posts.AcceptedAnswerId) | SUM(votes.posts.AnswerCount) | SUM(votes.posts.CommentCount) | SUM(votes.posts.FavoriteCount) | SUM(votes.posts.LastEditorUserId) | SUM(votes.posts.PostTypeId) | SUM(votes.posts.Score) | SUM(votes.posts.ViewCount) | SKEW(posts.STD(votes.BountyAmount)) | STD(posts.SKEW(votes.BountyAmount)) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float |
0 | 101 | 2 | Autobiographer | 2 | 0 | 0 | NULL | NULL | 19 | 2 | 7 | 1 | 0 | 3 | 2010 | 2014 | 8 | 6 | 5 | 2011 | 2 | 2 | 1 | 2 | NULL | NULL | 11 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
1 | 429 | 10 | Yearling | 9 | 2 | 59 | 2 | 4 | 19 | 10 | 7 | 9 | 0 | 2 | 2010 | 2014 | 19 | 7 | 2 | 2010 | 8 | 6 | 5 | 4 | 2 | 1 | 56 | 20 | 0 | nan | 7 | 4 | 7 | 17230 | 2 | 26 | 1582 | nan | 7 | 3 | 7 | 8659 | 1.5 | 23.5 | 1582 | nan | 7 | 2 | 7 | 88 | 1 | 21 | 1582 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1.4142 | nan | 12121.2244 | 0.7071 | 3.5355 | nan | 0 | 7 | 6 | 7 | 17318 | 3 | 47 | 1582 | nan | nan | nan | nan | nan | 0 | 30 | nan | nan | 3 | nan | nan | 0 | 29.5 | nan | nan | nan | 3 | nan | nan | 0 | 29 | nan | nan | 3 | nan | nan | 0 | nan | nan | nan | nan | nan | nan | 0.7071 | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | nan | 7 | 4 | 7 | 17230 | 2 | 26 | 1582 | nan | 7 | 2.9831 | 7 | 8513.7288 | 1.4915 | 23.4576 | 1582 | nan | 7 | 2 | 7 | 88 | 1 | 21 | 1582 | nan | 0 | 0.03479 | 0 | 0.03479 | 0.03479 | 0.03479 | 0 | nan | 0 | 1.0084 | 0 | 8643.3304 | 0.5042 | 2.5211 | 0 | 0 | 210 | 176 | 210 | 502310 | 88 | 1384 | 47460 | nan | nan |
2 | 101 | 5 | Autobiographer | 5 | 1 | 2 | 2 | 1 | 19 | 9 | 7 | 9 | 0 | 1 | 2010 | 2014 | 19 | 7 | 0 | 2010 | 4 | 3 | 4 | 2 | 2 | 1 | 10 | 5 | 0 | nan | nan | 0 | nan | 615 | 2 | 2 | nan | nan | nan | 0 | nan | 615 | 2 | 2 | nan | nan | nan | 0 | nan | 615 | 2 | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 615 | 2 | 2 | 0 | nan | nan | nan | nan | nan | 0 | 2 | nan | nan | 1 | nan | nan | 0 | 2 | nan | nan | nan | 1 | nan | nan | 0 | 2 | nan | nan | 1 | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 1 | 0 | 0 | nan | nan | 0 | nan | 615 | 2 | 2 | nan | nan | nan | 0 | nan | 615 | 2 | 2 | nan | nan | nan | 0 | nan | 615 | 2 | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | nan | 0 | 0 | 0 | nan | 0 | 0 | 0 | 0 | 1230 | 4 | 4 | 0 | nan | nan |
3 | 2185 | 44 | Announcer | 20 | 39 | 499 | 2 | 5 | 19 | 27 | 7 | 8 | 0 | 2 | 2010 | 2014 | 10 | 7 | 1 | 2012 | 23 | 12 | 7 | 5 | 2 | 2 | 312 | 40 | 3 | 15396 | 11 | 10 | 21 | 28666 | 2 | 74 | 1926 | 7502.75 | 2.4 | 2.1282 | 5.7692 | 3279.1053 | 1.4872 | 10.5641 | 655.75 | 844 | 0 | 0 | 1 | 30 | 1 | 0 | 73 | 0.3969 | 2.2973 | 1.2348 | 1.4673 | 2.7866 | 0.05337 | 3.4381 | 1.0282 | 4550.3906 | 2.5215 | 2.4299 | 7.0018 | 7949.6098 | 0.5064 | 15.5306 | 588.8215 | 60022 | 48 | 83 | 75 | 62303 | 58 | 412 | 13115 | nan | nan | nan | nan | nan | 0 | 74 | nan | nan | 4 | nan | nan | 0 | 12.7949 | nan | nan | nan | 1.5789 | nan | nan | 0 | 0 | nan | nan | 1 | nan | nan | 0 | 2.6117 | nan | nan | nan | 1.3039 | 0 | 17.2095 | nan | nan | nan | 0.7215 | 0 | 0 | 0 | 0 | 60 | 0 | 0 | 15396 | 11 | 10 | 21 | 28666 | 2 | 74 | 1926 | 6143.6699 | 4.4669 | 2.6934 | 11.1415 | 8242.8584 | 1.515 | 30.1603 | 1191.1488 | 844 | 0 | 0 | 1 | 30 | 1 | 1 | 73 | 0.5685 | 0.9218 | 0.7248 | -0.0435 | 0.8061 | -0.06033 | 0.7466 | -0.3801 | 3575.4252 | 3.6326 | 2.4585 | 8.0147 | 11448.7075 | 0.5003 | 27.7176 | 636.2425 | 632798 | 1081 | 1344 | 2362 | 2736629 | 756 | 15050 | 288258 | nan | nan |
4 | 1651 | 22 | Nice Answer | 14 | 14 | 172 | 2 | 4 | 19 | 7 | 7 | 8 | 0 | 3 | 2010 | 2014 | 19 | 7 | 0 | 2010 | 12 | 9 | 7 | 5 | 2 | 1 | 255 | 12 | 0 | nan | 4 | 5 | 6 | 805 | 2 | 37 | 413 | nan | 2.5 | 1.7857 | 4.5 | 102.8333 | 1.8571 | 11.1429 | 376 | nan | 1 | 0 | 3 | 39 | 1 | 1 | 339 | nan | nan | 0.8296 | nan | 3.4641 | -2.2948 | 1.703 | nan | nan | 2.1213 | 1.8051 | 2.1213 | 221.1252 | 0.3631 | 9.5905 | 52.3259 | 0 | 5 | 25 | 9 | 1234 | 26 | 156 | 752 | nan | nan | nan | nan | nan | 0 | 37 | nan | nan | 3 | nan | nan | 0 | 12.2857 | nan | nan | nan | 1.6429 | nan | nan | 0 | 1 | nan | nan | 1 | nan | nan | 0 | 1.3957 | nan | nan | nan | 0.4326 | 0 | 9.7067 | nan | nan | nan | 0.6333 | 0 | 0 | 0 | 0 | 23 | 0 | 0 | nan | 4 | 5 | 6 | 805 | 2 | 37 | 413 | nan | 2.4483 | 1.9419 | 4.5517 | 177.3056 | 1.8314 | 18.0756 | 377.2759 | nan | 1 | 0 | 3 | 39 | 1 | 1 | 339 | nan | 0.07283 | 0.7614 | -0.07283 | 1.6785 | -1.7859 | 0.6603 | -0.07283 | nan | 1.5256 | 1.6533 | 1.5256 | 295.6699 | 0.3755 | 11.4253 | 37.6325 | 0 | 71 | 334 | 132 | 25532 | 315 | 3109 | 10941 | nan | nan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
7987 | 1 | 0 | NULL | NULL | 0 | 0 | NULL | NULL | 13 | 13 | 9 | 9 | 5 | 5 | 2014 | 2014 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
7988 | 11 | 1 | Student | 1 | 1 | 2 | 2 | 1 | 13 | 14 | 9 | 9 | 5 | 6 | 2014 | 2014 | 13 | 9 | 5 | 2014 | 1 | 1 | 1 | 1 | 2 | 1 | 0 | 0 | 0 | nan | 0 | 4 | nan | nan | 1 | 2 | 40 | nan | 0 | 4 | nan | nan | 1 | 2 | 40 | nan | 0 | 4 | nan | nan | 1 | 2 | 40 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 4 | 0 | 0 | 1 | 2 | 40 | nan | nan | nan | nan | nan | 0 | 2 | nan | nan | 1 | nan | nan | 0 | 2 | nan | nan | nan | 1 | nan | nan | 0 | 2 | nan | nan | 1 | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 1 | 0 | 0 | nan | 0 | 4 | nan | nan | 1 | 2 | 40 | nan | 0 | 4 | nan | nan | 1 | 2 | 40 | nan | 0 | 4 | nan | nan | 1 | 2 | 40 | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 | 2 | 4 | 80 | nan | nan |
7989 | 1 | 0 | NULL | NULL | 0 | 0 | NULL | NULL | 13 | 13 | 9 | 9 | 5 | 5 | 2014 | 2014 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | nan | nan |
7990 | 6 | 1 | Student | 1 | 1 | 1 | 2 | 1 | 13 | 13 | 9 | 9 | 5 | 5 | 2014 | 2014 | 13 | 9 | 5 | 2014 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | nan | 0 | 2 | nan | nan | 1 | 1 | 13 | nan | 0 | 2 | nan | nan | 1 | 1 | 13 | nan | 0 | 2 | nan | nan | 1 | 1 | 13 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 2 | 0 | 0 | 1 | 1 | 13 | nan | nan | nan | nan | nan | 0 | 1 | nan | nan | 1 | nan | nan | 0 | 1 | nan | nan | nan | 1 | nan | nan | 0 | 1 | nan | nan | 1 | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 1 | 0 | 0 | nan | 0 | 2 | nan | nan | 1 | 1 | 13 | nan | 0 | 2 | nan | nan | 1 | 1 | 13 | nan | 0 | 2 | nan | nan | 1 | 1 | 13 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 2 | 0 | 0 | 1 | 1 | 13 | nan | nan |
7991 | 106 | 1 | Student | 1 | 1 | 1 | 2 | 1 | 14 | 14 | 9 | 9 | 6 | 6 | 2014 | 2014 | 14 | 9 | 6 | 2014 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 0 | 0 | nan | 0 | 2 | nan | 7290 | 1 | 1 | 5 | nan | 0 | 2 | nan | 7290 | 1 | 1 | 5 | nan | 0 | 2 | nan | 7290 | 1 | 1 | 5 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 2 | 0 | 7290 | 1 | 1 | 5 | nan | nan | nan | nan | nan | 0 | 1 | nan | nan | 1 | nan | nan | 0 | 1 | nan | nan | nan | 1 | nan | nan | 0 | 1 | nan | nan | 1 | nan | nan | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 1 | 0 | 0 | nan | 0 | 2 | nan | 7290 | 1 | 1 | 5 | nan | 0 | 2 | nan | 7290 | 1 | 1 | 5 | nan | 0 | 2 | nan | 7290 | 1 | 1 | 5 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 2 | 0 | 7290 | 1 | 1 | 5 | nan | nan |
7992 rows x 173 columns
memory usage: 10.26 MB
name: featuretools_test
type: getml.DataFrame
We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.
Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.
imputation = getml.preprocessors.Imputation()
predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
pipe3 = getml.Pipeline(
tags=['featuretools'],
preprocessors=[imputation],
predictors=[predictor],
include_categorical=True,
)
pipe3
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
pipe3.fit(featuretools_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08
Trained pipeline.
Time taken: 0:00:08.607237.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
featuretools_score = pipe3.score(featuretools_test)
featuretools_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:00:56 | featuretools_train | Reputation | 32.3818 | 48.2617 | 0.9969 |
1 | 2024-09-12 13:00:57 | featuretools_test | Reputation | 34.219 | 78.0558 | 0.9684 |
2.6 Features¶
The most important features look as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_87";
CREATE TABLE "FEATURE_1_87" AS
SELECT SUM( t2."score" ) AS "feature_1_87",
t1.rowid AS rownum
FROM "USERS__STAGING_TABLE_1" t1
INNER JOIN "POSTS__STAGING_TABLE_3" t2
ON t1."id" = t2."owneruserid"
WHERE t2."posttypeid" = '2'
GROUP BY t1.rowid;
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_1";
CREATE TABLE "FEATURE_1_1" AS
SELECT SUM(
CASE
WHEN ( f_1_2."feature_1_2_13" > 44.522517 ) AND ( f_1_2."feature_1_2_12" > 1114.704894 ) THEN 883.3731803463131
WHEN ( f_1_2."feature_1_2_13" > 44.522517 ) AND ( f_1_2."feature_1_2_12" <= 1114.704894 ) AND ( t2."t3__answercount" > 1.000000 ) THEN 180.6530491037182
WHEN ( f_1_2."feature_1_2_13" > 44.522517 ) AND ( f_1_2."feature_1_2_12" <= 1114.704894 ) AND ( t2."t3__answercount" <= 1.000000 OR t2."t3__answercount" IS NULL ) THEN 117.0240683994715
WHEN ( f_1_2."feature_1_2_13" <= 44.522517 ) AND ( f_1_2."feature_1_2_16" > 6.908843 ) AND ( t1."upvotes" > 2258.000000 ) THEN 12.25445123010827
WHEN ( f_1_2."feature_1_2_13" <= 44.522517 ) AND ( f_1_2."feature_1_2_16" > 6.908843 ) AND ( t1."upvotes" <= 2258.000000 OR t1."upvotes" IS NULL ) THEN 14.16101412804239
WHEN ( f_1_2."feature_1_2_13" <= 44.522517 ) AND ( f_1_2."feature_1_2_16" <= 6.908843 ) AND ( t1."upvotes__mapping_target_1_avg" > 90.840760 ) THEN 6.041694975904914
WHEN ( f_1_2."feature_1_2_13" <= 44.522517 ) AND ( f_1_2."feature_1_2_16" <= 6.908843 ) AND ( t1."upvotes__mapping_target_1_avg" <= 90.840760 OR t1."upvotes__mapping_target_1_avg" IS NULL ) THEN -1.015010571805766
ELSE NULL
END
) AS "feature_1_1",
t1.rowid AS rownum
FROM "USERS__STAGING_TABLE_1" t1
INNER JOIN "POSTS__STAGING_TABLE_3" t2
ON t1."id" = t2."owneruserid"
LEFT JOIN "FEATURES_1_2" f_1_2
ON t2.rowid = f_1_2."rownum"
GROUP BY t1.rowid;
2.7 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named stats_pipeline containing
# the SQL code.
pipe2.features.to_sql().save("stats_pipeline")
pipe2.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("stats_spark")
2.8 Discussion¶
For a more convenient overview, we summarize our results into a table.
scores = [fastprop_score, relboost_score, featuretools_score]
pd.DataFrame(data={
'Name': ['getML: FastProp', 'getML: Relboost', 'featuretools'],
'R-squared': [f'{score.rsquared:.2%}' for score in scores],
'RMSE': [f'{score.rmse:,.2f}' for score in scores],
'MAE': [f'{score.mae:,.2f}' for score in scores]
})
Name | R-squared | RMSE | MAE | |
---|---|---|---|---|
0 | getML: FastProp | 97.77% | 65.33 | 33.61 |
1 | getML: Relboost | 97.28% | 75.66 | 32.27 |
2 | featuretools | 96.84% | 78.06 | 34.22 |
getml.engine.shutdown()
3. Conclusion¶
In this notebook, we have benchmarked several relational learning algorithms based on their ability to reverse-engineer StatsExchange's system of calculating users' reputation. We have found that all relational learning algorithm can reverse-engineer the system with high levels of accuracy, but we have also found that getML nonetheless outperforms featuretools.
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).