Robot - Feature engineering on sensor data¶
The purpose of this notebook is to illustrate how we can overcome the feature explosion problem based on an example dataset involving sensor data.
Summary:
- Prediction type: Regression
- Domain: Robotics
- Prediction target: The force vector on the robot's arm
- Population size: 15001
Feature explosion¶
The problem¶
The feature explosion problem is one of the most important issues in automated feature engineering. In fact, it is probably the main reason why automated feature engineering is not already the norm in data science projects involving business data.
To illustrate the problem, consider how data scientists write features for a simple time series problem:
SELECT SOME_AGGREGATION(t2.some_column)
FROM some_table t1
LEFT JOIN some_table t2
ON t1.join_key = t2.join_key
WHERE t2.some_other_column >= some_value
AND t2.rowid <= t1.rowid
AND t2.rowid + some_other_value > t1.rowid
GROUP BY t1.rowid;
Think about that for a second.
Every column that we have can either be aggregated (some_column) or it can be used for our conditions (some_other_column). That means if we have n columns to aggregate, we can potentially build conditions for $n$ other columns. In other words, the computational complexity is $n^2$ in the number of columns.
Note that this problem occurs regardless of whether you automate feature engineering or you do it by hand. The size of the search space is $n^2$ in the number of columns in either case, unless you can rule something out a-priori.
This problem is known as feature explosion.
The solution¶
So when we have relational data or time series with many columns, what do we do? The answer is to write different features. Specifically, suppose we had features like this:
SELECT SOME_AGGREGATION(
CASE
WHEN t2.some_column > some_value THEN weight1
WHEN t2.some_column <= some_value THEN weight2
END
)
FROM some_table t1
LEFT JOIN some_table t2
ON t1.join_key = t2.join_key
WHERE t2.rowid <= t1.rowid
AND t2.rowid + some_other_value > t1.rowid
GROUP BY t1.rowid;
weight1 and weight2 are learnable weights. An algorithm that generates features like this can only use columns for conditions, it is not allowed to aggregate columns – and it doesn't need to do so.
That means the computational complexity is linear instead of quadratic. For data sets with a large number of columns this can make all the difference in the world. For instance, if you have 100 columns the size of the search space of the second approach is only 1% of the size of the search space of the first one.
getML features an algorithm called relboost, which generates features according to this principle and is therefore very suitable for data sets with many columns.
The data set¶
To illustrate the problem, we use a data set related to robotics. When robots interact with humans, the most important think is that they don't hurt people. In order to prevent such accidents, the force vector on the robot's arm is measured. However, measuring the force vector is expensive.
Therefore, we want consider an alternative approach. We would like to predict the force vector based on other sensor data that are less costly to measure. To do so, we use machine learning.
However, the data set contains measurements from almost 100 different sensors and we do not know which and how many sensors are relevant for predicting the force vector.
The data set has been generously provided by Erik Berger who originally collected it for his dissertation:
Berger, E. (2018). Behavior-Specific Proprioception Models for Robotic Force Estimation: A Machine Learning Approach. Freiberg, Germany: Technische Universitaet Bergakademie Freiberg.
1. Loading data¶
We begin by importing the libraries and setting the project.
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"
import matplotlib.pyplot as plt
import getml
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('robot')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912154610.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'robot'.
1.1 Download from source¶
data_all = getml.data.DataFrame.from_csv(
"https://static.getml.com/datasets/robotarm/robot-demo.csv",
"data_all"
)
data_all
name | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | f_x | f_y | f_z |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float |
0 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9605 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1233 | -6.5483 | -2.8045 | -0.8296 | 0.07625 | -0.1906 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1983 | 0.7699 | 0.41 | 0.08279 | -1.4094 | 0.786 | -0.3682 | 0 | 0 | 0 | 0 | 0 | 0 | -22.654 | -11.503 | -18.673 | -3.5155 | 5.8354 | -2.05 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.069 | 48.009 | 0.9668 | 47.834 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 | -11.03 | 6.9 | -7.33 |
1 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9604 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1188 | -6.5506 | -2.8404 | -0.8281 | 0.06405 | -0.1998 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1983 | 0.7699 | 0.41 | 0.0828 | -1.4094 | 0.7859 | -0.3682 | 0 | 0 | 0 | 0 | 0 | 0 | -21.627 | -11.046 | -18.66 | -3.5395 | 5.7577 | -1.9805 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.009 | 48.009 | 0.8594 | 47.834 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 | -10.848 | 6.7218 | -7.4427 |
2 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9605 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1099 | -6.5438 | -2.8 | -0.8205 | 0.07473 | -0.183 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1922 | 0.7699 | 0.41 | 0.08279 | -1.4094 | 0.7859 | -0.3682 | 0 | 0 | 0 | 0 | 0 | 0 | -23.843 | -12.127 | -18.393 | -3.6453 | 5.978 | -1.9978 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.009 | 48.069 | 0.931 | 47.879 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 | -10.666 | 6.5436 | -7.5555 |
3 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3273 | 0.9604 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1233 | -6.5483 | -2.8224 | -0.8266 | 0.07168 | -0.1998 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1967 | 0.7699 | 0.41 | 0.08275 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | -21.772 | -10.872 | -18.691 | -3.5512 | 5.6648 | -1.9976 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.069 | 48.069 | 0.931 | 47.879 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 | -10.507 | 6.4533 | -7.65 |
4 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9604 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1255 | -6.5394 | -2.8 | -0.8327 | 0.07473 | -0.1952 | 0.1211 | -6.5483 | -2.8157 | -0.8327 | 0.07015 | -0.1922 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | -22.823 | -11.645 | -18.524 | -3.5305 | 5.8712 | -2.0096 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.069 | 48.069 | 0.8952 | 47.879 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 | -10.413 | 6.6267 | -7.69 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
14996 | 3.0837 | -0.8836 | 1.4501 | -2.2102 | -1.559 | -5.3265 | -0.03151 | -0.05375 | 0.04732 | 0.1482 | -0.05218 | 0.06706 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3694 | -4.1879 | -1.1847 | -0.09441 | -0.1568 | 0.1898 | 1.1605 | -42.951 | -19.023 | -2.6343 | 0.1551 | -0.1338 | 3.0836 | -0.8836 | 1.4503 | -2.2101 | -1.5591 | -5.3263 | -0.03347 | -0.05585 | 0.04805 | 0.151 | -0.05513 | 0.07114 | -0.3564 | -6.0394 | -2.3001 | -0.2181 | -0.1159 | 0.09608 | -0.3632 | -6.0394 | -2.3023 | -0.212 | -0.125 | 0.1113 | 0.7116 | 0.06957 | 0.06036 | -0.8506 | 2.9515 | -0.03352 | -0.03558 | -0.03029 | 0.002444 | -0.04208 | 0.1458 | -0.1098 | -0.8784 | -0.07291 | -37.584 | 0.0001132 | -2.1031 | 0.03318 | 0.7117 | 0.0697 | 0.06044 | -0.8511 | 2.951 | -0.03356 | -0.03508 | -0.02849 | 0.001571 | -0.03951 | 0.1442 | -0.1036 | 48.069 | 48.009 | 0.8952 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 | 10.84 | -1.41 | 16.14 |
14997 | 3.0835 | -0.884 | 1.4505 | -2.2091 | -1.5594 | -5.326 | -0.02913 | -0.0497 | 0.04376 | 0.137 | -0.04825 | 0.062 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3677 | -4.1837 | -1.1874 | -0.09682 | -0.1562 | 0.189 | 1.1592 | -42.937 | -19.023 | -2.6331 | 0.1545 | -0.1338 | 3.0833 | -0.8841 | 1.4507 | -2.209 | -1.5596 | -5.3258 | -0.02909 | -0.04989 | 0.04198 | 0.1481 | -0.05465 | 0.06249 | -0.3161 | -6.1179 | -2.253 | -0.3752 | -0.03965 | 0.08693 | -0.3273 | -6.1022 | -2.2597 | -0.366 | -0.05033 | 0.0915 | 0.7114 | 0.06932 | 0.06039 | -0.8497 | 2.953 | -0.03359 | -0.0335 | -0.02723 | 0.001208 | -0.04242 | 0.1428 | -0.0967 | -2.7137 | 0.8552 | -38.514 | -0.6088 | -3.2383 | -0.9666 | 0.7114 | 0.06948 | 0.06045 | -0.8503 | 2.9525 | -0.03359 | -0.03246 | -0.02633 | 0.001469 | -0.03657 | 0.1333 | -0.09571 | 48.009 | 48.009 | 0.8594 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 | 10.857 | -1.52 | 15.943 |
14998 | 3.0833 | -0.8844 | 1.4508 | -2.208 | -1.5598 | -5.3256 | -0.02676 | -0.04565 | 0.04019 | 0.1258 | -0.04431 | 0.05695 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3659 | -4.1797 | -1.1901 | -0.09922 | -0.1555 | 0.1881 | 1.1579 | -42.924 | -19.023 | -2.6321 | 0.154 | -0.1338 | 3.0831 | -0.8844 | 1.451 | -2.2078 | -1.56 | -5.3253 | -0.02776 | -0.04382 | 0.03652 | 0.1295 | -0.05064 | 0.04818 | -0.343 | -6.2569 | -2.1566 | -0.3035 | 0.00305 | 0.1434 | -0.3385 | -6.2322 | -2.1589 | -0.302 | -0.00915 | 0.1571 | 0.7111 | 0.06912 | 0.06039 | -0.849 | 2.9544 | -0.0337 | -0.02911 | -0.02589 | 0.001292 | -0.04046 | 0.1246 | -0.08058 | 4.2749 | 1.0128 | -36.412 | -1.2811 | -0.4296 | -1.1013 | 0.7112 | 0.06928 | 0.06046 | -0.8495 | 2.9538 | -0.03362 | -0.02984 | -0.02417 | 0.001364 | -0.03362 | 0.1224 | -0.08786 | 48.009 | 48.009 | 0.931 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 | 10.89 | -1.74 | 15.55 |
14999 | 3.0831 | -0.8847 | 1.4511 | -2.2071 | -1.5602 | -5.3251 | -0.02438 | -0.0416 | 0.03662 | 0.1147 | -0.04038 | 0.0519 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3642 | -4.1758 | -1.1928 | -0.1016 | -0.1548 | 0.1873 | 1.1568 | -42.912 | -19.023 | -2.6311 | 0.1535 | -0.1338 | 3.0829 | -0.8848 | 1.4513 | -2.2068 | -1.5604 | -5.3249 | -0.02149 | -0.04059 | 0.03417 | 0.1202 | -0.0395 | 0.04178 | -0.4237 | -6.2703 | -2.0939 | -0.302 | -0.01372 | 0.1739 | -0.4125 | -6.2569 | -2.0916 | -0.2943 | -0.02898 | 0.1891 | 0.7109 | 0.06894 | 0.06039 | -0.8484 | 2.9557 | -0.03384 | -0.02738 | -0.01982 | 0.001031 | -0.03028 | 0.1157 | -0.06702 | 11.518 | 1.5002 | -39.314 | -1.8671 | -0.3734 | -0.5733 | 0.7109 | 0.06909 | 0.06047 | -0.8488 | 2.955 | -0.03364 | -0.02721 | -0.02201 | 0.001255 | -0.03067 | 0.1115 | -0.08003 | 48.009 | 48.009 | 0.931 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 | 11.29 | -1.4601 | 15.743 |
15000 | 3.0829 | -0.885 | 1.4514 | -2.2062 | -1.5605 | -5.3247 | -0.02201 | -0.03755 | 0.03305 | 0.1035 | -0.03645 | 0.04684 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3624 | -4.172 | -1.1955 | -0.1041 | -0.1542 | 0.1864 | 1.1558 | -42.901 | -19.023 | -2.6302 | 0.1531 | -0.1338 | 3.0827 | -0.8851 | 1.4516 | -2.2059 | -1.5607 | -5.3246 | -0.02096 | -0.03808 | 0.02958 | 0.1171 | -0.03289 | 0.03883 | -0.417 | -6.2434 | -2.058 | -0.4102 | -0.04728 | 0.1967 | -0.4237 | -6.2367 | -2.0714 | -0.4163 | -0.0671 | 0.2059 | 0.7107 | 0.06878 | 0.06041 | -0.8478 | 2.9567 | -0.03382 | -0.02535 | -0.01854 | 0.001614 | -0.02421 | 0.11 | -0.06304 | 15.099 | 2.936 | -39.068 | -1.9402 | 0.139 | -0.2674 | 0.7107 | 0.06893 | 0.06048 | -0.8482 | 2.9561 | -0.03367 | -0.02458 | -0.01986 | 0.001142 | -0.0277 | 0.1007 | -0.07221 | 48.009 | 48.069 | 0.8952 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.955 | 11.69 | -1.1801 | 15.937 |
15001 rows x 96 columns
memory usage: 11.52 MB
name: data_all
type: getml.DataFrame
1.2 Prepare data for getML¶
The force vector consists of three component (f_x, f_y and f_z), meaning that we have three targets.
data_all.set_role(["f_x", "f_y", "f_z"], getml.data.roles.target)
data_all.set_role(data_all.roles.unused, getml.data.roles.numerical)
This is what the data set looks like:
data_all
name | f_x | f_y | f_z | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | target | target | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | -11.03 | 6.9 | -7.33 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9605 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1233 | -6.5483 | -2.8045 | -0.8296 | 0.07625 | -0.1906 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1983 | 0.7699 | 0.41 | 0.08279 | -1.4094 | 0.786 | -0.3682 | 0 | 0 | 0 | 0 | 0 | 0 | -22.654 | -11.503 | -18.673 | -3.5155 | 5.8354 | -2.05 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.069 | 48.009 | 0.9668 | 47.834 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 |
1 | -10.848 | 6.7218 | -7.4427 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9604 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1188 | -6.5506 | -2.8404 | -0.8281 | 0.06405 | -0.1998 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1983 | 0.7699 | 0.41 | 0.0828 | -1.4094 | 0.7859 | -0.3682 | 0 | 0 | 0 | 0 | 0 | 0 | -21.627 | -11.046 | -18.66 | -3.5395 | 5.7577 | -1.9805 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.009 | 48.009 | 0.8594 | 47.834 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 |
2 | -10.666 | 6.5436 | -7.5555 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9605 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1099 | -6.5438 | -2.8 | -0.8205 | 0.07473 | -0.183 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1922 | 0.7699 | 0.41 | 0.08279 | -1.4094 | 0.7859 | -0.3682 | 0 | 0 | 0 | 0 | 0 | 0 | -23.843 | -12.127 | -18.393 | -3.6453 | 5.978 | -1.9978 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.009 | 48.069 | 0.931 | 47.879 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 |
3 | -10.507 | 6.4533 | -7.65 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3273 | 0.9604 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1233 | -6.5483 | -2.8224 | -0.8266 | 0.07168 | -0.1998 | 0.1211 | -6.5483 | -2.8157 | -0.8281 | 0.07015 | -0.1967 | 0.7699 | 0.41 | 0.08275 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | -21.772 | -10.872 | -18.691 | -3.5512 | 5.6648 | -1.9976 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.069 | 48.069 | 0.931 | 47.879 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 |
4 | -10.413 | 6.6267 | -7.69 | 3.4098 | -0.3274 | 0.9604 | -3.7436 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8.38e-17 | -4.8116 | -1.4033 | -0.1369 | 0.002472 | 0 | 9.803e-16 | -55.642 | -16.312 | -1.2042 | 0.02167 | 0 | 3.4098 | -0.3274 | 0.9604 | -3.7437 | -1.0191 | -6.0205 | 0 | 0 | 0 | 0 | 0 | 0 | 0.1255 | -6.5394 | -2.8 | -0.8327 | 0.07473 | -0.1952 | 0.1211 | -6.5483 | -2.8157 | -0.8327 | 0.07015 | -0.1922 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | -22.823 | -11.645 | -18.524 | -3.5305 | 5.8712 | -2.0096 | 0.7699 | 0.41 | 0.08278 | -1.4094 | 0.786 | -0.3681 | 0 | 0 | 0 | 0 | 0 | 0 | 48.069 | 48.069 | 0.8952 | 47.879 | 47.925 | 47.818 | 47.834 | 47.955 | 47.971 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
14996 | 10.84 | -1.41 | 16.14 | 3.0837 | -0.8836 | 1.4501 | -2.2102 | -1.559 | -5.3265 | -0.03151 | -0.05375 | 0.04732 | 0.1482 | -0.05218 | 0.06706 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3694 | -4.1879 | -1.1847 | -0.09441 | -0.1568 | 0.1898 | 1.1605 | -42.951 | -19.023 | -2.6343 | 0.1551 | -0.1338 | 3.0836 | -0.8836 | 1.4503 | -2.2101 | -1.5591 | -5.3263 | -0.03347 | -0.05585 | 0.04805 | 0.151 | -0.05513 | 0.07114 | -0.3564 | -6.0394 | -2.3001 | -0.2181 | -0.1159 | 0.09608 | -0.3632 | -6.0394 | -2.3023 | -0.212 | -0.125 | 0.1113 | 0.7116 | 0.06957 | 0.06036 | -0.8506 | 2.9515 | -0.03352 | -0.03558 | -0.03029 | 0.002444 | -0.04208 | 0.1458 | -0.1098 | -0.8784 | -0.07291 | -37.584 | 0.0001132 | -2.1031 | 0.03318 | 0.7117 | 0.0697 | 0.06044 | -0.8511 | 2.951 | -0.03356 | -0.03508 | -0.02849 | 0.001571 | -0.03951 | 0.1442 | -0.1036 | 48.069 | 48.009 | 0.8952 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 |
14997 | 10.857 | -1.52 | 15.943 | 3.0835 | -0.884 | 1.4505 | -2.2091 | -1.5594 | -5.326 | -0.02913 | -0.0497 | 0.04376 | 0.137 | -0.04825 | 0.062 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3677 | -4.1837 | -1.1874 | -0.09682 | -0.1562 | 0.189 | 1.1592 | -42.937 | -19.023 | -2.6331 | 0.1545 | -0.1338 | 3.0833 | -0.8841 | 1.4507 | -2.209 | -1.5596 | -5.3258 | -0.02909 | -0.04989 | 0.04198 | 0.1481 | -0.05465 | 0.06249 | -0.3161 | -6.1179 | -2.253 | -0.3752 | -0.03965 | 0.08693 | -0.3273 | -6.1022 | -2.2597 | -0.366 | -0.05033 | 0.0915 | 0.7114 | 0.06932 | 0.06039 | -0.8497 | 2.953 | -0.03359 | -0.0335 | -0.02723 | 0.001208 | -0.04242 | 0.1428 | -0.0967 | -2.7137 | 0.8552 | -38.514 | -0.6088 | -3.2383 | -0.9666 | 0.7114 | 0.06948 | 0.06045 | -0.8503 | 2.9525 | -0.03359 | -0.03246 | -0.02633 | 0.001469 | -0.03657 | 0.1333 | -0.09571 | 48.009 | 48.009 | 0.8594 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 |
14998 | 10.89 | -1.74 | 15.55 | 3.0833 | -0.8844 | 1.4508 | -2.208 | -1.5598 | -5.3256 | -0.02676 | -0.04565 | 0.04019 | 0.1258 | -0.04431 | 0.05695 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3659 | -4.1797 | -1.1901 | -0.09922 | -0.1555 | 0.1881 | 1.1579 | -42.924 | -19.023 | -2.6321 | 0.154 | -0.1338 | 3.0831 | -0.8844 | 1.451 | -2.2078 | -1.56 | -5.3253 | -0.02776 | -0.04382 | 0.03652 | 0.1295 | -0.05064 | 0.04818 | -0.343 | -6.2569 | -2.1566 | -0.3035 | 0.00305 | 0.1434 | -0.3385 | -6.2322 | -2.1589 | -0.302 | -0.00915 | 0.1571 | 0.7111 | 0.06912 | 0.06039 | -0.849 | 2.9544 | -0.0337 | -0.02911 | -0.02589 | 0.001292 | -0.04046 | 0.1246 | -0.08058 | 4.2749 | 1.0128 | -36.412 | -1.2811 | -0.4296 | -1.1013 | 0.7112 | 0.06928 | 0.06046 | -0.8495 | 2.9538 | -0.03362 | -0.02984 | -0.02417 | 0.001364 | -0.03362 | 0.1224 | -0.08786 | 48.009 | 48.009 | 0.931 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 |
14999 | 11.29 | -1.4601 | 15.743 | 3.0831 | -0.8847 | 1.4511 | -2.2071 | -1.5602 | -5.3251 | -0.02438 | -0.0416 | 0.03662 | 0.1147 | -0.04038 | 0.0519 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3642 | -4.1758 | -1.1928 | -0.1016 | -0.1548 | 0.1873 | 1.1568 | -42.912 | -19.023 | -2.6311 | 0.1535 | -0.1338 | 3.0829 | -0.8848 | 1.4513 | -2.2068 | -1.5604 | -5.3249 | -0.02149 | -0.04059 | 0.03417 | 0.1202 | -0.0395 | 0.04178 | -0.4237 | -6.2703 | -2.0939 | -0.302 | -0.01372 | 0.1739 | -0.4125 | -6.2569 | -2.0916 | -0.2943 | -0.02898 | 0.1891 | 0.7109 | 0.06894 | 0.06039 | -0.8484 | 2.9557 | -0.03384 | -0.02738 | -0.01982 | 0.001031 | -0.03028 | 0.1157 | -0.06702 | 11.518 | 1.5002 | -39.314 | -1.8671 | -0.3734 | -0.5733 | 0.7109 | 0.06909 | 0.06047 | -0.8488 | 2.955 | -0.03364 | -0.02721 | -0.02201 | 0.001255 | -0.03067 | 0.1115 | -0.08003 | 48.009 | 48.009 | 0.931 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.94 |
15000 | 11.69 | -1.1801 | 15.937 | 3.0829 | -0.885 | 1.4514 | -2.2062 | -1.5605 | -5.3247 | -0.02201 | -0.03755 | 0.03305 | 0.1035 | -0.03645 | 0.04684 | 0.2969 | 0.5065 | -0.4459 | -1.3963 | 0.4916 | -0.6319 | -0.3624 | -4.172 | -1.1955 | -0.1041 | -0.1542 | 0.1864 | 1.1558 | -42.901 | -19.023 | -2.6302 | 0.1531 | -0.1338 | 3.0827 | -0.8851 | 1.4516 | -2.2059 | -1.5607 | -5.3246 | -0.02096 | -0.03808 | 0.02958 | 0.1171 | -0.03289 | 0.03883 | -0.417 | -6.2434 | -2.058 | -0.4102 | -0.04728 | 0.1967 | -0.4237 | -6.2367 | -2.0714 | -0.4163 | -0.0671 | 0.2059 | 0.7107 | 0.06878 | 0.06041 | -0.8478 | 2.9567 | -0.03382 | -0.02535 | -0.01854 | 0.001614 | -0.02421 | 0.11 | -0.06304 | 15.099 | 2.936 | -39.068 | -1.9402 | 0.139 | -0.2674 | 0.7107 | 0.06893 | 0.06048 | -0.8482 | 2.9561 | -0.03367 | -0.02458 | -0.01986 | 0.001142 | -0.0277 | 0.1007 | -0.07221 | 48.009 | 48.069 | 0.8952 | 47.818 | 47.834 | 47.818 | 47.803 | 47.94 | 47.955 |
15001 rows x 96 columns
memory usage: 11.52 MB
name: data_all
type: getml.DataFrame
1.3 Separate data into a training and testing set¶
We also want to separate the data set into a training and testing set. We do so by using the first 10,500 measurements for training and then using the remainder for testing.
split = getml.data.split.time(data_all, "rowid", test=10500)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | train |
4 | train |
... |
unknown number of rows
type: StringColumnView
time_series = getml.data.TimeSeries(
population=data_all,
split=split,
time_stamps="rowid",
lagged_targets=False,
memory=30,
)
time_series
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | data_all | DATA_ALL__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | data_all | 4501 | View |
1 | train | data_all | 10500 | View |
name | rows | type | |
---|---|---|---|
0 | data_all | 15001 | View |
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_features=10,
)
xgboost = getml.predictors.XGBoostRegressor()
pipe1 = getml.pipeline.Pipeline(
data_model=time_series.data_model,
feature_learners=[relboost],
predictors=xgboost
)
It is always a good idea to check the pipeline for any potential issues.
pipe1.check(time_series.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
2.2 Fitting the pipeline¶
pipe1.fit(time_series.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:32 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:34 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:34 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:07 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:07 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04
Trained pipeline.
Time taken: 0:02:16.080596.
Pipeline(data_model='population', feature_learners=['Relboost'], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=['data_all'], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['container-cETtgf'])
2.3 Evaluating the pipeline¶
pipe1.score(time_series.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 15:44:28 | train | f_x | 0.4467 | 0.5882 | 0.9961 |
1 | 2024-09-12 15:44:28 | train | f_y | 0.511 | 0.675 | 0.9895 |
2 | 2024-09-12 15:44:28 | train | f_z | 0.2724 | 0.3544 | 0.9988 |
3 | 2024-09-12 15:44:40 | test | f_x | 0.5649 | 0.7368 | 0.995 |
4 | 2024-09-12 15:44:40 | test | f_y | 0.564 | 0.7535 | 0.9871 |
5 | 2024-09-12 15:44:40 | test | f_z | 0.301 | 0.3931 | 0.9985 |
2.4 Feature importances¶
It is always a good idea to study the features the relational learning algorithm has extracted.
The feature importance is calculated by xgboost based on the improvement of the optimizing criterium at each split in the decision tree and is normalized to 100%.
Also note that we have three different target (f_x, f_y and f_z) and that different features are relevant for different targets.
plt.subplots(figsize=(20, 10))
names, importances = pipe1.features.importances(target_num=0)
plt.bar(names[0:30], importances[0:30])
plt.title("feature importances for the x-component", size=20)
plt.grid(True)
plt.xlabel("features")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
plt.subplots(figsize=(20, 10))
names, importances = pipe1.features.importances(target_num=1)
plt.bar(names[0:30], importances[0:30])
plt.title("feature importances for the y-component", size=20)
plt.grid(True)
plt.xlabel("features")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
plt.subplots(figsize=(20, 10))
names, importances = pipe1.features.importances(target_num=2)
plt.bar(names[0:30], importances[0:30])
plt.title("feature importances for the z-component", size=20)
plt.grid(True)
plt.xlabel("features")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
2.5 Column importances¶
Because getML is a tool for relational learning, we can also calculate the importances for the original columns, using similar methods we have used for the feature importances.
plt.subplots(figsize=(20, 10))
names, importances = pipe1.columns.importances(target_num=0)
plt.bar(names[0:30], importances[0:30])
plt.title("column importances for the x-component", size=20)
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
plt.subplots(figsize=(20, 10))
names, importances = pipe1.columns.importances(target_num=1)
plt.bar(names[0:30], importances[0:30])
plt.title("column importances for the y-component", size=20)
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
plt.subplots(figsize=(20, 10))
names, importances = pipe1.columns.importances(target_num=2)
plt.bar(names[0:30], importances[0:30])
plt.title("column importances for the z-component", size=20)
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
2.6 Column selection¶
When we study the plots for the column importances we find that there are some good news. We actually don't need that many columns. About 80% of the columns contain very little predictive value.
This means that we can also apply other algorithms that are not as scalable as relboost. All we have to do is to select the most relevant columns:
The .select(...)
returns a new column, in which the unimportant columns have been dropped:
time_series2 = pipe1.columns.select(time_series, share_selected_columns=0.35)
time_series2
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | data_all | DATA_ALL__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | train | data_all | 10500 | View |
1 | test | data_all | 4501 | View |
name | rows | type | |
---|---|---|---|
0 | data_all | 15001 | View |
2.7 Fitting a second pipeline¶
The multirel algorithm does scale well do data sets with many columns. As we have discussed in the introduction, its computational complexity is $n^2$ in the number of columns. But now, we only use 35% of the original columns, meaning that it is fine to use multirel.
multirel = getml.feature_learning.Multirel(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_features=10,
)
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_features=10,
)
xgboost = getml.predictors.XGBoostRegressor(n_jobs=7)
pipe2 = getml.pipeline.Pipeline(
data_model=time_series2.data_model,
feature_learners=[multirel, relboost],
predictors=xgboost
)
pipe2.check(time_series2.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | MIGHT TAKE LONG | DATA_ALL__STAGING_TABLE_2 contains 23 categorical and numerical columns. Please note that columns created by the preprocessors are also part of this count. The multirel algorithm does not scale very well to data frames with many columns. This pipeline might take a very long time to fit. You should consider removing some columns or preprocessors. You could use a column selection to pick the right columns. You could also replace Multirel with Relboost or Fastboost. Both algorithms have been designed to scale well to data frames with many columns. |
pipe2.fit(time_series2.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Multirel: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:11 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08 Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Trained pipeline.
Time taken: 0:00:48.628798.
Pipeline(data_model='population', feature_learners=['Multirel', 'Relboost'], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=['data_all'], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['container-5qzYt6'])
pipe2.score(time_series2.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 15:45:31 | train | f_x | 0.4525 | 0.5984 | 0.996 |
1 | 2024-09-12 15:45:31 | train | f_y | 0.5236 | 0.6882 | 0.9891 |
2 | 2024-09-12 15:45:31 | train | f_z | 0.2664 | 0.3486 | 0.9988 |
3 | 2024-09-12 15:45:37 | test | f_x | 0.5587 | 0.7319 | 0.995 |
4 | 2024-09-12 15:45:37 | test | f_y | 0.5667 | 0.7534 | 0.9871 |
5 | 2024-09-12 15:45:37 | test | f_z | 0.2914 | 0.3789 | 0.9986 |
2.8 Visualizing the predictions¶
Sometimes a picture says more than a 1000 words. We therefore want to visualize our predictions on the testing set.
f_x = time_series2.test.population["f_x"].to_numpy()
f_y = time_series2.test.population["f_y"].to_numpy()
f_z = time_series2.test.population["f_z"].to_numpy()
predictions = pipe2.predict(time_series2.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
plt.subplots(figsize=(20, 10))
plt.title("x-component of the force vector", size=20)
plt.plot(f_x, label="ground truth")
plt.plot(predictions[:,0], label="prediction")
plt.legend(loc="upper right", fontsize=16)
<matplotlib.legend.Legend at 0x7f3d17173650>
plt.subplots(figsize=(20, 10))
plt.title("y-component of the force vector", size=20)
plt.plot(f_y, label="ground truth")
plt.plot(predictions[:,1], label="prediction")
plt.legend(loc="upper right", fontsize=16)
<matplotlib.legend.Legend at 0x7f3d1173da50>
plt.subplots(figsize=(20, 10))
plt.title("z-component of the force vector", size=20)
plt.plot(f_z, label="ground truth")
plt.plot(predictions[:,2], label="prediction")
plt.legend(loc="upper right", fontsize=16)
<matplotlib.legend.Legend at 0x7f3d117b3b10>
2.9 Features¶
The most important feature looks as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_3";
CREATE TABLE "FEATURE_1_3" AS
SELECT AVG(
CASE
WHEN ( t2."4" > -0.563815 ) AND ( t2."8" > -5.915106 ) AND ( t2."4" > -0.560007 ) THEN -7.035334308806516
WHEN ( t2."4" > -0.563815 ) AND ( t2."8" > -5.915106 ) AND ( t2."4" <= -0.560007 OR t2."4" IS NULL ) THEN -62.74602640372516
WHEN ( t2."4" > -0.563815 ) AND ( t2."8" <= -5.915106 OR t2."8" IS NULL ) AND ( t1."10" > 0.262721 ) THEN -16.94725742676829
WHEN ( t2."4" > -0.563815 ) AND ( t2."8" <= -5.915106 OR t2."8" IS NULL ) AND ( t1."10" <= 0.262721 OR t1."10" IS NULL ) THEN -12.11164189597631
WHEN ( t2."4" <= -0.563815 OR t2."4" IS NULL ) AND ( t2."38" > -5.460974 ) AND ( t2."36" > -3.092528 ) THEN 5.036006845911339
WHEN ( t2."4" <= -0.563815 OR t2."4" IS NULL ) AND ( t2."38" > -5.460974 ) AND ( t2."36" <= -3.092528 OR t2."36" IS NULL ) THEN -52.99309740580681
WHEN ( t2."4" <= -0.563815 OR t2."4" IS NULL ) AND ( t2."38" <= -5.460974 OR t2."38" IS NULL ) AND ( t2."62" > -0.143896 ) THEN 0.473437318248295
WHEN ( t2."4" <= -0.563815 OR t2."4" IS NULL ) AND ( t2."38" <= -5.460974 OR t2."38" IS NULL ) AND ( t2."62" <= -0.143896 OR t2."62" IS NULL ) THEN 77.8375239280841
ELSE NULL
END
) AS "feature_1_3",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DATA_ALL__STAGING_TABLE_2" t2
ON 1 = 1
WHERE t2."rowid" <= t1."rowid"
AND ( t2."rowid__30_000000" > t1."rowid" OR t2."rowid__30_000000" IS NULL )
GROUP BY t1.rowid;
As we can see, the predictions are very accurate. This suggests that it is very feasible to predict the force vector based on other sensor data.
2.9 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named robot_pipeline containing
# the SQL code.
pipe1.features.to_sql().save("robot_pipeline", remove=True)
# Creates a folder named containing the SQL code for Apache Spark.
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("robot_pipeline_spark", remove=True)
getml.engine.shutdown()
3. Conclusion¶
The purpose of this notebook has been to illustrate the problem of the curse of dimensionality when engineering features from datasets with many columns.
The most important thing to remember is that this problem exists regardless of whether you engineer your features manually or using algorithms. Whether you like it or not: If you write your features in the traditional way, your search space grows quadratically with the number of columns.