AdventureWorks - Predicting customer churn¶
In this notebook, we will demonstrate how getML can be used for a customer churn project using a synthetic dataset of a fictional company. We will also benchmark getML against featuretools.
Summary:
- Prediction type: Classification model
- Domain: Customer loyalty
- Prediction target: churn
- Population size: 19704
Background¶
AdventureWorks is a fictional company that sells bicycles. It is used by Microsoft to showcase how its MS SQL Server can be used to manage business data. Since this dataset resembles a real-world customer database and it is open-source, we will use it to showcase how getML can be used for a classic customer churn project (real customer databases are not easily available for the purposes of showcasing and benchmarking, for reasons of data privacy).
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015) (Now residing at relational-data.org.).
We will benchmark getML 's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.
Analysis¶
Let's get started with the analysis and set up your session:
import copy
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path
from urllib import request
import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
%matplotlib inline
import featuretools
import woodwork as ww
import getml
getml.engine.launch(home_directory=Path.home(), allow_remote_ips=True, token='token')
getml.set_project('adventure_works')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/getml --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/getml/.getML/getml-1.4.0-x64-linux... Launched the getML engine. The log output will be stored in /home/getml/.getML/logs/20240221145013.log. Loading pipelines... 100% |██████████| [elapsed: 00:01, remaining: 00:00] Connected to project 'adventure_works'
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="AdventureWorks2014",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='AdventureWorks2014', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
product = load_if_needed("Product")
sales_order_detail = load_if_needed("SalesOrderDetail")
sales_order_header = load_if_needed("SalesOrderHeader")
sales_order_reason = load_if_needed("SalesOrderHeaderSalesReason")
special_offer = load_if_needed("SpecialOffer")
store = load_if_needed("Store")
product
name | ProductID | MakeFlag | FinishedGoodsFlag | SafetyStockLevel | ReorderPoint | DaysToManufacture | ProductSubcategoryID | ProductModelID | Name | ProductNumber | Color | StandardCost | ListPrice | Size | SizeUnitMeasureCode | WeightUnitMeasureCode | Weight | ProductLine | Class | Style | SellStartDate | SellEndDate | DiscontinuedDate | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 0 | 0 | 1000 | 750 | 0 | nan | nan | Adjustable Race | AR-5381 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 694215B7-08F7-4C0D-ACB1-D734BA44... | 2014-02-08 09:01:36 |
1 | 2 | 0 | 0 | 1000 | 750 | 0 | nan | nan | Bearing Ball | BA-8327 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 58AE3C20-4F3A-4749-A7D4-D568806C... | 2014-02-08 09:01:36 |
2 | 3 | 1 | 0 | 800 | 600 | 1 | nan | nan | BB Ball Bearing | BE-2349 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 9C21AED2-5BFA-4F18-BCB8-F11638DC... | 2014-02-08 09:01:36 |
3 | 4 | 0 | 0 | 800 | 600 | 0 | nan | nan | Headset Ball Bearings | BE-2908 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | ECFED6CB-51FF-49B5-B06C-7D8AC834... | 2014-02-08 09:01:36 |
4 | 316 | 1 | 0 | 800 | 600 | 1 | nan | nan | Blade | BL-2036 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | E73E9750-603B-4131-89F5-3DD15ED5... | 2014-02-08 09:01:36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
499 | 995 | 1 | 1 | 500 | 375 | 1 | 5 | 96 | ML Bottom Bracket | BB-8107 | NULL | 44.9506 | 101.2400 | NULL | NULL | G | 168.00 | NULL | M | NULL | 2013-05-30 00:00:00 | NULL | NULL | 71AB847F-D091-42D6-B735-7B0C2D82... | 2014-02-08 09:01:36 |
500 | 996 | 1 | 1 | 500 | 375 | 1 | 5 | 97 | HL Bottom Bracket | BB-9108 | NULL | 53.9416 | 121.4900 | NULL | NULL | G | 170.00 | NULL | H | NULL | 2013-05-30 00:00:00 | NULL | NULL | 230C47C5-08B2-4CE3-B706-69C0BDD6... | 2014-02-08 09:01:36 |
501 | 997 | 1 | 1 | 100 | 75 | 4 | 2 | 31 | Road-750 Black, 44 | BK-R19B-44 | Black | 343.6496 | 539.9900 | 44 | CM | LB | 19.77 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 44CE4802-409F-43AB-9B27-CA534218... | 2014-02-08 09:01:36 |
502 | 998 | 1 | 1 | 100 | 75 | 4 | 2 | 31 | Road-750 Black, 48 | BK-R19B-48 | Black | 343.6496 | 539.9900 | 48 | CM | LB | 20.13 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 3DE9A212-1D49-40B6-B10A-F564D981... | 2014-02-08 09:01:36 |
503 | 999 | 1 | 1 | 100 | 75 | 4 | 2 | 31 | Road-750 Black, 52 | BK-R19B-52 | Black | 343.6496 | 539.9900 | 52 | CM | LB | 20.42 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | AE638923-2B67-4679-B90E-ABBAB17D... | 2014-02-08 09:01:36 |
504 rows x 25 columns
memory usage: 0.17 MB
name: Product
type: getml.DataFrame
sales_order_detail
name | SalesOrderID | SalesOrderDetailID | OrderQty | ProductID | SpecialOfferID | CarrierTrackingNumber | UnitPrice | UnitPriceDiscount | LineTotal | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 43659 | 1 | 1 | 776 | 1 | 4911-403C-98 | 2024.9940 | 0.0000 | 2024.994000 | B207C96D-D9E6-402B-8470-2CC176C4... | 2011-05-30 22:00:00 |
1 | 43659 | 2 | 3 | 777 | 1 | 4911-403C-98 | 2024.9940 | 0.0000 | 6074.982000 | 7ABB600D-1E77-41BE-9FE5-B9142CFC... | 2011-05-30 22:00:00 |
2 | 43659 | 3 | 1 | 778 | 1 | 4911-403C-98 | 2024.9940 | 0.0000 | 2024.994000 | 475CF8C6-49F6-486E-B0AD-AFC6A50C... | 2011-05-30 22:00:00 |
3 | 43659 | 4 | 1 | 771 | 1 | 4911-403C-98 | 2039.9940 | 0.0000 | 2039.994000 | 04C4DE91-5815-45D6-8670-F462719F... | 2011-05-30 22:00:00 |
4 | 43659 | 5 | 1 | 772 | 1 | 4911-403C-98 | 2039.9940 | 0.0000 | 2039.994000 | 5A74C7D2-E641-438E-A7AC-37BF2328... | 2011-05-30 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
121312 | 75122 | 121313 | 1 | 878 | 1 | NULL | 21.9800 | 0.0000 | 21.980000 | 8CAD6675-18CC-4F47-8287-97B41A8E... | 2014-06-29 22:00:00 |
121313 | 75122 | 121314 | 1 | 712 | 1 | NULL | 8.9900 | 0.0000 | 8.990000 | 84F1C363-1C50-4442-BE16-541C59B6... | 2014-06-29 22:00:00 |
121314 | 75123 | 121315 | 1 | 878 | 1 | NULL | 21.9800 | 0.0000 | 21.980000 | C18B6476-429F-4BB1-828E-2BE5F82A... | 2014-06-29 22:00:00 |
121315 | 75123 | 121316 | 1 | 879 | 1 | NULL | 159.0000 | 0.0000 | 159.000000 | 75A89C6A-C60A-47EA-8A52-B52A9C43... | 2014-06-29 22:00:00 |
121316 | 75123 | 121317 | 1 | 712 | 1 | NULL | 8.9900 | 0.0000 | 8.990000 | 73646D26-0461-450D-8019-2C6C8586... | 2014-06-29 22:00:00 |
121317 rows x 11 columns
memory usage: 21.60 MB
name: SalesOrderDetail
type: getml.DataFrame
sales_order_header
name | SalesOrderID | RevisionNumber | Status | OnlineOrderFlag | CustomerID | SalesPersonID | TerritoryID | BillToAddressID | ShipToAddressID | ShipMethodID | CreditCardID | CurrencyRateID | OrderDate | DueDate | ShipDate | SalesOrderNumber | PurchaseOrderNumber | AccountNumber | CreditCardApprovalCode | SubTotal | TaxAmt | Freight | TotalDue | Comment | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 43659 | 8 | 5 | 0 | 29825 | 279 | 5 | 985 | 985 | 5 | 16281 | nan | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43659 | PO522145787 | 10-4020-000676 | 105041Vi84182 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | NULL | 79B65321-39CA-4115-9CBA-8FE0903E... | 2011-06-06 22:00:00 |
1 | 43660 | 8 | 5 | 0 | 29672 | 279 | 5 | 921 | 921 | 5 | 5618 | nan | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43660 | PO18850127500 | 10-4020-000117 | 115213Vi29411 | 1294.2529 | 124.2483 | 38.8276 | 1457.3288 | NULL | 738DC42D-D03B-48A1-9822-F95A67EA... | 2011-06-06 22:00:00 |
2 | 43661 | 8 | 5 | 0 | 29734 | 282 | 6 | 517 | 517 | 5 | 1346 | 4 | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43661 | PO18473189620 | 10-4020-000442 | 85274Vi6854 | 32726.4786 | 3153.7696 | 985.5530 | 36865.8012 | NULL | D91B9131-18A4-4A11-BC3A-90B6F53E... | 2011-06-06 22:00:00 |
3 | 43662 | 8 | 5 | 0 | 29994 | 282 | 6 | 482 | 482 | 5 | 10456 | 4 | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43662 | PO18444174044 | 10-4020-000227 | 125295Vi53935 | 28832.5289 | 2775.1646 | 867.2389 | 32474.9324 | NULL | 4A1ECFC0-CC3A-4740-B028-1C50BB48... | 2011-06-06 22:00:00 |
4 | 43663 | 8 | 5 | 0 | 29565 | 276 | 4 | 1073 | 1073 | 5 | 4322 | nan | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43663 | PO18009186470 | 10-4020-000510 | 45303Vi22691 | 419.4589 | 40.2681 | 12.5838 | 472.3108 | NULL | 9B1E7A40-6AE0-4AD3-811C-A6495185... | 2011-06-06 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31460 | 75119 | 8 | 5 | 1 | 11981 | nan | 1 | 17649 | 17649 | 1 | 6761 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75119 | NULL | 10-4030-011981 | 429826Vi35166 | 42.2800 | 3.3824 | 1.0570 | 46.7194 | NULL | 9382F1C9-383A-435F-9449-0EECEA21... | 2014-07-06 22:00:00 |
31461 | 75120 | 8 | 5 | 1 | 18749 | nan | 6 | 28374 | 28374 | 1 | 8925 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75120 | NULL | 10-4030-018749 | 929849Vi46003 | 84.9600 | 6.7968 | 2.1240 | 93.8808 | NULL | AE6A4FCF-FF73-4CD4-AF2C-5993D00D... | 2014-07-06 22:00:00 |
31462 | 75121 | 8 | 5 | 1 | 15251 | nan | 6 | 26553 | 26553 | 1 | 14220 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75121 | NULL | 10-4030-015251 | 529864Vi73738 | 74.9800 | 5.9984 | 1.8745 | 82.8529 | NULL | D7395C0E-00CB-4BFA-A238-0D6A9F49... | 2014-07-06 22:00:00 |
31463 | 75122 | 8 | 5 | 1 | 15868 | nan | 6 | 14616 | 14616 | 1 | 18719 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75122 | NULL | 10-4030-015868 | 330022Vi97312 | 30.9700 | 2.4776 | 0.7743 | 34.2219 | NULL | 4221035A-4159-492F-AF40-4363A64F... | 2014-07-06 22:00:00 |
31464 | 75123 | 8 | 5 | 1 | 18759 | nan | 6 | 14024 | 14024 | 1 | 10084 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75123 | NULL | 10-4030-018759 | 230370Vi51970 | 189.9700 | 15.1976 | 4.7493 | 209.9169 | NULL | D54752FF-2B54-4BE5-95EA-3B72289C... | 2014-07-06 22:00:00 |
31465 rows x 26 columns
memory usage: 12.56 MB
name: SalesOrderHeader
type: getml.DataFrame
sales_order_reason
name | SalesOrderID | SalesReasonID | ModifiedDate |
---|---|---|---|
role | unused_float | unused_float | unused_string |
0 | 43697 | 5 | 2011-05-30 22:00:00 |
1 | 43697 | 9 | 2011-05-30 22:00:00 |
2 | 43702 | 5 | 2011-05-31 22:00:00 |
3 | 43702 | 9 | 2011-05-31 22:00:00 |
4 | 43703 | 5 | 2011-05-31 22:00:00 |
... | ... | ... | |
27642 | 75119 | 1 | 2014-06-29 22:00:00 |
27643 | 75120 | 1 | 2014-06-29 22:00:00 |
27644 | 75121 | 1 | 2014-06-29 22:00:00 |
27645 | 75122 | 1 | 2014-06-29 22:00:00 |
27646 | 75123 | 1 | 2014-06-29 22:00:00 |
27647 rows x 3 columns
memory usage: 1.22 MB
name: SalesOrderHeaderSalesReason
type: getml.DataFrame
special_offer
name | SpecialOfferID | MinQty | MaxQty | Description | DiscountPct | Type | Category | StartDate | EndDate | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 0 | nan | No Discount | 0.0000 | No Discount | No Discount | 2011-05-01 00:00:00 | 2014-11-30 00:00:00 | 0290C4F5-191F-4337-AB6B-0A2DDE03... | 2011-03-31 22:00:00 |
1 | 2 | 11 | 14 | Volume Discount 11 to 14 | 0.0200 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | D7542EE7-15DB-4541-985C-5CC27AEF... | 2011-04-30 22:00:00 |
2 | 3 | 15 | 24 | Volume Discount 15 to 24 | 0.0500 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | 4BDBCC01-8CF7-40A9-B643-40EC5B71... | 2011-04-30 22:00:00 |
3 | 4 | 25 | 40 | Volume Discount 25 to 40 | 0.1000 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | 504B5E85-8F3F-4EBC-9E1D-C1BC5DEA... | 2011-04-30 22:00:00 |
4 | 5 | 41 | 60 | Volume Discount 41 to 60 | 0.1500 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | 677E1D9D-944F-4E81-90E8-47EB0A82... | 2011-04-30 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
11 | 12 | 0 | nan | LL Road Frame Sale | 0.3500 | Excess Inventory | Reseller | 2013-05-30 00:00:00 | 2013-07-14 00:00:00 | C0AF1C89-9722-4235-9248-3FBA4D9E... | 2013-04-29 22:00:00 |
12 | 13 | 0 | nan | Touring-3000 Promotion | 0.1500 | New Product | Reseller | 2013-05-30 00:00:00 | 2013-08-29 00:00:00 | 5061CCE4-E021-45A8-9A75-DFB36CBB... | 2013-04-29 22:00:00 |
13 | 14 | 0 | nan | Touring-1000 Promotion | 0.2000 | New Product | Reseller | 2013-05-30 00:00:00 | 2013-08-29 00:00:00 | 1AF84A9E-A98C-4BD9-B48F-DC2B8B6B... | 2013-04-29 22:00:00 |
14 | 15 | 0 | nan | Half-Price Pedal Sale | 0.5000 | Seasonal Discount | Customer | 2013-07-14 00:00:00 | 2013-08-14 00:00:00 | 03E3594D-6EBB-46A6-B8EE-A9289C0C... | 2013-06-13 22:00:00 |
15 | 16 | 0 | nan | Mountain-500 Silver Clearance Sa... | 0.4000 | Discontinued Product | Reseller | 2014-03-31 00:00:00 | 2014-05-30 00:00:00 | EB7CB484-BCCF-4D2D-BF73-521B2001... | 2014-02-28 23:00:00 |
16 rows x 11 columns
memory usage: 0.00 MB
name: SpecialOffer
type: getml.DataFrame
store
name | BusinessEntityID | SalesPersonID | Name | Demographics | rowguid | ModifiedDate |
---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 292 | 279 | Next-Door Bike Store | <StoreSurvey xmlns="http://schem... | A22517E3-848D-4EBE-B9D9-7437F343... | 2014-09-12 09:15:07 |
1 | 294 | 276 | Professional Sales and Service | <StoreSurvey xmlns="http://schem... | B50CA50B-C601-4A13-B07E-2C63862D... | 2014-09-12 09:15:07 |
2 | 296 | 277 | Riders Company | <StoreSurvey xmlns="http://schem... | 337C3688-1339-4E1A-A08A-B54B2356... | 2014-09-12 09:15:07 |
3 | 298 | 275 | The Bike Mechanics | <StoreSurvey xmlns="http://schem... | 7894F278-F0C8-4D16-BD75-213FDBF1... | 2014-09-12 09:15:07 |
4 | 300 | 286 | Nationwide Supply | <StoreSurvey xmlns="http://schem... | C3FC9705-A8C4-4F3A-9550-EB2FA4B7... | 2014-09-12 09:15:07 |
... | ... | ... | ... | ... | ... | |
696 | 1988 | 282 | Retreat Inn | <StoreSurvey xmlns="http://schem... | EA21EC81-1BFA-4A07-9B4D-73D9852A... | 2014-09-12 09:15:07 |
697 | 1990 | 281 | Technical Parts Manufacturing | <StoreSurvey xmlns="http://schem... | C8E3C4ED-8F58-4DB2-B600-E0CD11D9... | 2014-09-12 09:15:07 |
698 | 1992 | 277 | Totes & Baskets Company | <StoreSurvey xmlns="http://schem... | CE860B58-643C-4567-BFD8-06E97969... | 2014-09-12 09:15:07 |
699 | 1994 | 277 | World of Bikes | <StoreSurvey xmlns="http://schem... | 0C10F2B6-A13A-440C-9C25-5B28D482... | 2014-09-12 09:15:07 |
700 | 2051 | 275 | A Bicycle Association | <StoreSurvey xmlns="http://schem... | 82237172-D3FE-4A95-82EF-636F6552... | 2014-09-12 09:15:07 |
701 rows x 6 columns
memory usage: 0.38 MB
name: Store
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
product.set_role("ProductID", getml.data.roles.join_key)
product.set_role(["MakeFlag", "ProductSubcategoryID", "ProductModelID"], getml.data.roles.categorical)
product.set_role(["SafetyStockLevel", "ReorderPoint", "StandardCost", "ListPrice"], getml.data.roles.numerical)
product
name | ProductID | MakeFlag | ProductSubcategoryID | ProductModelID | SafetyStockLevel | ReorderPoint | StandardCost | ListPrice | FinishedGoodsFlag | DaysToManufacture | Name | ProductNumber | Color | Size | SizeUnitMeasureCode | WeightUnitMeasureCode | Weight | ProductLine | Class | Style | SellStartDate | SellEndDate | DiscontinuedDate | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | join_key | categorical | categorical | categorical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 0 | nan | nan | 1000 | 750 | 0 | 0 | 0 | 0 | Adjustable Race | AR-5381 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 694215B7-08F7-4C0D-ACB1-D734BA44... | 2014-02-08 09:01:36 |
1 | 2 | 0 | nan | nan | 1000 | 750 | 0 | 0 | 0 | 0 | Bearing Ball | BA-8327 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 58AE3C20-4F3A-4749-A7D4-D568806C... | 2014-02-08 09:01:36 |
2 | 3 | 1 | nan | nan | 800 | 600 | 0 | 0 | 0 | 1 | BB Ball Bearing | BE-2349 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 9C21AED2-5BFA-4F18-BCB8-F11638DC... | 2014-02-08 09:01:36 |
3 | 4 | 0 | nan | nan | 800 | 600 | 0 | 0 | 0 | 0 | Headset Ball Bearings | BE-2908 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | ECFED6CB-51FF-49B5-B06C-7D8AC834... | 2014-02-08 09:01:36 |
4 | 316 | 1 | nan | nan | 800 | 600 | 0 | 0 | 0 | 1 | Blade | BL-2036 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | E73E9750-603B-4131-89F5-3DD15ED5... | 2014-02-08 09:01:36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
499 | 995 | 1 | 5 | 96 | 500 | 375 | 44.9506 | 101.24 | 1 | 1 | ML Bottom Bracket | BB-8107 | NULL | NULL | NULL | G | 168.00 | NULL | M | NULL | 2013-05-30 00:00:00 | NULL | NULL | 71AB847F-D091-42D6-B735-7B0C2D82... | 2014-02-08 09:01:36 |
500 | 996 | 1 | 5 | 97 | 500 | 375 | 53.9416 | 121.49 | 1 | 1 | HL Bottom Bracket | BB-9108 | NULL | NULL | NULL | G | 170.00 | NULL | H | NULL | 2013-05-30 00:00:00 | NULL | NULL | 230C47C5-08B2-4CE3-B706-69C0BDD6... | 2014-02-08 09:01:36 |
501 | 997 | 1 | 2 | 31 | 100 | 75 | 343.6496 | 539.99 | 1 | 4 | Road-750 Black, 44 | BK-R19B-44 | Black | 44 | CM | LB | 19.77 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 44CE4802-409F-43AB-9B27-CA534218... | 2014-02-08 09:01:36 |
502 | 998 | 1 | 2 | 31 | 100 | 75 | 343.6496 | 539.99 | 1 | 4 | Road-750 Black, 48 | BK-R19B-48 | Black | 48 | CM | LB | 20.13 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 3DE9A212-1D49-40B6-B10A-F564D981... | 2014-02-08 09:01:36 |
503 | 999 | 1 | 2 | 31 | 100 | 75 | 343.6496 | 539.99 | 1 | 4 | Road-750 Black, 52 | BK-R19B-52 | Black | 52 | CM | LB | 20.42 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | AE638923-2B67-4679-B90E-ABBAB17D... | 2014-02-08 09:01:36 |
504 rows x 25 columns
memory usage: 0.16 MB
name: Product
type: getml.DataFrame
sales_order_detail.set_role(["SalesOrderID", "SalesOrderDetailID", "ProductID", "SpecialOfferID"], getml.data.roles.join_key)
sales_order_detail.set_role(["OrderQty", "UnitPrice", "UnitPriceDiscount", "LineTotal"], getml.data.roles.numerical)
sales_order_detail.set_role("ModifiedDate", getml.data.roles.time_stamp)
sales_order_detail
name | ModifiedDate | SalesOrderID | SalesOrderDetailID | ProductID | SpecialOfferID | OrderQty | UnitPrice | UnitPriceDiscount | LineTotal | CarrierTrackingNumber | rowguid |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | join_key | numerical | numerical | numerical | numerical | unused_string | unused_string |
unit | time stamp, comparison only | ||||||||||
0 | 2011-05-30 22:00:00 | 43659 | 1 | 776 | 1 | 1 | 2024.994 | 0 | 2024.994 | 4911-403C-98 | B207C96D-D9E6-402B-8470-2CC176C4... |
1 | 2011-05-30 22:00:00 | 43659 | 2 | 777 | 1 | 3 | 2024.994 | 0 | 6074.982 | 4911-403C-98 | 7ABB600D-1E77-41BE-9FE5-B9142CFC... |
2 | 2011-05-30 22:00:00 | 43659 | 3 | 778 | 1 | 1 | 2024.994 | 0 | 2024.994 | 4911-403C-98 | 475CF8C6-49F6-486E-B0AD-AFC6A50C... |
3 | 2011-05-30 22:00:00 | 43659 | 4 | 771 | 1 | 1 | 2039.994 | 0 | 2039.994 | 4911-403C-98 | 04C4DE91-5815-45D6-8670-F462719F... |
4 | 2011-05-30 22:00:00 | 43659 | 5 | 772 | 1 | 1 | 2039.994 | 0 | 2039.994 | 4911-403C-98 | 5A74C7D2-E641-438E-A7AC-37BF2328... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
121312 | 2014-06-29 22:00:00 | 75122 | 121313 | 878 | 1 | 1 | 21.98 | 0 | 21.98 | NULL | 8CAD6675-18CC-4F47-8287-97B41A8E... |
121313 | 2014-06-29 22:00:00 | 75122 | 121314 | 712 | 1 | 1 | 8.99 | 0 | 8.99 | NULL | 84F1C363-1C50-4442-BE16-541C59B6... |
121314 | 2014-06-29 22:00:00 | 75123 | 121315 | 878 | 1 | 1 | 21.98 | 0 | 21.98 | NULL | C18B6476-429F-4BB1-828E-2BE5F82A... |
121315 | 2014-06-29 22:00:00 | 75123 | 121316 | 879 | 1 | 1 | 159 | 0 | 159 | NULL | 75A89C6A-C60A-47EA-8A52-B52A9C43... |
121316 | 2014-06-29 22:00:00 | 75123 | 121317 | 712 | 1 | 1 | 8.99 | 0 | 8.99 | NULL | 73646D26-0461-450D-8019-2C6C8586... |
121317 rows x 11 columns
memory usage: 14.08 MB
name: SalesOrderDetail
type: getml.DataFrame
sales_order_reason.set_role("SalesOrderID", getml.data.roles.join_key)
sales_order_reason.set_role("SalesReasonID", getml.data.roles.categorical)
sales_order_reason
name | SalesOrderID | SalesReasonID | ModifiedDate |
---|---|---|---|
role | join_key | categorical | unused_string |
0 | 43697 | 5 | 2011-05-30 22:00:00 |
1 | 43697 | 9 | 2011-05-30 22:00:00 |
2 | 43702 | 5 | 2011-05-31 22:00:00 |
3 | 43702 | 9 | 2011-05-31 22:00:00 |
4 | 43703 | 5 | 2011-05-31 22:00:00 |
... | ... | ... | |
27642 | 75119 | 1 | 2014-06-29 22:00:00 |
27643 | 75120 | 1 | 2014-06-29 22:00:00 |
27644 | 75121 | 1 | 2014-06-29 22:00:00 |
27645 | 75122 | 1 | 2014-06-29 22:00:00 |
27646 | 75123 | 1 | 2014-06-29 22:00:00 |
27647 rows x 3 columns
memory usage: 1.00 MB
name: SalesOrderHeaderSalesReason
type: getml.DataFrame
special_offer.set_role(["SpecialOfferID"], getml.data.roles.join_key)
special_offer.set_role(["MinQty", "DiscountPct"], getml.data.roles.numerical)
special_offer.set_role(["Category", "Description", "Type"], getml.data.roles.categorical)
special_offer.set_role(["StartDate", "EndDate"], getml.data.roles.time_stamp)
special_offer
name | StartDate | EndDate | SpecialOfferID | Category | Description | Type | MinQty | DiscountPct | MaxQty | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | join_key | categorical | categorical | categorical | numerical | numerical | unused_float | unused_string | unused_string |
unit | time stamp, comparison only | time stamp, comparison only | |||||||||
0 | 2011-05-01 | 2014-11-30 | 1 | No Discount | No Discount | No Discount | 0 | 0 | nan | 0290C4F5-191F-4337-AB6B-0A2DDE03... | 2011-03-31 22:00:00 |
1 | 2011-05-31 | 2014-05-30 | 2 | Reseller | Volume Discount 11 to 14 | Volume Discount | 11 | 0.02 | 14 | D7542EE7-15DB-4541-985C-5CC27AEF... | 2011-04-30 22:00:00 |
2 | 2011-05-31 | 2014-05-30 | 3 | Reseller | Volume Discount 15 to 24 | Volume Discount | 15 | 0.05 | 24 | 4BDBCC01-8CF7-40A9-B643-40EC5B71... | 2011-04-30 22:00:00 |
3 | 2011-05-31 | 2014-05-30 | 4 | Reseller | Volume Discount 25 to 40 | Volume Discount | 25 | 0.1 | 40 | 504B5E85-8F3F-4EBC-9E1D-C1BC5DEA... | 2011-04-30 22:00:00 |
4 | 2011-05-31 | 2014-05-30 | 5 | Reseller | Volume Discount 41 to 60 | Volume Discount | 41 | 0.15 | 60 | 677E1D9D-944F-4E81-90E8-47EB0A82... | 2011-04-30 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
11 | 2013-05-30 | 2013-07-14 | 12 | Reseller | LL Road Frame Sale | Excess Inventory | 0 | 0.35 | nan | C0AF1C89-9722-4235-9248-3FBA4D9E... | 2013-04-29 22:00:00 |
12 | 2013-05-30 | 2013-08-29 | 13 | Reseller | Touring-3000 Promotion | New Product | 0 | 0.15 | nan | 5061CCE4-E021-45A8-9A75-DFB36CBB... | 2013-04-29 22:00:00 |
13 | 2013-05-30 | 2013-08-29 | 14 | Reseller | Touring-1000 Promotion | New Product | 0 | 0.2 | nan | 1AF84A9E-A98C-4BD9-B48F-DC2B8B6B... | 2013-04-29 22:00:00 |
14 | 2013-07-14 | 2013-08-14 | 15 | Customer | Half-Price Pedal Sale | Seasonal Discount | 0 | 0.5 | nan | 03E3594D-6EBB-46A6-B8EE-A9289C0C... | 2013-06-13 22:00:00 |
15 | 2014-03-31 | 2014-05-30 | 16 | Reseller | Mountain-500 Silver Clearance Sa... | Discontinued Product | 0 | 0.4 | nan | EB7CB484-BCCF-4D2D-BF73-521B2001... | 2014-02-28 23:00:00 |
16 rows x 11 columns
memory usage: 0.00 MB
name: SpecialOffer
type: getml.DataFrame
store.set_role(["SalesPersonID"], getml.data.roles.join_key)
store["test"] = store["ModifiedDate"].update(getml.data.random() > 0.5, "NULL")
store.set_role(["SalesPersonID"], getml.data.roles.join_key)
store.set_role(["test"], getml.data.roles.time_stamp)
store
name | test | SalesPersonID | BusinessEntityID | Name | Demographics | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|
role | time_stamp | join_key | unused_float | unused_string | unused_string | unused_string | unused_string |
unit | time stamp, comparison only | ||||||
0 | 2014-09-12 09:15:07 | 279 | 292 | Next-Door Bike Store | <StoreSurvey xmlns="http://schem... | A22517E3-848D-4EBE-B9D9-7437F343... | 2014-09-12 09:15:07 |
1 | NULL | 276 | 294 | Professional Sales and Service | <StoreSurvey xmlns="http://schem... | B50CA50B-C601-4A13-B07E-2C63862D... | 2014-09-12 09:15:07 |
2 | 2014-09-12 09:15:07 | 277 | 296 | Riders Company | <StoreSurvey xmlns="http://schem... | 337C3688-1339-4E1A-A08A-B54B2356... | 2014-09-12 09:15:07 |
3 | NULL | 275 | 298 | The Bike Mechanics | <StoreSurvey xmlns="http://schem... | 7894F278-F0C8-4D16-BD75-213FDBF1... | 2014-09-12 09:15:07 |
4 | NULL | 286 | 300 | Nationwide Supply | <StoreSurvey xmlns="http://schem... | C3FC9705-A8C4-4F3A-9550-EB2FA4B7... | 2014-09-12 09:15:07 |
... | ... | ... | ... | ... | ... | ... | |
696 | NULL | 282 | 1988 | Retreat Inn | <StoreSurvey xmlns="http://schem... | EA21EC81-1BFA-4A07-9B4D-73D9852A... | 2014-09-12 09:15:07 |
697 | 2014-09-12 09:15:07 | 281 | 1990 | Technical Parts Manufacturing | <StoreSurvey xmlns="http://schem... | C8E3C4ED-8F58-4DB2-B600-E0CD11D9... | 2014-09-12 09:15:07 |
698 | NULL | 277 | 1992 | Totes & Baskets Company | <StoreSurvey xmlns="http://schem... | CE860B58-643C-4567-BFD8-06E97969... | 2014-09-12 09:15:07 |
699 | NULL | 277 | 1994 | World of Bikes | <StoreSurvey xmlns="http://schem... | 0C10F2B6-A13A-440C-9C25-5B28D482... | 2014-09-12 09:15:07 |
700 | 2014-09-12 09:15:07 | 275 | 2051 | A Bicycle Association | <StoreSurvey xmlns="http://schem... | 82237172-D3FE-4A95-82EF-636F6552... | 2014-09-12 09:15:07 |
701 rows x 7 columns
memory usage: 0.38 MB
name: Store
type: getml.DataFrame
sales_order_header["SalesPersonIDCat"] = sales_order_header["SalesPersonID"]
sales_order_header["TerritoryIDCat"] = sales_order_header["TerritoryID"]
sales_order_header.set_role(["CustomerID", "SalesOrderID", "SalesPersonID", "TerritoryID"], getml.data.roles.join_key)
sales_order_header.set_role(
["RevisionNumber", "OnlineOrderFlag", "SalesPersonIDCat", "TerritoryIDCat", "ShipMethodID"],
getml.data.roles.categorical)
sales_order_header.set_role(["SubTotal", "TaxAmt", "Freight", "TotalDue"], getml.data.roles.numerical)
sales_order_header.set_role(["OrderDate", "DueDate", "ShipDate", "ModifiedDate"], getml.data.roles.time_stamp)
sales_order_header
name | OrderDate | DueDate | ShipDate | ModifiedDate | CustomerID | SalesOrderID | SalesPersonID | TerritoryID | RevisionNumber | OnlineOrderFlag | SalesPersonIDCat | TerritoryIDCat | ShipMethodID | SubTotal | TaxAmt | Freight | TotalDue | Status | BillToAddressID | ShipToAddressID | CreditCardID | CurrencyRateID | SalesOrderNumber | PurchaseOrderNumber | AccountNumber | CreditCardApprovalCode | Comment | rowguid |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | time_stamp | time_stamp | join_key | join_key | join_key | join_key | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
unit | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | ||||||||||||||||||||||||
0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29825 | 43659 | 279 | 5 | 8 | 0 | 279 | 5 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 5 | 985 | 985 | 16281 | nan | SO43659 | PO522145787 | 10-4020-000676 | 105041Vi84182 | NULL | 79B65321-39CA-4115-9CBA-8FE0903E... |
1 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29672 | 43660 | 279 | 5 | 8 | 0 | 279 | 5 | 5 | 1294.2529 | 124.2483 | 38.8276 | 1457.3288 | 5 | 921 | 921 | 5618 | nan | SO43660 | PO18850127500 | 10-4020-000117 | 115213Vi29411 | NULL | 738DC42D-D03B-48A1-9822-F95A67EA... |
2 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29734 | 43661 | 282 | 6 | 8 | 0 | 282 | 6 | 5 | 32726.4786 | 3153.7696 | 985.553 | 36865.8012 | 5 | 517 | 517 | 1346 | 4 | SO43661 | PO18473189620 | 10-4020-000442 | 85274Vi6854 | NULL | D91B9131-18A4-4A11-BC3A-90B6F53E... |
3 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29994 | 43662 | 282 | 6 | 8 | 0 | 282 | 6 | 5 | 28832.5289 | 2775.1646 | 867.2389 | 32474.9324 | 5 | 482 | 482 | 10456 | 4 | SO43662 | PO18444174044 | 10-4020-000227 | 125295Vi53935 | NULL | 4A1ECFC0-CC3A-4740-B028-1C50BB48... |
4 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29565 | 43663 | 276 | 4 | 8 | 0 | 276 | 4 | 5 | 419.4589 | 40.2681 | 12.5838 | 472.3108 | 5 | 1073 | 1073 | 4322 | nan | SO43663 | PO18009186470 | 10-4020-000510 | 45303Vi22691 | NULL | 9B1E7A40-6AE0-4AD3-811C-A6495185... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31460 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 11981 | 75119 | nan | 1 | 8 | 1 | nan | 1 | 1 | 42.28 | 3.3824 | 1.057 | 46.7194 | 5 | 17649 | 17649 | 6761 | nan | SO75119 | NULL | 10-4030-011981 | 429826Vi35166 | NULL | 9382F1C9-383A-435F-9449-0EECEA21... |
31461 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 18749 | 75120 | nan | 6 | 8 | 1 | nan | 6 | 1 | 84.96 | 6.7968 | 2.124 | 93.8808 | 5 | 28374 | 28374 | 8925 | nan | SO75120 | NULL | 10-4030-018749 | 929849Vi46003 | NULL | AE6A4FCF-FF73-4CD4-AF2C-5993D00D... |
31462 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 15251 | 75121 | nan | 6 | 8 | 1 | nan | 6 | 1 | 74.98 | 5.9984 | 1.8745 | 82.8529 | 5 | 26553 | 26553 | 14220 | nan | SO75121 | NULL | 10-4030-015251 | 529864Vi73738 | NULL | D7395C0E-00CB-4BFA-A238-0D6A9F49... |
31463 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 15868 | 75122 | nan | 6 | 8 | 1 | nan | 6 | 1 | 30.97 | 2.4776 | 0.7743 | 34.2219 | 5 | 14616 | 14616 | 18719 | nan | SO75122 | NULL | 10-4030-015868 | 330022Vi97312 | NULL | 4221035A-4159-492F-AF40-4363A64F... |
31464 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 18759 | 75123 | nan | 6 | 8 | 1 | nan | 6 | 1 | 189.97 | 15.1976 | 4.7493 | 209.9169 | 5 | 14024 | 14024 | 10084 | nan | SO75123 | NULL | 10-4030-018759 | 230370Vi51970 | NULL | D54752FF-2B54-4BE5-95EA-3B72289C... |
31465 rows x 28 columns
memory usage: 8.34 MB
name: SalesOrderHeader
type: getml.DataFrame
We must also define customer churn. In this case, we define customer churn as a customer not making another purchase within 180 days of his or her last purchase.
Thus, the churn variable is defined as follows:
- 0, if another purchase by the same customer has been made within 180 days after
OrderDate
- 1, if no purchase by the same customer has been made within 180 days after
OrderDate
- NULL, if
max(OrderDate) - OrderDate <= 180 days
NULL targets can not be used in our analysis.
sales_order_header_pd = sales_order_header[["OrderDate", "CustomerID", "SalesOrderID"]].to_pandas()
repeat_purchases = sales_order_header_pd.merge(
sales_order_header_pd[["OrderDate", "CustomerID"]],
on="CustomerID",
how="left",
)
repeat_purchases = repeat_purchases[
repeat_purchases["OrderDate_y"] > repeat_purchases["OrderDate_x"]
]
repeat_purchases = repeat_purchases[
repeat_purchases["OrderDate_y"] - repeat_purchases["OrderDate_x"] > pd.Timedelta('180 days')
]
repeat_purchases.groupby("SalesOrderID", as_index=False).aggregate({"CustomerID": "max"})
repeat_purchase_ids = {sid: True for sid in repeat_purchases["SalesOrderID"]}
cut_off_date = max(sales_order_header_pd["OrderDate"]) - pd.Timedelta('180 days')
churn = np.asarray([
np.nan if order_date >= cut_off_date else 0 if order_id in repeat_purchase_ids else 1
for (order_date, order_id) in zip(sales_order_header_pd["OrderDate"], sales_order_header_pd["SalesOrderID"])
])
sales_order_header["churn"] = churn
sales_order_header = sales_order_header[~sales_order_header.churn.is_nan()].to_df("SalesOrderHeaderRefined")
sales_order_header.set_role("churn", getml.data.roles.target)
sales_order_header
name | OrderDate | DueDate | ShipDate | ModifiedDate | CustomerID | SalesOrderID | SalesPersonID | TerritoryID | churn | RevisionNumber | OnlineOrderFlag | SalesPersonIDCat | TerritoryIDCat | ShipMethodID | SubTotal | TaxAmt | Freight | TotalDue | Status | BillToAddressID | ShipToAddressID | CreditCardID | CurrencyRateID | SalesOrderNumber | PurchaseOrderNumber | AccountNumber | CreditCardApprovalCode | Comment | rowguid |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | time_stamp | time_stamp | join_key | join_key | join_key | join_key | target | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
unit | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | |||||||||||||||||||||||||
0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29825 | 43659 | 279 | 5 | 0 | 8 | 0 | 279 | 5 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 5 | 985 | 985 | 16281 | nan | SO43659 | PO522145787 | 10-4020-000676 | 105041Vi84182 | NULL | 79B65321-39CA-4115-9CBA-8FE0903E... |
1 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29672 | 43660 | 279 | 5 | 0 | 8 | 0 | 279 | 5 | 5 | 1294.2529 | 124.2483 | 38.8276 | 1457.3288 | 5 | 921 | 921 | 5618 | nan | SO43660 | PO18850127500 | 10-4020-000117 | 115213Vi29411 | NULL | 738DC42D-D03B-48A1-9822-F95A67EA... |
2 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29734 | 43661 | 282 | 6 | 0 | 8 | 0 | 282 | 6 | 5 | 32726.4786 | 3153.7696 | 985.553 | 36865.8012 | 5 | 517 | 517 | 1346 | 4 | SO43661 | PO18473189620 | 10-4020-000442 | 85274Vi6854 | NULL | D91B9131-18A4-4A11-BC3A-90B6F53E... |
3 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29994 | 43662 | 282 | 6 | 0 | 8 | 0 | 282 | 6 | 5 | 28832.5289 | 2775.1646 | 867.2389 | 32474.9324 | 5 | 482 | 482 | 10456 | 4 | SO43662 | PO18444174044 | 10-4020-000227 | 125295Vi53935 | NULL | 4A1ECFC0-CC3A-4740-B028-1C50BB48... |
4 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29565 | 43663 | 276 | 4 | 0 | 8 | 0 | 276 | 4 | 5 | 419.4589 | 40.2681 | 12.5838 | 472.3108 | 5 | 1073 | 1073 | 4322 | nan | SO43663 | PO18009186470 | 10-4020-000510 | 45303Vi22691 | NULL | 9B1E7A40-6AE0-4AD3-811C-A6495185... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
19699 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 20826 | 63358 | nan | 7 | 1 | 8 | 1 | nan | 7 | 1 | 1173.96 | 93.9168 | 29.349 | 1297.2258 | 5 | 24387 | 24387 | 3239 | nan | SO63358 | NULL | 10-4030-020826 | 1142084Vi17039 | NULL | 41278FBB-3DD8-488B-AEA5-8BF4A6F1... |
19700 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 24114 | 63359 | nan | 10 | 1 | 8 | 1 | nan | 10 | 1 | 1179.47 | 94.3576 | 29.4868 | 1303.3144 | 5 | 29682 | 29682 | nan | 10770 | SO63359 | NULL | 10-4030-024114 | NULL | NULL | 749532D4-BFF7-4FEC-9F77-396C4A96... |
19701 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 11417 | 63360 | nan | 7 | 1 | 8 | 1 | nan | 7 | 1 | 548.98 | 43.9184 | 13.7245 | 606.6229 | 5 | 21465 | 21465 | 6582 | nan | SO63360 | NULL | 10-4030-011417 | 242387Vi34223 | NULL | F92FA2A3-73E1-4DD0-987C-99D1C87E... |
19702 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 18125 | 63361 | nan | 8 | 1 | 8 | 1 | nan | 8 | 1 | 2384.07 | 190.7256 | 59.6018 | 2634.3974 | 5 | 26562 | 26562 | 11994 | nan | SO63361 | NULL | 10-4030-018125 | 1242859Vi61993 | NULL | 3A3758BF-CDFA-4740-9104-87BD5A08... |
19703 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 15692 | 63362 | nan | 10 | 1 | 8 | 1 | nan | 10 | 1 | 2419.06 | 193.5248 | 60.4765 | 2673.0613 | 5 | 27090 | 27090 | 16083 | 10770 | SO63362 | NULL | 10-4030-015692 | 242864Vi83167 | NULL | 3151D31F-4020-41B2-87F3-B6732869... |
19704 rows x 29 columns
memory usage: 5.39 MB
name: SalesOrderHeaderRefined
type: getml.DataFrame
split = getml.data.split.random(train=0.8, test=0.2)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | test |
4 | train |
... |
infinite number of rows
type: StringColumnView
container = getml.data.Container(population=sales_order_header, split=split)
container.add(
product=product,
sales_order_detail=sales_order_detail,
sales_order_header=sales_order_header,
sales_order_reason=sales_order_reason,
special_offer=special_offer,
store=store,
)
container
subset | name | rows | type | |
---|---|---|---|---|
0 | test | SalesOrderHeaderRefined | 3879 | View |
1 | train | SalesOrderHeaderRefined | 15825 | View |
alias | name | rows | type | |
---|---|---|---|---|
0 | product | Product | 504 | DataFrame |
1 | sales_order_detail | SalesOrderDetail | 121317 | DataFrame |
2 | sales_order_header | SalesOrderHeaderRefined | 19704 | DataFrame |
3 | sales_order_reason | SalesOrderHeaderSalesReason | 27647 | DataFrame |
4 | special_offer | SpecialOffer | 16 | DataFrame |
5 | store | Store | 701 | DataFrame |
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
dm = getml.data.DataModel(sales_order_header.to_placeholder("population"))
dm.add(getml.data.to_placeholder(
product=product,
sales_order_detail=sales_order_detail,
sales_order_header=sales_order_header,
sales_order_reason=sales_order_reason,
special_offer=special_offer,
store=store,
))
dm.population.join(
dm.sales_order_header,
on="CustomerID",
time_stamps="OrderDate",
lagged_targets=True,
horizon=getml.data.time.days(1),
)
dm.population.join(
dm.sales_order_detail,
on="SalesOrderID",
)
dm.population.join(
dm.sales_order_reason,
on="SalesOrderID",
)
dm.population.join(
dm.store,
on="SalesPersonID",
)
dm.sales_order_detail.join(
dm.product,
on="ProductID",
relationship=getml.data.relationship.many_to_one,
)
dm.sales_order_detail.join(
dm.special_offer,
on="SpecialOfferID",
relationship=getml.data.relationship.many_to_one,
)
dm
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | sales_order_detail, product, special_offer | SALES_ORDER_DETAIL__STAGING_TABLE_2 |
2 | sales_order_header | SALES_ORDER_HEADER__STAGING_TABLE_3 |
3 | sales_order_reason | SALES_ORDER_REASON__STAGING_TABLE_4 |
4 | store | STORE__STAGING_TABLE_5 |
2.2 getML pipeline¶
Set-up the feature learner & predictor
seasonal = getml.preprocessors.Seasonal()
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
num_features=400,
)
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
predictor = getml.predictors.XGBoostClassifier(n_jobs=1)
Build the pipeline
pipe1 = getml.Pipeline(
tags=['fast_prop'],
data_model=dm,
preprocessors=[mapping],
feature_learners=[fast_prop],
predictors=[predictor],
include_categorical=True,
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['product', 'sales_order_detail', 'sales_order_header', 'sales_order_reason', 'special_offer', 'store'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop'])
pipe2 = getml.Pipeline(
tags=['relboost'],
data_model=dm,
preprocessors=[seasonal, mapping],
feature_learners=[relboost],
predictors=[predictor],
include_categorical=True,
)
pipe2
Pipeline(data_model='population', feature_learners=['Relboost'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['product', 'sales_order_detail', 'sales_order_header', 'sales_order_reason', 'special_offer', 'store'], predictors=['XGBoostClassifier'], preprocessors=['Seasonal', 'Mapping'], share_selected_features=0.5, tags=['relboost'])
2.3 Model training¶
pipe1.check(container.train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and SALES_ORDER_REASON__STAGING_TABLE_4 over 'SalesOrderID' and 'SalesOrderID', there are no corresponding entries for 33.769352% of entries in 'SalesOrderID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and STORE__STAGING_TABLE_5 over 'SalesPersonID' and 'SalesPersonID', there are no corresponding entries for 84.941548% of entries in 'SalesPersonID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe1.fit(container.train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING. To see the issues in full, run .check() on the pipeline. Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] FastProp: Trying 710 features... 100% |██████████| [elapsed: 00:02, remaining: 00:00] FastProp: Building features... 100% |██████████| [elapsed: 00:01, remaining: 00:00] XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:10, remaining: 00:00] Trained pipeline. Time taken: 0h:0m:13.940875
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['product', 'sales_order_detail', 'sales_order_header', 'sales_order_reason', 'special_offer', 'store'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop', 'container-lYi3CR'])
pipe2.check(container.train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:01, remaining: 00:00] Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and SALES_ORDER_REASON__STAGING_TABLE_4 over 'SalesOrderID' and 'SalesOrderID', there are no corresponding entries for 33.769352% of entries in 'SalesOrderID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and STORE__STAGING_TABLE_5 over 'SalesPersonID' and 'SalesPersonID', there are no corresponding entries for 84.941548% of entries in 'SalesPersonID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe2.fit(container.train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING. To see the issues in full, run .check() on the pipeline. Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Relboost: Training features... 100% |██████████| [elapsed: 00:24, remaining: 00:00] Relboost: Building features... 100% |██████████| [elapsed: 00:03, remaining: 00:00] XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:03, remaining: 00:00] Trained pipeline. Time taken: 0h:0m:30.068949
Pipeline(data_model='population', feature_learners=['Relboost'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['product', 'sales_order_detail', 'sales_order_header', 'sales_order_reason', 'special_offer', 'store'], predictors=['XGBoostClassifier'], preprocessors=['Seasonal', 'Mapping'], share_selected_features=0.5, tags=['relboost', 'container-lYi3CR'])
2.4 Model evaluation¶
fastprop_score = pipe1.score(container.test)
fastprop_score
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] FastProp: Building features... 100% |██████████| [elapsed: 00:00, remaining: 00:00]
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-02-21 14:50:35 | train | churn | 0.9187 | 0.975 | 0.2106 |
1 | 2024-02-21 14:51:07 | test | churn | 0.9142 | 0.9723 | 0.2199 |
relboost_score = pipe2.score(container.test)
relboost_score
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Relboost: Building features... 100% |██████████| [elapsed: 00:01, remaining: 00:00]
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-02-21 14:51:07 | train | churn | 0.9329 | 0.9835 | 0.1664 |
1 | 2024-02-21 14:51:09 | test | churn | 0.9255 | 0.9788 | 0.188 |
2.5 featuretools¶
population_train_pd = container.train.population.drop(container.train.population.roles.unused).to_pandas()
population_test_pd = container.test.population.drop(container.train.population.roles.unused).to_pandas()
product_pd = product.drop(product.roles.unused).to_pandas()
sales_order_detail_pd = sales_order_detail.drop(sales_order_detail.roles.unused).to_pandas()
sales_order_header_pd = sales_order_header.drop(sales_order_header.roles.unused).to_pandas()
sales_order_reason_pd = sales_order_reason.drop(sales_order_reason.roles.unused).to_pandas()
special_offer_pd = special_offer.drop(special_offer.roles.unused).to_pandas()
store_pd = store.drop(store.roles.unused).to_pandas()
featuretools does not support many-to-one joins. Therefore, we must manually merge sales_order_detail_pd
, product_pd
and special_offer_pd
.
sales_order_detail_pd = sales_order_detail_pd.merge(
product_pd,
on="ProductID",
how="left",
)
sales_order_detail_pd = sales_order_detail_pd.merge(
special_offer_pd,
on="SpecialOfferID",
how="left",
)
del sales_order_detail_pd["SalesOrderDetailID"]
del sales_order_detail_pd["ProductID"]
del sales_order_detail_pd["SpecialOfferID"]
sales_order_detail_pd
SalesOrderID | OrderQty | UnitPrice | UnitPriceDiscount | LineTotal | ModifiedDate | MakeFlag | ProductSubcategoryID | ProductModelID | SafetyStockLevel | ReorderPoint | StandardCost | ListPrice | Category | Description | Type | MinQty | DiscountPct | StartDate | EndDate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 43659 | 1.0 | 2024.994 | 0.0 | 2024.994 | 2011-05-30 22:00:00 | 1 | 1 | 19 | 100.0 | 75.0 | 1898.0944 | 3374.99 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
1 | 43659 | 3.0 | 2024.994 | 0.0 | 6074.982 | 2011-05-30 22:00:00 | 1 | 1 | 19 | 100.0 | 75.0 | 1898.0944 | 3374.99 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
2 | 43659 | 1.0 | 2024.994 | 0.0 | 2024.994 | 2011-05-30 22:00:00 | 1 | 1 | 19 | 100.0 | 75.0 | 1898.0944 | 3374.99 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
3 | 43659 | 1.0 | 2039.994 | 0.0 | 2039.994 | 2011-05-30 22:00:00 | 1 | 1 | 19 | 100.0 | 75.0 | 1912.1544 | 3399.99 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
4 | 43659 | 1.0 | 2039.994 | 0.0 | 2039.994 | 2011-05-30 22:00:00 | 1 | 1 | 19 | 100.0 | 75.0 | 1912.1544 | 3399.99 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
121312 | 75122 | 1.0 | 21.980 | 0.0 | 21.980 | 2014-06-29 22:00:00 | 0 | 30 | 121 | 4.0 | 3.0 | 8.2205 | 21.98 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
121313 | 75122 | 1.0 | 8.990 | 0.0 | 8.990 | 2014-06-29 22:00:00 | 0 | 19 | 2 | 4.0 | 3.0 | 6.9223 | 8.99 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
121314 | 75123 | 1.0 | 21.980 | 0.0 | 21.980 | 2014-06-29 22:00:00 | 0 | 30 | 121 | 4.0 | 3.0 | 8.2205 | 21.98 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
121315 | 75123 | 1.0 | 159.000 | 0.0 | 159.000 | 2014-06-29 22:00:00 | 0 | 27 | 122 | 4.0 | 3.0 | 59.4660 | 159.00 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
121316 | 75123 | 1.0 | 8.990 | 0.0 | 8.990 | 2014-06-29 22:00:00 | 0 | 19 | 2 | 4.0 | 3.0 | 6.9223 | 8.99 | No Discount | No Discount | No Discount | 0.0 | 0.0 | 2011-05-01 | 2014-11-30 |
121317 rows × 20 columns
def prepare_sales_order_header(peripheral_pd, train_or_test):
"""
Helper function that imitates the behavior of
the data model defined above.
"""
peripheral_new = peripheral_pd.merge(
train_or_test[["CustomerID", "OrderDate", "SalesOrderID"]],
on="CustomerID"
)
peripheral_new = peripheral_new[
peripheral_new["OrderDate_x"] < peripheral_new["OrderDate_y"]
]
del peripheral_new["SalesOrderID_x"]
del peripheral_new["OrderDate_y"]
del peripheral_new["CustomerID"]
del peripheral_new["SalesPersonIDCat"]
del peripheral_new["TerritoryIDCat"]
return peripheral_new.rename(columns={"OrderDate_x": "OrderDate", "SalesOrderID_y": "SalesOrderID"})
def prepare_store(peripheral_pd, train_or_test):
"""
Helper function that imitates the behavior of
the data model defined above.
"""
peripheral_new = peripheral_pd.merge(
train_or_test[["SalesPersonID", "SalesOrderID"]],
on="SalesPersonID"
)
return peripheral_new
store_train_pd = prepare_store(store_pd, population_train_pd)
store_test_pd = prepare_store(store_pd, population_test_pd)
store_train_pd
SalesPersonID | test | SalesOrderID | |
---|---|---|---|
0 | 279 | 2014-09-12 09:15:07 | 43659 |
1 | 279 | 2014-09-12 09:15:07 | 43660 |
2 | 279 | 2014-09-12 09:15:07 | 43681 |
3 | 279 | 2014-09-12 09:15:07 | 43685 |
4 | 279 | 2014-09-12 09:15:07 | 43695 |
... | ... | ... | ... |
142427 | 290 | NaT | 63216 |
142428 | 290 | NaT | 63217 |
142429 | 290 | NaT | 63223 |
142430 | 290 | NaT | 63282 |
142431 | 290 | NaT | 63284 |
142432 rows × 3 columns
sales_order_header_train_pd = prepare_sales_order_header(sales_order_header_pd, population_train_pd)
sales_order_header_test_pd = prepare_sales_order_header(sales_order_header_pd, population_test_pd)
sales_order_header_train_pd
RevisionNumber | OnlineOrderFlag | ShipMethodID | SalesPersonID | TerritoryID | SubTotal | TaxAmt | Freight | TotalDue | churn | OrderDate | DueDate | ShipDate | ModifiedDate | SalesOrderID | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 8 | 0 | 5 | 279 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 0.0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 44305 |
2 | 8 | 0 | 5 | 279 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 0.0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 45061 |
3 | 8 | 0 | 5 | 279 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 0.0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 45779 |
4 | 8 | 0 | 5 | 279 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 0.0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 46604 |
5 | 8 | 0 | 5 | 279 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 0.0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 47693 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
39894 | 8 | 1 | 1 | None | 6 | 32.6000 | 2.6080 | 0.8150 | 36.0230 | 1.0 | 2013-12-14 23:00:00 | 2013-12-27 | 2013-12-22 | 2013-12-21 23:00:00 | 62799 |
39908 | 8 | 1 | 1 | None | 4 | 42.2800 | 3.3824 | 1.0570 | 46.7194 | 1.0 | 2013-12-12 23:00:00 | 2013-12-25 | 2013-12-20 | 2013-12-19 23:00:00 | 62770 |
39952 | 8 | 1 | 1 | None | 1 | 53.9900 | 4.3192 | 1.3498 | 59.6590 | 1.0 | 2013-12-14 23:00:00 | 2013-12-27 | 2013-12-22 | 2013-12-21 23:00:00 | 62796 |
40080 | 8 | 1 | 1 | None | 4 | 35.0000 | 2.8000 | 0.8750 | 38.6750 | 1.0 | 2013-12-18 23:00:00 | 2013-12-31 | 2013-12-26 | 2013-12-25 23:00:00 | 62643 |
40101 | 8 | 1 | 1 | None | 10 | 12.9400 | 1.0352 | 0.3235 | 14.2987 | 1.0 | 2013-12-19 23:00:00 | 2014-01-01 | 2013-12-27 | 2013-12-26 23:00:00 | 62914 |
12271 rows × 15 columns
del population_train_pd["CustomerID"]
del population_train_pd["SalesPersonIDCat"]
del population_test_pd["CustomerID"]
del population_test_pd["SalesPersonIDCat"]
featuretools's dataframes are similar for getML's container class. featuretools uses woodwork for typing information, which will be guessed automatically, but can also be provided manually.
def add_index(df):
df.insert(0, "index", range(len(df)))
population_pd_logical_types = {
'RevisionNumber': ww.logical_types.Categorical,
'OnlineOrderFlag': ww.logical_types.Categorical,
'ShipMethodID': ww.logical_types.Integer,
'TerritoryIDCat': ww.logical_types.Integer,
'SalesOrderID': ww.logical_types.Integer,
'SalesPersonID': ww.logical_types.IntegerNullable,
'TerritoryID': ww.logical_types.IntegerNullable,
'SubTotal': ww.logical_types.Double,
'TaxAmt': ww.logical_types.Double,
'Freight': ww.logical_types.Double,
'TotalDue': ww.logical_types.Double,
'churn': ww.logical_types.Categorical,
'OrderDate': ww.logical_types.Datetime,
'DueDate': ww.logical_types.Datetime,
'ShipDate': ww.logical_types.Datetime,
'ModifiedDate': ww.logical_types.Datetime
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index="SalesOrderID", name="population")
population_test_pd.ww.init(logical_types=population_pd_logical_types, index="SalesOrderID", name="population")
sales_order_detail_pd_logical_types = {
'index': ww.logical_types.Integer,
'SalesOrderID': ww.logical_types.Integer,
'OrderQty': ww.logical_types.Integer,
'UnitPrice': ww.logical_types.Double,
'UnitPriceDiscount': ww.logical_types.Double,
'LineTotal': ww.logical_types.Double,
'ModifiedDate': ww.logical_types.Datetime,
# merged with product_pd
'MakeFlag': ww.logical_types.Categorical,
'ProductSubcategoryID': ww.logical_types.Categorical,
'ProductModelID': ww.logical_types.Integer,
'ReorderPoint': ww.logical_types.Integer,
'StandardCost': ww.logical_types.Double,
'ListPrice': ww.logical_types.Double,
# merged with special_offer_pd
'Category': ww.logical_types.Categorical,
'Description': ww.logical_types.Categorical,
'Type': ww.logical_types.Categorical,
'MinQty': ww.logical_types.Integer,
'DiscountPct': ww.logical_types.Double,
'StartDate': ww.logical_types.Datetime,
'EndDate': ww.logical_types.Datetime
}
add_index(sales_order_detail_pd)
sales_order_detail_pd.ww.init(logical_types=sales_order_detail_pd_logical_types, index="index", name="sales_order_detail")
sales_order_header_pd_logical_types = {
'index': ww.logical_types.Integer,
'RevisionNumber': ww.logical_types.Categorical,
'OnlineOrderFlag': ww.logical_types.Categorical,
'ShipMethodID': ww.logical_types.Integer,
'SalesOrderID': ww.logical_types.Integer,
'SalesPersonID': ww.logical_types.IntegerNullable,
'TerritoryID': ww.logical_types.IntegerNullable,
'SubTotal': ww.logical_types.Double,
'TaxAmt': ww.logical_types.Double,
'Freight': ww.logical_types.Double,
'TotalDue': ww.logical_types.Double,
'churn': ww.logical_types.Categorical,
'OrderDate': ww.logical_types.Datetime,
'DueDate': ww.logical_types.Datetime,
'ShipDate': ww.logical_types.Datetime,
'ModifiedDate': ww.logical_types.Datetime
}
add_index(sales_order_header_train_pd)
sales_order_header_train_pd.ww.init(logical_types=sales_order_header_pd_logical_types, index="index", name="sales_order_header")
add_index(sales_order_header_test_pd)
sales_order_header_test_pd.ww.init(logical_types=sales_order_header_pd_logical_types, index="index", name="sales_order_header")
sales_order_reason_pd_logical_types = {
'index': ww.logical_types.Integer,
'SalesReasonID': ww.logical_types.Categorical,
'SalesOrderID': ww.logical_types.Integer
}
add_index(sales_order_reason_pd)
sales_order_reason_pd.ww.init(logical_types=sales_order_reason_pd_logical_types, index="index", name="sales_order_reason")
store_pd_logical_types = {
'index': ww.logical_types.Integer,
'SalesPersonID': ww.logical_types.IntegerNullable,
'test': ww.logical_types.Datetime,
'SalesOrderID': ww.logical_types.Integer
}
add_index(store_train_pd)
store_train_pd.ww.init(logical_types=store_pd_logical_types,