AdventureWorks - Predicting customer churn¶
In this notebook, we will demonstrate how getML can be used for a customer churn project using a synthetic dataset of a fictional company.
Summary:
- Prediction type: Classification model
- Domain: Customer loyalty
- Prediction target: churn
- Population size: 19704
Author: Dr. Patrick Urbanke
Background¶
AdventureWorks is a fictional company that sells bicycles. It is used by Microsoft to showcase how its MS SQL Server can be used to manage business data. Since this dataset resembles a real-world customer database and it is open-source, we will use it to showcase how getML can be used for a classic customer churn project (real customer databases are not easily available for the purposes of showcasing and benchmarking, for reasons of data privacy).
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.4.0" "numpy<2.0.0" "pandas~=2.2"
import numpy as np
import pandas as pd
import getml
getml.engine.launch()
getml.set_project('adventure_works')
Note: you may need to restart the kernel to use updated packages. Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/alex/.local/lib/python3.10/site-packages/getml --in-memory=true --install=false --launch-browser=true --log=false in /home/alex/.local/lib/python3.10/site-packages/getml/.getML/getml-1.4.0-x64-community-edition-linux... Launched the getML engine. The log output will be stored in /home/alex/.getML/logs/20240807162542.log. Connected to project 'adventure_works'
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mariadb(
host="db.relational-data.org",
dbname="AdventureWorks2014",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='AdventureWorks2014', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
product = load_if_needed("Product")
sales_order_detail = load_if_needed("SalesOrderDetail")
sales_order_header = load_if_needed("SalesOrderHeader")
sales_order_reason = load_if_needed("SalesOrderHeaderSalesReason")
special_offer = load_if_needed("SpecialOffer")
store = load_if_needed("Store")
product
name | ProductID | MakeFlag | FinishedGoodsFlag | SafetyStockLevel | ReorderPoint | DaysToManufacture | ProductSubcategoryID | ProductModelID | Name | ProductNumber | Color | StandardCost | ListPrice | Size | SizeUnitMeasureCode | WeightUnitMeasureCode | Weight | ProductLine | Class | Style | SellStartDate | SellEndDate | DiscontinuedDate | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 0 | 0 | 1000 | 750 | 0 | nan | nan | Adjustable Race | AR-5381 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 694215B7-08F7-4C0D-ACB1-D734BA44... | 2014-02-08 09:01:36 |
1 | 2 | 0 | 0 | 1000 | 750 | 0 | nan | nan | Bearing Ball | BA-8327 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 58AE3C20-4F3A-4749-A7D4-D568806C... | 2014-02-08 09:01:36 |
2 | 3 | 1 | 0 | 800 | 600 | 1 | nan | nan | BB Ball Bearing | BE-2349 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 9C21AED2-5BFA-4F18-BCB8-F11638DC... | 2014-02-08 09:01:36 |
3 | 4 | 0 | 0 | 800 | 600 | 0 | nan | nan | Headset Ball Bearings | BE-2908 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | ECFED6CB-51FF-49B5-B06C-7D8AC834... | 2014-02-08 09:01:36 |
4 | 316 | 1 | 0 | 800 | 600 | 1 | nan | nan | Blade | BL-2036 | NULL | 0.0000 | 0.0000 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | E73E9750-603B-4131-89F5-3DD15ED5... | 2014-02-08 09:01:36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
499 | 995 | 1 | 1 | 500 | 375 | 1 | 5 | 96 | ML Bottom Bracket | BB-8107 | NULL | 44.9506 | 101.2400 | NULL | NULL | G | 168.00 | NULL | M | NULL | 2013-05-30 00:00:00 | NULL | NULL | 71AB847F-D091-42D6-B735-7B0C2D82... | 2014-02-08 09:01:36 |
500 | 996 | 1 | 1 | 500 | 375 | 1 | 5 | 97 | HL Bottom Bracket | BB-9108 | NULL | 53.9416 | 121.4900 | NULL | NULL | G | 170.00 | NULL | H | NULL | 2013-05-30 00:00:00 | NULL | NULL | 230C47C5-08B2-4CE3-B706-69C0BDD6... | 2014-02-08 09:01:36 |
501 | 997 | 1 | 1 | 100 | 75 | 4 | 2 | 31 | Road-750 Black, 44 | BK-R19B-44 | Black | 343.6496 | 539.9900 | 44 | CM | LB | 19.77 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 44CE4802-409F-43AB-9B27-CA534218... | 2014-02-08 09:01:36 |
502 | 998 | 1 | 1 | 100 | 75 | 4 | 2 | 31 | Road-750 Black, 48 | BK-R19B-48 | Black | 343.6496 | 539.9900 | 48 | CM | LB | 20.13 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 3DE9A212-1D49-40B6-B10A-F564D981... | 2014-02-08 09:01:36 |
503 | 999 | 1 | 1 | 100 | 75 | 4 | 2 | 31 | Road-750 Black, 52 | BK-R19B-52 | Black | 343.6496 | 539.9900 | 52 | CM | LB | 20.42 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | AE638923-2B67-4679-B90E-ABBAB17D... | 2014-02-08 09:01:36 |
504 rows x 25 columns
memory usage: 0.17 MB
name: Product
type: getml.DataFrame
sales_order_detail
name | SalesOrderID | SalesOrderDetailID | OrderQty | ProductID | SpecialOfferID | CarrierTrackingNumber | UnitPrice | UnitPriceDiscount | LineTotal | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 43659 | 1 | 1 | 776 | 1 | 4911-403C-98 | 2024.9940 | 0.0000 | 2024.994000 | B207C96D-D9E6-402B-8470-2CC176C4... | 2011-05-30 22:00:00 |
1 | 43659 | 2 | 3 | 777 | 1 | 4911-403C-98 | 2024.9940 | 0.0000 | 6074.982000 | 7ABB600D-1E77-41BE-9FE5-B9142CFC... | 2011-05-30 22:00:00 |
2 | 43659 | 3 | 1 | 778 | 1 | 4911-403C-98 | 2024.9940 | 0.0000 | 2024.994000 | 475CF8C6-49F6-486E-B0AD-AFC6A50C... | 2011-05-30 22:00:00 |
3 | 43659 | 4 | 1 | 771 | 1 | 4911-403C-98 | 2039.9940 | 0.0000 | 2039.994000 | 04C4DE91-5815-45D6-8670-F462719F... | 2011-05-30 22:00:00 |
4 | 43659 | 5 | 1 | 772 | 1 | 4911-403C-98 | 2039.9940 | 0.0000 | 2039.994000 | 5A74C7D2-E641-438E-A7AC-37BF2328... | 2011-05-30 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
121312 | 75122 | 121313 | 1 | 878 | 1 | NULL | 21.9800 | 0.0000 | 21.980000 | 8CAD6675-18CC-4F47-8287-97B41A8E... | 2014-06-29 22:00:00 |
121313 | 75122 | 121314 | 1 | 712 | 1 | NULL | 8.9900 | 0.0000 | 8.990000 | 84F1C363-1C50-4442-BE16-541C59B6... | 2014-06-29 22:00:00 |
121314 | 75123 | 121315 | 1 | 878 | 1 | NULL | 21.9800 | 0.0000 | 21.980000 | C18B6476-429F-4BB1-828E-2BE5F82A... | 2014-06-29 22:00:00 |
121315 | 75123 | 121316 | 1 | 879 | 1 | NULL | 159.0000 | 0.0000 | 159.000000 | 75A89C6A-C60A-47EA-8A52-B52A9C43... | 2014-06-29 22:00:00 |
121316 | 75123 | 121317 | 1 | 712 | 1 | NULL | 8.9900 | 0.0000 | 8.990000 | 73646D26-0461-450D-8019-2C6C8586... | 2014-06-29 22:00:00 |
121317 rows x 11 columns
memory usage: 21.60 MB
name: SalesOrderDetail
type: getml.DataFrame
sales_order_header
name | SalesOrderID | RevisionNumber | Status | OnlineOrderFlag | CustomerID | SalesPersonID | TerritoryID | BillToAddressID | ShipToAddressID | ShipMethodID | CreditCardID | CurrencyRateID | OrderDate | DueDate | ShipDate | SalesOrderNumber | PurchaseOrderNumber | AccountNumber | CreditCardApprovalCode | SubTotal | TaxAmt | Freight | TotalDue | Comment | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 43659 | 8 | 5 | 0 | 29825 | 279 | 5 | 985 | 985 | 5 | 16281 | nan | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43659 | PO522145787 | 10-4020-000676 | 105041Vi84182 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | NULL | 79B65321-39CA-4115-9CBA-8FE0903E... | 2011-06-06 22:00:00 |
1 | 43660 | 8 | 5 | 0 | 29672 | 279 | 5 | 921 | 921 | 5 | 5618 | nan | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43660 | PO18850127500 | 10-4020-000117 | 115213Vi29411 | 1294.2529 | 124.2483 | 38.8276 | 1457.3288 | NULL | 738DC42D-D03B-48A1-9822-F95A67EA... | 2011-06-06 22:00:00 |
2 | 43661 | 8 | 5 | 0 | 29734 | 282 | 6 | 517 | 517 | 5 | 1346 | 4 | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43661 | PO18473189620 | 10-4020-000442 | 85274Vi6854 | 32726.4786 | 3153.7696 | 985.5530 | 36865.8012 | NULL | D91B9131-18A4-4A11-BC3A-90B6F53E... | 2011-06-06 22:00:00 |
3 | 43662 | 8 | 5 | 0 | 29994 | 282 | 6 | 482 | 482 | 5 | 10456 | 4 | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43662 | PO18444174044 | 10-4020-000227 | 125295Vi53935 | 28832.5289 | 2775.1646 | 867.2389 | 32474.9324 | NULL | 4A1ECFC0-CC3A-4740-B028-1C50BB48... | 2011-06-06 22:00:00 |
4 | 43663 | 8 | 5 | 0 | 29565 | 276 | 4 | 1073 | 1073 | 5 | 4322 | nan | 2011-05-30 22:00:00 | 2011-06-12 00:00:00 | 2011-06-07 00:00:00 | SO43663 | PO18009186470 | 10-4020-000510 | 45303Vi22691 | 419.4589 | 40.2681 | 12.5838 | 472.3108 | NULL | 9B1E7A40-6AE0-4AD3-811C-A6495185... | 2011-06-06 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31460 | 75119 | 8 | 5 | 1 | 11981 | nan | 1 | 17649 | 17649 | 1 | 6761 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75119 | NULL | 10-4030-011981 | 429826Vi35166 | 42.2800 | 3.3824 | 1.0570 | 46.7194 | NULL | 9382F1C9-383A-435F-9449-0EECEA21... | 2014-07-06 22:00:00 |
31461 | 75120 | 8 | 5 | 1 | 18749 | nan | 6 | 28374 | 28374 | 1 | 8925 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75120 | NULL | 10-4030-018749 | 929849Vi46003 | 84.9600 | 6.7968 | 2.1240 | 93.8808 | NULL | AE6A4FCF-FF73-4CD4-AF2C-5993D00D... | 2014-07-06 22:00:00 |
31462 | 75121 | 8 | 5 | 1 | 15251 | nan | 6 | 26553 | 26553 | 1 | 14220 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75121 | NULL | 10-4030-015251 | 529864Vi73738 | 74.9800 | 5.9984 | 1.8745 | 82.8529 | NULL | D7395C0E-00CB-4BFA-A238-0D6A9F49... | 2014-07-06 22:00:00 |
31463 | 75122 | 8 | 5 | 1 | 15868 | nan | 6 | 14616 | 14616 | 1 | 18719 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75122 | NULL | 10-4030-015868 | 330022Vi97312 | 30.9700 | 2.4776 | 0.7743 | 34.2219 | NULL | 4221035A-4159-492F-AF40-4363A64F... | 2014-07-06 22:00:00 |
31464 | 75123 | 8 | 5 | 1 | 18759 | nan | 6 | 14024 | 14024 | 1 | 10084 | nan | 2014-06-29 22:00:00 | 2014-07-12 00:00:00 | 2014-07-07 00:00:00 | SO75123 | NULL | 10-4030-018759 | 230370Vi51970 | 189.9700 | 15.1976 | 4.7493 | 209.9169 | NULL | D54752FF-2B54-4BE5-95EA-3B72289C... | 2014-07-06 22:00:00 |
31465 rows x 26 columns
memory usage: 12.56 MB
name: SalesOrderHeader
type: getml.DataFrame
sales_order_reason
name | SalesOrderID | SalesReasonID | ModifiedDate |
---|---|---|---|
role | unused_float | unused_float | unused_string |
0 | 43697 | 5 | 2011-05-30 22:00:00 |
1 | 43697 | 9 | 2011-05-30 22:00:00 |
2 | 43702 | 5 | 2011-05-31 22:00:00 |
3 | 43702 | 9 | 2011-05-31 22:00:00 |
4 | 43703 | 5 | 2011-05-31 22:00:00 |
... | ... | ... | |
27642 | 75119 | 1 | 2014-06-29 22:00:00 |
27643 | 75120 | 1 | 2014-06-29 22:00:00 |
27644 | 75121 | 1 | 2014-06-29 22:00:00 |
27645 | 75122 | 1 | 2014-06-29 22:00:00 |
27646 | 75123 | 1 | 2014-06-29 22:00:00 |
27647 rows x 3 columns
memory usage: 1.22 MB
name: SalesOrderHeaderSalesReason
type: getml.DataFrame
special_offer
name | SpecialOfferID | MinQty | MaxQty | Description | DiscountPct | Type | Category | StartDate | EndDate | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 0 | nan | No Discount | 0.0000 | No Discount | No Discount | 2011-05-01 00:00:00 | 2014-11-30 00:00:00 | 0290C4F5-191F-4337-AB6B-0A2DDE03... | 2011-03-31 22:00:00 |
1 | 2 | 11 | 14 | Volume Discount 11 to 14 | 0.0200 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | D7542EE7-15DB-4541-985C-5CC27AEF... | 2011-04-30 22:00:00 |
2 | 3 | 15 | 24 | Volume Discount 15 to 24 | 0.0500 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | 4BDBCC01-8CF7-40A9-B643-40EC5B71... | 2011-04-30 22:00:00 |
3 | 4 | 25 | 40 | Volume Discount 25 to 40 | 0.1000 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | 504B5E85-8F3F-4EBC-9E1D-C1BC5DEA... | 2011-04-30 22:00:00 |
4 | 5 | 41 | 60 | Volume Discount 41 to 60 | 0.1500 | Volume Discount | Reseller | 2011-05-31 00:00:00 | 2014-05-30 00:00:00 | 677E1D9D-944F-4E81-90E8-47EB0A82... | 2011-04-30 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
11 | 12 | 0 | nan | LL Road Frame Sale | 0.3500 | Excess Inventory | Reseller | 2013-05-30 00:00:00 | 2013-07-14 00:00:00 | C0AF1C89-9722-4235-9248-3FBA4D9E... | 2013-04-29 22:00:00 |
12 | 13 | 0 | nan | Touring-3000 Promotion | 0.1500 | New Product | Reseller | 2013-05-30 00:00:00 | 2013-08-29 00:00:00 | 5061CCE4-E021-45A8-9A75-DFB36CBB... | 2013-04-29 22:00:00 |
13 | 14 | 0 | nan | Touring-1000 Promotion | 0.2000 | New Product | Reseller | 2013-05-30 00:00:00 | 2013-08-29 00:00:00 | 1AF84A9E-A98C-4BD9-B48F-DC2B8B6B... | 2013-04-29 22:00:00 |
14 | 15 | 0 | nan | Half-Price Pedal Sale | 0.5000 | Seasonal Discount | Customer | 2013-07-14 00:00:00 | 2013-08-14 00:00:00 | 03E3594D-6EBB-46A6-B8EE-A9289C0C... | 2013-06-13 22:00:00 |
15 | 16 | 0 | nan | Mountain-500 Silver Clearance Sa... | 0.4000 | Discontinued Product | Reseller | 2014-03-31 00:00:00 | 2014-05-30 00:00:00 | EB7CB484-BCCF-4D2D-BF73-521B2001... | 2014-02-28 23:00:00 |
16 rows x 11 columns
memory usage: 0.00 MB
name: SpecialOffer
type: getml.DataFrame
store
name | BusinessEntityID | SalesPersonID | Name | Demographics | rowguid | ModifiedDate |
---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 292 | 279 | Next-Door Bike Store | <StoreSurvey xmlns="http://schem... | A22517E3-848D-4EBE-B9D9-7437F343... | 2014-09-12 09:15:07 |
1 | 294 | 276 | Professional Sales and Service | <StoreSurvey xmlns="http://schem... | B50CA50B-C601-4A13-B07E-2C63862D... | 2014-09-12 09:15:07 |
2 | 296 | 277 | Riders Company | <StoreSurvey xmlns="http://schem... | 337C3688-1339-4E1A-A08A-B54B2356... | 2014-09-12 09:15:07 |
3 | 298 | 275 | The Bike Mechanics | <StoreSurvey xmlns="http://schem... | 7894F278-F0C8-4D16-BD75-213FDBF1... | 2014-09-12 09:15:07 |
4 | 300 | 286 | Nationwide Supply | <StoreSurvey xmlns="http://schem... | C3FC9705-A8C4-4F3A-9550-EB2FA4B7... | 2014-09-12 09:15:07 |
... | ... | ... | ... | ... | ... | |
696 | 1988 | 282 | Retreat Inn | <StoreSurvey xmlns="http://schem... | EA21EC81-1BFA-4A07-9B4D-73D9852A... | 2014-09-12 09:15:07 |
697 | 1990 | 281 | Technical Parts Manufacturing | <StoreSurvey xmlns="http://schem... | C8E3C4ED-8F58-4DB2-B600-E0CD11D9... | 2014-09-12 09:15:07 |
698 | 1992 | 277 | Totes & Baskets Company | <StoreSurvey xmlns="http://schem... | CE860B58-643C-4567-BFD8-06E97969... | 2014-09-12 09:15:07 |
699 | 1994 | 277 | World of Bikes | <StoreSurvey xmlns="http://schem... | 0C10F2B6-A13A-440C-9C25-5B28D482... | 2014-09-12 09:15:07 |
700 | 2051 | 275 | A Bicycle Association | <StoreSurvey xmlns="http://schem... | 82237172-D3FE-4A95-82EF-636F6552... | 2014-09-12 09:15:07 |
701 rows x 6 columns
memory usage: 0.38 MB
name: Store
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
product.set_role("ProductID", getml.data.roles.join_key)
product.set_role(["MakeFlag", "ProductSubcategoryID", "ProductModelID"], getml.data.roles.categorical)
product.set_role(["SafetyStockLevel", "ReorderPoint", "StandardCost", "ListPrice"], getml.data.roles.numerical)
product
name | ProductID | MakeFlag | ProductSubcategoryID | ProductModelID | SafetyStockLevel | ReorderPoint | StandardCost | ListPrice | FinishedGoodsFlag | DaysToManufacture | Name | ProductNumber | Color | Size | SizeUnitMeasureCode | WeightUnitMeasureCode | Weight | ProductLine | Class | Style | SellStartDate | SellEndDate | DiscontinuedDate | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | join_key | categorical | categorical | categorical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 0 | nan | nan | 1000 | 750 | 0 | 0 | 0 | 0 | Adjustable Race | AR-5381 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 694215B7-08F7-4C0D-ACB1-D734BA44... | 2014-02-08 09:01:36 |
1 | 2 | 0 | nan | nan | 1000 | 750 | 0 | 0 | 0 | 0 | Bearing Ball | BA-8327 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 58AE3C20-4F3A-4749-A7D4-D568806C... | 2014-02-08 09:01:36 |
2 | 3 | 1 | nan | nan | 800 | 600 | 0 | 0 | 0 | 1 | BB Ball Bearing | BE-2349 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | 9C21AED2-5BFA-4F18-BCB8-F11638DC... | 2014-02-08 09:01:36 |
3 | 4 | 0 | nan | nan | 800 | 600 | 0 | 0 | 0 | 0 | Headset Ball Bearings | BE-2908 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | ECFED6CB-51FF-49B5-B06C-7D8AC834... | 2014-02-08 09:01:36 |
4 | 316 | 1 | nan | nan | 800 | 600 | 0 | 0 | 0 | 1 | Blade | BL-2036 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2008-04-30 00:00:00 | NULL | NULL | E73E9750-603B-4131-89F5-3DD15ED5... | 2014-02-08 09:01:36 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
499 | 995 | 1 | 5 | 96 | 500 | 375 | 44.9506 | 101.24 | 1 | 1 | ML Bottom Bracket | BB-8107 | NULL | NULL | NULL | G | 168.00 | NULL | M | NULL | 2013-05-30 00:00:00 | NULL | NULL | 71AB847F-D091-42D6-B735-7B0C2D82... | 2014-02-08 09:01:36 |
500 | 996 | 1 | 5 | 97 | 500 | 375 | 53.9416 | 121.49 | 1 | 1 | HL Bottom Bracket | BB-9108 | NULL | NULL | NULL | G | 170.00 | NULL | H | NULL | 2013-05-30 00:00:00 | NULL | NULL | 230C47C5-08B2-4CE3-B706-69C0BDD6... | 2014-02-08 09:01:36 |
501 | 997 | 1 | 2 | 31 | 100 | 75 | 343.6496 | 539.99 | 1 | 4 | Road-750 Black, 44 | BK-R19B-44 | Black | 44 | CM | LB | 19.77 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 44CE4802-409F-43AB-9B27-CA534218... | 2014-02-08 09:01:36 |
502 | 998 | 1 | 2 | 31 | 100 | 75 | 343.6496 | 539.99 | 1 | 4 | Road-750 Black, 48 | BK-R19B-48 | Black | 48 | CM | LB | 20.13 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | 3DE9A212-1D49-40B6-B10A-F564D981... | 2014-02-08 09:01:36 |
503 | 999 | 1 | 2 | 31 | 100 | 75 | 343.6496 | 539.99 | 1 | 4 | Road-750 Black, 52 | BK-R19B-52 | Black | 52 | CM | LB | 20.42 | R | L | U | 2013-05-30 00:00:00 | NULL | NULL | AE638923-2B67-4679-B90E-ABBAB17D... | 2014-02-08 09:01:36 |
504 rows x 25 columns
memory usage: 0.16 MB
name: Product
type: getml.DataFrame
sales_order_detail.set_role(["SalesOrderID", "SalesOrderDetailID", "ProductID", "SpecialOfferID"], getml.data.roles.join_key)
sales_order_detail.set_role(["OrderQty", "UnitPrice", "UnitPriceDiscount", "LineTotal"], getml.data.roles.numerical)
sales_order_detail.set_role("ModifiedDate", getml.data.roles.time_stamp)
sales_order_detail
name | ModifiedDate | SalesOrderID | SalesOrderDetailID | ProductID | SpecialOfferID | OrderQty | UnitPrice | UnitPriceDiscount | LineTotal | CarrierTrackingNumber | rowguid |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | join_key | numerical | numerical | numerical | numerical | unused_string | unused_string |
unit | time stamp, comparison only | ||||||||||
0 | 2011-05-30 22:00:00 | 43659 | 1 | 776 | 1 | 1 | 2024.994 | 0 | 2024.994 | 4911-403C-98 | B207C96D-D9E6-402B-8470-2CC176C4... |
1 | 2011-05-30 22:00:00 | 43659 | 2 | 777 | 1 | 3 | 2024.994 | 0 | 6074.982 | 4911-403C-98 | 7ABB600D-1E77-41BE-9FE5-B9142CFC... |
2 | 2011-05-30 22:00:00 | 43659 | 3 | 778 | 1 | 1 | 2024.994 | 0 | 2024.994 | 4911-403C-98 | 475CF8C6-49F6-486E-B0AD-AFC6A50C... |
3 | 2011-05-30 22:00:00 | 43659 | 4 | 771 | 1 | 1 | 2039.994 | 0 | 2039.994 | 4911-403C-98 | 04C4DE91-5815-45D6-8670-F462719F... |
4 | 2011-05-30 22:00:00 | 43659 | 5 | 772 | 1 | 1 | 2039.994 | 0 | 2039.994 | 4911-403C-98 | 5A74C7D2-E641-438E-A7AC-37BF2328... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
121312 | 2014-06-29 22:00:00 | 75122 | 121313 | 878 | 1 | 1 | 21.98 | 0 | 21.98 | NULL | 8CAD6675-18CC-4F47-8287-97B41A8E... |
121313 | 2014-06-29 22:00:00 | 75122 | 121314 | 712 | 1 | 1 | 8.99 | 0 | 8.99 | NULL | 84F1C363-1C50-4442-BE16-541C59B6... |
121314 | 2014-06-29 22:00:00 | 75123 | 121315 | 878 | 1 | 1 | 21.98 | 0 | 21.98 | NULL | C18B6476-429F-4BB1-828E-2BE5F82A... |
121315 | 2014-06-29 22:00:00 | 75123 | 121316 | 879 | 1 | 1 | 159 | 0 | 159 | NULL | 75A89C6A-C60A-47EA-8A52-B52A9C43... |
121316 | 2014-06-29 22:00:00 | 75123 | 121317 | 712 | 1 | 1 | 8.99 | 0 | 8.99 | NULL | 73646D26-0461-450D-8019-2C6C8586... |
121317 rows x 11 columns
memory usage: 14.08 MB
name: SalesOrderDetail
type: getml.DataFrame
sales_order_reason.set_role("SalesOrderID", getml.data.roles.join_key)
sales_order_reason.set_role("SalesReasonID", getml.data.roles.categorical)
sales_order_reason
name | SalesOrderID | SalesReasonID | ModifiedDate |
---|---|---|---|
role | join_key | categorical | unused_string |
0 | 43697 | 5 | 2011-05-30 22:00:00 |
1 | 43697 | 9 | 2011-05-30 22:00:00 |
2 | 43702 | 5 | 2011-05-31 22:00:00 |
3 | 43702 | 9 | 2011-05-31 22:00:00 |
4 | 43703 | 5 | 2011-05-31 22:00:00 |
... | ... | ... | |
27642 | 75119 | 1 | 2014-06-29 22:00:00 |
27643 | 75120 | 1 | 2014-06-29 22:00:00 |
27644 | 75121 | 1 | 2014-06-29 22:00:00 |
27645 | 75122 | 1 | 2014-06-29 22:00:00 |
27646 | 75123 | 1 | 2014-06-29 22:00:00 |
27647 rows x 3 columns
memory usage: 1.00 MB
name: SalesOrderHeaderSalesReason
type: getml.DataFrame
special_offer.set_role(["SpecialOfferID"], getml.data.roles.join_key)
special_offer.set_role(["MinQty", "DiscountPct"], getml.data.roles.numerical)
special_offer.set_role(["Category", "Description", "Type"], getml.data.roles.categorical)
special_offer.set_role(["StartDate", "EndDate"], getml.data.roles.time_stamp)
special_offer
name | StartDate | EndDate | SpecialOfferID | Category | Description | Type | MinQty | DiscountPct | MaxQty | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | join_key | categorical | categorical | categorical | numerical | numerical | unused_float | unused_string | unused_string |
unit | time stamp, comparison only | time stamp, comparison only | |||||||||
0 | 2011-05-01 | 2014-11-30 | 1 | No Discount | No Discount | No Discount | 0 | 0 | nan | 0290C4F5-191F-4337-AB6B-0A2DDE03... | 2011-03-31 22:00:00 |
1 | 2011-05-31 | 2014-05-30 | 2 | Reseller | Volume Discount 11 to 14 | Volume Discount | 11 | 0.02 | 14 | D7542EE7-15DB-4541-985C-5CC27AEF... | 2011-04-30 22:00:00 |
2 | 2011-05-31 | 2014-05-30 | 3 | Reseller | Volume Discount 15 to 24 | Volume Discount | 15 | 0.05 | 24 | 4BDBCC01-8CF7-40A9-B643-40EC5B71... | 2011-04-30 22:00:00 |
3 | 2011-05-31 | 2014-05-30 | 4 | Reseller | Volume Discount 25 to 40 | Volume Discount | 25 | 0.1 | 40 | 504B5E85-8F3F-4EBC-9E1D-C1BC5DEA... | 2011-04-30 22:00:00 |
4 | 2011-05-31 | 2014-05-30 | 5 | Reseller | Volume Discount 41 to 60 | Volume Discount | 41 | 0.15 | 60 | 677E1D9D-944F-4E81-90E8-47EB0A82... | 2011-04-30 22:00:00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
11 | 2013-05-30 | 2013-07-14 | 12 | Reseller | LL Road Frame Sale | Excess Inventory | 0 | 0.35 | nan | C0AF1C89-9722-4235-9248-3FBA4D9E... | 2013-04-29 22:00:00 |
12 | 2013-05-30 | 2013-08-29 | 13 | Reseller | Touring-3000 Promotion | New Product | 0 | 0.15 | nan | 5061CCE4-E021-45A8-9A75-DFB36CBB... | 2013-04-29 22:00:00 |
13 | 2013-05-30 | 2013-08-29 | 14 | Reseller | Touring-1000 Promotion | New Product | 0 | 0.2 | nan | 1AF84A9E-A98C-4BD9-B48F-DC2B8B6B... | 2013-04-29 22:00:00 |
14 | 2013-07-14 | 2013-08-14 | 15 | Customer | Half-Price Pedal Sale | Seasonal Discount | 0 | 0.5 | nan | 03E3594D-6EBB-46A6-B8EE-A9289C0C... | 2013-06-13 22:00:00 |
15 | 2014-03-31 | 2014-05-30 | 16 | Reseller | Mountain-500 Silver Clearance Sa... | Discontinued Product | 0 | 0.4 | nan | EB7CB484-BCCF-4D2D-BF73-521B2001... | 2014-02-28 23:00:00 |
16 rows x 11 columns
memory usage: 0.00 MB
name: SpecialOffer
type: getml.DataFrame
store.set_role(["SalesPersonID"], getml.data.roles.join_key)
store["test"] = store["ModifiedDate"].update(getml.data.random() > 0.5, "NULL")
store.set_role(["SalesPersonID"], getml.data.roles.join_key)
store.set_role(["test"], getml.data.roles.time_stamp)
store
name | test | SalesPersonID | BusinessEntityID | Name | Demographics | rowguid | ModifiedDate |
---|---|---|---|---|---|---|---|
role | time_stamp | join_key | unused_float | unused_string | unused_string | unused_string | unused_string |
unit | time stamp, comparison only | ||||||
0 | 2014-09-12 09:15:07 | 279 | 292 | Next-Door Bike Store | <StoreSurvey xmlns="http://schem... | A22517E3-848D-4EBE-B9D9-7437F343... | 2014-09-12 09:15:07 |
1 | NULL | 276 | 294 | Professional Sales and Service | <StoreSurvey xmlns="http://schem... | B50CA50B-C601-4A13-B07E-2C63862D... | 2014-09-12 09:15:07 |
2 | 2014-09-12 09:15:07 | 277 | 296 | Riders Company | <StoreSurvey xmlns="http://schem... | 337C3688-1339-4E1A-A08A-B54B2356... | 2014-09-12 09:15:07 |
3 | NULL | 275 | 298 | The Bike Mechanics | <StoreSurvey xmlns="http://schem... | 7894F278-F0C8-4D16-BD75-213FDBF1... | 2014-09-12 09:15:07 |
4 | NULL | 286 | 300 | Nationwide Supply | <StoreSurvey xmlns="http://schem... | C3FC9705-A8C4-4F3A-9550-EB2FA4B7... | 2014-09-12 09:15:07 |
... | ... | ... | ... | ... | ... | ... | |
696 | NULL | 282 | 1988 | Retreat Inn | <StoreSurvey xmlns="http://schem... | EA21EC81-1BFA-4A07-9B4D-73D9852A... | 2014-09-12 09:15:07 |
697 | 2014-09-12 09:15:07 | 281 | 1990 | Technical Parts Manufacturing | <StoreSurvey xmlns="http://schem... | C8E3C4ED-8F58-4DB2-B600-E0CD11D9... | 2014-09-12 09:15:07 |
698 | NULL | 277 | 1992 | Totes & Baskets Company | <StoreSurvey xmlns="http://schem... | CE860B58-643C-4567-BFD8-06E97969... | 2014-09-12 09:15:07 |
699 | NULL | 277 | 1994 | World of Bikes | <StoreSurvey xmlns="http://schem... | 0C10F2B6-A13A-440C-9C25-5B28D482... | 2014-09-12 09:15:07 |
700 | 2014-09-12 09:15:07 | 275 | 2051 | A Bicycle Association | <StoreSurvey xmlns="http://schem... | 82237172-D3FE-4A95-82EF-636F6552... | 2014-09-12 09:15:07 |
701 rows x 7 columns
memory usage: 0.38 MB
name: Store
type: getml.DataFrame
sales_order_header["SalesPersonIDCat"] = sales_order_header["SalesPersonID"]
sales_order_header["TerritoryIDCat"] = sales_order_header["TerritoryID"]
sales_order_header.set_role(["CustomerID", "SalesOrderID", "SalesPersonID", "TerritoryID"], getml.data.roles.join_key)
sales_order_header.set_role(
["RevisionNumber", "OnlineOrderFlag", "SalesPersonIDCat", "TerritoryIDCat", "ShipMethodID"],
getml.data.roles.categorical)
sales_order_header.set_role(["SubTotal", "TaxAmt", "Freight", "TotalDue"], getml.data.roles.numerical)
sales_order_header.set_role(["OrderDate", "DueDate", "ShipDate", "ModifiedDate"], getml.data.roles.time_stamp)
sales_order_header
name | OrderDate | DueDate | ShipDate | ModifiedDate | CustomerID | SalesOrderID | SalesPersonID | TerritoryID | RevisionNumber | OnlineOrderFlag | SalesPersonIDCat | TerritoryIDCat | ShipMethodID | SubTotal | TaxAmt | Freight | TotalDue | Status | BillToAddressID | ShipToAddressID | CreditCardID | CurrencyRateID | SalesOrderNumber | PurchaseOrderNumber | AccountNumber | CreditCardApprovalCode | Comment | rowguid |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | time_stamp | time_stamp | join_key | join_key | join_key | join_key | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
unit | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | ||||||||||||||||||||||||
0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29825 | 43659 | 279 | 5 | 8 | 0 | 279 | 5 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 5 | 985 | 985 | 16281 | nan | SO43659 | PO522145787 | 10-4020-000676 | 105041Vi84182 | NULL | 79B65321-39CA-4115-9CBA-8FE0903E... |
1 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29672 | 43660 | 279 | 5 | 8 | 0 | 279 | 5 | 5 | 1294.2529 | 124.2483 | 38.8276 | 1457.3288 | 5 | 921 | 921 | 5618 | nan | SO43660 | PO18850127500 | 10-4020-000117 | 115213Vi29411 | NULL | 738DC42D-D03B-48A1-9822-F95A67EA... |
2 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29734 | 43661 | 282 | 6 | 8 | 0 | 282 | 6 | 5 | 32726.4786 | 3153.7696 | 985.553 | 36865.8012 | 5 | 517 | 517 | 1346 | 4 | SO43661 | PO18473189620 | 10-4020-000442 | 85274Vi6854 | NULL | D91B9131-18A4-4A11-BC3A-90B6F53E... |
3 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29994 | 43662 | 282 | 6 | 8 | 0 | 282 | 6 | 5 | 28832.5289 | 2775.1646 | 867.2389 | 32474.9324 | 5 | 482 | 482 | 10456 | 4 | SO43662 | PO18444174044 | 10-4020-000227 | 125295Vi53935 | NULL | 4A1ECFC0-CC3A-4740-B028-1C50BB48... |
4 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29565 | 43663 | 276 | 4 | 8 | 0 | 276 | 4 | 5 | 419.4589 | 40.2681 | 12.5838 | 472.3108 | 5 | 1073 | 1073 | 4322 | nan | SO43663 | PO18009186470 | 10-4020-000510 | 45303Vi22691 | NULL | 9B1E7A40-6AE0-4AD3-811C-A6495185... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31460 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 11981 | 75119 | nan | 1 | 8 | 1 | nan | 1 | 1 | 42.28 | 3.3824 | 1.057 | 46.7194 | 5 | 17649 | 17649 | 6761 | nan | SO75119 | NULL | 10-4030-011981 | 429826Vi35166 | NULL | 9382F1C9-383A-435F-9449-0EECEA21... |
31461 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 18749 | 75120 | nan | 6 | 8 | 1 | nan | 6 | 1 | 84.96 | 6.7968 | 2.124 | 93.8808 | 5 | 28374 | 28374 | 8925 | nan | SO75120 | NULL | 10-4030-018749 | 929849Vi46003 | NULL | AE6A4FCF-FF73-4CD4-AF2C-5993D00D... |
31462 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 15251 | 75121 | nan | 6 | 8 | 1 | nan | 6 | 1 | 74.98 | 5.9984 | 1.8745 | 82.8529 | 5 | 26553 | 26553 | 14220 | nan | SO75121 | NULL | 10-4030-015251 | 529864Vi73738 | NULL | D7395C0E-00CB-4BFA-A238-0D6A9F49... |
31463 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 15868 | 75122 | nan | 6 | 8 | 1 | nan | 6 | 1 | 30.97 | 2.4776 | 0.7743 | 34.2219 | 5 | 14616 | 14616 | 18719 | nan | SO75122 | NULL | 10-4030-015868 | 330022Vi97312 | NULL | 4221035A-4159-492F-AF40-4363A64F... |
31464 | 2014-06-29 22:00:00 | 2014-07-12 | 2014-07-07 | 2014-07-06 22:00:00 | 18759 | 75123 | nan | 6 | 8 | 1 | nan | 6 | 1 | 189.97 | 15.1976 | 4.7493 | 209.9169 | 5 | 14024 | 14024 | 10084 | nan | SO75123 | NULL | 10-4030-018759 | 230370Vi51970 | NULL | D54752FF-2B54-4BE5-95EA-3B72289C... |
31465 rows x 28 columns
memory usage: 8.34 MB
name: SalesOrderHeader
type: getml.DataFrame
We must also define customer churn. In this case, we define customer churn as a customer not making another purchase within 180 days of his or her last purchase.
Thus, the churn variable is defined as follows:
- 0, if another purchase by the same customer has been made within 180 days after
OrderDate
- 1, if no purchase by the same customer has been made within 180 days after
OrderDate
- NULL, if
max(OrderDate) - OrderDate <= 180 days
NULL targets can not be used in our analysis.
sales_order_header_pd = sales_order_header[["OrderDate", "CustomerID", "SalesOrderID"]].to_pandas()
repeat_purchases = sales_order_header_pd.merge(
sales_order_header_pd[["OrderDate", "CustomerID"]],
on="CustomerID",
how="left",
)
repeat_purchases = repeat_purchases[
repeat_purchases["OrderDate_y"] > repeat_purchases["OrderDate_x"]
]
repeat_purchases = repeat_purchases[
repeat_purchases["OrderDate_y"] - repeat_purchases["OrderDate_x"] > pd.Timedelta('180 days')
]
repeat_purchases = repeat_purchases.groupby("SalesOrderID", as_index=False).aggregate({"CustomerID": "max"})
repeat_purchase_ids = {sid: True for sid in repeat_purchases["SalesOrderID"]}
cut_off_date = max(sales_order_header_pd["OrderDate"]) - pd.Timedelta('180 days')
churn = np.asarray([
np.nan if order_date >= cut_off_date else 0 if order_id in repeat_purchase_ids else 1
for (order_date, order_id) in zip(sales_order_header_pd["OrderDate"], sales_order_header_pd["SalesOrderID"])
])
sales_order_header["churn"] = churn
sales_order_header = sales_order_header[~sales_order_header.churn.is_nan()].to_df("SalesOrderHeaderRefined")
sales_order_header.set_role("churn", getml.data.roles.target)
sales_order_header
/home/alex/.local/lib/python3.10/site-packages/getml/data/data_frame.py:463: FutureWarning: Series.__getitem__ treating keys as positions is deprecated. In a future version, integer keys will always be treated as labels (consistent with DataFrame behavior). To access a value by position, use `ser.iloc[pos]` if _is_numerical_type(temp_df.dtypes[0]):
name | OrderDate | DueDate | ShipDate | ModifiedDate | CustomerID | SalesOrderID | SalesPersonID | TerritoryID | churn | RevisionNumber | OnlineOrderFlag | SalesPersonIDCat | TerritoryIDCat | ShipMethodID | SubTotal | TaxAmt | Freight | TotalDue | Status | BillToAddressID | ShipToAddressID | CreditCardID | CurrencyRateID | SalesOrderNumber | PurchaseOrderNumber | AccountNumber | CreditCardApprovalCode | Comment | rowguid |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | time_stamp | time_stamp | join_key | join_key | join_key | join_key | target | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
unit | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | time stamp, comparison only | |||||||||||||||||||||||||
0 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29825 | 43659 | 279 | 5 | 0 | 8 | 0 | 279 | 5 | 5 | 20565.6206 | 1971.5149 | 616.0984 | 23153.2339 | 5 | 985 | 985 | 16281 | nan | SO43659 | PO522145787 | 10-4020-000676 | 105041Vi84182 | NULL | 79B65321-39CA-4115-9CBA-8FE0903E... |
1 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29672 | 43660 | 279 | 5 | 0 | 8 | 0 | 279 | 5 | 5 | 1294.2529 | 124.2483 | 38.8276 | 1457.3288 | 5 | 921 | 921 | 5618 | nan | SO43660 | PO18850127500 | 10-4020-000117 | 115213Vi29411 | NULL | 738DC42D-D03B-48A1-9822-F95A67EA... |
2 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29734 | 43661 | 282 | 6 | 0 | 8 | 0 | 282 | 6 | 5 | 32726.4786 | 3153.7696 | 985.553 | 36865.8012 | 5 | 517 | 517 | 1346 | 4 | SO43661 | PO18473189620 | 10-4020-000442 | 85274Vi6854 | NULL | D91B9131-18A4-4A11-BC3A-90B6F53E... |
3 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29994 | 43662 | 282 | 6 | 0 | 8 | 0 | 282 | 6 | 5 | 28832.5289 | 2775.1646 | 867.2389 | 32474.9324 | 5 | 482 | 482 | 10456 | 4 | SO43662 | PO18444174044 | 10-4020-000227 | 125295Vi53935 | NULL | 4A1ECFC0-CC3A-4740-B028-1C50BB48... |
4 | 2011-05-30 22:00:00 | 2011-06-12 | 2011-06-07 | 2011-06-06 22:00:00 | 29565 | 43663 | 276 | 4 | 0 | 8 | 0 | 276 | 4 | 5 | 419.4589 | 40.2681 | 12.5838 | 472.3108 | 5 | 1073 | 1073 | 4322 | nan | SO43663 | PO18009186470 | 10-4020-000510 | 45303Vi22691 | NULL | 9B1E7A40-6AE0-4AD3-811C-A6495185... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
19699 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 20826 | 63358 | nan | 7 | 1 | 8 | 1 | nan | 7 | 1 | 1173.96 | 93.9168 | 29.349 | 1297.2258 | 5 | 24387 | 24387 | 3239 | nan | SO63358 | NULL | 10-4030-020826 | 1142084Vi17039 | NULL | 41278FBB-3DD8-488B-AEA5-8BF4A6F1... |
19700 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 24114 | 63359 | nan | 10 | 1 | 8 | 1 | nan | 10 | 1 | 1179.47 | 94.3576 | 29.4868 | 1303.3144 | 5 | 29682 | 29682 | nan | 10770 | SO63359 | NULL | 10-4030-024114 | NULL | NULL | 749532D4-BFF7-4FEC-9F77-396C4A96... |
19701 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 11417 | 63360 | nan | 7 | 1 | 8 | 1 | nan | 7 | 1 | 548.98 | 43.9184 | 13.7245 | 606.6229 | 5 | 21465 | 21465 | 6582 | nan | SO63360 | NULL | 10-4030-011417 | 242387Vi34223 | NULL | F92FA2A3-73E1-4DD0-987C-99D1C87E... |
19702 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 18125 | 63361 | nan | 8 | 1 | 8 | 1 | nan | 8 | 1 | 2384.07 | 190.7256 | 59.6018 | 2634.3974 | 5 | 26562 | 26562 | 11994 | nan | SO63361 | NULL | 10-4030-018125 | 1242859Vi61993 | NULL | 3A3758BF-CDFA-4740-9104-87BD5A08... |
19703 | 2013-12-30 23:00:00 | 2014-01-12 | 2014-01-07 | 2014-01-06 23:00:00 | 15692 | 63362 | nan | 10 | 1 | 8 | 1 | nan | 10 | 1 | 2419.06 | 193.5248 | 60.4765 | 2673.0613 | 5 | 27090 | 27090 | 16083 | 10770 | SO63362 | NULL | 10-4030-015692 | 242864Vi83167 | NULL | 3151D31F-4020-41B2-87F3-B6732869... |
19704 rows x 29 columns
memory usage: 5.39 MB
name: SalesOrderHeaderRefined
type: getml.DataFrame
split = getml.data.split.random(train=0.8, test=0.2)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | test |
4 | train |
... |
infinite number of rows
type: StringColumnView
container = getml.data.Container(population=sales_order_header, split=split)
container.add(
product=product,
sales_order_detail=sales_order_detail,
sales_order_header=sales_order_header,
sales_order_reason=sales_order_reason,
special_offer=special_offer,
store=store,
)
container
subset | name | rows | type | |
---|---|---|---|---|
0 | test | SalesOrderHeaderRefined | 3879 | View |
1 | train | SalesOrderHeaderRefined | 15825 | View |
alias | name | rows | type | |
---|---|---|---|---|
0 | product | Product | 504 | DataFrame |
1 | sales_order_detail | SalesOrderDetail | 121317 | DataFrame |
2 | sales_order_header | SalesOrderHeaderRefined | 19704 | DataFrame |
3 | sales_order_reason | SalesOrderHeaderSalesReason | 27647 | DataFrame |
4 | special_offer | SpecialOffer | 16 | DataFrame |
5 | store | Store | 701 | DataFrame |
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
dm = getml.data.DataModel(sales_order_header.to_placeholder("population"))
dm.add(getml.data.to_placeholder(
product=product,
sales_order_detail=sales_order_detail,
sales_order_header=sales_order_header,
sales_order_reason=sales_order_reason,
special_offer=special_offer,
store=store,
))
dm.population.join(
dm.sales_order_header,
on="CustomerID",
time_stamps="OrderDate",
lagged_targets=True,
horizon=getml.data.time.days(1),
)
dm.population.join(
dm.sales_order_detail,
on="SalesOrderID",
)
dm.population.join(
dm.sales_order_reason,
on="SalesOrderID",
)
dm.population.join(
dm.store,
on="SalesPersonID",
)
dm.sales_order_detail.join(
dm.product,
on="ProductID",
relationship=getml.data.relationship.many_to_one,
)
dm.sales_order_detail.join(
dm.special_offer,
on="SpecialOfferID",
relationship=getml.data.relationship.many_to_one,
)
dm
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | sales_order_detail, product, special_offer | SALES_ORDER_DETAIL__STAGING_TABLE_2 |
2 | sales_order_header | SALES_ORDER_HEADER__STAGING_TABLE_3 |
3 | sales_order_reason | SALES_ORDER_REASON__STAGING_TABLE_4 |
4 | store | STORE__STAGING_TABLE_5 |
2.2 getML pipeline¶
Set-up the feature learner & predictor
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
num_features=400,
)
predictor = getml.predictors.XGBoostClassifier(n_jobs=1)
Build the pipeline
pipe1 = getml.Pipeline(
tags=['fast_prop'],
data_model=dm,
feature_learners=[fast_prop],
predictors=[predictor],
include_categorical=True,
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['product', 'sales_order_detail', 'sales_order_header', 'sales_order_reason', 'special_offer', 'store'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.5, tags=['fast_prop'])
2.3 Model training¶
pipe1.check(container.train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and SALES_ORDER_REASON__STAGING_TABLE_4 over 'SalesOrderID' and 'SalesOrderID', there are no corresponding entries for 33.769352% of entries in 'SalesOrderID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and STORE__STAGING_TABLE_5 over 'SalesPersonID' and 'SalesPersonID', there are no corresponding entries for 84.941548% of entries in 'SalesPersonID' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe1.fit(container.train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING. To see the issues in full, run .check() on the pipeline. Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] FastProp: Trying 563 features... 100% |██████████| [elapsed: 00:02, remaining: 00:00] FastProp: Building features... 100% |██████████| [elapsed: 00:01, remaining: 00:00] XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:14, remaining: 00:00] Trained pipeline. Time taken: 0h:0m:19.952771
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['product', 'sales_order_detail', 'sales_order_header', 'sales_order_reason', 'special_offer', 'store'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.5, tags=['fast_prop', 'container-1l2C0q'])
2.4 Model evaluation¶
pipe1.score(container.test)
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] FastProp: Building features... 100% |██████████| [elapsed: 00:00, remaining: 00:00]
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-08-07 16:26:10 | train | churn | 0.9152 | 0.9723 | 0.2187 |
1 | 2024-08-07 16:26:11 | test | churn | 0.9162 | 0.9699 | 0.2265 |
2.5 Features¶
The most important features look as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_307";
CREATE TABLE "FEATURE_1_307" AS
SELECT AVG( t1."orderdate" - t2."t4__startdate" ) AS "feature_1_307",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "SALES_ORDER_DETAIL__STAGING_TABLE_2" t2
ON t1."salesorderid" = t2."salesorderid"
GROUP BY t1.rowid;
getml.engine.shutdown()
3. Conclusion¶
In this notebook, we successfully demonstrated the process of predicting customer churn for AdventureWorks, a fictional company that sells bicycles, using the getML library. The key steps we covered include:
Background and Data Preparation:
- Introduced AdventureWorks and the purpose of the analysis.
- Loaded and prepared the data from the AdventureWorks database, including various tables such as Product, SalesOrderDetail, SalesOrderHeader, and more.
Data Visualization and Preparation:
- Defined roles for the dataset columns to prepare them for modeling.
- Ensured the data was in the correct format for getML to process.
Predictive Modeling:
- Created a relational model using getML to capture the relationships between different tables.
- Built a getML pipeline for feature learning and prediction.
Model Training and Evaluation:
- Trained the model on the prepared data.
- Evaluated the model's performance to predict customer churn.
By leveraging getML's capabilities, we efficiently handled the complexities of relational data and built a robust classification model to predict customer churn. This approach can be extended to other customer databases and prediction tasks, providing valuable insights and accurate forecasts.
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).