Online Retail - Predicting order cancellations¶

In this tutorial, we demonstrate how getML can be applied in an e-commerce context. Using a dataset of about 400,000 orders, our goal is to predict whether an order will be cancelled.

We also show that we can significantly improve our results by using getML's built-in hyperparameter tuning routines.

Summary:

Prediction type: Classification model
Domain: E-commerce
Prediction target: Whether an order will be cancelled
Population size: 397925

Background¶

The data set contains about 400,000 orders from a British online retailer. Each order consists of a product that has been ordered and a corresponding quantity. Several orders can be summarized onto a single invoice. The goal is to predict whether an order will be cancelled.

Because the company mainly sells to other businesses, the cancellation rate is relatively low, namely 1.83%.

The data set has been originally collected for this study:

Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197-208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).

It has been downloaded from the UCI Machine Learning Repository:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

Analysis¶

Let's get started with the analysis and set up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "pyspark==3.5.2" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "pyspark==3.5.2" "ipywidgets==8.1.5"

In [2]:

  Copied!     
 
import os
from urllib import request

import numpy as np
import pandas as pd

from pyspark.sql import SparkSession
import getml

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

print(f"getML API version: {getml.__version__}\n")
import os from urllib import request import numpy as np import pandas as pd from pyspark.sql import SparkSession import getml os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [3]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('online_retail')
getml.engine.launch(allow_remote_ips=True, token='token') getml.engine.set_project('online_retail')

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912145332.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'online_retail'.

In [4]:

  Copied!     
 
RUN_SPARK = False
RUN_SPARK = False

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data from the source file:

In [5]:

  Copied!     
 
fname = "online_retail.csv"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://static.getml.com/datasets/online_retail/" + fname, 
        fname
    )
    
full_data_pandas = pd.read_csv(fname, sep="|")
fname = "online_retail.csv" if not os.path.exists(fname): fname, res = request.urlretrieve( "https://static.getml.com/datasets/online_retail/" + fname, fname ) full_data_pandas = pd.read_csv(fname, sep="|")

1.2 Data preparation¶

The invoice dates are in a somewhat unusual format, fo we need to rectify that.

In [6]:

  Copied!     
 
def add_zero(string):
    if len(string) == 1:
        return "0" + string
    return string
def add_zero(string): if len(string) == 1: return "0" + string return string

In [7]:

  Copied!     
 
def format_date(string):
    datetime = string.split(" ")
    assert len(datetime) == 2, "Expected date and time"
    
    date_components = datetime[0].split("/")
    assert len(date_components) == 3, "Expected three date components"
    
    date_components = [add_zero(x) for x in date_components]
    
    return "-".join(date_components) + " " + datetime[1]
def format_date(string): datetime = string.split(" ") assert len(datetime) == 2, "Expected date and time" date_components = datetime[0].split("/") assert len(date_components) == 3, "Expected three date components" date_components = [add_zero(x) for x in date_components] return "-".join(date_components) + " " + datetime[1] 

In [8]:

  Copied!     
 
full_data_pandas["InvoiceDate"] = [
    format_date(string) for string in np.asarray(full_data_pandas["InvoiceDate"])
]
full_data_pandas["InvoiceDate"] = [ format_date(string) for string in np.asarray(full_data_pandas["InvoiceDate"]) ]

In [9]:

  Copied!     
 
full_data_pandas
full_data_pandas

Out[9]:

	Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	6	2010-12-01 08:26	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	8	2010-12-01 08:26	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	6	2010-12-01 08:26	3.39	17850.0	United Kingdom
...	...	...	...	...	...	...	...	...
541905	581587	22899	CHILDREN'S APRON DOLLY GIRL	6	2011-12-09 12:50	2.10	12680.0	France
541906	581587	23254	CHILDRENS CUTLERY DOLLY GIRL	4	2011-12-09 12:50	4.15	12680.0	France
541907	581587	23255	CHILDRENS CUTLERY CIRCUS PARADE	4	2011-12-09 12:50	4.15	12680.0	France
541908	581587	22138	BAKING SET 9 PIECE RETROSPOT	3	2011-12-09 12:50	4.95	12680.0	France
541909	581587	POST	POSTAGE	1	2011-12-09 12:50	18.00	12680.0	France

541910 rows × 8 columns

In this data set, the targets aren't as clearly defined as we would like to, so we have do define them ourselves.

In [10]:

  Copied!     
 
def add_target(df):
    df = df.sort_values(by=["Customer ID", "InvoiceDate"])
    
    cancelled = np.zeros(df.shape[0])

    invoice = np.asarray(df["Invoice"])
    stock_code = np.asarray(df["StockCode"])
    customer_id = np.asarray(df["Customer ID"])

    for i in range(len(invoice)):
        if (invoice[i][0] == 'C') or (i == len(invoice) - 1):
            continue

        j = i + 1

        while customer_id[j] == customer_id[i]:
            if (invoice[j][0] == 'C') and (stock_code[i] == stock_code[j]):
                cancelled[i] = 1.0
                break

            if stock_code[i] == stock_code[j]:
                break

            j += 1
    
    df["cancelled"] = cancelled
    
    return df
def add_target(df): df = df.sort_values(by=["Customer ID", "InvoiceDate"]) cancelled = np.zeros(df.shape[0]) invoice = np.asarray(df["Invoice"]) stock_code = np.asarray(df["StockCode"]) customer_id = np.asarray(df["Customer ID"]) for i in range(len(invoice)): if (invoice[i][0] == 'C') or (i == len(invoice) - 1): continue j = i + 1 while customer_id[j] == customer_id[i]: if (invoice[j][0] == 'C') and (stock_code[i] == stock_code[j]): cancelled[i] = 1.0 break if stock_code[i] == stock_code[j]: break j += 1 df["cancelled"] = cancelled return df

Also, we want to remove any orders in the data set that are actually cancellations.

In [11]:

  Copied!     
 
def remove_cancellations(df):
    invoice = np.asarray(df["Invoice"])

    is_order = [inv[0] != 'C' for inv in invoice]
    
    df = df[is_order]
    
    return df
def remove_cancellations(df): invoice = np.asarray(df["Invoice"]) is_order = [inv[0] != 'C' for inv in invoice] df = df[is_order] return df

In [12]:

  Copied!     
 
full_data_pandas = add_target(full_data_pandas)
full_data_pandas = remove_cancellations(full_data_pandas)
full_data_pandas = add_target(full_data_pandas) full_data_pandas = remove_cancellations(full_data_pandas)

Finally, there are some order for which we do not have a customer ID. We want to remove those.

In [13]:

  Copied!     
 
full_data_pandas = full_data_pandas[~np.isnan(full_data_pandas["Customer ID"])]
full_data_pandas = full_data_pandas[~np.isnan(full_data_pandas["Customer ID"])]

Now we can upload the data to getML.

In [14]:

  Copied!     
 
full_data = getml.data.DataFrame.from_pandas(full_data_pandas, "full_data")

full_data
full_data = getml.data.DataFrame.from_pandas(full_data_pandas, "full_data") full_data

Out[14]:

name	Quantity	Price	Customer ID	cancelled	Invoice	StockCode	Description	InvoiceDate	Country
role	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string	unused_string	unused_string
0	74215	1.04	12346	1	541431	23166	MEDIUM CERAMIC TOP STORAGE JAR	2011-01-18 10:01	United Kingdom
1	12	2.1	12347	0	537626	85116	BLACK CANDELABRA T-LIGHT HOLDER	2010-12-07 14:57	Iceland
2	4	4.25	12347	0	537626	22375	AIRLINE BAG VINTAGE JET SET BROW...	2010-12-07 14:57	Iceland
3	12	3.25	12347	0	537626	71477	COLOUR GLASS. STAR T-LIGHT HOLDE...	2010-12-07 14:57	Iceland
4	36	0.65	12347	0	537626	22492	MINI PAINT SET VINTAGE	2010-12-07 14:57	Iceland
	...	...	...	...	...	...	...	...	...
397920	12	0.42	18287	0	570715	22419	LIPSTICK PEN RED	2011-10-12 10:23	United Kingdom
397921	12	2.1	18287	0	570715	22866	HAND WARMER SCOTTY DOG DESIGN	2011-10-12 10:23	United Kingdom
397922	36	1.25	18287	0	573167	23264	SET OF 3 WOODEN SLEIGH DECORATIO...	2011-10-28 09:29	United Kingdom
397923	48	0.39	18287	0	573167	21824	PAINTED METAL STAR WITH HOLLY BE...	2011-10-28 09:29	United Kingdom
397924	24	0.29	18287	0	573167	21014	SWISS CHALET TREE DECORATION	2011-10-28 09:29	United Kingdom

397925 rows x 9 columns
memory usage: 57.28 MB
name: full_data
type: getml.DataFrame

1.3 Prepare data for getML¶

getML requires that we define roles for each of the columns.

In [15]:

  Copied!     
 
full_data.set_role("InvoiceDate", getml.data.roles.time_stamp, time_formats=['%Y-%m-%d %H:%M'])
full_data.set_role(["Customer ID", "Invoice"], getml.data.roles.join_key)
full_data.set_role(["cancelled"], getml.data.roles.target)
full_data.set_role(["Quantity", "Price"], getml.data.roles.numerical)
full_data.set_role("Country", getml.data.roles.categorical)
full_data.set_role("Description", getml.data.roles.text)
full_data.set_role("InvoiceDate", getml.data.roles.time_stamp, time_formats=['%Y-%m-%d %H:%M']) full_data.set_role(["Customer ID", "Invoice"], getml.data.roles.join_key) full_data.set_role(["cancelled"], getml.data.roles.target) full_data.set_role(["Quantity", "Price"], getml.data.roles.numerical) full_data.set_role("Country", getml.data.roles.categorical) full_data.set_role("Description", getml.data.roles.text)

The StockCode is a 5-digit code that uniquely defines a product. It is hierarchical, meaning that every digit has a meaning. We want to make use of that, so we assign a unit to the stock code, which we can reference in our preprocessors.

In [16]:

  Copied!     
 
full_data.set_unit("StockCode", "code")
full_data.set_unit("StockCode", "code")

In [17]:

  Copied!     
 
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)
split
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15) split

Out[17]:


0	train
1	validation
2	train
3	validation
4	validation
	...

infinite number of rows
type: StringColumnView

2. Predictive modeling¶

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model¶

To get started with relational learning, we need to specify the data model.

In our case, there are two joins we are interested in:

We want to take a look at all of the other orders on the same invoice.
We want to check out how often a certain customer has cancelled orders in the past. Here, we limit ourselves to the last 90 days. To avoid data leaks, we set a horizon of one day.

In [18]:

  Copied!     
 
star_schema = getml.data.StarSchema(
    population=full_data, 
    alias="population",
    split=split,
)

star_schema.join(
    full_data.drop("Description"),
    alias="full_data",
    on='Invoice',
)

star_schema.join(
    full_data.drop("Description"),
    alias="full_data",
    on='Customer ID',
    time_stamps='InvoiceDate',
    horizon=getml.data.time.days(1),
    memory=getml.data.time.days(90),
    lagged_targets=True,
)

star_schema
star_schema = getml.data.StarSchema( population=full_data, alias="population", split=split, ) star_schema.join( full_data.drop("Description"), alias="full_data", on='Invoice', ) star_schema.join( full_data.drop("Description"), alias="full_data", on='Customer ID', time_stamps='InvoiceDate', horizon=getml.data.time.days(1), memory=getml.data.time.days(90), lagged_targets=True, ) star_schema

Out[18]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	full_data	FULL_DATA__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	full_data	60013	View
1	train	full_data	278171	View
2	validation	full_data	59741	View

peripheral

	name	rows	type
0	full_data	397925	View

2.2 getML pipeline¶

Set-up the feature learner & predictor

We have mentioned that the StockCode is a hierarchical code. To make use of that fact, we use getML's substring preprocessor, extracting the first digit, the first two digits etc. Since we have assigned the unit code to the StockCode, the preprocessors know which column they should be applied to.

In [19]:

  Copied!     
 
substr1 = getml.preprocessors.Substring(0, 1, "code")
substr2 = getml.preprocessors.Substring(0, 2, "code")
substr3 = getml.preprocessors.Substring(0, 3, "code")

mapping = getml.preprocessors.Mapping()

text_field_splitter = getml.preprocessors.TextFieldSplitter()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
    sampling_factor=0.1,
)

feature_selector = getml.predictors.XGBoostClassifier()

predictor = getml.predictors.XGBoostClassifier()
substr1 = getml.preprocessors.Substring(0, 1, "code") substr2 = getml.preprocessors.Substring(0, 2, "code") substr3 = getml.preprocessors.Substring(0, 3, "code") mapping = getml.preprocessors.Mapping() text_field_splitter = getml.preprocessors.TextFieldSplitter() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, num_threads=1, sampling_factor=0.1, ) feature_selector = getml.predictors.XGBoostClassifier() predictor = getml.predictors.XGBoostClassifier()

Build the pipeline

In [20]:

  Copied!     
 
pipe = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    preprocessors=[substr1, substr2, substr3, mapping, text_field_splitter],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=[predictor],
    share_selected_features=0.2,
)

pipe
pipe = getml.pipeline.Pipeline( tags=['fast_prop'], data_model=star_schema.data_model, preprocessors=[substr1, substr2, substr3, mapping, text_field_splitter], feature_learners=[fast_prop], feature_selectors=[feature_selector], predictors=[predictor], share_selected_features=0.2, ) pipe

Out[20]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['full_data'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],
         share_selected_features=0.2,
         tags=['fast_prop'])

2.3 Model training¶

In [21]:

  Copied!     
 
pipe.check(star_schema.train)
pipe.check(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:13
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

In [22]:

  Copied!     
 
pipe.fit(star_schema.train)
pipe.fit(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 206 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:54
  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:40
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:21

Trained pipeline.

Time taken: 0:03:05.120829.

Out[22]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['full_data'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Substring', 'Substring', 'Substring', 'Mapping', 'TextFieldSplitter'],
         share_selected_features=0.2,
         tags=['fast_prop', 'container-TWm7IL'])

2.4 Model evaluation¶

In [23]:

  Copied!     
 
pipe.score(star_schema.test)
pipe.score(star_schema.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Out[23]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 11:51:44	train	cancelled	0.9825	0.8446	0.0736
1	2024-09-12 11:51:47	test	cancelled	0.9825	0.8119	0.07529

2.5 Features¶

The most important feature looks as follows:

In [24]:

  Copied!     
 
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]

Out[24]:

DROP TABLE IF EXISTS "FEATURE_1_194";

CREATE TABLE "FEATURE_1_194" AS
SELECT AVG( t2."description__mapping_3_target_1_avg" ) AS "feature_1_194",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "POPULATION__STAGING_TABLE_1__DESCRIPTION" t2
ON t1."rowid" = t2."rownum"
GROUP BY t1.rowid;

2.6 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code.

In [ ]:

  Copied!     
 
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("online_retail_spark")
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("online_retail_spark")

In [26]:

  Copied!     
 
if RUN_SPARK:
    spark = SparkSession.builder.appName(
        "online_retail"
    ).config(
        "spark.driver.maxResultSize","5g"
    ).config(
        "spark.driver.memory", "5g"
    ).config(
        "spark.executor.memory", "5g"
    ).config(
        "spark.sql.execution.arrow.pyspark.enabled", "true"
    ).config(
        "spark.sql.session.timeZone", "UTC"
    ).enableHiveSupport().getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")
if RUN_SPARK: spark = SparkSession.builder.appName( "online_retail" ).config( "spark.driver.maxResultSize","5g" ).config( "spark.driver.memory", "5g" ).config( "spark.executor.memory", "5g" ).config( "spark.sql.execution.arrow.pyspark.enabled", "true" ).config( "spark.sql.session.timeZone", "UTC" ).enableHiveSupport().getOrCreate() spark.sparkContext.setLogLevel("ERROR")

In [27]:

  Copied!     
 
if RUN_SPARK:
    population_spark = star_schema.train.population.to_pyspark(spark, name="population")
    peripheral_spark = star_schema.full_data.to_pyspark(spark, name="full_data")
if RUN_SPARK: population_spark = star_schema.train.population.to_pyspark(spark, name="population") peripheral_spark = star_schema.full_data.to_pyspark(spark, name="full_data")

In [28]:

  Copied!     
 
if RUN_SPARK:
    getml.spark.execute(spark, "online_retail_spark")
if RUN_SPARK: getml.spark.execute(spark, "online_retail_spark")

The resulting features are in a table called features. Here is how you can retrieve them:

In [29]:

  Copied!     
 
if RUN_SPARK:
    spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()
if RUN_SPARK: spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

In this notebook we have demonstrated how getML can be applied to an e-commerce setting. In particular, we have seen how results can be improved using the built-in hyperparamater tuning routines.