project

Consumer 1/3 (nb intro)

EXCERPT (BLOG ITEM + OG, TWITTER)

Consumer expenditure categorization

Why relational learning matters

This example demonstrates how powerful a real relational learning algorithm can be. Based on a public-domain dataset on consumer behavior, we use a propostionalization algorithm to predict whether purchases were made as a gift. We show that with relational learning, we can get an AUC of over 90%. The generated features would have been impossible to build by hand or by using brute-force approaches.

Summary:

  • Prediction type: Classification model
  • Domain: Retail
  • Prediction target: If a purchase is a gift
  • Source data: Relational data set, 4 tables
  • Population size: 398.895

Author: Patrick Urbanke

Background

Relational learning is one of the most underappreciated fields of machine learning. Even though relational learning is very relevant to many real world data science projects, many data scientists don't even know what relational learning is.

There are many subdomains of relational learning, but the most important one is extracting features from relational data: Most business data is relational, meaning that it is spread out over several relational tables. However, most machine learning algorithms require that the data be presented in the form of a single flat table. So we need to extract features from our relational data. Some people also call this data wrangling.

Most data scientists we know extract features from relational data manually or by using crude, brute-force approaches (randomly generate thousands of features and then do a feature selection). This is very time-consuming and does not produce good features.

The challenge

The Consumer Expenditure Data Set is a public domain data set provided by the American Bureau of Labor Statistics (https://www.bls.gov/cex/pumd.htm). It includes the diary entries, where American consumers are asked to keep diaries of the products they have purchased each month.

These consumer goods are categorized using a six-digit classification system the UCC. This system is hierarchical, meaning that every digit represents an increasingly granular category.

For instance, all UCC codes beginning with ‘200’ represent beverages. UCC codes beginning with ‘20011’ represents beer and ‘200111’ represents ‘beer and ale’ and ‘200112’ represents ‘nonalcoholic beer’ (https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx).

The diaries also contain a flag that indicates whether the product was purchased as a gift. The challenge is to predict that flag using other information in the diary entries.

This can be done based on the following considerations:

  1. Some items are less likely to be purchased as gifts than others (for instance, it is unlikely that toilet paper is ever purchased as a gift).
  2. Items that diverge from the usual consumption patterns are more likely to be gifts.

In total, there are three tables which we find interesting:

  1. EXPD, which contains information on the consumer expenditures, including the target variable GIFT.
  2. FMLD, which contains socio-demographic information on the households.
  3. MEMD, which contains socio-demographic information on each member of the households.

Related code example

Notebook:
Open in nbviewer
Open in mybinder