The relational nature of multivariate time series and external regressors

Automate Machine Learning on time series with getML

The relational nature of time series

Most of the material and information provided about getML is focusing on relational data. But what if one has to deal with time series? Will the software be applicable and of any help?

Short answer: Yes, it is applicable and suites your use case too! Regardless of whether you are using getML as an all-in-one solution or only deploy its generated features into your custom analysis workflow.

For those interested in the long answer the remainder of this document will first explain why a time series is indeed a subset of relational data. Then, we will have a look at how to make getML support your time series and how the most important aspects and operations, like causality and time windows, are achieved internally and within relation databases in general.

Time series as a subset of relational data

Let's assume we do have a set of files containing measurements of one or more quantities generated in their individual context, like daily temperature measurements at different observation stations. Classically we would represent the data in a table like this:

time_stamp temperature
01.01.2019 2.3
02.01.2019 1.8
03.01.2019 -0.4

Being a table already, we can insert the data into a relational database system as is. But we can do more. Way more.

By adding an additional column specifying the particular measurement station and appending the data of all other files - which could very well be of different length - to our table, we can integrate all data at hand into a single object. This enables us to harness the power of modern database technology and allows our analysis to scale to vast amounts of data as well as to be both more flexible and powerful. The individual series can be easily be retrieved by filtering a particular station.

time_stamp temperature station
01.01.2019 2.3 Berlin
02.01.2019 1.8 Berlin
03.01.2019 -0.4 Berlin
01.01.2019 3.8 Paris
02.01.2019 4.2 Paris
03.01.2019 1.2 Paris
01.01.2019 1.3 Brussels
02.01.2019 1.2 Brussels
03.01.2019 1.4 Brussels

But we still have just one table. Using getML you can easily add further information using additional tables in order to produce better features and more accurate predictions. Be it metadata about the measurement stations, costumer profiles, or third-party measurements. You name it! Imagine it as a many-to-one AI connector retrieving data from various sources and handing only the most promising combinations to your well-known machine learning models or time series analysis tools.

Handling time series with getML

getML fully respects your time stamps and makes sure causality and temporal ordering won't be violated while generating features. The only thing you have to do is to tell the engine which column should be handled as time stamps using the time_stamps argument while constructing a DataFrame and to add its basic time unit. That's all!

Handling time series in relational databases

In case you are wondering how this can be achieved internally or in relation database systems in general, the key idea is the so-called self-join. It allows a table to reference to itself. Let's have a look at our example above to understand what's going on. Disclaimer: you will find some SQL code in this section. This is only for illustration purposes. You won't need to write a single line yourself since getML will bear this burden for you!

Let's start with causality: Under no circumstances any information about future events should be leaked by the generated features. Therefore, regardless which time stamp to look at, only events of present and past are allowed to be aggregated. This will be illustrated with the following lines of code calculating the mean temperature at a certain point in time at a particular measurement station.

SELECT t1.time_stamp, AVG( t1.temperature ), t1.station
FROM tableAll t1
LEFT JOIN tableAll t2
ON t1.station = t2.station
WHERE t2.time_stamp <= t1.time_stamp

We'll not cover every single detail of the SQL syntax but literally this statement says: give me the time stamps, station ids, and the average temperatures of our table tableAll over all temperatures at the particular station having a time stamp equal or smaller than the current one. t1 and t2 are used as aliases referencing the same table. It's a little bit much, so let's have a look at the individual lines in our table. The first line deals with Berlin at the 01.01.2019. Since there are no time stamps prior to this date and measurements done at other sites do not match our ON condition, the average will be the measured temperature itself. In the second row there are two time stamps matching the WHERE condition and yielding an average of (2.3+1.8)/2 = 2.05. The result will look like this.

time_stamp temperature station
01.01.2019 2.3 Berlin
02.01.2019 2.05 Berlin
03.01.2019 1.233 Berlin
01.01.2019 3.8 Paris
02.01.2019 4 Paris
03.01.2019 3.067 Paris
01.01.2019 1.3 Brussels
02.01.2019 1.25 Brussels
03.01.2019 1.3 Brussels

Performing an average over a time window, e.g. just taking the last week of our daily data set into account, is a little bit more complicated (in terms of syntax).

SELECT t1.time_stamp, AVG( t1.temperature ), t1.station
FROM tableAll t1
LEFT JOIN tableAll t2
ON t1.station = t2.station
WHERE t2.time_stamp <= t1.time_stamp AND t2.time_stamp + 7 > t1.time_stamp

Here all rows considered in the average must, again, have a time stamp equal or smaller than the current one. In addition the time stamp must not older than 7 days with respect to the former one.

But remember, all of this will be done automatically by getML so you won't have to care about a single line of SQL (if you not intend to).