getml.data.TimeSeries
TimeSeries(
population: Union[DataFrame, View],
time_stamps: str,
alias: Optional[str] = None,
peripheral: Optional[
Dict[str, Union[DataFrame, View]]
] = None,
split: Optional[
Union[StringColumn, StringColumnView]
] = None,
deep_copy: Optional[bool] = False,
on: OnType = None,
memory: Optional[float] = None,
horizon: Optional[float] = None,
lagged_targets: bool = False,
upper_time_stamp: Optional[str] = None,
)
Bases: StarSchema
A TimeSeries is a simplifying abstraction that can be used for machine learning problems on time series data.
It unifies Container
and
DataModel
thus abstracting away the need to
differentiate between the concrete data and the abstract data model.
It also abstracts away the need for
self joins.
ATTRIBUTE | DESCRIPTION |
---|---|
time_stamps |
The time stamps used to limit the self-join.
|
population |
The population table defines the statistical population of the machine learning problem and contains the target variables.
|
alias |
The alias to be used for the population table. If it isn't set, the 'population'
will be used as the alias. To explicitly set an alias for the
peripheral table, use
|
peripheral |
The peripheral tables are joined onto population or other
peripheral tables. Note that you can also pass them using
|
split |
Contains information on how you want to split population into
different
|
deep_copy |
Whether you want to create deep copies or your tables.
|
on |
The join keys to use. If none is passed, then everything will be joined to everything else.
|
memory |
The difference between the time stamps until data is 'forgotten'.
Limiting your joins using memory can significantly speed up
training time. Provide the value in seconds, alternatively use
the convenience functions from
|
horizon |
The prediction horizon to apply to this join.
Provide the value in seconds, alternatively use
the convenience functions from
|
lagged_targets |
Whether you want to allow lagged targets. If this is set to True, you must also pass a positive, non-zero horizon.
|
upper_time_stamp |
Name of a time stamp in right_df that serves as an upper limit on the join.
|
Example
# All rows before row 10500 will be used for training.
split = getml.data.split.time(data_all, "rowid", test=10500)
time_series = getml.data.TimeSeries(
population=data_all,
time_stamps="rowid",
split=split,
lagged_targets=False,
memory=30,
)
pipe = getml.Pipeline(
data_model=time_series.data_model,
feature_learners=[...],
predictors=...
)
pipe.check(time_series.train)
pipe.fit(time_series.train)
pipe.score(time_series.test)
# To generate predictions on new data,
# it is sufficient to use a Container.
# You don't have to recreate the entire
# TimeSeries, because the abstract data model
# is stored in the pipeline.
container = getml.data.Container(
population=population_new,
)
# Add the data as a peripheral table, for the
# self-join.
container.add(population=population_new)
predictions = pipe.predict(container.full)
Source code in getml/data/time_series.py
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|