getml.data.split
Splits data into a training, testing, validation or other sets.
concat
concat(
name: str, **kwargs: DataFrame
) -> Tuple[DataFrame, StringColumnView]
Concatenates several data frames into and produces a split column that keeps track of their origin.
PARAMETER | DESCRIPTION |
---|---|
name | The name of the data frame you would like to create. TYPE: |
kwargs | The data frames you would like to concat with the name in which they should appear in the split column. TYPE: |
RETURNS | DESCRIPTION |
---|---|
Tuple[DataFrame, StringColumnView] | A tuple containing the concatenated data frame and the split column. |
Example
A common use case for this functionality are TimeSeries
:
data_train = getml.DataFrame.from_pandas(
datatraining_pandas, name='data_train')
data_validate = getml.DataFrame.from_pandas(
datatest_pandas, name='data_validate')
data_test = getml.DataFrame.from_pandas(
datatest2_pandas, name='data_test')
population, split = getml.data.split.concat(
"population", train=data_train, validate=data_validate, test=data_test)
...
time_series = getml.data.TimeSeries(
population=population, split=split)
my_pipeline.fit(time_series.train)
Source code in getml/data/split/concat.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
random
random(
seed: int = 5849,
train: float = 0.8,
test: float = 0.2,
validation: float = 0,
**kwargs: float
) -> StringColumnView
Returns a StringColumnView
that can be used to randomly divide data into training, testing, validation or other sets.
PARAMETER | DESCRIPTION |
---|---|
seed | Seed used for the random number generator. TYPE: |
train | The share of random samples assigned to the training set. TYPE: |
validation | The share of random samples assigned to the validation set. TYPE: |
test | The share of random samples assigned to the test set. TYPE: |
kwargs | Any other sets you would like to assign. You can name these sets whatever you want to (in our example, we called it 'other'). TYPE: |
Example
split = getml.data.split.random(
train=0.8, test=0.1, validation=0.05, other=0.05
)
train_set = data_frame[split=='train']
validation_set = data_frame[split=='validation']
test_set = data_frame[split=='test']
other_set = data_frame[split=='other']
Source code in getml/data/split/random.py
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
|
time
time(
population: DataFrame,
time_stamp: Union[str, FloatColumn, FloatColumnView],
validation: Optional[
Union[float, int, datetime64]
] = None,
test: Optional[Union[float, int, datetime64]] = None,
**kwargs: Union[float, int, datetime64]
) -> StringColumnView
Returns a StringColumnView
that can be used to divide data into training, testing, validation or other sets.
The arguments are key=value
pairs of names (key
) and starting points (value
). The starting point defines the left endpoint of the subset. Intervals are left closed and right open, such that \([value, next value)\). The (unnamed) subset left from the first named starting point, i.e. \([0, first value)\), is always considered to be the training set.
PARAMETER | DESCRIPTION |
---|---|
population | The population table you would like to split. TYPE: |
time_stamp | The name of the time stamp column in the population table you want to use. Ideally, the role of said column would be TYPE: |
validation | The start date of the validation set. |
test | The start date of the test set. |
kwargs | Any other sets you would like to assign. You can name these sets whatever you want to (in our example, we called it 'other'). |
Example
validation_begin = getml.data.time.datetime(2010, 1, 1)
test_begin = getml.data.time.datetime(2011, 1, 1)
other_begin = getml.data.time.datetime(2012, 1, 1)
split = getml.data.split.time(
population=data_frame,
time_stamp="ds",
test=test_begin,
validation=validation_begin,
other=other_begin
)
# Contains all data before 2010-01-01 (not included)
train_set = data_frame[split=='train']
# Contains all data between 2010-01-01 (included) and 2011-01-01 (not included)
validation_set = data_frame[split=='validation']
# Contains all data between 2011-01-01 (included) and 2012-01-01 (not included)
test_set = data_frame[split=='test']
# Contains all data after 2012-01-01 (included)
other_set = data_frame[split=='other']
Source code in getml/data/split/time.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|