getml.data.roles
categorical
module-attribute
categorical: Final[Categorical] = 'categorical'
Marks categorical columns.
This role tells the getML Engine to include the associated
StringColumn
during feature
learning.
It should be used for all data with no inherent ordering, even if the categories are encoded as integer instead of strings in your provided data set.
join_key
module-attribute
Marks join keys.
Role required to establish a relation between two
Placeholder
, the abstract representation of the
DataFrame
, by using the
join
method. Please refer to the
chapter Data Model for details.
The content of this column is allowed to contain NULL values. But beware, columns with NULL in their join keys won't be matched to anything, not even to NULL in other join keys.
columns
of this role will not be handled by
the feature learning algorithm.
numerical
module-attribute
Marks numerical columns.
This role tells the getML Engine to include the associated
FloatColumn
during feature
learning.
It should be used for all data with an inherent ordering, regardless of whether it is sampled from a continuous quantity, like passed time or the total amount of rainfall, or a discrete one, like the number of sugary mulberries one has eaten since lunch.
target
module-attribute
Marks the column(s) we would like to predict.
The associated columns
contain the variables we
want to predict. They are not used by the feature learning
algorithm unless we explicitly tell it to do so
(refer to lagged_target
in join
).
But they
are such an important part of the analysis that the population table is required
to contain at least one of them (refer to Data Model Tables).
The content of the target columns needs to be numerical.
For classification problems, target variables can only assume the values
0 or 1. Target variables can never be NULL
.
text
module-attribute
Marks text columns.
This role tells the getML Engine to include the associated
StringColumn
during feature
learning.
It should be used for all data with no inherent ordering. Unlike categorical columns, text columns can not be used as a whole. Instead, the feature learners have to apply basic text mining techniques before they are able to use them.
time_stamp
module-attribute
Marks time stamps.
This role is used to prevent data leaks. When you join one table onto another, you usually want to make sure that no data from the future is used. Time stamps can be used to limit your joins.
In addition, the feature learning algorithm can aggregate time stamps or use them for conditions. However, they will not be compared to fixed values unless you explicitly change their units. This means that conditions like this are not possible by default:
WHERE time_stamp > some_fixed_date
WHERE time_stamp1 - time_stamp2 > some_value
This is because it is unlikely that comparing time stamps to a fixed date performs well out-of-sample.
When assigning the role time stamp to a column that is currently a
StringColumn
,
you need to specify the format of this string. You can do so by using
the time_formats
argument of
set_role
. You can pass a list of time formats
that is used to try to interpret the input strings. Possible format options are
- %w - abbreviated weekday (Mon, Tue, ...)
- %W - full weekday (Monday, Tuesday, ...)
- %b - abbreviated month (Jan, Feb, ...)
- %B - full month (January, February, ...)
- %d - zero-padded day of month (01 .. 31)
- %e - day of month (1 .. 31)
- %f - space-padded day of month ( 1 .. 31)
- %m - zero-padded month (01 .. 12)
- %n - month (1 .. 12)
- %o - space-padded month ( 1 .. 12)
- %y - year without century (70)
- %Y - year with century (1970)
- %H - hour (00 .. 23)
- %h - hour (00 .. 12)
- %a - am/pm
- %A - AM/PM
- %M - minute (00 .. 59)
- %S - second (00 .. 59)
- %s - seconds and microseconds (equivalent to %S.%F)
- %i - millisecond (000 .. 999)
- %c - centisecond (0 .. 9)
- %F - fractional seconds/microseconds (000000 - 999999)
- %z - time zone differential in ISO 8601 format (Z or +NN.NN)
- %Z - time zone differential in RFC format (GMT or +NNNN)
- %% - percent sign
If none of the formats works, the getML Engine will try to interpret the time stamps as numerical values. If this fails, the time stamp will be set to NULL.
Example
data_df = dict(
date1=[getml.data.time.days(365), getml.data.time.days(366), getml.data.time.days(367)],
date2=['1971-01-01', '1971-01-02', '1971-01-03'],
date3=['1|1|71', '1|2|71', '1|3|71'],
)
df = getml.DataFrame.from_dict(data_df, name='dates')
df.set_role(['date1', 'date2', 'date3'], getml.data.roles.time_stamp, time_formats=['%Y-%m-%d', '%n|%e|%y'])
df
| date1 | date2 | date3 |
| time stamp | time stamp | time stamp |
-------------------------------------------------------------------------------------------
| 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z |
| 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z |
| 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z |
Note
getML time stamps are actually floats expressing the number of seconds since UNIX time (1970-01-01T00:00:00).
unused_float
module-attribute
unused_float: Final[UnusedFloat] = 'unused_float'
Marks a FloatColumn
as unused.
The associated column
will be neither used in the
data model nor during feature learning or prediction.
unused_string
module-attribute
unused_string: Final[UnusedString] = 'unused_string'
Marks a StringColumn
as unused.
The associated column
will be neither used in the
data model nor during feature learning or prediction.
types
CategoricalLike
module-attribute
CategoricalLike = Literal[
Categorical, JoinKey, Text, UnusedString
]
Role
module-attribute
Role = Literal[
Categorical,
JoinKey,
Numerical,
Target,
Text,
TimeStamp,
UnusedFloat,
UnusedString,
]
sets
categorical
module-attribute
categorical: FrozenSet[CategoricalLike] = frozenset(
{categorical, join_key, text, unused_string}
)
Set of roles that are interpreted as categorical.
numerical
module-attribute
numerical: FrozenSet[NumericalLike] = frozenset(
{numerical, target, time_stamp, unused_float}
)
Set of roles that are interpreted as numerical.