getml.data.Container
Container(
population: Optional[Union[DataFrame, View]] = None,
peripheral: Optional[
Dict[str, Union[DataFrame, View]]
] = None,
split: Optional[
Union[StringColumn, StringColumnView]
] = None,
deep_copy: Optional[bool] = False,
train: Optional[Union[DataFrame, View]] = None,
validation: Optional[Union[DataFrame, View]] = None,
test: Optional[Union[DataFrame, View]] = None,
**kwargs: Optional[Union[DataFrame, View]]
)
A container holds the actual data in the form of a DataFrame
or a View
.
The purpose of a container is twofold:
-
Assigning concrete data to an abstract
DataModel
. -
Storing data and allowing you to reproduce previous results.
ATTRIBUTE | DESCRIPTION |
---|---|
population |
The population table defines the statistical population of the machine learning problem and contains the target variables.
|
peripheral |
The peripheral tables are joined onto population or other
peripheral tables. Note that you can also pass them using
|
split |
Contains information on how you want to split population into
different
|
deep_copy |
Whether you want to create deep copies or your tables.
|
train |
The population table used in the train
|
validation |
The population table used in the validation
|
test |
The population table used in the test
|
kwargs |
The population table used in Example
|
Example
A DataModel
only contains abstract data. When we
fit a pipeline, we need to assign concrete data.
This example is taken from the
loans notebook .
Note that in the notebook the high level StarSchema
implementation is used. For
demonstration purposes we are proceeding now with the low level implementation.
# The abstract data model is constructed
# using the DataModel class. A data model
# does not contain any actual data. It just
# defines the abstract relational structure.
dm = getml.data.DataModel(
population_train.to_placeholder("population")
)
dm.add(getml.data.to_placeholder(
meta=meta,
order=order,
trans=trans)
)
dm.population.join(
dm.trans,
on="account_id",
time_stamps=("date_loan", "date")
)
dm.population.join(
dm.order,
on="account_id",
)
dm.population.join(
dm.meta,
on="account_id",
)
# We now have abstract placeholders on something
# called "population", "meta", "order" and "trans".
# But how do we assign concrete data? By using
# a container.
container = getml.data.Container(
train=population_train,
test=population_test
)
# meta, order and trans are either
# DataFrames or Views. Their aliases need
# to match the names of the placeholders in the
# data model.
container.add(
meta=meta,
order=order,
trans=trans
)
# Freezing makes the container immutable.
# This is not required, but often a good idea.
container.freeze()
# When we call 'train', the container
# will return the train set and the
# peripheral tables.
my_pipeline.fit(container.train)
# Same for 'test'
my_pipeline.score(container.test)
split
module.
split = getml.data.split.random(
train=0.8, test=0.2)
container = getml.data.Container(
population=population_all,
split=split,
)
# The remaining code is the same as in
# the example above. In particular,
# container.train and container.test
# work just like above.
Containers can also be used for storage and reproducing your
results.
A recommended pattern is to assign 'baseline roles' to your data frames
and then using a View
to tweak them:
# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)
# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()
# Save the data frame.
data_frame.save()
# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])
# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()
The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which container you have used.
Source code in getml/data/container.py
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 |
|
add
add(*args, **kwargs)
Adds new peripheral data frames or views.
Source code in getml/data/container.py
489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 |
|
freeze
freeze()
Freezes the container, so that changes are no longer possible.
This is required before you can extract data when deep_copy=True
. The idea of
deep_copy
is to ensure that you can always retrace and reproduce your results.
That is why the container needs to be immutable before it can be
used.
Source code in getml/data/container.py
526 527 528 529 530 531 532 533 534 535 536 |
|
save
save()
Saves the Container to disk.
Source code in getml/data/container.py
538 539 540 541 542 543 544 545 546 547 548 549 |
|
sync
sync()
Synchronizes the last change with the data to avoid warnings that the data has been changed.
This is only a problem when deep_copy=False
.
Source code in getml/data/container.py
551 552 553 554 555 556 557 558 559 560 561 562 |
|
to_pandas
Returns a Container
's contents as a dictionary of pandas.DataFrame
s.
key
holds the data frame's name
, value the data converted to a pandas.DataFrame
.
Source code in getml/data/container.py
564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 |
|