getml.data.Container
Container(
population: Optional[Union[DataFrame, View]] = None,
peripheral: Optional[
Dict[str, Union[DataFrame, View]]
] = None,
split: Optional[
Union[StringColumn, StringColumnView]
] = None,
deep_copy: Optional[bool] = False,
train: Optional[Union[DataFrame, View]] = None,
validation: Optional[Union[DataFrame, View]] = None,
test: Optional[Union[DataFrame, View]] = None,
**kwargs: Optional[Union[DataFrame, View]]
)
A container holds the actual data in the form of a DataFrame
or a View
.
The purpose of a container is twofold:
-
Assigning concrete data to an abstract
DataModel
. -
Storing data and allowing you to reproduce previous results.
ATTRIBUTE | DESCRIPTION |
---|---|
population | The population table defines the statistical population of the machine learning problem and contains the target variables.
|
peripheral | The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using
|
split | Contains information on how you want to split population into different
|
deep_copy | Whether you want to create deep copies or your tables.
|
train | The population table used in the train
|
validation | The population table used in the validation
|
test | The population table used in the test
|
kwargs | The population table used in Example
|
Example
A DataModel
only contains abstract data. When we fit a pipeline, we need to assign concrete data.
This example is taken from the loans notebook . Note that in the notebook the high level StarSchema
implementation is used. For demonstration purposes we are proceeding now with the low level implementation.
# The abstract data model is constructed
# using the DataModel class. A data model
# does not contain any actual data. It just
# defines the abstract relational structure.
dm = getml.data.DataModel(
population_train.to_placeholder("population")
)
dm.add(getml.data.to_placeholder(
meta=meta,
order=order,
trans=trans)
)
dm.population.join(
dm.trans,
on="account_id",
time_stamps=("date_loan", "date")
)
dm.population.join(
dm.order,
on="account_id",
)
dm.population.join(
dm.meta,
on="account_id",
)
# We now have abstract placeholders on something
# called "population", "meta", "order" and "trans".
# But how do we assign concrete data? By using
# a container.
container = getml.data.Container(
train=population_train,
test=population_test
)
# meta, order and trans are either
# DataFrames or Views. Their aliases need
# to match the names of the placeholders in the
# data model.
container.add(
meta=meta,
order=order,
trans=trans
)
# Freezing makes the container immutable.
# This is not required, but often a good idea.
container.freeze()
# When we call 'train', the container
# will return the train set and the
# peripheral tables.
my_pipeline.fit(container.train)
# Same for 'test'
my_pipeline.score(container.test)
split
module. split = getml.data.split.random(
train=0.8, test=0.2)
container = getml.data.Container(
population=population_all,
split=split,
)
# The remaining code is the same as in
# the example above. In particular,
# container.train and container.test
# work just like above.
Containers can also be used for storage and reproducing your results. A recommended pattern is to assign 'baseline roles' to your data frames and then using a View
to tweak them:
# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)
# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()
# Save the data frame.
data_frame.save()
# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])
# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()
The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which container you have used.
Source code in getml/data/container.py
232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
|
add
add(*args, **kwargs)
Adds new peripheral data frames or views.
Source code in getml/data/container.py
500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 |
|
freeze
freeze()
Freezes the container, so that changes are no longer possible.
This is required before you can extract data when deep_copy=True
. The idea of deep_copy
is to ensure that you can always retrace and reproduce your results. That is why the container needs to be immutable before it can be used.
Source code in getml/data/container.py
537 538 539 540 541 542 543 544 545 546 547 |
|
save
save()
Saves the Container to disk.
Source code in getml/data/container.py
549 550 551 552 553 554 555 556 557 558 559 560 |
|
sync
sync()
Synchronizes the last change with the data to avoid warnings that the data has been changed.
This is only a problem when deep_copy=False
.
Source code in getml/data/container.py
562 563 564 565 566 567 568 569 570 571 572 573 |
|
to_pandas
Returns a Container
's contents as a dictionary of pandas.DataFrame
s. name
holds the data frame's name
, value the data converted to a pandas.DataFrame
.
Source code in getml/data/container.py
575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 |
|