Skip to content

getml.data.View

View(
    base: Union[DataFrame, View],
    name: Optional[str] = None,
    subselection: Optional[
        Union[
            BooleanColumnView, FloatColumn, FloatColumnView
        ]
    ] = None,
    added: Optional[Dict] = None,
    dropped: Optional[List[str]] = None,
)

A view is a lazily evaluated, immutable representation of a DataFrame.

There are important differences between a DataFrame and a view:

  • Views are lazily evaluated. That means that views do not contain any data themselves. Instead, they just refer to an underlying data frame. If the underlying data frame changes, so will the view (but such behavior will result in a warning).

  • Views are immutable. In-place operations on a view are not possible. Any operation on a view will result in a new view.

  • Views have no direct representation on the getML Engine, and therefore they do not need to have an identifying name.

ATTRIBUTE DESCRIPTION
base

A data frame or view used as the basis for this view.

TYPE: Union[DataFrame, View]

name

The name assigned to this view.

TYPE: str

subselection

Indicates which rows we would like to keep.

TYPE: Union[BooleanColumnView, FloatColumn, FloatColumnView]

added

A dictionary that describes a new column that has been added to the view.

TYPE: Dict

dropped

A list of columns that have been dropped.

TYPE: List[str]

Example

You hardly ever directly create views. Instead, it is more likely that you will encounter them as a result of some operation on a DataFrame:

# Creates a view on the first 100 lines
view1 = data_frame[:100]

# Creates a view without some columns.
view2 = data_frame.drop(["col1", "col2"])

# Creates a view in which some roles are reassigned.
view3 = data_frame.with_role(["col1", "col2"], getml.data.roles.categorical)
A recommended pattern is to assign 'baseline roles' to your data frames and then using views to tweak them:

# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)

# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()

# Save the data frame.
data_frame.save()

# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])

# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()

The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which Container you have used.

Source code in getml/data/view.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
def __init__(
    self,
    base: Union[DataFrame, View],
    name: Optional[str] = None,
    subselection: Optional[
        Union[BooleanColumnView, FloatColumn, FloatColumnView]
    ] = None,
    added: Optional[Dict] = None,
    dropped: Optional[List[str]] = None,
):
    self._added = added
    self._base = deepcopy(base)
    self._dropped = dropped or []
    self._name = name
    self._subselection = subselection

    self._initial_timestamp: str = (
        self._base._initial_timestamp
        if isinstance(self._base, View)
        else self._base.last_change
    )

    self._base.refresh()

added property

added: Dict

The column that has been added to the view.

RETURNS DESCRIPTION
Dict

The column that has been added to the view.

base property

The basis on which the view is created. Must be a DataFrame or a View.

RETURNS DESCRIPTION
Union[DataFrame, View]

The basis on which the view is created.

colnames property

colnames: List[str]

List of the names of all columns.

RETURNS DESCRIPTION
List[str]

List of the names of all columns.

columns property

columns: List[str]

Alias for colnames.

RETURNS DESCRIPTION
List[str]

List of the names of all columns.

dropped property

dropped: List[str]

The names of the columns that has been dropped.

RETURNS DESCRIPTION
List[str]

The names of the columns that has been dropped.

last_change property

last_change: str

A string describing the last time this data frame has been changed.

RETURNS DESCRIPTION
str

A string describing the last time this data frame has been changed.

name property

name: str

The name of the view. If no name is explicitly set, the name will be identical to the name of the base.

RETURNS DESCRIPTION
str

The name of the view.

roles property

roles: Roles

The roles of the columns included in this View.

RETURNS DESCRIPTION
Roles

The roles of the columns included in this View.

rowid property

rowid: List[int]

The rowids for this view.

RETURNS DESCRIPTION
List[int]

The rowids for this view.

subselection property

The subselection that is applied to this view.

RETURNS DESCRIPTION
Union[BooleanColumnView, FloatColumn, FloatColumnView]

The subselection that is applied to this view.

shape property

shape: Tuple[Union[int, str], int]

A tuple containing the number of rows and columns of the View.

check

check()

Checks whether the underlying data frame has been changed after the creation of the view.

Source code in getml/data/view.py
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
def check(self):
    """
    Checks whether the underlying data frame has been changed
    after the creation of the view.
    """
    last_change = self.last_change
    if last_change != self.__dict__["_initial_timestamp"]:
        logger.warning(
            "The data frame underlying view '"
            + self.name
            + "' was last changed at "
            + last_change
            + ", which was after the creation of the view. "
            + "This might lead to unexpected results. You might "
            + "want to recreate the view. (Views are lazily "
            + "evaluated, so recreating them is a very "
            + "inexpensive operation)."
        )

drop

drop(cols: Union[str, List[str]]) -> View

Returns a new View that has one or several columns removed.

PARAMETER DESCRIPTION
cols

The names of the columns to be dropped.

TYPE: Union[str, List[str]]

RETURNS DESCRIPTION
View

A new view with the specified columns removed.

Source code in getml/data/view.py
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
def drop(self, cols: Union[str, List[str]]) -> View:
    """Returns a new [`View`][getml.data.View] that has one or several columns removed.

    Args:
        cols:
            The names of the columns to be dropped.

    Returns:
            A new view with the specified columns removed.
    """
    if isinstance(cols, str):
        cols = [cols]

    if not _is_typed_list(cols, str):
        raise TypeError("'cols' must be a string or a list of strings.")

    return View(base=self, dropped=cols)

ncols

ncols() -> int

Number of columns in the current instance.

RETURNS DESCRIPTION
int

Overall number of columns

Source code in getml/data/view.py
485
486
487
488
489
490
491
492
def ncols(self) -> int:
    """
    Number of columns in the current instance.

    Returns:
            Overall number of columns
    """
    return len(self.colnames)

nrows

nrows(force: bool = False) -> Union[int, str]

Returns the number of rows in the current instance.

PARAMETER DESCRIPTION
force

If the number of rows is unknown, do you want to force the Engine to calculate it anyway? This is a relatively expensive operation, therefore you might not necessarily want this.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Union[int, str]

The number of rows in the current instance.

Source code in getml/data/view.py
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
def nrows(self, force: bool = False) -> Union[int, str]:
    """
    Returns the number of rows in the current instance.

    Args:
        force:
            If the number of rows is unknown,
            do you want to force the Engine to calculate it anyway?
            This is a relatively expensive operation, therefore
            you might not necessarily want this.

    Returns:
            The number of rows in the current instance.
    """

    self.refresh()

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "View.get_nrows"
    cmd["name_"] = ""

    cmd["cols_"] = [self[cname].cmd for cname in self.colnames]
    cmd["force_"] = force

    with comm.send_and_get_socket(cmd) as sock:
        json_str = comm.recv_string(sock)

    if json_str[0] != "{":
        comm.handle_engine_exception(json_str)

    response = json.loads(json_str)

    if "recordsTotal" in response:
        return response["recordsTotal"]

    # ensure that we do not display "unknown" if the number of rows is
    # less than or equal to the maximum number of diplayed rows
    nrows_to_display = len(self[: _ViewFormatter.max_rows + 1])
    if nrows_to_display <= _ViewFormatter.max_rows:
        return nrows_to_display

    return "unknown"

refresh

refresh() -> View

Aligns meta-information of the current instance with the corresponding data frame in the getML Engine.

RETURNS DESCRIPTION
View

Updated handle the underlying data frame in the getML

View

Engine.

Source code in getml/data/view.py
548
549
550
551
552
553
554
555
556
557
558
def refresh(self) -> View:
    """Aligns meta-information of the current instance with the
    corresponding data frame in the getML Engine.

    Returns:
            Updated handle the underlying data frame in the getML
            Engine.

    """
    self._base = self.__dict__["_base"].refresh()
    return self

to_arrow

to_arrow() -> Table

Creates a pyarrow.Table from the view.

Loads the underlying data from the getML Engine and constructs a pyarrow.Table.

RETURNS DESCRIPTION
Table

Pyarrow equivalent of the current instance including

Table

its underlying data.

Source code in getml/data/view.py
636
637
638
639
640
641
642
643
644
645
646
def to_arrow(self) -> pyarrow.Table:
    """Creates a `pyarrow.Table` from the view.

    Loads the underlying data from the getML Engine and constructs
    a `pyarrow.Table`.

    Returns:
            Pyarrow equivalent of the current instance including
            its underlying data.
    """
    return to_arrow(self)

to_json

to_json() -> str

Creates a JSON string from the current instance.

Loads the underlying data from the getML Engine and constructs a JSON string.

RETURNS DESCRIPTION
str

JSON string of the current instance including its

str

underlying data.

Source code in getml/data/view.py
650
651
652
653
654
655
656
657
658
659
660
def to_json(self) -> str:
    """Creates a JSON string from the current instance.

    Loads the underlying data from the getML Engine and constructs
    a JSON string.

    Returns:
            JSON string of the current instance including its
            underlying data.
    """
    return self.to_pandas().to_json()

to_csv

to_csv(
    fname: str,
    quotechar: str = '"',
    sep: str = ",",
    batch_size: int = 0,
    quoting_style: str = "needed",
)

Writes the underlying data into a newly created CSV file.

PARAMETER DESCRIPTION
fname

The name of the CSV file. The ending ".csv" and an optional batch number will be added automatically.

TYPE: str

quotechar

The character used to wrap strings.

TYPE: str DEFAULT: '"'

sep

The character used for separating fields.

TYPE: str DEFAULT: ','

batch_size

Maximum number of lines per file. Set to 0 to read the entire data frame into a single file.

TYPE: int DEFAULT: 0

quoting_style

The quoting style to use. Delegated to pyarrow.

The following values are accepted: - "needed" (default): only enclose values in quotes when needed. - "all_valid": enclose all valid values in quotes; nulls are not quoted. - "none": do not enclose any values in quotes; values containing special characters (such as quotes, cell delimiters or line endings) will raise an error.

TYPE: str DEFAULT: 'needed'

Deprecated

1.5: The quotechar parameter is deprecated.

Source code in getml/data/view.py
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
def to_csv(
    self,
    fname: str,
    quotechar: str = '"',
    sep: str = ",",
    batch_size: int = 0,
    quoting_style: str = "needed",
):
    """
    Writes the underlying data into a newly created CSV file.

    Args:
        fname:
            The name of the CSV file.
            The ending ".csv" and an optional batch number will
            be added automatically.

        quotechar:
            The character used to wrap strings.

        sep:
            The character used for separating fields.

        batch_size:
            Maximum number of lines per file. Set to 0 to read
            the entire data frame into a single file.

        quoting_style (str):
            The quoting style to use. Delegated to pyarrow.

            The following values are accepted:
            - `"needed"` (default): only enclose values in quotes when needed.
            - `"all_valid"`: enclose all valid values in quotes; nulls are not
              quoted.
            - `"none"`: do not enclose any values in quotes; values containing
              special characters (such as quotes, cell delimiters or line
              endings) will raise an error.

    Deprecated:
       1.5: The `quotechar` parameter is deprecated.
    """

    if quotechar != '"':
        warnings.warn(
            "'quotechar' is deprecated, use 'quoting_style' instead.",
            DeprecationWarning,
        )

    to_csv(self, fname, sep, batch_size, quoting_style)

to_db

to_db(table_name: str, conn: Optional[Connection] = None)

Writes the underlying data into a newly created table in the database.

PARAMETER DESCRIPTION
table_name

Name of the table to be created.

If a table of that name already exists, it will be replaced.

TYPE: str

conn

The database connection to be used. If you don't explicitly pass a connection, the Engine will use the default connection.

TYPE: Optional[Connection] DEFAULT: None

Source code in getml/data/view.py
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
def to_db(self, table_name: str, conn: Optional[Connection] = None):
    """Writes the underlying data into a newly created table in the
    database.

    Args:
        table_name:
            Name of the table to be created.

            If a table of that name already exists, it will be
            replaced.

        conn:
            The database connection to be used.
            If you don't explicitly pass a connection,
            the Engine will use the default connection.
    """

    conn = conn or Connection()

    self.refresh()

    if not isinstance(table_name, str):
        raise TypeError("'table_name' must be of type str")

    if not isinstance(conn, Connection):
        raise TypeError("'conn' must be a getml.database.Connection object or None")

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "View.to_db"
    cmd["name_"] = ""

    cmd["view_"] = self._getml_deserialize()
    cmd["table_name_"] = table_name
    cmd["conn_id_"] = conn.conn_id

    comm.send(cmd)

to_df

to_df(name) -> DataFrame

Creates a DataFrame from the view.

Source code in getml/data/view.py
756
757
758
759
760
761
def to_df(self, name) -> DataFrame:
    """Creates a [`DataFrame`][getml.DataFrame] from the view."""
    self.check()
    self = self.refresh()
    df = data.DataFrame(name)
    return df.read_view(self)

to_pandas

to_pandas() -> DataFrame

Creates a pandas.DataFrame from the view.

Loads the underlying data from the getML Engine and constructs a pandas.DataFrame.

RETURNS DESCRIPTION
DataFrame

Pandas equivalent of the current instance including

DataFrame

its underlying data.

Source code in getml/data/view.py
765
766
767
768
769
770
771
772
773
774
775
def to_pandas(self) -> pd.DataFrame:
    """Creates a `pandas.DataFrame` from the view.

    Loads the underlying data from the getML Engine and constructs
    a `pandas.DataFrame`.

    Returns:
            Pandas equivalent of the current instance including
            its underlying data.
    """
    return to_arrow(self).to_pandas()

to_placeholder

to_placeholder(name: Optional[str] = None) -> Placeholder

Generates a Placeholder from the current View.

PARAMETER DESCRIPTION
name

The name of the placeholder. If no name is passed, then the name of the placeholder will be identical to the name of the current view.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
Placeholder

A placeholder with the same name as this data frame.

Source code in getml/data/view.py
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
def to_placeholder(self, name: Optional[str] = None) -> Placeholder:
    """Generates a [`Placeholder`][getml.data.Placeholder] from the
    current [`View`][getml.data.View].

    Args:
        name:
            The name of the placeholder. If no
            name is passed, then the name of the placeholder will
            be identical to the name of the current view.

    Returns:
            A placeholder with the same name as this data frame.


    """
    self.refresh()
    return Placeholder(name=name or self.name, roles=self.roles)

to_parquet

to_parquet(
    fname: str,
    compression: Literal[
        "brotli", "gzip", "lz4", "snappy", "zstd"
    ] = "snappy",
    coerce_timestamps: Optional[bool] = None,
)

Writes the underlying data into a newly created parquet file.

PARAMETER DESCRIPTION
fname

The name of the parquet file. The ending ".parquet" will be added automatically.

TYPE: str

compression

The compression format to use. Supported values are "brotli", "gzip", "lz4", "snappy", "zstd"

TYPE: Literal['brotli', 'gzip', 'lz4', 'snappy', 'zstd'] DEFAULT: 'snappy'

coerce_timestamps

Cast time stamps to a particular resolution. For detailes, refer to [pyarrow.parquet.ParquetWriter][].

TYPE: Optional[bool] DEFAULT: None

Source code in getml/data/view.py
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
def to_parquet(
    self,
    fname: str,
    compression: Literal["brotli", "gzip", "lz4", "snappy", "zstd"] = "snappy",
    coerce_timestamps: Optional[bool] = None,
):
    """
    Writes the underlying data into a newly created parquet file.

    Args:
        fname:
            The name of the parquet file.
            The ending ".parquet" will be added automatically.

        compression:
            The compression format to use.
            Supported values are "brotli", "gzip", "lz4", "snappy", "zstd"
        coerce_timestamps:
            Cast time stamps to a particular resolution.
            For detailes, refer to [pyarrow.parquet.ParquetWriter][pyarrow.parquet.ParquetWriter].
    """
    to_parquet(self, fname, compression, coerce_timestamps=coerce_timestamps)

to_pyspark

to_pyspark(
    spark: SparkSession, name: Optional[str] = None
) -> DataFrame

Creates a pyspark.sql.DataFrame from the current instance.

Loads the underlying data from the getML Engine and constructs a pyspark.sql.DataFrame.

PARAMETER DESCRIPTION
spark

The pyspark session in which you want to create the data frame.

TYPE: SparkSession

name

The name of the temporary view to be created on top of the pyspark.sql.DataFrame, with which it can be referred to in Spark SQL (refer to pyspark.sql.DataFrame.createOrReplaceTempView). If none is passed, then the name of this DataFrame will be used.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION
DataFrame

Pyspark equivalent of the current instance including

DataFrame

its underlying data.

Source code in getml/data/view.py
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
def to_pyspark(
    self, spark: pyspark.sql.SparkSession, name: Optional[str] = None
) -> pyspark.sql.DataFrame:
    """Creates a `pyspark.sql.DataFrame` from the current instance.

    Loads the underlying data from the getML Engine and constructs
    a `pyspark.sql.DataFrame`.

    Args:
        spark:
            The pyspark session in which you want to
            create the data frame.

        name:
            The name of the temporary view to be created on top
            of the `pyspark.sql.DataFrame`,
            with which it can be referred to
            in Spark SQL (refer to
            `pyspark.sql.DataFrame.createOrReplaceTempView`).
            If none is passed, then the name of this
            [`DataFrame`][getml.DataFrame] will be used.

    Returns:
            Pyspark equivalent of the current instance including
            its underlying data.

    """
    return _to_pyspark(self, name, spark)

to_s3

to_s3(
    bucket: str,
    key: str,
    region: str,
    sep: str = ",",
    batch_size: int = 50000,
)

Writes the underlying data into a newly created CSV file located in an S3 bucket.

Note

S3 is not supported on Windows.

PARAMETER DESCRIPTION
bucket

The bucket from which to read the files.

TYPE: str

key

The key in the S3 bucket in which you want to write the output. The ending ".csv" and an optional batch number will be added automatically.

TYPE: str

region

The region in which the bucket is located.

TYPE: str

sep

The character used for separating fields.

TYPE: str DEFAULT: ','

batch_size

Maximum number of lines per file. Set to 0 to read the entire data frame into a single file.

TYPE: int DEFAULT: 50000

Example
getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

your_view.to_s3(
    bucket="your-bucket-name",
    key="filename-on-s3",
    region="us-east-2",
    sep=';'
)
Source code in getml/data/view.py
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
def to_s3(
    self,
    bucket: str,
    key: str,
    region: str,
    sep: str = ",",
    batch_size: int = 50000,
):
    """
    Writes the underlying data into a newly created CSV file
    located in an S3 bucket.

    Note:
        S3 is not supported on Windows.

    Args:
        bucket:
            The bucket from which to read the files.

        key:
            The key in the S3 bucket in which you want to
            write the output. The ending ".csv" and an optional
            batch number will be added automatically.

        region:
            The region in which the bucket is located.

        sep:
            The character used for separating fields.

        batch_size:
            Maximum number of lines per file. Set to 0 to read
            the entire data frame into a single file.

    ??? example
        ```python
        getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
        getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")

        your_view.to_s3(
            bucket="your-bucket-name",
            key="filename-on-s3",
            region="us-east-2",
            sep=';'
        )
        ```
    """

    self.refresh()

    if not isinstance(bucket, str):
        raise TypeError("'bucket' must be of type str")

    if not isinstance(key, str):
        raise TypeError("'fname' must be of type str")

    if not isinstance(region, str):
        raise TypeError("'region' must be of type str")

    if not isinstance(sep, str):
        raise TypeError("'sep' must be of type str")

    if not isinstance(batch_size, numbers.Real):
        raise TypeError("'batch_size' must be a real number")

    cmd: Dict[str, Any] = {}

    cmd["type_"] = "View.to_s3"
    cmd["name_"] = self.name

    cmd["view_"] = self._getml_deserialize()
    cmd["bucket_"] = bucket
    cmd["key_"] = key
    cmd["region_"] = region
    cmd["sep_"] = sep
    cmd["batch_size_"] = batch_size

    comm.send(cmd)

where

Extract a subset of rows.

Creates a new View as a subselection of the current instance.

PARAMETER DESCRIPTION
index

Boolean column indicating the rows you want to select.

TYPE: Optional[Union[BooleanColumnView, FloatColumn, FloatColumnView]]

RETURNS DESCRIPTION
View

A new view containing only the rows that satisfy the condition.

Example

Generate example data:

data = dict(
    fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"],
    price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4],
    join_key=["0", "1", "2", "2", "3", "3"])

fruits = getml.DataFrame.from_dict(data, name="fruits",
roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]})

fruits
| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 0        | banana      | 2.4       |
| 1        | apple       | 3         |
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |
| 3        | melon       | 3.4       |
| 3        | pineapple   | 3.4       |
Apply where condition. This creates a new DataFrame called "cherries":

cherries = fruits.where(
    fruits["fruit"] == "cherry")

cherries
| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |

Source code in getml/data/view.py
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
def where(
    self, index: Optional[Union[BooleanColumnView, FloatColumn, FloatColumnView]]
) -> View:
    """Extract a subset of rows.

    Creates a new [`View`][getml.data.View] as a
    subselection of the current instance.

    Args:
        index:
            Boolean column indicating the rows you want to select.

    Returns:
        A new view containing only the rows that satisfy the condition.

    ??? example
        Generate example data:
        ```python
        data = dict(
            fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"],
            price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4],
            join_key=["0", "1", "2", "2", "3", "3"])

        fruits = getml.DataFrame.from_dict(data, name="fruits",
        roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]})

        fruits
        ```
        ```
        | join_key | fruit       | price     |
        | join key | categorical | numerical |
        --------------------------------------
        | 0        | banana      | 2.4       |
        | 1        | apple       | 3         |
        | 2        | cherry      | 1.2       |
        | 2        | cherry      | 1.4       |
        | 3        | melon       | 3.4       |
        | 3        | pineapple   | 3.4       |
        ```
        Apply where condition. This creates a new DataFrame called "cherries":

        ```python

        cherries = fruits.where(
            fruits["fruit"] == "cherry")

        cherries
        ```
        ```
        | join_key | fruit       | price     |
        | join key | categorical | numerical |
        --------------------------------------
        | 2        | cherry      | 1.2       |
        | 2        | cherry      | 1.4       |
        ```
    """
    return _where(self, index)

with_column

with_column(
    col: Union[
        bool,
        str,
        float,
        int,
        datetime64,
        FloatColumn,
        FloatColumnView,
        StringColumn,
        StringColumnView,
        BooleanColumnView,
    ],
    name: str,
    role: Optional[Role] = None,
    unit: str = "",
    subroles: Optional[
        Union[Subrole, Iterable[str]]
    ] = None,
    time_formats: Optional[List[str]] = None,
) -> View

Returns a new View that contains an additional column.

PARAMETER DESCRIPTION
col

The column to be added.

TYPE: Union[bool, str, float, int, datetime64, FloatColumn, FloatColumnView, StringColumn, StringColumnView, BooleanColumnView]

name

Name of the new column.

TYPE: str

role

Role of the new column. Must be from roles.

TYPE: Optional[Role] DEFAULT: None

subroles

Subroles of the new column. Must be from subroles.

TYPE: Optional[Union[Subrole, Iterable[str]]] DEFAULT: None

unit

Unit of the column.

TYPE: str DEFAULT: ''

time_formats

Formats to be used to parse the time stamps.

This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place.

The formats are allowed to contain the following special characters:

  • %w - abbreviated weekday (Mon, Tue, ...)
  • %W - full weekday (Monday, Tuesday, ...)
  • %b - abbreviated month (Jan, Feb, ...)
  • %B - full month (January, February, ...)
  • %d - zero-padded day of month (01 .. 31)
  • %e - day of month (1 .. 31)
  • %f - space-padded day of month ( 1 .. 31)
  • %m - zero-padded month (01 .. 12)
  • %n - month (1 .. 12)
  • %o - space-padded month ( 1 .. 12)
  • %y - year without century (70)
  • %Y - year with century (1970)
  • %H - hour (00 .. 23)
  • %h - hour (00 .. 12)
  • %a - am/pm
  • %A - AM/PM
  • %M - minute (00 .. 59)
  • %S - second (00 .. 59)
  • %s - seconds and microseconds (equivalent to %S.%F)
  • %i - millisecond (000 .. 999)
  • %c - centisecond (0 .. 9)
  • %F - fractional seconds/microseconds (000000 - 999999)
  • %z - time zone differential in ISO 8601 format (Z or +NN.NN)
  • %Z - time zone differential in RFC format (GMT or +NNNN)
  • %% - percent sign

TYPE: Optional[List[str]] DEFAULT: None

RETURNS DESCRIPTION
View

A new view containing the additional column.

Source code in getml/data/view.py
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
def with_column(
    self,
    col: Union[
        bool,
        str,
        float,
        int,
        np.datetime64,
        FloatColumn,
        FloatColumnView,
        StringColumn,
        StringColumnView,
        BooleanColumnView,
    ],
    name: str,
    role: Optional[Role] = None,
    unit: str = "",
    subroles: Optional[Union[Subrole, Iterable[str]]] = None,
    time_formats: Optional[List[str]] = None,
) -> View:
    """Returns a new [`View`][getml.data.View] that contains an additional column.

    Args:
        col:
            The column to be added.

        name:
            Name of the new column.

        role:
            Role of the new column. Must be from [`roles`][getml.data.roles].

        subroles:
            Subroles of the new column. Must be from [`subroles`][getml.data.subroles].

        unit:
            Unit of the column.

        time_formats:
            Formats to be used to parse the time stamps.

            This is only necessary, if an implicit conversion from
            a [`StringColumn`][getml.data.columns.StringColumn] to a time
            stamp is taking place.

            The formats are allowed to contain the following
            special characters:

            * %w - abbreviated weekday (Mon, Tue, ...)
            * %W - full weekday (Monday, Tuesday, ...)
            * %b - abbreviated month (Jan, Feb, ...)
            * %B - full month (January, February, ...)
            * %d - zero-padded day of month (01 .. 31)
            * %e - day of month (1 .. 31)
            * %f - space-padded day of month ( 1 .. 31)
            * %m - zero-padded month (01 .. 12)
            * %n - month (1 .. 12)
            * %o - space-padded month ( 1 .. 12)
            * %y - year without century (70)
            * %Y - year with century (1970)
            * %H - hour (00 .. 23)
            * %h - hour (00 .. 12)
            * %a - am/pm
            * %A - AM/PM
            * %M - minute (00 .. 59)
            * %S - second (00 .. 59)
            * %s - seconds and microseconds (equivalent to %S.%F)
            * %i - millisecond (000 .. 999)
            * %c - centisecond (0 .. 9)
            * %F - fractional seconds/microseconds (000000 - 999999)
            * %z - time zone differential in ISO 8601 format (Z or +NN.NN)
            * %Z - time zone differential in RFC format (GMT or +NNNN)
            * %% - percent sign

    Returns:
        A new view containing the additional column.
    """
    col, role, subroles = _with_column(
        col, name, role, subroles, unit, time_formats
    )
    return View(
        base=self,
        added={
            "col_": col,
            "name_": name,
            "role_": role,
            "subroles_": subroles,
            "unit_": unit,
        },
    )

with_name

with_name(name: str) -> View

Returns a new View with a new name.

PARAMETER DESCRIPTION
name

The name of the new view.

TYPE: str

RETURNS DESCRIPTION
View

A new view with the new name.

Source code in getml/data/view.py
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
def with_name(self, name: str) -> View:
    """Returns a new [`View`][getml.data.View] with a new name.

    Args:
        name (str):
            The name of the new view.

    Returns:
        A new view with the new name.
    """
    return View(base=self, name=name)

with_role

with_role(
    names: Union[str, List[str]],
    role: str,
    time_formats: Optional[List[str]] = None,
) -> View

Returns a new View with modified roles.

When switching from a role based on type float to a role based on type string or vice verse, an implicit type conversion will be conducted. The time_formats argument is used to interpret time format string: annotating_roles_time_stamp. For more information on roles, please refer to the User Guide.

PARAMETER DESCRIPTION
names

The name or names of the column.

TYPE: Union[str, List[str]]

role

The role to be assigned.

TYPE: str

time_formats

Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place.

TYPE: Optional[List[str]] DEFAULT: None

RETURNS DESCRIPTION
View

A new view with the modified roles.

Source code in getml/data/view.py
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
def with_role(
    self,
    names: Union[str, List[str]],
    role: str,
    time_formats: Optional[List[str]] = None,
) -> View:
    """Returns a new [`View`][getml.data.View] with modified roles.

    When switching from a role based on type float to a role based on type
    string or vice verse, an implicit type conversion will be conducted.
    The `time_formats` argument is used to interpret time
    format string: `annotating_roles_time_stamp`. For more information on
    roles, please refer to the [User Guide][annotating-data].

    Args:
        names:
            The name or names of the column.

        role:
            The role to be assigned.

        time_formats:
            Formats to be used to parse the time stamps.
            This is only necessary, if an implicit conversion from a StringColumn to
            a time stamp is taking place.

    Returns:
        A new view with the modified roles.
    """
    return _with_role(self, names, role, time_formats)

with_subroles

with_subroles(
    names: Union[str, List[str]],
    subroles: Union[Subrole, Iterable[str]],
    append: bool = True,
) -> View

Returns a new view with one or several new subroles on one or more columns.

PARAMETER DESCRIPTION
names

The name or names of the column.

TYPE: Union[str, List[str]]

subroles

The subroles to be assigned.

TYPE: Union[Subrole, Iterable[str]]

append

Whether you want to append the new subroles to the existing subroles.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
View

A new view with the modified subroles.

Source code in getml/data/view.py
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
def with_subroles(
    self,
    names: Union[str, List[str]],
    subroles: Union[Subrole, Iterable[str]],
    append: bool = True,
) -> View:
    """Returns a new view with one or several new subroles on one or more columns.

    Args:
        names:
            The name or names of the column.

        subroles:
            The subroles to be assigned.

        append:
            Whether you want to append the
            new subroles to the existing subroles.

    Returns:
        A new view with the modified subroles.
    """
    return _with_subroles(self, names, subroles, append)

with_unit

with_unit(
    names: Union[str, List[str]],
    unit: str,
    comparison_only: bool = False,
) -> View

Returns a view that contains a new unit on one or more columns.

PARAMETER DESCRIPTION
names

The name or names of the column.

TYPE: Union[str, List[str]]

unit

The unit to be assigned.

TYPE: str

comparison_only

Whether you want the column to be used for comparison only. This means that the column can only be used in comparison to other columns of the same unit.

An example might be a bank account number: The number in itself is hardly interesting, but it might be useful to know how often we have seen that same bank account number in another table.

If True, this will append ", comparison only" to the unit. The feature learning algorithms and the feature selectors will interpret this accordingly.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
View

A new view with the modified unit.

Source code in getml/data/view.py
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
def with_unit(
    self, names: Union[str, List[str]], unit: str, comparison_only: bool = False
) -> View:
    """Returns a view that contains a new unit on one or more columns.

    Args:
        names:
            The name or names of the column.

        unit:
            The unit to be assigned.

        comparison_only:
            Whether you want the column to
            be used for comparison only. This means that the column can
            only be used in comparison to other columns of the same unit.

            An example might be a bank account number: The number in itself
            is hardly interesting, but it might be useful to know how often
            we have seen that same bank account number in another table.

            If True, this will append ", comparison only" to the unit.
            The feature learning algorithms and the feature selectors will
            interpret this accordingly.

    Returns:
        A new view with the modified unit.
    """
    return _with_unit(self, names, unit, comparison_only)