cyto_dl.dataframe.transforms.split module#

cyto_dl.dataframe.transforms.split.sample_n_each(dataframe: DataFrame, column: str, number: int = 1, force: bool = False, seed: int = 42)[source]#

Transform a dataframe to have equal number of rows per value of column.

In case a given value of column has less than number corresponding rows: - if force is True the corresponding rows are sampled with replacement - if force is False all the rows are given for that value

Parameters:
  • dataframe (pd.DataFrame) – Input dataframe

  • column (str) – The column to be used for selection

  • number (int) – Number of rows to include per unique value of column

  • force (bool = False) – Toggle upsampling of classes with number of samples smaller than number

  • seed (int) – Random seed used for sampling

cyto_dl.dataframe.transforms.split.split_dataframe(dataframe: DataFrame, train_frac: float, val_frac: float | None = None, return_splits: bool = True, seed: int = 42)[source]#

Given a pandas dataframe, perform a train-val-test split and either return three different dataframes, or append a column identifying the split each row belongs to.

TODO: extend this to enable balanced / stratified splitting

Parameters:
  • dataframe (pd.DataFrame) – Input dataframe

  • train_frac (float) – Fraction of data to use for training. Must be <= 1

  • val_frac (Optional[float]) – Fraction of data to use for validation. By default, the data not used for training is split in half between validation and test

  • return_splits (bool = True) – Whether to return the three splits separately, or to append a column to the existing dataframe and return the modified dataframe

  • seed (int = 42) – Random seed for reproducibility