cyto_dl.dataframe.transforms.split module#

cyto_dl.dataframe.transforms.split.sample_n_each(dataframe: DataFrame, column: str, number: int = 1, force: bool = False, seed: int = 42)[source]#

Transform a dataframe to have equal number of rows per value of column.

In case a given value of column has less than number corresponding rows: - if force is True the corresponding rows are sampled with replacement - if force is False all the rows are given for that value

Parameters:

dataframe (pd.DataFrame) – Input dataframe
column (str) – The column to be used for selection
number (int) – Number of rows to include per unique value of column
force (bool = False) – Toggle upsampling of classes with number of samples smaller than number
seed (int) – Random seed used for sampling

cyto_dl.dataframe.transforms.split.split_dataframe(dataframe: DataFrame, train_frac: float, val_frac: float | None = None, return_splits: bool = True, seed: int = 42)[source]#

Given a pandas dataframe, perform a train-val-test split and either return three different dataframes, or append a column identifying the split each row belongs to.

TODO: extend this to enable balanced / stratified splitting

Parameters:

dataframe (pd.DataFrame) – Input dataframe
train_frac (float) – Fraction of data to use for training. Must be <= 1
val_frac (Optional[float]) – Fraction of data to use for validation. By default, the data not used for training is split in half between validation and test
return_splits (bool = True) – Whether to return the three splits separately, or to append a column to the existing dataframe and return the modified dataframe
seed (int = 42) – Random seed for reproducibility