quilt3distribute package

Submodules

quilt3distribute.dataset module

class quilt3distribute.dataset.Dataset(dataset: Union[str, pathlib.Path, pandas.core.frame.DataFrame], name: str, package_owner: str, readme_path: Union[str, pathlib.Path])[source]

Bases: object

Initialize a dataset object.

Parameters
  • dataset – Filepath or preloaded pandas dataframe.

  • name – A name for the dataset. May only contain alphabetic and underscore characters.

  • package_owner – The name of the dataset owner. To be attached to the dataset name.

  • readme_path – A path to a markdown README file.

add_license(doc_or_link: Union[str, pathlib.Path])[source]

Add a document’s content or add a link to a publically accessibly resource for license details.

Parameters

doc_or_link – A filepath or string uri to a resource for license details.

Wrapper around quilt3distribute.documentation.README.append_readme_standards.

add_usage_doc(doc_or_link: Union[str, pathlib.Path])[source]

Add a document’s content or add a link to a publically accessibly resource for documentation and usage examples.

Parameters

doc_or_link – A filepath or string uri to a resource detailing usage of this dataset.

Wrapper around quilt3distribute.documentation.README.append_readme_standards.

property data
distribute(push_uri: Optional[str] = None, message: Optional[str] = None, attach_associates: bool = True) → quilt3.packages.Package[source]

Push a package to a specific S3 bucket. If no bucket is provided, the un-built, un-pushed package is returned. You can push a dataset with the same name multiple times to the same bucket multiple times as instead of overriding a prior dataset, Quilt simply creates a new dataset version. Please refer to Quilt documentation for more details: https://docs.quiltdata.com

Parameters
  • push_uri – The S3 bucket uri to push to. Example: “s3://quilt-jacksonb”

  • message – An optional message to attach to that version of the dataset.

  • attach_associates – Boolean option to attach associates as metadata to each file. Associates are used to retain quick navigation between related files.

Returns

The built and optionally pushed quilt3.Package.

property readme
static return_or_raise_approved_name(name: str) → str[source]

Attempt to clean a string to match the pattern expected by Quilt 3. If after the cleaning operation, it still doesn’t match the approved pattern, will raise a ValueError.

Parameters

name – String name to clean.

Returns

Cleaned name.

set_column_names_map(columns: Dict[str, str])[source]

Explicit override for the labeling of column names on file distribution. Example, a column (“2dReadPath”) is detected to have files, in the package that file will be placed in a directory called “2dReadPath”. Using this function, those directory names can be explicitly overridden.

Parameters

columns – A mapping of current column name contain files to desired labeled directory name.

set_extra_files(files: Union[List[Union[str, pathlib.Path]], Dict[str, List[Union[str, pathlib.Path]]]])[source]

Datasets commonly have extra or supporting files. Any file passed to this function will be added to the requested directory.

Parameters

files – When provided a list of string or Path objects all paths provided in the list will be sent to the same logical key “supporting_files”. When provided a dictionary mapping strings to list of string or Path objects, the paths will be placed in logical keys labeled by their dictionary entry.

set_metadata_columns(columns: List[str])[source]

Use the manifest contents to attach metadata to the files found in the dataset.

Parameters

columns – A list of columns to use for metadata attachment.

Example row: {“CellId”: 1, “Structure”: “lysosome”, “2dReadPath”: “/allen…”, “3dReadPath”: “/allen…”} Attach structure metadata: dataset.set_metadata_columns([“Structure”]) Results in the files found at the 2dReadPath and the 3dReadPath both having {“Structure”: “lysosome”} attached

In short: the values in each column provided will be used for metadata attachment for every file found.

set_path_columns(columns: List[str])[source]

Explicit override for which columns will be used for file distribution.

Parameters

columns – A list of columns to use for file distribution.

quilt3distribute.documentation module

class quilt3distribute.documentation.README(fp: Union[str, pathlib.Path])[source]

Bases: object

Initialize a README object.

Parameters

fp – Filepath to a markdown readme document.

append_readme_standards(usage_doc_or_link: Union[str, pathlib.Path, None] = None, license_doc_or_link: Union[str, pathlib.Path, None] = None) → str[source]

Attach a standard document or link to the readme. If the provided value is an external resource, a default message is attached before linking to the external resource. Additionally, updates the underlying text attribute for this object to retain prior document attachments.

Parameters
  • usage_doc_or_link – A document or link to external resource with details on dataset usage.

  • license_doc_or_link – A document or link to external resource with details on licensing.

Returns

The entire contents of the readme returned as a string.

property fp
property referenced_files
property text
class quilt3distribute.documentation.ReferencedFiles(target, resolved)[source]

Bases: tuple

Create new instance of ReferencedFiles(target, resolved)

property resolved

Alias for field number 1

property target

Alias for field number 0

quilt3distribute.file_utils module

quilt3distribute.file_utils.create_unique_logical_key(physical_key: Union[str, pathlib.Path]) → str[source]

quilt3distribute.validation module

class quilt3distribute.validation.FeatureDefinition(dtype: Type, validation_functions: Union[List[Callable], Tuple[Callable], None] = None, cast_values: bool = False, display_name: Optional[str] = None, description: Optional[str] = None, units: Optional[str] = None)[source]

Bases: object

Initialize a new feature definition. A feature definition can be as simple as providing a data type (dtype) or can be incredibly specific by including validation and cleaning operations or providing metadata. If dtype of pathlib.Path is provided, cast_values is automatically set to True.

Parameters
  • dtype – The data type for the feature.

  • validation_functions – A list or tuple of callable functions to validate each instance of the feature.

  • cast_values – In the case that an instance of the feature is found that doesn’t match the dtype provided, should that instance be attempted to cast to the provided dtype.

  • display_name – Metadata attachment for a display name to be given to the feature.

  • description – Metadata attachment for a description for the feature.

  • units – Metadata attachment for unit details for the feature.

exception quilt3distribute.validation.PlannedDelayedDropError(message: str, **kwargs)[source]

Bases: Exception

class quilt3distribute.validation.PlannedDelayedDropResult(index, error)[source]

Bases: tuple

Create new instance of PlannedDelayedDropResult(index, error)

property error

Alias for field number 1

property index

Alias for field number 0

class quilt3distribute.validation.Schema(features: Dict[str, quilt3distribute.validation.ValidatedFeature])[source]

Bases: object

A schema is the summation of multiple validated and unvalidated features for a Dataset. It provides helpful methods for viewing which features have and have not been validated and with which data types, functions, and metadata.

Parameters

features – A dictionary mapping the dataset manifest column names to ValidatedFeatures.

property df
property features
property unvalidated
property validated
class quilt3distribute.validation.ValidatedDataset(data, schema)[source]

Bases: tuple

Create new instance of ValidatedDataset(data, schema)

property data

Alias for field number 0

property schema

Alias for field number 1

class quilt3distribute.validation.ValidatedFeature(name: str, dtype: Type, display_name: Optional[str] = None, description: Optional[str] = None, units: Optional[str] = None, validation_functions: Optional[Tuple[Callable]] = None, errored_results: Optional[Set[quilt3distribute.validation.PlannedDelayedDropResult]] = None)[source]

Bases: object

A feature that has it’s core validation attributes locked but metadata freely mutable.

Parameters
  • name – The name for the feature in the dataset (usually this is the column).

  • dtype – A single data type for the feature.

  • display_name – A display name for the feature.

  • description – A description for the feature.

  • units – Units for the feature.

  • validation_functions – The tuple of validation functions ran against the feature values.

  • errored_results – An optional set of PlannedDelayedDropResults that errored out during validation.

property dtype
property errored_results
property name
to_dict() → Dict[str, Union[str, Type, Tuple[Callable]]][source]
property validation_functions
class quilt3distribute.validation.ValidationReturn(name, feature)[source]

Bases: tuple

Create new instance of ValidationReturn(name, feature)

property feature

Alias for field number 1

property name

Alias for field number 0

class quilt3distribute.validation.Validator(name: str, values: numpy.ndarray, definition: quilt3distribute.validation.FeatureDefinition, drop_on_error: bool = False)[source]

Bases: object

A container to manage feature values and feature definition that can actually process (validate) the feature.

Parameters
  • name – The name of the feature (usually this is the column name).

  • values – The np.ndarray of feature values.

  • definition – The feature definition to validate against.

  • drop_on_error – In the case that an error occurs during validation should the row be dropped and validation continue.

process(progress_bar: Optional[tqdm.std.tqdm] = None) → quilt3distribute.validation.ValidatedFeature[source]

Use the feature definition stored on this object to attempt to validate the feature.

Parameters

progress_bar – An optional tqdm progress bar to update as the values are processed.

Returns

A ValidatedFeature object representing that this feature has been checked.

quilt3distribute.validation.validate(data: pandas.core.frame.DataFrame, schema: Optional[Dict[str, quilt3distribute.validation.FeatureDefinition]] = None, drop_on_error: bool = False, n_workers: Optional[int] = None, show_progress: bool = True) → quilt3distribute.validation.ValidatedDataset[source]

A function that validates a dataset against the proposed schema.

Parameters
  • data – A pandas dataframe to validate.

  • schema – The proposed schema to validate the dataset against. A dictionary mapping dataframe column names to FeatureDefinitions. If no schema provided, it will use _generate_schema_template to generate one for the data provided.

  • drop_on_error – In the case that an error occurs during validation should the row be dropped and validation continue.

  • n_workers – The number of threads to use during validation.

  • show_progress – Boolean option to show or hide progress bar.

Returns

A ValidatedDataset object that stores the cleaned copy of the data as well as the validated schema.

Validation isn’t a CPU intensive task so async threadpool is used over processpool. The most intensive task is file existence checks.

Module contents

Top-level package for quilt3distribute.

quilt3distribute.get_module_version()[source]