quilt3distribute package¶

Subpackages¶

quilt3distribute.bin package

Submodules¶

quilt3distribute.dataset module¶

class quilt3distribute.dataset.Dataset(dataset: Union[str, pathlib.Path, pandas.core.frame.DataFrame], name: str, package_owner: str, readme_path: Union[str, pathlib.Path])[source]¶

Bases: object

Initialize a dataset object.

Parameters

dataset – Filepath or preloaded pandas dataframe.
name – A name for the dataset. May only contain alphabetic and underscore characters.
package_owner – The name of the dataset owner. To be attached to the dataset name.
readme_path – A path to a markdown README file.

add_license(doc_or_link: Union[str, pathlib.Path])[source]¶

Add a document’s content or add a link to a publically accessibly resource for license details.

Parameters: doc_or_link – A filepath or string uri to a resource for license details.

Wrapper around quilt3distribute.documentation.README.append_readme_standards.

add_usage_doc(doc_or_link: Union[str, pathlib.Path])[source]¶

Add a document’s content or add a link to a publically accessibly resource for documentation and usage examples.

Parameters: doc_or_link – A filepath or string uri to a resource detailing usage of this dataset.

Wrapper around quilt3distribute.documentation.README.append_readme_standards.

property data¶

distribute(push_uri: Optional[str] = None, message: Optional[str] = None, attach_associates: bool = True) → quilt3.packages.Package[source]¶

Push a package to a specific S3 bucket. If no bucket is provided, the un-built, un-pushed package is returned. You can push a dataset with the same name multiple times to the same bucket multiple times as instead of overriding a prior dataset, Quilt simply creates a new dataset version. Please refer to Quilt documentation for more details: https://docs.quiltdata.com

Parameters

push_uri – The S3 bucket uri to push to. Example: “s3://quilt-jacksonb”
message – An optional message to attach to that version of the dataset.
attach_associates – Boolean option to attach associates as metadata to each file. Associates are used to retain quick navigation between related files.

Returns

The built and optionally pushed quilt3.Package.

property readme¶

static return_or_raise_approved_name(name: str) → str[source]¶

Attempt to clean a string to match the pattern expected by Quilt 3. If after the cleaning operation, it still doesn’t match the approved pattern, will raise a ValueError.

Parameters: name – String name to clean.
Returns: Cleaned name.

set_column_names_map(columns: Dict[str, str])[source]¶

Explicit override for the labeling of column names on file distribution. Example, a column (“2dReadPath”) is detected to have files, in the package that file will be placed in a directory called “2dReadPath”. Using this function, those directory names can be explicitly overridden.

Parameters: columns – A mapping of current column name contain files to desired labeled directory name.

set_extra_files(files: Union[List[Union[str, pathlib.Path]], Dict[str, List[Union[str, pathlib.Path]]]])[source]¶

Datasets commonly have extra or supporting files. Any file passed to this function will be added to the requested directory.

Parameters: files – When provided a list of string or Path objects all paths provided in the list will be sent to the same logical key “supporting_files”. When provided a dictionary mapping strings to list of string or Path objects, the paths will be placed in logical keys labeled by their dictionary entry.

set_metadata_columns(columns: List[str])[source]¶

Use the manifest contents to attach metadata to the files found in the dataset.

Parameters: columns – A list of columns to use for metadata attachment.

Example row: {“CellId”: 1, “Structure”: “lysosome”, “2dReadPath”: “/allen…”, “3dReadPath”: “/allen…”} Attach structure metadata: dataset.set_metadata_columns([“Structure”]) Results in the files found at the 2dReadPath and the 3dReadPath both having {“Structure”: “lysosome”} attached

In short: the values in each column provided will be used for metadata attachment for every file found.

set_path_columns(columns: List[str])[source]¶

Explicit override for which columns will be used for file distribution.

Parameters: columns – A list of columns to use for file distribution.

quilt3distribute.documentation module¶

class quilt3distribute.documentation.README(fp: Union[str, pathlib.Path])[source]¶

Bases: object

Initialize a README object.

Parameters: fp – Filepath to a markdown readme document.

append_readme_standards(usage_doc_or_link: Union[str, pathlib.Path, None] = None, license_doc_or_link: Union[str, pathlib.Path, None] = None) → str[source]¶

Attach a standard document or link to the readme. If the provided value is an external resource, a default message is attached before linking to the external resource. Additionally, updates the underlying text attribute for this object to retain prior document attachments.

Parameters

usage_doc_or_link – A document or link to external resource with details on dataset usage.
license_doc_or_link – A document or link to external resource with details on licensing.

Returns

The entire contents of the readme returned as a string.

property fp¶

property referenced_files¶

property text¶

class quilt3distribute.documentation.ReferencedFiles(target, resolved)[source]¶

Bases: tuple

Create new instance of ReferencedFiles(target, resolved)

property resolved¶: Alias for field number 1

property target¶: Alias for field number 0

quilt3distribute.file_utils module¶

quilt3distribute.file_utils.create_unique_logical_key(physical_key: Union[str, pathlib.Path]) → str[source]¶

quilt3distribute.validation module¶

class quilt3distribute.validation.FeatureDefinition(dtype: Type, validation_functions: Union[List[Callable], Tuple[Callable], None] = None, cast_values: bool = False, display_name: Optional[str] = None, description: Optional[str] = None, units: Optional[str] = None)[source]¶

Bases: object

Initialize a new feature definition. A feature definition can be as simple as providing a data type (dtype) or can be incredibly specific by including validation and cleaning operations or providing metadata. If dtype of pathlib.Path is provided, cast_values is automatically set to True.

Parameters

dtype – The data type for the feature.
validation_functions – A list or tuple of callable functions to validate each instance of the feature.
cast_values – In the case that an instance of the feature is found that doesn’t match the dtype provided, should that instance be attempted to cast to the provided dtype.
display_name – Metadata attachment for a display name to be given to the feature.
description – Metadata attachment for a description for the feature.
units – Metadata attachment for unit details for the feature.

exception quilt3distribute.validation.PlannedDelayedDropError(message: str, **kwargs)[source]¶: Bases: Exception

class quilt3distribute.validation.PlannedDelayedDropResult(index, error)[source]¶

Bases: tuple

Create new instance of PlannedDelayedDropResult(index, error)

property error¶: Alias for field number 1

property index¶: Alias for field number 0

class quilt3distribute.validation.Schema(features: Dict[str, quilt3distribute.validation.ValidatedFeature])[source]¶

Bases: object

A schema is the summation of multiple validated and unvalidated features for a Dataset. It provides helpful methods for viewing which features have and have not been validated and with which data types, functions, and metadata.

Parameters: features – A dictionary mapping the dataset manifest column names to ValidatedFeatures.

property df¶

property features¶

property unvalidated¶

property validated¶

class quilt3distribute.validation.ValidatedDataset(data, schema)[source]¶

Bases: tuple

Create new instance of ValidatedDataset(data, schema)

property data¶: Alias for field number 0

property schema¶: Alias for field number 1

class quilt3distribute.validation.ValidatedFeature(name: str, dtype: Type, display_name: Optional[str] = None, description: Optional[str] = None, units: Optional[str] = None, validation_functions: Optional[Tuple[Callable]] = None, errored_results: Optional[Set[quilt3distribute.validation.PlannedDelayedDropResult]] = None)[source]¶

Bases: object

A feature that has it’s core validation attributes locked but metadata freely mutable.

Parameters

name – The name for the feature in the dataset (usually this is the column).
dtype – A single data type for the feature.
display_name – A display name for the feature.
description – A description for the feature.
units – Units for the feature.
validation_functions – The tuple of validation functions ran against the feature values.
errored_results – An optional set of PlannedDelayedDropResults that errored out during validation.

property dtype¶

property errored_results¶

property name¶

to_dict() → Dict[str, Union[str, Type, Tuple[Callable]]][source]¶

property validation_functions¶

class quilt3distribute.validation.ValidationReturn(name, feature)[source]¶

Bases: tuple

Create new instance of ValidationReturn(name, feature)

property feature¶: Alias for field number 1

property name¶: Alias for field number 0

class quilt3distribute.validation.Validator(name: str, values: numpy.ndarray, definition: quilt3distribute.validation.FeatureDefinition, drop_on_error: bool = False)[source]¶

Bases: object

A container to manage feature values and feature definition that can actually process (validate) the feature.

Parameters

name – The name of the feature (usually this is the column name).
values – The np.ndarray of feature values.
definition – The feature definition to validate against.
drop_on_error – In the case that an error occurs during validation should the row be dropped and validation continue.

process(progress_bar: Optional[tqdm.std.tqdm] = None) → quilt3distribute.validation.ValidatedFeature[source]¶

Use the feature definition stored on this object to attempt to validate the feature.

Parameters: progress_bar – An optional tqdm progress bar to update as the values are processed.
Returns: A ValidatedFeature object representing that this feature has been checked.

quilt3distribute.validation.validate(data: pandas.core.frame.DataFrame, schema: Optional[Dict[str, quilt3distribute.validation.FeatureDefinition]] = None, drop_on_error: bool = False, n_workers: Optional[int] = None, show_progress: bool = True) → quilt3distribute.validation.ValidatedDataset[source]¶

A function that validates a dataset against the proposed schema.

Parameters

data – A pandas dataframe to validate.
schema – The proposed schema to validate the dataset against. A dictionary mapping dataframe column names to FeatureDefinitions. If no schema provided, it will use _generate_schema_template to generate one for the data provided.
drop_on_error – In the case that an error occurs during validation should the row be dropped and validation continue.
n_workers – The number of threads to use during validation.
show_progress – Boolean option to show or hide progress bar.

Returns

A ValidatedDataset object that stores the cleaned copy of the data as well as the validated schema.

Validation isn’t a CPU intensive task so async threadpool is used over processpool. The most intensive task is file existence checks.

Module contents¶

Top-level package for quilt3distribute.

quilt3distribute.get_module_version()[source]¶