quilt3distribute package¶
Subpackages¶
Submodules¶
quilt3distribute.dataset module¶
-
class
quilt3distribute.dataset.
Dataset
(dataset: Union[str, pathlib.Path, pandas.core.frame.DataFrame], name: str, package_owner: str, readme_path: Union[str, pathlib.Path])[source]¶ Bases:
object
Initialize a dataset object.
- Parameters
dataset – Filepath or preloaded pandas dataframe.
name – A name for the dataset. May only contain alphabetic and underscore characters.
package_owner – The name of the dataset owner. To be attached to the dataset name.
readme_path – A path to a markdown README file.
-
add_license
(doc_or_link: Union[str, pathlib.Path])[source]¶ Add a document’s content or add a link to a publically accessibly resource for license details.
- Parameters
doc_or_link – A filepath or string uri to a resource for license details.
Wrapper around quilt3distribute.documentation.README.append_readme_standards.
-
add_usage_doc
(doc_or_link: Union[str, pathlib.Path])[source]¶ Add a document’s content or add a link to a publically accessibly resource for documentation and usage examples.
- Parameters
doc_or_link – A filepath or string uri to a resource detailing usage of this dataset.
Wrapper around quilt3distribute.documentation.README.append_readme_standards.
-
property
data
¶
-
distribute
(push_uri: Optional[str] = None, message: Optional[str] = None, attach_associates: bool = True) → quilt3.packages.Package[source]¶ Push a package to a specific S3 bucket. If no bucket is provided, the un-built, un-pushed package is returned. You can push a dataset with the same name multiple times to the same bucket multiple times as instead of overriding a prior dataset, Quilt simply creates a new dataset version. Please refer to Quilt documentation for more details: https://docs.quiltdata.com
- Parameters
push_uri – The S3 bucket uri to push to. Example: “s3://quilt-jacksonb”
message – An optional message to attach to that version of the dataset.
attach_associates – Boolean option to attach associates as metadata to each file. Associates are used to retain quick navigation between related files.
- Returns
The built and optionally pushed quilt3.Package.
-
property
readme
¶
-
static
return_or_raise_approved_name
(name: str) → str[source]¶ Attempt to clean a string to match the pattern expected by Quilt 3. If after the cleaning operation, it still doesn’t match the approved pattern, will raise a ValueError.
- Parameters
name – String name to clean.
- Returns
Cleaned name.
-
set_column_names_map
(columns: Dict[str, str])[source]¶ Explicit override for the labeling of column names on file distribution. Example, a column (“2dReadPath”) is detected to have files, in the package that file will be placed in a directory called “2dReadPath”. Using this function, those directory names can be explicitly overridden.
- Parameters
columns – A mapping of current column name contain files to desired labeled directory name.
-
set_extra_files
(files: Union[List[Union[str, pathlib.Path]], Dict[str, List[Union[str, pathlib.Path]]]])[source]¶ Datasets commonly have extra or supporting files. Any file passed to this function will be added to the requested directory.
- Parameters
files – When provided a list of string or Path objects all paths provided in the list will be sent to the same logical key “supporting_files”. When provided a dictionary mapping strings to list of string or Path objects, the paths will be placed in logical keys labeled by their dictionary entry.
-
set_metadata_columns
(columns: List[str])[source]¶ Use the manifest contents to attach metadata to the files found in the dataset.
- Parameters
columns – A list of columns to use for metadata attachment.
Example row: {“CellId”: 1, “Structure”: “lysosome”, “2dReadPath”: “/allen…”, “3dReadPath”: “/allen…”} Attach structure metadata: dataset.set_metadata_columns([“Structure”]) Results in the files found at the 2dReadPath and the 3dReadPath both having {“Structure”: “lysosome”} attached
In short: the values in each column provided will be used for metadata attachment for every file found.
quilt3distribute.documentation module¶
-
class
quilt3distribute.documentation.
README
(fp: Union[str, pathlib.Path])[source]¶ Bases:
object
Initialize a README object.
- Parameters
fp – Filepath to a markdown readme document.
-
append_readme_standards
(usage_doc_or_link: Union[str, pathlib.Path, None] = None, license_doc_or_link: Union[str, pathlib.Path, None] = None) → str[source]¶ Attach a standard document or link to the readme. If the provided value is an external resource, a default message is attached before linking to the external resource. Additionally, updates the underlying text attribute for this object to retain prior document attachments.
- Parameters
usage_doc_or_link – A document or link to external resource with details on dataset usage.
license_doc_or_link – A document or link to external resource with details on licensing.
- Returns
The entire contents of the readme returned as a string.
-
property
fp
¶
-
property
referenced_files
¶
-
property
text
¶
quilt3distribute.file_utils module¶
quilt3distribute.validation module¶
-
class
quilt3distribute.validation.
FeatureDefinition
(dtype: Type, validation_functions: Union[List[Callable], Tuple[Callable], None] = None, cast_values: bool = False, display_name: Optional[str] = None, description: Optional[str] = None, units: Optional[str] = None)[source]¶ Bases:
object
Initialize a new feature definition. A feature definition can be as simple as providing a data type (dtype) or can be incredibly specific by including validation and cleaning operations or providing metadata. If dtype of pathlib.Path is provided, cast_values is automatically set to True.
- Parameters
dtype – The data type for the feature.
validation_functions – A list or tuple of callable functions to validate each instance of the feature.
cast_values – In the case that an instance of the feature is found that doesn’t match the dtype provided, should that instance be attempted to cast to the provided dtype.
display_name – Metadata attachment for a display name to be given to the feature.
description – Metadata attachment for a description for the feature.
units – Metadata attachment for unit details for the feature.
-
exception
quilt3distribute.validation.
PlannedDelayedDropError
(message: str, **kwargs)[source]¶ Bases:
Exception
-
class
quilt3distribute.validation.
PlannedDelayedDropResult
(index, error)[source]¶ Bases:
tuple
Create new instance of PlannedDelayedDropResult(index, error)
-
property
error
¶ Alias for field number 1
-
property
index
¶ Alias for field number 0
-
property
-
class
quilt3distribute.validation.
Schema
(features: Dict[str, quilt3distribute.validation.ValidatedFeature])[source]¶ Bases:
object
A schema is the summation of multiple validated and unvalidated features for a Dataset. It provides helpful methods for viewing which features have and have not been validated and with which data types, functions, and metadata.
- Parameters
features – A dictionary mapping the dataset manifest column names to ValidatedFeatures.
-
property
df
¶
-
property
features
¶
-
property
unvalidated
¶
-
property
validated
¶
-
class
quilt3distribute.validation.
ValidatedDataset
(data, schema)[source]¶ Bases:
tuple
Create new instance of ValidatedDataset(data, schema)
-
property
data
¶ Alias for field number 0
-
property
schema
¶ Alias for field number 1
-
property
-
class
quilt3distribute.validation.
ValidatedFeature
(name: str, dtype: Type, display_name: Optional[str] = None, description: Optional[str] = None, units: Optional[str] = None, validation_functions: Optional[Tuple[Callable]] = None, errored_results: Optional[Set[quilt3distribute.validation.PlannedDelayedDropResult]] = None)[source]¶ Bases:
object
A feature that has it’s core validation attributes locked but metadata freely mutable.
- Parameters
name – The name for the feature in the dataset (usually this is the column).
dtype – A single data type for the feature.
display_name – A display name for the feature.
description – A description for the feature.
units – Units for the feature.
validation_functions – The tuple of validation functions ran against the feature values.
errored_results – An optional set of PlannedDelayedDropResults that errored out during validation.
-
property
dtype
¶
-
property
errored_results
¶
-
property
name
¶
-
property
validation_functions
¶
-
class
quilt3distribute.validation.
ValidationReturn
(name, feature)[source]¶ Bases:
tuple
Create new instance of ValidationReturn(name, feature)
-
property
feature
¶ Alias for field number 1
-
property
name
¶ Alias for field number 0
-
property
-
class
quilt3distribute.validation.
Validator
(name: str, values: numpy.ndarray, definition: quilt3distribute.validation.FeatureDefinition, drop_on_error: bool = False)[source]¶ Bases:
object
A container to manage feature values and feature definition that can actually process (validate) the feature.
- Parameters
name – The name of the feature (usually this is the column name).
values – The np.ndarray of feature values.
definition – The feature definition to validate against.
drop_on_error – In the case that an error occurs during validation should the row be dropped and validation continue.
-
process
(progress_bar: Optional[tqdm.std.tqdm] = None) → quilt3distribute.validation.ValidatedFeature[source]¶ Use the feature definition stored on this object to attempt to validate the feature.
- Parameters
progress_bar – An optional tqdm progress bar to update as the values are processed.
- Returns
A ValidatedFeature object representing that this feature has been checked.
-
quilt3distribute.validation.
validate
(data: pandas.core.frame.DataFrame, schema: Optional[Dict[str, quilt3distribute.validation.FeatureDefinition]] = None, drop_on_error: bool = False, n_workers: Optional[int] = None, show_progress: bool = True) → quilt3distribute.validation.ValidatedDataset[source]¶ A function that validates a dataset against the proposed schema.
- Parameters
data – A pandas dataframe to validate.
schema – The proposed schema to validate the dataset against. A dictionary mapping dataframe column names to FeatureDefinitions. If no schema provided, it will use _generate_schema_template to generate one for the data provided.
drop_on_error – In the case that an error occurs during validation should the row be dropped and validation continue.
n_workers – The number of threads to use during validation.
show_progress – Boolean option to show or hide progress bar.
- Returns
A ValidatedDataset object that stores the cleaned copy of the data as well as the validated schema.
Validation isn’t a CPU intensive task so async threadpool is used over processpool. The most intensive task is file existence checks.