pugh_torch.datasets package

Submodules

pugh_torch.datasets.base module

Design philosophies/rules:
  • All datasets in this repo are a child of Dataset.

  • All paths are pathlib.Path objects.
    • If something cannot handle it as a Path object, cast it to a string as late as possible.

  • Whenever possible, require the least amount of effort on the dev’s part to get a dataset downloaded and properly formatted.

  • Dataset directories are automatically parsed/derived, so no need to prompt the developer on where they want their dataset files.

  • self.transform is ONLY ever used in the dev’s implementation of self.__getitem__. However, the package albumentations does a great job, so when in doubt, assume this is a albumentations.Compose.

To implement your own dataset:
  1. Subclass the pugh_torch.datasets.Dataset class. This class itself is a subclass of torch.utils.data.Dataset.

  2. Implement the download method:
    def download(self):

    # the local folder (guarenteed to exist) is self.path

    This will only be called if the downloaded data isn’t available. The download being available is determined by a sentinel “downloaded” file.

  3. Implement the unpack method:
    def unpack(self):

    # the local folder (guarenteed to exist) is self.path

    This will only be called if the data hasn’t been unpacked yet. The unpacked being available is determined by a sentinel “unpacked” file.

  4. Follow the other remaining instructions at:

    https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset

  5. Registration, path-handling, and all of that other stuff is automatically handled.

class pugh_torch.datasets.base.Dataset(split='train', *, transform=None, **kwargs)[source]

Bases: torch.utils.data.dataset.Dataset

Attempts to download data.

Parameters
  • split (str) – One of {“train”, “val”, “test”}. Which data partition to use. Case insensitive.

  • transform (obj) – Whatever format you want. Depends on dataset __getitem__ implementation. Defaults to just a ToTensor transform. This attribute is NOT used anywhere except in the dataset-specific __get__ implementation, or other parent classes of the dataset..

download()[source]

Function to download data to self.path.

The directories up to self.path have already been created.

Will only be called if data has not been downloaded.

property downloaded

We detect if the data has been fully downloaded by a “downloaded” file in the root of the data directory.

property downloaded_file
property path

pathlib.Path to the root of the stored data

unpack()[source]

Post-process the downloaded payload.

Typically this will be something like unpacking a tar file, or possibly re-arranging files.

property unpacked

We detect if the data has been fully unpacked by a “unpacked” file in the root of the data directory.

property unpacked_file

pugh_torch.datasets.nyuv2 module

class pugh_torch.datasets.nyuv2.NYUv2(*args, raw_depth=False, types=['rgb', 'depth'], transform=None, **kwargs)[source]

Bases: pugh_torch.datasets.base.Dataset

rgbnp.array uint8

Images in RGB order

depthnp.array float32

Depth in meters

Parameters
  • raw_depth (bool) – Return the depth data before invalid areas were infilled. Defaults to False.

  • types (list of str) – Data types to return.

DOWNLOAD_URL = 'http://horatio.cs.nyu.edu/mit/silberman/nyu_depth_v2/nyu_depth_v2_labeled.mat'
K = array([[518.85790117, 0. , 325.58244941], [ 0. , 519.46961112, 253.73616633], [ 0. , 0. , 1. ]])
K4 = array([518.85790117, 519.46961112, 325.58244941, 253.73616633])
PAYLOAD_NAME = 'nyu_depth_v2_labeled.mat'
available_types = {'depth', 'instances', 'labels', 'rgb'}
cx = 325.58244941119034
cy = 253.73616633400465
download()[source]

Function to download data to self.path.

The directories up to self.path have already been created.

Will only be called if data has not been downloaded.

fx = 518.8579011745019
fy = 519.4696111212749
unpack()[source]

No unpacking necessary

pugh_torch.datasets.torchvision module

Lightly wraps torchvision datasets.

This just allows us greater customization without modifying another repo.

Most notably, this:
  • Automatically gets the torchvision dataset constructor based on name

  • Moves the transform responsibility to us

  • Applies our automatic opinionated pathing rules.

class pugh_torch.datasets.torchvision.TorchVisionDataset(*args, **kwargs)[source]

Bases: pugh_torch.datasets.base.Dataset

Attempts to download data.

Parameters
  • split (str) – One of {“train”, “val”, “test”}. Which data partition to use. Case insensitive.

  • transform (obj) – Whatever format you want. Depends on dataset __getitem__ implementation. Defaults to just a ToTensor transform. This attribute is NOT used anywhere except in the dataset-specific __get__ implementation, or other parent classes of the dataset..

auto_construct = True
property class_to_idx
property classes
download()[source]

Handled by the torchvision dataset

unpack()[source]

Handled by the torchvision dataset

Module contents

pugh_torch.datasets.__init__

The root dataset path can be set via the environmental variable PUGH_TORCH_DATASETS_PATH.

I don’t expose this in code because I think it just clutters the code.

pugh_torch.datasets.get(*args)[source]

Gets dataset constructor from string identifiers

Example:

constructor = get(“classification”, “imagenet”)

Parameters

*args (str) – Case-insensitive Strings that lead to a dataset. Typically in form (genre, name) Type of dataset. e.x. “classification”.

pugh_torch.datasets.get_dataset(*args)

Gets dataset constructor from string identifiers

Example:

constructor = get(“classification”, “imagenet”)

Parameters

*args (str) – Case-insensitive Strings that lead to a dataset. Typically in form (genre, name) Type of dataset. e.x. “classification”.