Datasets

Download and store datasets

MedMNIST Datasets


download_medmnist

 download_medmnist (dataset:str, output_dir:str='.',
                    download_only:bool=False, save_images:bool=True)

Downloads the specified MedMNIST dataset and saves the training, validation, and test datasets into the specified output directory. Images are saved as .png for 2D data and multi-page .tiff for 3D data, organized into folders named after their labels.

Returns: None, saves images in the specified output directory if save_images is True.

Type Default Details
dataset str The name of the MedMNIST dataset (e.g., ‘pathmnist’, ‘bloodmnist’, etc.).
output_dir str . The path to the directory where the datasets will be saved.
download_only bool False If True, only download the dataset into the output directory without processing.
save_images bool True If True, save the images into the output directory as .png (2D datasets) or multipage .tiff (3D datasets) files.

medmnist2df

 medmnist2df (train_dataset, val_dataset=None, test_dataset=None,
              mode='RGB')

Convert MedMNIST datasets to DataFrames, with images as PIL Image objects and labels as DataFrame columns.

Missing datasets (if None) are represented by None in the return tuple.

Type Default Details
train_dataset MedMNIST training dataset with images and labels
val_dataset NoneType None (Optional) MedMNIST validation dataset with images and labels
test_dataset NoneType None (Optional) MedMNIST test dataset with images and labels
mode str RGB Mode for PIL Image conversion, e.g., ‘RGB’, ‘L’
Returns (<class ‘pandas.core.frame.DataFrame’>, <class ‘pandas.core.frame.DataFrame’>, <class ‘pandas.core.frame.DataFrame’>) (df_train, df_val, df_test): DataFrames with columns ‘image’ and ‘label’

Download data via Pooch


download_file

 download_file (url, output_dir='data', extract=True, hash=None,
                extract_dir=None)

Download and optionally decompress a single file using Pooch.

Type Default Details
url The URL of the file to be downloaded
output_dir str data The directory where the downloaded file will be saved
extract bool True If True, decompresses the file if it’s in a compressed format
hash NoneType None Optional: You can add a checksum for integrity verification
extract_dir NoneType None Directory to extract the files to

download_dataset

 download_dataset (base_url, expected_checksums, file_names, output_dir,
                   processor=None)

Download a dataset using Pooch and save it to the specified output directory.

Type Default Details
base_url The base URL from which the files will be downloaded.
expected_checksums A dictionary mapping file names to their expected checksums.
file_names A dictionary mapping task identifiers to file names.
output_dir The directory where the downloaded files will be saved.
processor NoneType None A function to process the downloaded data.

download_dataset_from_csv

 download_dataset_from_csv (csv_file, base_url, output_dir,
                            processor=None, rows=None, prepend_mdf5=True)

Download a dataset using Pooch and save it to the specified output directory, reading file names and checksums from a CSV file.

Type Default Details
csv_file Path to the CSV file containing file names and checksums.
base_url The base URL from which the files will be downloaded.
output_dir The directory where the downloaded files will be saved.
processor NoneType None A function to process the downloaded data.
rows NoneType None Specific row indices to download. If None, download all rows.
prepend_mdf5 bool True If True, prepend ‘md5:’ to the checksums.
# Specify the directory where you want to save the downloaded files
output_directory = "./_test_folder"
# Define the base URL for the MSD dataset
base_url = 'https://s3.ap-northeast-1.wasabisys.com/gigadb-datasets/live/pub/10.5524/100001_101000/100888/'

download_dataset_from_csv('./data_examples/FMD_dataset_info.csv', base_url, output_directory, rows=[6])
The dataset has been successfully downloaded and saved to: ./_test_folder

Download data via Quilt/T4

Allen Institute Cell Science (AICS)


aics_pipeline

 aics_pipeline (n_images_to_download=40, image_save_dir=None,
                col='SourceReadPath')
Type Default Details
n_images_to_download int 40 Number of images to download
image_save_dir NoneType None Directory to save the images
col str SourceReadPath Column name for image paths in the data manifest
image_target_paths, data_manifest = aics_pipeline(1, "../_data/aics")
Loading manifest: 100%|██████████| 77165/77165 [00:01<00:00, 45.2k/s]
print(image_target_paths)
data_manifest.to_csv('../_data/aics/aics_dataset.csv')
['../_data/aics/9e5d8f2e_3500001004_100X_20170623_5-Scene-1-P24-E06.czi_nucWholeIndexImageScale.tiff', '../_data/aics/77a69ff1_3500001004_100X_20170623_5-Scene-3-P26-F05.czi_nucWholeIndexImageScale.tiff']
image_target_paths, data_manifest = aics_pipeline(1, "../_data/aics", col="NucleusSegmentationReadPath")
Loading manifest: 100%|██████████| 77165/77165 [00:01<00:00, 46.5k/s]
100%|██████████| 491k/491k [00:02<00:00, 171kB/s] 

Dataset Manifest

Utilities to make a list of all of the files of the train and test dataset in csv form.


manifest2csv

 manifest2csv (signal, target, paths=None, train_fraction=0.8,
               data_save_path='./', train='train.csv', test='test.csv',
               identifier=None)
Type Default Details
signal List of paths to signal images
target List of paths to target images
paths NoneType None List of paths to images
train_fraction float 0.8 Fraction of data to use for training
data_save_path str ./ Path to save the CSV files
train str train.csv Name of the training CSV file
test str test.csv Name of the test CSV file
identifier NoneType None Identifier to add to the paths
manifest2csv(data_manifest["ChannelNumberBrightfield"],data_manifest["ChannelNumber405"], image_target_paths, data_save_path='./data_examples/')

split_dataframe

 split_dataframe (input_data, train_fraction=0.7, valid_fraction=0.1,
                  split_column=None, stratify=False, add_is_valid=False,
                  train_path='train.csv', test_path='test.csv',
                  valid_path='valid.csv', data_save_path=None)

Splits a DataFrame or CSV file into train, test, and optional validation sets.

Type Default Details
input_data Path to CSV file or DataFrame
train_fraction float 0.7 Proportion of data to use for the training set
valid_fraction float 0.1 Proportion of data to use for the validation set
split_column NoneType None Column name that indicates pre-defined split
stratify bool False If True, stratify by split_column during random split
add_is_valid bool False If True, adds ‘is_valid’ column in the train set to mark validation samples
train_path str train.csv Path to save the training CSV file
test_path str test.csv Path to save the test CSV file
valid_path str valid.csv Path to save the validation CSV file
data_save_path NoneType None Path to save the data files

add_columns_to_csv

 add_columns_to_csv (csv_path, column_data, output_path=None)

Adds one or more new columns to an existing CSV file.

Type Default Details
csv_path Path to the input CSV file
column_data Dictionary of column names and values to add. Each value can be a scalar (single value for all rows) or a list matching the number of rows.
output_path NoneType None Path to save the updated CSV file. If None, it overwrites the input CSV file.