Datasets

Download and store datasets

MedMNIST Datasets

download_medmnist

 download_medmnist (dataset:str, output_dir:str='.',
                    download_only:bool=False, save_images:bool=True)

Downloads the specified MedMNIST dataset and saves the training, validation, and test datasets into the specified output directory. Images are saved as .png for 2D data and multi-page .tiff for 3D data, organized into folders named after their labels.

Returns: None, saves images in the specified output directory if save_images is True.

	Type	Default	Details
dataset	str		The name of the MedMNIST dataset (e.g., ‘pathmnist’, ‘bloodmnist’, etc.).
output_dir	str	.	The path to the directory where the datasets will be saved.
download_only	bool	False	If True, only download the dataset into the output directory without processing.
save_images	bool	True	If True, save the images into the output directory as .png (2D datasets) or multipage .tiff (3D datasets) files.

medmnist2df

 medmnist2df (train_dataset, val_dataset=None, test_dataset=None,
              mode='RGB')

Convert MedMNIST datasets to DataFrames, with images as PIL Image objects and labels as DataFrame columns.

Missing datasets (if None) are represented by None in the return tuple.

	Type	Default	Details
train_dataset			MedMNIST training dataset with images and labels
val_dataset	NoneType	None	(Optional) MedMNIST validation dataset with images and labels
test_dataset	NoneType	None	(Optional) MedMNIST test dataset with images and labels
mode	str	RGB	Mode for PIL Image conversion, e.g., ‘RGB’, ‘L’
Returns	(<class ‘pandas.core.frame.DataFrame’>, <class ‘pandas.core.frame.DataFrame’>, <class ‘pandas.core.frame.DataFrame’>)		(df_train, df_val, df_test): DataFrames with columns ‘image’ and ‘label’

Download data via Pooch

download_file

 download_file (url, output_dir='data', extract=True, hash=None,
                extract_dir=None)

Download and optionally decompress a single file using Pooch.

	Type	Default	Details
url			The URL of the file to be downloaded
output_dir	str	data	The directory where the downloaded file will be saved
extract	bool	True	If True, decompresses the file if it’s in a compressed format
hash	NoneType	None	Optional: You can add a checksum for integrity verification
extract_dir	NoneType	None	Directory to extract the files to

download_dataset

 download_dataset (base_url, expected_checksums, file_names, output_dir,
                   processor=None)

Download a dataset using Pooch and save it to the specified output directory.

	Type	Default	Details
base_url			The base URL from which the files will be downloaded.
expected_checksums			A dictionary mapping file names to their expected checksums.
file_names			A dictionary mapping task identifiers to file names.
output_dir			The directory where the downloaded files will be saved.
processor	NoneType	None	A function to process the downloaded data.

download_dataset_from_csv

 download_dataset_from_csv (csv_file, base_url, output_dir,
                            processor=None, rows=None, prepend_mdf5=True)

Download a dataset using Pooch and save it to the specified output directory, reading file names and checksums from a CSV file.

	Type	Default	Details
csv_file			Path to the CSV file containing file names and checksums.
base_url			The base URL from which the files will be downloaded.
output_dir			The directory where the downloaded files will be saved.
processor	NoneType	None	A function to process the downloaded data.
rows	NoneType	None	Specific row indices to download. If None, download all rows.
prepend_mdf5	bool	True	If True, prepend ‘md5:’ to the checksums.

# Specify the directory where you want to save the downloaded files
output_directory = "./_test_folder"
# Define the base URL for the MSD dataset
base_url = 'https://s3.ap-northeast-1.wasabisys.com/gigadb-datasets/live/pub/10.5524/100001_101000/100888/'

download_dataset_from_csv('./data_examples/FMD_dataset_info.csv', base_url, output_directory, rows=[6])

The dataset has been successfully downloaded and saved to: ./_test_folder

Download data via Quilt/T4

Allen Institute Cell Science (AICS)

aics_pipeline

 aics_pipeline (n_images_to_download=40, image_save_dir=None,
                col='SourceReadPath')

	Type	Default	Details
n_images_to_download	int	40	Number of images to download
image_save_dir	NoneType	None	Directory to save the images
col	str	SourceReadPath	Column name for image paths in the data manifest

image_target_paths, data_manifest = aics_pipeline(1, "../_data/aics")

Loading manifest: 100%|██████████| 77165/77165 [00:01<00:00, 45.2k/s]

print(image_target_paths)
data_manifest.to_csv('../_data/aics/aics_dataset.csv')

['../_data/aics/9e5d8f2e_3500001004_100X_20170623_5-Scene-1-P24-E06.czi_nucWholeIndexImageScale.tiff', '../_data/aics/77a69ff1_3500001004_100X_20170623_5-Scene-3-P26-F05.czi_nucWholeIndexImageScale.tiff']

image_target_paths, data_manifest = aics_pipeline(1, "../_data/aics", col="NucleusSegmentationReadPath")

Loading manifest: 100%|██████████| 77165/77165 [00:01<00:00, 46.5k/s]
100%|██████████| 491k/491k [00:02<00:00, 171kB/s]

Dataset Manifest

Utilities to make a list of all of the files of the train and test dataset in csv form.

manifest2csv

 manifest2csv (signal, target, paths=None, train_fraction=0.8,
               data_save_path='./', train='train.csv', test='test.csv',
               identifier=None)

	Type	Default	Details
signal			List of paths to signal images
target			List of paths to target images
paths	NoneType	None	List of paths to images
train_fraction	float	0.8	Fraction of data to use for training
data_save_path	str	./	Path to save the CSV files
train	str	train.csv	Name of the training CSV file
test	str	test.csv	Name of the test CSV file
identifier	NoneType	None	Identifier to add to the paths

manifest2csv(data_manifest["ChannelNumberBrightfield"],data_manifest["ChannelNumber405"], image_target_paths, data_save_path='./data_examples/')

split_dataframe

 split_dataframe (input_data, train_fraction=0.7, valid_fraction=0.1,
                  split_column=None, stratify=False, add_is_valid=False,
                  train_path='train.csv', test_path='test.csv',
                  valid_path='valid.csv', data_save_path=None)

Splits a DataFrame or CSV file into train, test, and optional validation sets.

	Type	Default	Details
input_data			Path to CSV file or DataFrame
train_fraction	float	0.7	Proportion of data to use for the training set
valid_fraction	float	0.1	Proportion of data to use for the validation set
split_column	NoneType	None	Column name that indicates pre-defined split
stratify	bool	False	If True, stratify by split_column during random split
add_is_valid	bool	False	If True, adds ‘is_valid’ column in the train set to mark validation samples
train_path	str	train.csv	Path to save the training CSV file
test_path	str	test.csv	Path to save the test CSV file
valid_path	str	valid.csv	Path to save the validation CSV file
data_save_path	NoneType	None	Path to save the data files

add_columns_to_csv

 add_columns_to_csv (csv_path, column_data, output_path=None)

Adds one or more new columns to an existing CSV file.

	Type	Default	Details
csv_path			Path to the input CSV file
column_data			Dictionary of column names and values to add. Each value can be a scalar (single value for all rows) or a list matching the number of rows.
output_path	NoneType	None	Path to save the updated CSV file. If None, it overwrites the input CSV file.