Tensorflow Dataset Interleave Image File Paths for Reading - tensorflow-datasets

I am trying to use the tensorflow Dataset API to load tif images that can be on network storage. The map_func I pass to the tensorflow.Dataset.interleave function is not passed the string filename I expect, but instead a tensor with dtype string.
I have tried evaluating this tensor using sess.run and tensor.eval() (also passing in the current session as the session parameter), but tensorflow raises a ValueError: "ValueError: Fetch argument cannot be interpreted as a Tensor. (Tensor Tensor("arg0:0", shape=(), dtype=string) is not an element of this graph.)
or("arg0:0", shape=(), dtype=string) is not an element of this graph.)".
An example of my tensorflow data pipeline
class DataLoader:
...
def setup(self):
...
tf.data.Dataset.from_tensor_slices(
(
self.training_filenames, # a python list of strings
self.training_label_filenames # a python list of strings
)
)
.apply(tf.data.experimental.filter_for_shard(
self.shard_count,
self.shard_index))
.repeat()
.shuffle(buffer_size=self.training_data_shuffle_buffer_size)
.interleave(
lambda data_filepath, label_filepath: (
self.preprocess_training_data(data_filepath, label_filepath)
),
cycle_length=tf.data.experimental.AUTOTUNE,
num_parallel_calls=tf.data.experimental.AUTOTUNE
)
.batch(self.training_data_batch_size)
.prefetch(self.training_data_batch_size)
...
def preprocess_training_data(self, data_filepath, label_filepath):
data = tifffile.imread(self.session.run(data_filepath).decode())
data_resize = (self.training_data_shape[0], self.training_data_shape[1])
data_transpose = (1, 0, 2)
data_scale = 255.0
data_dtype = self.training_data_type.as_numpy_dtype()
data = numpy.transpose(
cv2.resize(data, data_resize), data_transpose
).astype(data_dtype) / data_scale
label = tifffile.imread(self.session.run(label_filepath))
label = numpy.transpose(
numpy.expand_dims(
cv2.resize(
label,
data_resize),
2
),
data_transpose
)
weights = [
self.vec_class_weights(current_label)
.astype(self.label_data_type.as_numpy_dtype())
for current_label
in label
]
return data, label, weights
I expect my preprocess_training_data function to be passed strings, or I would be able to evaluate the tensor my function is passed from the dataset interleave transformation which would evaluate to a string.

Related

Problem with following along with notebook on kaggle "max() received an invalid combination of arguments" issue

For my studying purposes I am following along a very popular notebook for sentiment classification with Bert.
Kaggle notebook for sentiment classification with BERT
But in place of train the model like in notebook, i just load another model
MODEL_NAME = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
and want to test this on my data, to get a heatmap and accuracy score likde on the end of this notebook.
But when i am at the step of evalution i get
TypeError: max() received an invalid combination of arguments - got (SequenceClassifierOutput, dim=int), but expected one of:
* (Tensor input)
* (Tensor input, Tensor other, *, Tensor out)
* (Tensor input, int dim, bool keepdim, *, tuple of Tensors out)
* (Tensor input, name dim, bool keepdim, *, tuple of Tensors out)
in evaluation function where it says
_, preds = torch.max(outputs, dim=1)
I tried to change this to
_, preds = torch.max(torch.tensor(outputs), dim=1)
But then a got another issue:
RuntimeError: Could not infer dtype of SequenceClassifierOutput
the method for evaluation looks like this:
def eval_model(model, data_loader, loss_fn, device, n_examples):
model = model.eval()
losses = []
correct_predictions = 0
with torch.no_grad():
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
# Get model ouptuts
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask,
)
_, preds = torch.max(outputs, dim=1)
loss = loss_fn(outputs, targets)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
return correct_predictions.double() / n_examples, np.mean(losses)
And outputs it self in the code above looks like this
SequenceClassifierOutput(loss=None, logits=tensor([[ 2.2241, 1.2025, 0.1638, -1.4620, -1.6424],
[ 3.1578, 1.3957, -0.1131, -1.8141, -1.9536],
[ 0.7273, 1.7851, 1.1237, -0.9063, -2.3822],
[ 0.9843, 0.9711, 0.5067, -0.7553, -1.4547],
[-0.4127, -0.8895, 0.0572, 0.3550, 0.7377],
[-0.4885, 0.6933, 0.8272, -0.3176, -0.7546],
[ 1.3953, 1.4224, 0.7842, -0.9143, -2.2898],
[-2.4618, -1.2675, 0.5480, 1.4326, 1.2893],
[ 2.5044, 0.9191, -0.1483, -1.4413, -1.4156],
[ 1.3901, 1.0331, 0.4259, -0.8006, -1.6999],
[ 4.2252, 2.6539, -0.0392, -2.6362, -3.3261],
[ 1.9750, 1.8845, 0.6779, -1.3163, -2.5570],
[ 5.1688, 2.2360, -0.6230, -2.9657, -2.9031],
[ 1.1857, 0.4277, -0.1837, -0.7163, -0.6682],
[ 2.1133, 1.3829, 0.5750, -1.3095, -2.2234],
[ 2.3258, 0.9406, -0.0115, -1.1673, -1.6775]], device='cuda:0'), hidden_states=None, attentions=None)
How i can make it work?
Kind regards

Change all images in training set

I have a convolutional neural network. And I wanted to train it on images from the training set but first they should be wrapped with my function change(tensor, float) that takes in a tensor/image of the form [hight,width,3] and a float.
Batch size =4
loading data
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
shuffle=True, num_workers=2)
Cnn architecture
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
#size of inputs [4,3,32,32]
#size of labels [4]
inputs = change(inputs,0.1) <----------------------------
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs) #[4, 10]
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
if i % 2000 == 1999: # print every 2000 mini-batches
print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
running_loss = 0.0
print('Finished Training')
I am trying to apply the image function change but it gives an object error.
it there a quick way to fix it?
I am using a Julia function but it works completely fine with other objects. Error message:
JULIA: MethodError: no method matching copy(::PyObject)
Closest candidates are:
copy(!Matched::T) where T<:SHA.SHA3_CTX at /opt/julia-1.7.2/share/julia/stdlib/v1.7/SHA/src/types.jl:213
copy(!Matched::T) where T<:SHA.SHA2_CTX at /opt/julia-1.7.2/share/julia/stdlib/v1.7/SHA/src/types.jl:212
copy(!Matched::Number) at /opt/julia-1.7.2/share/julia/base/number.jl:113
I would recommend to put change function to transforms list, so you do data changes on transformation stage.
partial from functools will help you to fix number of arguments, like this:
from functools import partial
def change(input, float):
pass
# Use partial to fix number of params, such that change accepts only input
change_partial = partial(change, float=pass_float_value_here)
# Add change_partial to a list of transforms before or after converting to tensors
transforms = Compose([
RandomResizedCrop(img_size), # example
# Add change_partial here if it operates on PIL Image
change_partial,
ToTensor(), # convert to tensor
# Add change_partial here if it operates on torch tensors
change_partial,
])

Properly evaluate a test dataset

I trained a machine translation model using huggingface library:
def compute_metrics(eval_preds):
preds, labels = eval_preds
if isinstance(preds, tuple):
preds = preds[0]
decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
# Replace -100 in the labels as we can't decode them.
labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
# Some simple post-processing
decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
result = metric.compute(predictions=decoded_preds, references=decoded_labels)
result = {"bleu": result["score"]}
prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
result["gen_len"] = np.mean(prediction_lens)
result = {k: round(v, 4) for k, v in result.items()}
return result
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test'],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
trainer.train()
model_dir = './models/'
trainer.save_model(model_dir)
The code above is taken from this Google Colab notebook. After the training, I can see the trained model is saved to the folder models and the metric is calculated. Now I want to load the trained model and do the prediction on a new dataset, here is what I tried:
dataset = load_dataset('csv', data_files='data/training_data.csv')
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# Tokenize the test dataset
tokenized_datasets = train_test.map(preprocess_function_v2, batched=True)
test_dataset = tokenized_datasets['test']
model = AutoModelForSeq2SeqLM.from_pretrained('models')
model(test_dataset)
It threw the following error:
*** AttributeError: 'Dataset' object has no attribute 'size'
I tried the evaluate() function as well, but it said:
*** torch.nn.modules.module.ModuleAttributeError: 'MarianMTModel' object has no attribute 'evaluate'
And the function eval only prints the configuration of the model.
What is the proper way to evaluate the performance of the trained model on a new dataset?
Turned out that the prediction can be produced using the following code:
inputs = tokenizer(
questions,
max_length=max_input_length,
truncation=True,
return_tensors='pt',
padding=True).to('cuda')
translation = model.generate(**inputs)

Keras custom dataset with images as label/ground truth

I'm making a model for image denoising and use ImageDataGenerator.flow_from_directory to load the dataset. It is structured in two folders, one with noisy input images and one with the corresponding clean images. I want the generator to use the images in the first folder as inputs and the other folder as "labels"/ground truth.
With the method I'm using right now all images in both folders are treated as inputs with the folder name as label. I can extract the images manually by selecting specific batches and train on that, but it's inconvenient and probably wasn't intended to be used that way.
What is the proper way of doing this? There probably is a function for this but I can't find it.
Had similar problem. Found it necessary to create a custom generator to feed the images into model.fit. Code (rather lengthy) is posted below.
import os
import pandas as pd
import numpy as np
import glob
import cv2
from sklearn.model_selection import train_test_split
def create_df(image_dir, label_dir, shuffle=True):
path=image_dir + '/**/*'
image_file_paths=glob.glob(path,recursive=True)
path=label_dir + '/**/*'
label_file_paths=glob.glob(path,recursive=True)
# run a check and make sure filename without extensions match
df=pd.DataFrame({'image': image_file_paths, 'label':label_file_paths}).astype(str)
if shuffle:
df=df.sample(frac=1.0, replace=False, weights=None, random_state=123, axis=0).reset_index(drop=True)
return df
class jpgen():
batch_index=0 #tracks the number of batches generated
def __init__(self, df, train_split=None, test_split=None):
self.train_split=train_split # float between 0 and 1 indicating the percentage of images to use for training
self.test_split=test_split
self.df=df.copy() # create a copy of the data frame
if self.train_split != None: # split the df to create a training df
self.train_df, dummy_df=train_test_split(self.df, train_size=self.train_split, shuffle=False)
if self.test_split !=None: # create as test set and a validation set
t_split=self.test_split/(1.0-self.train_split)
self.test_df, self.valid_df=train_test_split(dummy_df, train_size=t_split, shuffle=False)
self.valid_gen_len=len(self.valid_df['image'].unique())# create var to return no of samples in valid generator
self.valid_gen_filenames=list(self.valid_df['image'])# create list ofjpg file names in valid generator
else: self.test_df=dummy_df
self.test_gen_len=len(self.test_df['image'].unique())#create var to return no of test samples
self.test_gen_filenames=list(self.test_df['image']) # create list to return jpg file paths in test_gen
else:
self.train_df=self.df
self.tr_gen_len=len(self.train_df['image'].unique()) # crete variable to return no of samples in train generator
def flow(self, batch_size=32, image_shape=None,rescale=None,shuffle=True, subset=None ):
# flows batches of jpg images and png masks to model.fit
self.batch_size=batch_size
self.image_shape=image_shape
self.shuffle=shuffle
self.subset=subset
self.rescale=rescale
image_batch_list=[] # initialize list to hold a batch of jpg images
label_batch_list=[] # initialize list to hold batches of png masks
if self.subset=='training' or self.train_split ==None:
op_df=self.train_df
elif self.subset=='test':
op_df=self.test_df
else:
op_df=self.valid_df
if self.shuffle : # shuffle the op_df then rest the index
op_df=op_df.sample(frac=1.0, replace=False, weights=None, random_state=123, axis=0).reset_index(drop=True)
#op_df will be either train, test or valid depending on subset
# develop the batch of data
while True:
label_batch_list=[]
image_batch_list=[]
start=jpgen.batch_index * self.batch_size # set start value of iteration
end=start + self.batch_size # set end value of iteration to yield 1 batch of data of length batch_size
sample_count=len(op_df['image'])
for i in range(start, end): # iterate over one batch size of data
j=i % sample_count # used to roll the images back to the front if the end is reached
k=j % self.batch_size
path_to_image= op_df.iloc[j]['image']
path_to_label= op_df.iloc[j] ['label']
label_image=cv2.imread(path_to_label, -1) # read unchanged to preserve 4 th channel print (png_image.)
label_image= cv2.cvtColor(label_image, cv2.COLOR_BGR2RGB)
image_image=cv2.imread(path_to_image)
image_image= cv2.cvtColor(image_image, cv2.COLOR_BGR2RGB)
label_image=cv2.resize(label_image, self.image_shape)
image_image=cv2.resize(image_image, self.image_shape )
if rescale !=None:
label_image=label_image * self.rescale
image_image=image_image * self.rescale
label_batch_list.append(label_image)
image_batch_list.append(image_image)
image_array=np.array(image_batch_list)
label_array=np.array(label_batch_list)
jpgen.batch_index +=1
yield (image_array, label_array)
Code below shows how to use the functions above to make generators for model.fit
image_dir=r'C:\Temp\gen_test\images'# directory with clean images
label_dir=r'C:\Temp\gen_test\labels' # directory with noisy images file names same as filenames in clean dir
shuffle=False # if True shuffles the dataframe
df=create_df(image_dir, label_dir ,shuffle) # create a dataframe with columns 'images' , 'labels'
# where labels are the noisy images
train_split=.8 # use 80% of files for training
test_split=.1 # use 10% for test, automatically sets validation split at 1-train_split-test_split
batch_size=32 # set batch_size
height=224 # set image height for generator output images and labels
width=224 # set image width for generator output images and labels
channels=3 # set number of channel in images
image_shape=(height, width)
rescale=1/255 # set value to rescale image pixels
gen=jpgen(df, train_split=train_split, test_split=test_split) # create instance of generator class
tr_gen_len=gen.tr_gen_len
test_gen_len= gen.test_gen_len
valid_gen_len=gen.valid_gen_len
test_filenames=gen.test_gen_filenames # names of test file paths used for training
train_steps=tr_gen_len//batch_size # use this value in for steps_per_epoch in model.fit
valid_steps=valid_gen_len//batch_size # use this value for validation_steps in model.fit
test_steps=test_gen_len//batch_size # use this value for steps in model.predict
# instantiate generators
train_gen=gen.flow(batch_size=batch_size, image_shape=image_shape, rescale=rescale, shuffle=False, subset='training')
valid_gen=gen.flow(batch_size=batch_size, image_shape=image_shape, rescale=rescale, shuffle=False, subset='valid')
test_gen=gen.flow(batch_size=batch_size, image_shape=image_shape, rescale=rescale, shuffle=False, subset='test')
Build your model then use
history=model.fit(train_gen, epochs=epochs, steps_per_epoch=train_steps,validation_data=valid_gen,
validation_steps=valid_steps, verbose=1, shuffle=True)
predictions=model.predict(test_gen, steps=test_steps)

PyTorch custom dataset dataloader returns strings (of keys) not tensors

I am trying to load my own dataset and I use a custom Dataloader that reads in images and labels and converts them to PyTorch Tensors. However when the Dataloader is instantiated it returns strings x "image" and y "labels" but not the real values or tensors when read (iter)
print(self.train_loader) # shows a Tensor object
tic = time.time()
with tqdm(total=self.num_train) as pbar:
for i, (x, y) in enumerate(self.train_loader): # x and y are returned as string (where it fails)
if self.use_gpu:
x, y = x.cuda(), y.cuda()
x, y = Variable(x), Variable(y)
This is how dataloader.py looks like:
from __future__ import print_function, division #ds
import numpy as np
from utils import plot_images
import os #ds
import pandas as pd #ds
from skimage import io, transform #ds
import torch
from torchvision import datasets
from torch.utils.data import Dataset, DataLoader #ds
from torchvision import transforms
from torchvision import utils #ds
from torch.utils.data.sampler import SubsetRandomSampler
class CDataset(Dataset):
def __init__(self, csv_file, root_dir, transform=None):
"""
Args:
csv_file (string): Path to the csv file with annotations.
root_dir (string): Directory with all the images.
transform (callable, optional): Optional transform to be applied
on a sample.
"""
self.frame = pd.read_csv(csv_file)
self.root_dir = root_dir
self.transform = transform
def __len__(self):
return len(self.frame)
def __getitem__(self, idx):
img_name = os.path.join(self.root_dir,
self.frame.iloc[idx, 0]+'.jpg')
image = io.imread(img_name)
# image = image.transpose((2, 0, 1))
labels = np.array(self.frame.iloc[idx, 1])#.as_matrix() #ds
#landmarks = landmarks.astype('float').reshape(-1, 2)
#print(image.shape)
#print(img_name,labels)
sample = {'image': image, 'labels': labels}
if self.transform:
sample = self.transform(sample)
return sample
class ToTensor(object):
"""Convert ndarrays in sample to Tensors."""
def __call__(self, sample):
image, labels = sample['image'], sample['labels']
#print(image)
#print(labels)
# swap color axis because
# numpy image: H x W x C
# torch image: C X H X W
image = image.transpose((2, 0, 1))
#print(image.shape)
#print((torch.from_numpy(image)))
#print((torch.from_numpy(labels)))
return {'image': torch.from_numpy(image),
'labels': torch.from_numpy(labels)}
def get_train_valid_loader(data_dir,
batch_size,
random_seed,
#valid_size=0.1, #ds
#shuffle=True,
show_sample=False,
num_workers=4,
pin_memory=False):
"""
Utility function for loading and returning train and valid
multi-process iterators over the MNIST dataset. A sample
9x9 grid of the images can be optionally displayed.
If using CUDA, num_workers should be set to 1 and pin_memory to True.
Args
----
- data_dir: path directory to the dataset.
- batch_size: how many samples per batch to load.
- random_seed: fix seed for reproducibility.
- #ds valid_size: percentage split of the training set used for
the validation set. Should be a float in the range [0, 1].
In the paper, this number is set to 0.1.
- shuffle: whether to shuffle the train/validation indices.
- show_sample: plot 9x9 sample grid of the dataset.
- num_workers: number of subprocesses to use when loading the dataset.
- pin_memory: whether to copy tensors into CUDA pinned memory. Set it to
True if using GPU.
Returns
-------
- train_loader: training set iterator.
- valid_loader: validation set iterator.
"""
#ds
#error_msg = "[!] valid_size should be in the range [0, 1]."
#assert ((valid_size >= 0) and (valid_size <= 1)), error_msg
#ds
# define transforms
#normalize = transforms.Normalize((0.1307,), (0.3081,))
trans = transforms.Compose([
ToTensor(), #normalize,
])
# load train dataset
#train_dataset = datasets.MNIST(
# data_dir, train=True, download=True, transform=trans
#)
train_dataset = CDataset(csv_file='/home/Desktop/6June17/util/train.csv',
root_dir='/home/caffe/data/images/',transform=trans)
# load validation dataset
#valid_dataset = datasets.MNIST( #ds
# data_dir, train=True, download=True, transform=trans #ds
#)
valid_dataset = CDataset(csv_file='/home/Desktop/6June17/util/eval.csv',
root_dir='/home/caffe/data/images/',transform=trans)
num_train = len(train_dataset)
train_indices = list(range(num_train))
#ds split = int(np.floor(valid_size * num_train))
num_valid = len(valid_dataset) #ds
valid_indices = list(range(num_valid)) #ds
#if shuffle:
# np.random.seed(random_seed)
# np.random.shuffle(indices)
#ds train_idx, valid_idx = indices[split:], indices[:split]
train_idx = train_indices #ds
valid_idx = valid_indices #ds
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
train_loader = torch.utils.data.DataLoader(
train_dataset, batch_size=batch_size, sampler=train_sampler,
num_workers=num_workers, pin_memory=pin_memory,
)
print(train_loader)
valid_loader = torch.utils.data.DataLoader(
valid_dataset, batch_size=batch_size, sampler=valid_sampler,
num_workers=num_workers, pin_memory=pin_memory,
)
# visualize some images
if show_sample:
sample_loader = torch.utils.data.DataLoader(
dataset, batch_size=9, #shuffle=shuffle,
num_workers=num_workers, pin_memory=pin_memory
)
data_iter = iter(sample_loader)
images, labels = data_iter.next()
X = images.numpy()
X = np.transpose(X, [0, 2, 3, 1])
plot_images(X, labels)
return (train_loader, valid_loader)
def get_test_loader(data_dir,
batch_size,
num_workers=4,
pin_memory=False):
"""
Utility function for loading and returning a multi-process
test iterator over the MNIST dataset.
If using CUDA, num_workers should be set to 1 and pin_memory to True.
Args
----
- data_dir: path directory to the dataset.
- batch_size: how many samples per batch to load.
- num_workers: number of subprocesses to use when loading the dataset.
- pin_memory: whether to copy tensors into CUDA pinned memory. Set it to
True if using GPU.
Returns
-------
- data_loader: test set iterator.
"""
# define transforms
#normalize = transforms.Normalize((0.1307,), (0.3081,))
trans = transforms.Compose([
ToTensor(), #normalize,
])
# load dataset
#dataset = datasets.MNIST(
# data_dir, train=False, download=True, transform=trans
#)
test_dataset = CDataset(csv_file='/home/Desktop/6June17/util/test.csv',
root_dir='/home/caffe/data/images/',transform=trans)
test_loader = torch.utils.data.DataLoader(
test_dataset, batch_size=batch_size, shuffle=False,
num_workers=num_workers, pin_memory=pin_memory,
)
return test_loader
#for i_batch, sample_batched in enumerate(dataloader):
# print(i_batch, sample_batched['image'].size(),
# sample_batched['landmarks'].size())
# # observe 4th batch and stop.
# if i_batch == 3:
# plt.figure()
# show_landmarks_batch(sample_batched)
# plt.axis('off')
# plt.ioff()
# plt.show()
# break
A minimal working sample will be difficult to post here but basically I am trying to modify this project http://torch.ch/blog/2015/09/21/rmva.html which works smoothly with MNIST. I am just trying to run it with my own dataset with the custom dataloader.py I use above.
It instantiates a Dataloader like this:
in trainer.py:
if config.is_train:
self.train_loader = data_loader[0]
self.valid_loader = data_loader[1]
self.num_train = len(self.train_loader.sampler.indices)
self.num_valid = len(self.valid_loader.sampler.indices)
-> run from main.py:
if config.is_train:
data_loader = get_train_valid_loader(
config.data_dir, config.batch_size,
config.random_seed, #config.valid_size,
#config.shuffle,
config.show_sample, **kwargs
)
You are not properly using python's enumerate(). (x, y) are currently assigned the 2 keys of your batch dictionary i.e. the strings "image" and "labels". This should solve your problem:
for i, batch in enumerate(self.train_loader):
x, y = batch["image"], batch["labels"]
# ...

Resources