What problem does a reinitializable iterator solve? - tensorflow-datasets

From the tf.data documentation:
A reinitializable iterator can be initialized from multiple different
Dataset objects. For example, you might have a training input pipeline
that uses random perturbations to the input images to improve
generalization, and a validation input pipeline that evaluates
predictions on unmodified data. These pipelines will typically use
different Dataset objects that have the same structure (i.e. the same
types and compatible shapes for each component).
the following example was given:
# Define training and validation datasets with the same structure.
training_dataset = tf.data.Dataset.range(100).map(
lambda x: x + tf.random_uniform([], -10, 10, tf.int64))
validation_dataset = tf.data.Dataset.range(50)
# A reinitializable iterator is defined by its structure. We could use the
# `output_types` and `output_shapes` properties of either `training_dataset`
# or `validation_dataset` here, because they are compatible.
iterator = tf.data.Iterator.from_structure(training_dataset.output_types,
training_dataset.output_shapes)
next_element = iterator.get_next()
training_init_op = iterator.make_initializer(training_dataset)
validation_init_op = iterator.make_initializer(validation_dataset)
# Run 20 epochs in which the training dataset is traversed, followed by the
# validation dataset.
for _ in range(20):
# Initialize an iterator over the training dataset.
sess.run(training_init_op)
for _ in range(100):
sess.run(next_element)
# Initialize an iterator over the validation dataset.
sess.run(validation_init_op)
for _ in range(50):
sess.run(next_element)
It is unclear what the benefit of this complexity is.
Why not simply create 2 different iterators?

The original motivation for reinitializable iterators was as follows:
The user's input data is in two or more tf.data.Dataset objects with the same structure but different pipeline definitions.
For example, you might have a training data pipeline with augmentations in a Dataset.map(), and an evaluation data pipeline that produced raw examples, but they would both produce batches with the same structure (in terms of the number of tensors, their element types, shapes, etc.).
The user would define a single training graph that took input from a tf.data.Iterator, created using Iterator.from_structure().
The user could then switch between the different input data sources by reinitializing the iterator from one of the datasets.
In hindsight, reinitializable iterators have turned out to be quite hard to use for their intended purpose. In TensorFlow 2.0 (or 1.x with eager execution enabled), it is much easier to create iterators over different datasets using idiomatic Python for loops and high-level training APIs:
tf.enable_eager_execution()
model = ... # A `tf.keras.Model`, or some other class exposing `fit()` and `evaluate()` methods.
train_data = ... # A `tf.data.Dataset`.
eval_data = ... # A `tf.data.Dataset`.
for i in range(NUM_EPOCHS):
model.fit(train_data, ...)
# Evaluate every 5 epochs.
if i % 5 == 0:
model.evaluate(eval_data, ...)

Related

How can I use more than one additional regressor with DeepAREstimator in gluon-ts?

When creating training or test data in gluon-ts we can specify an additional real-valued regressor in the DeepAREstimator by specifying a feat_dynamic_real. Is there support for multiple real-valued regressors?
There is a one_dim_target flag in gluonts.dataset.common.ListDataset which is used to create the training/test data objects. This seems like it could be needed to support multiple additional regressors, however I couldn't find a good example on the intended usage.
Here is the set up for creating training data with one additional regressor:
training_data = ListDataset(
[{"start": df.index[0], "target": df.values, "feat_dynamic_real": df['randomColumn'].values}],
freq = "5min", one_dim_target=False
)
and the Estimator:
from gluonts.model.deepar import DeepAREstimator
from gluonts.trainer import Trainer
estimator = DeepAREstimator(freq="5min", prediction_length=12, trainer=Trainer(epochs=10))
predictor = estimator.train(training_data=training_data)
I'm looking for the syntax/configuration needed for multiple regressors.
Yes, there is support for that. First of, Gluon TS refers to regressors as features and the signal that we are trying to predict as target. Thus, the one_dim_target flag you mention is related to the dimension of the ouput and not the input.
Below is the code I use to associate a multi-dimensional feature (input) to each target signal (I use a one-dimensional target)
train_ds = ListDataset([{FieldName.TARGET: target,
FieldName.START: start,
FieldName.FEAT_DYNAMIC_REAL: fdr}
for (target, start, fdr) in zip(
target,
custom_ds_metadata['start'],
feat_dynamic_real)]
In the zip-function above,
target: Is a 1-dimensional numpy array containing the target signal, i.e., the shape of target is (1,#of time steps)
custom_ds_metadata['start'] : Is a pandas date variable indicating the beginning of the data
feat_dynamic_real: Is a 2-dimensional numpy array containing two feature signals, i.e., feat_dynamic_real has shape (#of features, #number of time steps)

How to inject a zero-noise signal compact binary coalescence signal

Is it possible to inject a signal by itself with no coloured Gaussian noise?
Question asked by Arunava Mukherjee via email
Yes. There are two easy ways to do this.
1) Use the existing helper functions
When generating an interferometer object, bilby provides several helper routines denoted by bilby.gw.detector.get_interferometer_with.... In this case, you'll want to use this function (I've truncated the doctring)
bilby.gw.detector.get_interferometer_with_fake_noise_and_injection(
name, injection_parameters, injection_polarizations=None,
waveform_generator=None, sampling_frequency=4096, duration=4,
start_time=None, outdir='outdir', label=None, plot=True, save=True,
zero_noise=False)
Docstring:
Helper function to obtain an Interferometer instance with appropriate
power spectral density and data, given an center_time.
Note: by default this generates an Interferometer with a power spectral
density based on advanced LIGO.
Parameters
----------
name: str
Detector name, e.g., 'H1'.
...
zero_noise: bool
If true, set noise to zero.
So you just pass the flag in and it will create an interferometer with just the injection signal (you'll then need to make one for each interferometer you want in the list of interferometers passed in to the likelihood.
2) Use the low level set strain data methods
Alternatively, you may instead wish to use the low level methods themselves. As a general rule of thumb, you can always look at the source code for the generic helper functions to figure out how this should be done. Here, we create a H1 interferometer set the strain data with zero noise and inject a signal:
interferometer = get_empty_interferometer("H1")
interferometer.power_spectral_density = PowerSpectralDensity.from_aligo()
interferometer.set_strain_data_from_zero_noise(
sampling_frequency=sampling_frequency, duration=duration,
start_time=start_time)
injection_polarizations = interferometer.inject_signal(
parameters=injection_parameters,
waveform_generator=waveform_generator)
Information correct as of v.0.3.5

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

Spark RDD Persistence and Partitions

When a certain RDD is created in Spark for example:
lines = sc.textFile("README.md")
And then a transformation is called on this RDD:
pythonLines = lines.filter(lambda line: "Python" in line)
If you call an action on this transformed Filter RDD (such as pythonlines.first) what does it mean when they say an RDD will be recomputed ones again each time you run an action on them? I thought the original RDD that you created using the textFile method is not persisted after you called the filter transformation on that original RDD. So will it just recompute the most recent transformed RDD, where in this case it is the RDD I made using the filter transformation? I don't really see why that would be necessary if my assumption is correct?
In spark, RDDs are lazily evaluated. This means if you simply write
lines = sc.textFile("README.md").map(xxx)
Your program will exit without reading the file since you never used the result. If you write something like:
linesLength = sc.textFile("README.md").map(line => line.split(" ").length)
sumLinesLength = linesLength.reduce(_ + _) // <-- scala way
maxLineLength = linesLength.max()
The computations needed to have lineLength will be made twice, since you are reusing it in two different places. To avoid that, you should persist your resulting RDD before using it in two different ways
linesLength = sc.textFile("README.md").map(line => line.split(" ").length)
linesLength.persist()
// ...
You can also take a look at https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence. Hope my explanation isn't too confused!

Caffe Multiple Input Images

I'm looking at implementing a Caffe CNN which accepts two input images and a label (later perhaps other data) and was wondering if anyone was aware of the correct syntax in the prototxt file for doing this? Is it simply an IMAGE_DATA layer with additional tops? Or should I use separate IMAGE_DATA layers for each?
Thanks,
James
Edit: I have been using the HDF5_DATA layer lately for this and it is definitely the way to go.
HDF5 is a key value store, where each key is a string, and each value is a multi-dimensional array. Thus, to use the HDF5_DATA layer, just add a new key for each top you want to use, and set the value for that key to store the image you want to use. Writing these HDF5 files from python is easy:
import h5py
import numpy as np
filelist = []
for i in range(100):
image1 = get_some_image(i)
image2 = get_another_image(i)
filename = '/tmp/my_hdf5%d.h5' % i
with hypy.File(filename, 'w') as f:
f['data1'] = np.transpose(image1, (2, 0, 1))
f['data2'] = np.transpose(image2, (2, 0, 1))
filelist.append(filename)
with open('/tmp/filelist.txt', 'w') as f:
for filename in filelist:
f.write(filename + '\n')
Then simply set the source of the HDF5_DATA param to be '/tmp/filelist.txt', and set the tops to be "data1" and "data2".
I'm leaving the original response below:
====================================================
There are two good ways of doing this. The easiest is probably to use two separate IMAGE_DATA layers, one with the first image and label, and a second with the second image. Caffe retrieves images from LMDB or LEVELDB, which are key value stores, and assuming you create your two databases with corresponding images having the same integer id key, Caffe will in fact load the images correctly, and you can proceed to construct your net with the data/labels of both layers.
The problem with this approach is that having two data layers is not really very satisfying, and it doesn't scale very well if you want to do more advanced things like having non-integer labels for things like bounding boxes, etc. If you're prepared to make a time investment in this, you can do a better job by modifying the tools/convert_imageset.cpp file to stack images or other data across channels. For example you could create a datum with 6 channels - the first 3 for your first image's RGB, and the second 3 for your second image's RGB. After reading this in using the IMAGE_DATA layer, you can split the stream into two images using a SLICE layer with a slice_point at index 3 along the slice_dim = 1 dimension. If further down the road, you decide that you want to load even more complex assortments of data, you'll understand the encoding scheme and can write your own decoding layer based off of src/caffe/layers/data_layer.cpp to gain full control of the pipeline.
You may also consider using HDF5_DATA layer with multiple "top"s

Resources