Transformer-XL: Input and labels for Language Modeling - huggingface-transformers

I'm trying to finetune the pretrained Transformer-XL model transfo-xl-wt103 for a language modeling task. Therfore, I use the model class TransfoXLLMHeadModel.
To iterate over my dataset I use the LMOrderedIterator from the file tokenization_transfo_xl.py which yields a tensor with the data and its target for each batch (and the sequence length).
Let's assume the following data with batch_size = 1 and bptt = 8:
data = tensor([[1,2,3,4,5,6,7,8]])
target = tensor([[2,3,4,5,6,7,8,9]])
mems # from the previous output
My question is: I currently pass this data into the model like this:
output = model(input_ids=data, labels=target, mems=mems)
Is this correct?
I am wondering because the documentation says for the labels parameter:
labels (:obj:torch.LongTensor of shape :obj:(batch_size, sequence_length), optional, defaults to :obj:None):
Labels for language modeling.
Note that the labels are shifted inside the model, i.e. you can set lm_labels = input_ids
So what is it about the parameter lm_labels? I only see labels defined in the forward method.
And when the labels "are shifted" inside the model, does this mean I have to pass data twice (additionally instead of targets) because its shifted inside? But how does the model then know the next token to predict?
I also read through this bug and the fix in this pull request but I don't quite understand how to treat the model now (before vs. after fix)
Thanks in advance for some help!
Edit: Link to issue on Github

That does sound like a typo from another model's convention. You do have to pass data twice, once to input_ids and once to labels (in your case, [1, ... , 8] for both). The model will then attempt to predict [2, ... , 8] from [1, ... , 7]). I am not sure adding something at the beginning of the target tensor would work as that would probably cause size mismatches later down the line.
Passing twice is the default way to do this in transformers; before the aforementioned PR, TransfoXL did not shift labels internally and you had to shift the labels yourself. The PR changed it to be consistent with the library and the documentation, where you have to pass the same data twice.

Related

Doc2Vec How to find most similar document

I am using Gensim's Doc2Vec, and was wondering if there is a way to get the most similar document to another document that is outside the list of TaggedDocuments used to train the Doc2Vec model.
Right now I can infer a vector from a document not in the training set:
# 'model' here is a instance of Doc2Vec class that has been trained
# Inferring a vector
doc_not_in_training_set = "Foo Foo Foo Foo Foo Foo Fie"
v1 = model.infer_vector(word_tokenize(doc_not_in_training_set.lower()))
print("V1_infer", v1)
This prints out a vector representation of the 'doc_not_in_training_set' string. However, is there a way to use this vector to find the n most similar documents to the 'doc_not_in_training_set' string (in the TaggedDocuments training set for this word2vec model)?
Looking under the documentation, the closest I could find was the model.docvec.most_similar() method:
# Finding most similar to first
similar_doc = model.docvecs.most_similar('0')
This returns the document in the training set most similar to the document in the training set with tag '0'.
In the documentation of this method, it looks like there is not yet the functionality I am looking for:
TODO: Accept vectors of out-of-training-set docs, as if from inference.
Is there another method I can use to find documents similar to a document not in the training set?
The .most_similar() method will also take a raw vectors as the target position.
It helps to explicitly name the positive parameter, to prevent other logic of that method, which tries to intuit what other strings/etc supplied as arguments might mean, from misinterpreting a single raw vector.
So try:
similar_docs = model.docvecs.most_similar(positive=[v1])
You should get back a list of nearest-neighbors to the v1 vector that you'd previously inferred.

Change lgbm internal parameter (threshold) by hand

I have trained a model with lgbm. I can dump its interval values with
booster.dump_model()
and see all the internal parameters that has been optimized during the training (leaf values, threshold, index of the variables for each split, ...). For testing purpose I would like to change some. Is there a way? I guess that changing just the output of dump_model will do nothing.
You can save your model to a human-understandable format using
booster.save_model('model.txt'), do your modifications on model.txt, and load back the modified model using modified_booster = lightgbm.Booster(model_file='model.txt').
I hope it helps!

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

MATLAB ConnectedComponentLabeler does not work in for loop

I am trying to get a set of binary images' eccentricity and solidity values using the regionprops function. I obtain the label matrix using the vision.ConnectedComponentLabeler function.
This is the code I have so far:
files = getFiles('images');
ecc = zeros(length(files)); %eccentricity values
sol = zeros(length(files)); %solidity values
ccl = vision.ConnectedComponentLabeler;
for i=1:length(files)
I = imread(files{i});
[L NUM] = step(ccl, I);
for j=1:NUM
L = changem(L==j, 1, j); %*
end
stats = regionprops(L, 'all');
ecc(i) = stats.Eccentricity;
sol(i) = stats.Solidity;
end
However, when I run this, I get an error says indicating the line marked with *:
Error using ConnectedComponentLabeler/step
Variable-size input signals are not supported when the OutputDataType property is set to 'Automatic'.'
I do not understand what MATLAB is talking about and I do not have any idea about how to get rid of it.
Edit
I have returned back to bwlabel function and have no problems now.
The error is a bit hard to understand, but I can explain what exactly it means. When you use the CVST Connected Components Labeller, it assumes that all of your images that you're going to use with the function are all the same size. That error happens because it looks like the images aren't... hence the notion about "Variable-size input signals".
The "Automatic" property means that the output data type of the images are automatic, meaning that you don't have to worry about whether the data type of the output is uint8, uint16, etc. If you want to remove this error, you need to manually set the output data type of the images produced by this labeller, or the OutputDataType property to be static. Hopefully, the images in the directory you're reading are all the same data type, so override this field to be a data type that this function accepts. The available types are uint8, uint16 and uint32. Therefore, assuming your images were uint8 for example, do this before you run your loop:
ccl = vision.ConnectedComponentLabeler;
ccl.OutputDataType = 'uint8';
Now run your code, and it should work. Bear in mind that the input needs to be logical for this to have any meaningful output.
Minor comment
Why are you using the CVST Connected Component Labeller when the Image Processing Toolbox bwlabel function works exactly the same way? As you are using regionprops, you have access to the Image Processing Toolbox, so this should be available to you. It's much simpler to use and requires no setup: http://www.mathworks.com/help/images/ref/bwlabel.html

Caffe Multiple Input Images

I'm looking at implementing a Caffe CNN which accepts two input images and a label (later perhaps other data) and was wondering if anyone was aware of the correct syntax in the prototxt file for doing this? Is it simply an IMAGE_DATA layer with additional tops? Or should I use separate IMAGE_DATA layers for each?
Thanks,
James
Edit: I have been using the HDF5_DATA layer lately for this and it is definitely the way to go.
HDF5 is a key value store, where each key is a string, and each value is a multi-dimensional array. Thus, to use the HDF5_DATA layer, just add a new key for each top you want to use, and set the value for that key to store the image you want to use. Writing these HDF5 files from python is easy:
import h5py
import numpy as np
filelist = []
for i in range(100):
image1 = get_some_image(i)
image2 = get_another_image(i)
filename = '/tmp/my_hdf5%d.h5' % i
with hypy.File(filename, 'w') as f:
f['data1'] = np.transpose(image1, (2, 0, 1))
f['data2'] = np.transpose(image2, (2, 0, 1))
filelist.append(filename)
with open('/tmp/filelist.txt', 'w') as f:
for filename in filelist:
f.write(filename + '\n')
Then simply set the source of the HDF5_DATA param to be '/tmp/filelist.txt', and set the tops to be "data1" and "data2".
I'm leaving the original response below:
====================================================
There are two good ways of doing this. The easiest is probably to use two separate IMAGE_DATA layers, one with the first image and label, and a second with the second image. Caffe retrieves images from LMDB or LEVELDB, which are key value stores, and assuming you create your two databases with corresponding images having the same integer id key, Caffe will in fact load the images correctly, and you can proceed to construct your net with the data/labels of both layers.
The problem with this approach is that having two data layers is not really very satisfying, and it doesn't scale very well if you want to do more advanced things like having non-integer labels for things like bounding boxes, etc. If you're prepared to make a time investment in this, you can do a better job by modifying the tools/convert_imageset.cpp file to stack images or other data across channels. For example you could create a datum with 6 channels - the first 3 for your first image's RGB, and the second 3 for your second image's RGB. After reading this in using the IMAGE_DATA layer, you can split the stream into two images using a SLICE layer with a slice_point at index 3 along the slice_dim = 1 dimension. If further down the road, you decide that you want to load even more complex assortments of data, you'll understand the encoding scheme and can write your own decoding layer based off of src/caffe/layers/data_layer.cpp to gain full control of the pipeline.
You may also consider using HDF5_DATA layer with multiple "top"s

Resources