Problem converting a Matrix to Data Frame in R (R thinks all numeric types are factors) - data-structures

I am passing data from C# to R over a COM interface. When the data arrives in R it is housed in a 'Matrix'. Some of the functions that I use require that the data be inside a 'DataFrame' instead. I convert the data structure using
newDataFrame <- as.data.frame(oldMatrix)
The table of data reaches R just fine, once I make the conversion to the DataFrame however, it assumes all of my numeric data are factors!
So it turns: {34, 46, 90, 54, 69, 54} into {1, 2, 3, 4, 5, 4}
My data table DOES have factors in it though, so I just can't force the whole thing to be numeric. Is there any way around this? Note: I can't export the data as a CSV onto the filesystem and read it into R manually.
On a side note, the function I am using that requires a DataFrame is the 'Hmisc' package using
hist.data.frame(dataFrame)
this produces a frequency histogram for every column of data in the DataFram and arranges them in all in a grid pattern (quite nifty)!
Thanks!
-Dave

I think you have mis-diagnosed the problem - all columns in a matrix must be of the same type, so this is likely to be where the problem arises, not the conversion to a data frame.

I've had this problem before. You need to set stringsAsFactors=F when you read the data.
Now, you can convert individual variables/columns to factors (ie, with as.numeric() and the like), without worrying about how the numbers are treated.

Related

Transformer-XL: Input and labels for Language Modeling

I'm trying to finetune the pretrained Transformer-XL model transfo-xl-wt103 for a language modeling task. Therfore, I use the model class TransfoXLLMHeadModel.
To iterate over my dataset I use the LMOrderedIterator from the file tokenization_transfo_xl.py which yields a tensor with the data and its target for each batch (and the sequence length).
Let's assume the following data with batch_size = 1 and bptt = 8:
data = tensor([[1,2,3,4,5,6,7,8]])
target = tensor([[2,3,4,5,6,7,8,9]])
mems # from the previous output
My question is: I currently pass this data into the model like this:
output = model(input_ids=data, labels=target, mems=mems)
Is this correct?
I am wondering because the documentation says for the labels parameter:
labels (:obj:torch.LongTensor of shape :obj:(batch_size, sequence_length), optional, defaults to :obj:None):
Labels for language modeling.
Note that the labels are shifted inside the model, i.e. you can set lm_labels = input_ids
So what is it about the parameter lm_labels? I only see labels defined in the forward method.
And when the labels "are shifted" inside the model, does this mean I have to pass data twice (additionally instead of targets) because its shifted inside? But how does the model then know the next token to predict?
I also read through this bug and the fix in this pull request but I don't quite understand how to treat the model now (before vs. after fix)
Thanks in advance for some help!
Edit: Link to issue on Github
That does sound like a typo from another model's convention. You do have to pass data twice, once to input_ids and once to labels (in your case, [1, ... , 8] for both). The model will then attempt to predict [2, ... , 8] from [1, ... , 7]). I am not sure adding something at the beginning of the target tensor would work as that would probably cause size mismatches later down the line.
Passing twice is the default way to do this in transformers; before the aforementioned PR, TransfoXL did not shift labels internally and you had to shift the labels yourself. The PR changed it to be consistent with the library and the documentation, where you have to pass the same data twice.

Extracting data from text file in AMPL without adding indexes

I'm new to AMPL and I have data in a text file in matrix form from which I need to use certain values. However, I don't know how to use the matrices directly without having to manually add column and row indexes to them. Is there a way around this?
So the data I need to use looks something like this, with hundreds of rows and columns (and several more matrices like this), and I would like to use it as a parameter with index i for rows and j for columns.
t=1
0.0 40.95 40.36 38.14 44.87 29.7 26.85 28.61 29.73 39.15 41.49 32.37 33.13 59.63 38.72 42.34 40.59 33.77 44.69 38.14 33.45 47.27 38.93 56.43 44.74 35.38 58.27 31.57 55.76 35.83 51.01 59.29 39.11 30.91 58.24 52.83 42.65 32.25 41.13 41.88 46.94 30.72 46.69 55.5 45.15 42.28 47.86 54.6 42.25 48.57 32.83 37.52 58.18 46.27 43.98 33.43 39.41 34.0 57.23 32.98 33.4 47.8 40.36 53.84 51.66 47.76 30.95 50.34 ...
I'm not aware of an easy way to do this. The closest thing is probably the table format given in section 9.3 of the AMPL Book. This avoids needing to give indices for every term individually, but it still requires explicitly stating row and column indices.
AMPL doesn't seem to do a lot with position-based input formats, probably because it defaults to treating index sets as unordered so the concept of "first row" etc. isn't meaningful.
If you really wanted to do it within AMPL, you could probably put together a work-around along these lines:
declare a single-index param with length equal to the total size of your matrix (e.g. if your matrix is 10 x 100, this param has length 1000)
edit the beginning and end of your "matrix" data file to turn it into appropriate format for a single-index parameter indexed from 1 to n
then define your matrix something like this:
param m{i in 1..nrows,j in 1..ncols} := x[j+i*(ncols-1)];
(not tested, I won't promise that I have rows and columns the right way around there!)
But you're probably better off editing the input file into one of the standard AMPL matrix formats. AMPL isn't really designed for data wrangling - you can do it in a pinch but if you're doing this kind of thing repeatedly it may be less trouble to code it in a general-purpose language e.g. Python.

How to use DBSCAN algorithm for a list of points in python

I am new to image processing and python coding.
I have detected a number of features in an image and have their respective pixel locations placed in a list format.
My_list = [(x1,y1),(x2,y2),......,(xn,yn)]
I would like to use DBSCAN algorithm to form clusters from the following points.
Currently using sklearn.cluster to import the build in DBSCAN function for python.
If the current format for the points is not compatible would like to know which is?
Error currently facing with the current format:
C:\Python\python.exe "F:/opencv_files/dbscan.py"
**Traceback (most recent call last):**
**File "**F:/opencv_files/dbscan.py**", line 83, in <module>
db = DBSCAN(eps=0.5, min_samples=5).fit(X) # metric=X)**
**File "**C:\Python\lib\site-packages\sklearn\cluster\dbscan_.py**", line 282, in fit
X = check_array(X, accept_sparse='csr')
File "**C:\Python\lib\site-packages\sklearn\utils\validation.py**", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.**
Your data is a list of tuple. There is nothing in this structure that prevents you from doing crazy things with that, such as having different lengths in there. Plus, this is a very slow and memory inefficient way of keeping the data because everything is boxed as a Python object.
Just call data = numpy.array(data) to convert your data into an efficient multidimensional numeric array. This array will then have a shape.

Iteratively populate dataframes using a for loop in Julia

I am looking to find a way to iteratively populate a dataframe in Julia.
I have a working function that creates multiple points along a line:
#function to draw QMD lines
using DataFrames
function make_lines(qmd)
BA=Float64[]
TPA=Float64[]
QMD=Int[]
for i in stk_percent
tpa= 1*(i*10)/(a[1]+a[2]*(-0.259+0.973*qmd)+a[3]*qmd^2)
ba=pi*(qmd/24)^2*tpa
push!(TPA,tpa)
push!(BA,ba)
push!(QMD,qmd)
end
return DataFrame(TPA=TPA,BA=BA,QMD=QMD)
end
The next step I am trying to accomplish is to run the make_lines function in a loop using a pre-defined set of inputs with all the outputs in one single dataframe but I cannot get it to work.
dia = [7, 8, 10, 12, 14, 16, 18, 20, 22]
# can't get for loop to append all the data frames?
for i in dia
df=DataFrame(TPA=Float64[],BA=Float64[],QMD=Int[])
append!(df,make_lines(i))
return df
end
At first I thought it was how I was using Dataframes, I have never used Push! etc before but I got this code chunk to work
#this works to combine dataframe
test=make_lines(22)
test2=make_lines(8)
test[:]
append!(test,test2)
So why when I run the for loop, do I end up with only the last dataframe it produces?
Am I misinterpreting something? From what I have read Dataframes in Julia work differently than dataframes in R, but I cannot wrap my head around how to get this working.
You are pretty close, but there are a couple of places where you are getting tripped up in your code. You currently have:
dia = [7, 8, 10, 12, 14, 16, 18, 20, 22]
# can't get for loop to append all the data frames?
for i in dia
df=DataFrame(TPA=Float64[],BA=Float64[],QMD=Int[])
append!(df,make_lines(i))
return df
end
This isn't quite what you want for two reasons:
One: This snippet isn't a function. It thus doesn't make sense, and will cause problems, to have return in it.
Two: At each step in your loop, you are re-creating your dataframe df from scratch, erasing everything that you put before it. This is why, as you say, you only end up with the last data frame that it produces. Instead, you would want something like:
dia = [7, 8, 10, 12, 14, 16, 18, 20, 22]
df=DataFrame(TPA=Float64[],BA=Float64[],QMD=Int[])
for i in dia
append!(df,make_lines(i))
end
Note: I couldn't get a completely working version of your code going - the objects stk_percent and a in your main function never get defined, so I didn't really know what to put in for those. But, I believe that if you fix these issues you'll likely be in a better spot (I made up some values for them and it worked fine).
Performance Tip: When you do fix those, my recommendation would be to make them as explicit arguments that you pass to your function. Although it will still work if they are just variables in the global space, this will lead to suboptimal performance of your code, both now and in the future, and potentially worse things, like confusing the scope of variables, having their values change when you don't want, etc. Best to start off from the beginning of your journey with Julia adopting as many best practices in writing your code as is practicable.
I managed to create a blank dataframe by providing the type of variable and the column names
df = DataFrame([DateTime;fill(Float64, 2);String;fill(Float64, 2)],
["Date","A","B","Letter","C","D"])
Then I can append the results to populate the new dataframe by using rename! and then append! functions inside the for loop.
This is very useful for large datasets with numerous columns.

Is there a drawback in using rxjs for readonly collection manipulation

I need to do a Min and Max operation on a array getting from server side.
I am new to rxjs extensions but those library is actually mean to observe changes on a collection, but in my case its just a ONE time calculation on a collection which is no further changed then until I do a server side refresh of the data.
I just want to use the right tool for the right job, thus I ask is it correct to use rxjs here or is that shooting with bombs on flys?
Or should I rather use a library like https://github.com/ENikS/LINQ
to get the Min/Max value of a collection?
There is a LINQ implementation IxJS that is developed and maintained by the same team that is developing RxJS. This might be the right tool for you.
However, you could go with RxJS as well. When using Rx.Observable.from([1, 2, ...]) the execution is synchronous on subscription.
I would use IxJS however:
// An array of values.. (just creating some random ones here)
const values = [2, 4, 23, 1, 0, 34, 56, 2, 3, 45, 98, 6, 3];
// Create an enumerable from the array
const valEnum = Ix.Enumerable.fromArray(values);
const min = valEnum.min();
const max = valEnum.max();
Working example on jsfiddle.
https://github.com/ENikS/LINQ uses all the latest language features and theoretically much faster than IxJS. Last edit on IxJS is 3 years old. (ECMA-262/6.0/) introduced few very important advancements and speed improvements.
It also has better compliance with standard LINQ API and can operate on any collection implementing iterables, including strings, maps, typed arrays, and etc. IxJS can only query array types.

Resources