R matrix transpose (2 into 1 column) many separate files and merge into one .csv - matrix

I have ~217 identical .csv files in a folder, each represents one individual, and has 2 columns (header: x, y) and 180 rows of data.
I need to transpose these into a single row (new headers: x1:x180, continued into y1:y180), create an ID column with an abbreviated file name, and merge the separate files into one data frame of 217 rows and an ID columns, and 360 columns of data.
Here's example data from separate .csv files in the same folder, truncated to the first 6 rows:
#dataA_observer_date
x y
1 -2.100343 -0.2601952
2 -2.128320 -0.2805480
3 -2.152010 -0.3000733
4 -2.168258 -0.3170724
5 -2.174368 -0.3305717
6 -2.168887 -0.3403942
#dataB_observer_date
x y
1 0.7577988 -0.1212715
2 0.7256039 -0.1344822
3 0.6933261 -0.1496408
4 0.6638619 -0.1657460
5 0.6409363 -0.1815894
6 0.6281463 -0.1960087
I need the data to look like this, in one file:
head(dataA)
ID [x1] [x2] [x3] [x4] [x5] [x6] [y1] [y2] [y3] [y4] [y5] [y6]
dataA -2.100343 -2.12832 -2.15201 -2.168258 -2.174368 -2.168887 -0.2601952 -0.280548 -0.3000733 -0.3170724 -0.3305717 -0.3403942
dataB
dataC...
...data217
for transposing, I tried the following, which results in a different column ordersince it works by row through the 180 rows:
t_Image1 <- matrix(t(Image1Coords), nrow = 1)
x1 y1 x2 y2...
I have the file names from the folder in a list using other help from https://stackoverflow.com/questions/31039269/combine-and-transpose-many-fixed-format-dataset-files-quickly
filenames <- list.files(path = "C:/Users/path_to_folder", pattern = "*.csv", full.names = FALSE)
require(data.table)
data_list <- lapply(filenames,read.csv)
But I can't get it to come together. So far, with help from https://stackoverflow.com/questions/21530672/in-r-loop-through-matrix-files-transpose-and-save-with-new-name and several other places, to just transpose the files and resave them to be combined in another step: But the exported file is hideous. The matrix transpose into one row retains quotes and puts 2 data points in one cell, and I'm not sure what its doing as headers, but its all in the first cell
for (i in filenames) {
mat <- matrix(t(read_table(i, col_names = TRUE, skip_empty_rows = TRUE)), nrow = 1)
mat$ID <- tools::file_path_sans_ext(basename(filename))
filename <- paste0("transposed_", i)
write.table(mat, file = filename)
}
I have not addressed shortening the file names and making an ID column yet.
Any help/advice would be greatly appreciated.

Related

COUNTIF over a moving window

I have a column wherein datapoints have been assigned a "1" or "2". I would like to use a function similar to COUNTIF in Excel, but over a moving window, e.g. =COUNTIF(G2:G31, 2) to determine how many "2"s exist in that given window
You might be able to use tibbletime.
1) Since you are interested in state being 1 or 2, we can recode it into a logical (boolean). Assuming your data.frame is named df,
df$state <- df$state == 2
2) Logicals are cool, because we can simply sum them, and get the number of TRUE values:
# total number of rows with state == 2:
sum(df$state)
3) Make a rollify function, cf. the link:
library(tibbletime)
rolling_sum <- rollify(sum, window = 30)
df$countif = rolling_sum(df$state)
This approach does however not solve the leading 29 rows. For those you can in your case use:
df$countif[1:29] <- cumsum(df$state[1:29])

Eigenvalues for matrices in a for loop

I need to calculate eigenvalues of a series of matrices and then save them in a separate file. My data has 5 columns and 10,000 rows. I use the following functions:
R<-NULL
A <- setwd("c:/location of the file on this computer")
for(i in 0:1){
X<-read.table(file="Example.prn", skip=i*5, nrow=5)
M <- as.matrix(X)
E=eigen(M, only.values = TRUE)
R<-rbind(R,E)}
print(E)
}
As an example I have used a data set with 10 rows and 5 columns. This gives me the following results:
$`values`
[1] 1.350000e+02+0.000e+00i -4.000000e+00+0.000e+00i 4.365884e-15+2.395e-15i 4.365884e-15-2.395e-15i
[5] 8.643810e-16+0.000e+00i
$vectors
NULL
$`values`
[1] 2.362320e+02+0.000000e+00i -4.960046e+01+1.258757e+01i -4.960046e+01-1.258757e+01i 9.689475e-01+0.000000e+00i
[5] 1.104994e-14+0.000000e+00i
$vectors
NULL
I have three questions and I would really appreciate any help:
I want to save the results in consecutive rows, such as:
Eigenvalue(1) Eigenvalue(3) Eigenvalue(5) Eigenvalue(7) Eigenvalue(9)
Eigenvalue(2) Eigenvalue(4) Eigenvalue(6) Eigenvalue(8) Eigenvalue(10)
any thoughts?
Also, I don't understand the eigenvalues in the output. They are not numbers. For example, one of them is 2.362320e+02+0.000000e+00i. My first though was that this is the sum of five determinants for a 5x5 matrix. However, "2.362320e+02+0.000000e+00i" seems to only have four numbers in it. Any thoughts? Doesn't eigen() function calculate the final values of eigenvalues?
how can I save my outcome on an Excel file? I have used the following codes
However, the result I get from the current codes are:
> class(R)
[1] "matrix"
> print(R)
values vectors
E Complex,5 NULL
E Complex,5 NULL
I think, you can easily get values by the following code:
R<-NULL
A <- setwd("c:/location of the file on this computer")
for(i in 0:1){
X<-read.table(file="Example.prn", skip=i*5, nrow=5)
M <- as.matrix(X)
E=eigen(M, only.values = TRUE)
R<-rbind(R,E$values)}
}
and then use the answer of this question, to save R into a file

RStudio Beginner: Joining tables

So I am doing a project on trip start and end points for a bike sharing program. I have two .csv files - one with the trips, which shows a start and end station ID (e.g. Start at 1, end at 5). I then have another .csv file which contains the lat/lon coordinates for each station number.
How do I join these together? I basically just want to create a lat and lon column alongside my trip data so it's one .csv file ready to be mapped.
I am completely new to R and programming/data in general so go easy! I realize it's probably super simple. I could do it by hand in excel but I have over 100,000+ trips so it might take a while...
Thanks in advance!
You should be able to achieve this using just Excel and the VLOOKUP function.
You would need your two CSV files in the same spreadsheet but on different tabs. Your stations would need to be in order of ID (you can order it in Excel if you need to) and then follow the instructions in the video below.
Example use of VLOOKUP.
Hope that helps!
Here is a step-by-step on how to use start and end station ids from one csv, and get the corresponding latitude and longitudes from another.
In technical terms, this shows you how to make use of merge() to find commonalities between two data frames:
Files
Firstly, simple fake data for demonstration purposes:
coordinates.csv:
station_id,lat,lon
1,lat1,lon1
2,lat2,lon2
3,lat3,lon3
4,lat4,lon4
trips.csv:
start,end
1,3
2,4
Import
Start R or rstudio in the same directory containing the csvs.
Then import the csvs into two new data frames trips and coords. In R console:
> trips = read.csv('trips.csv')
> coords = read.csv('coordinates.csv')
Merges
A first merge can then be used to get start station's coordinates:
> trip_coords = merge(trips, coords, by.x = "start", by.y = "station_id")
by.x = "start" tells R that in the first data set trips, the unique id variable is named start
by.y = "station_id" tells R that in the second data set coords, the unique id variable is named station_id
this is an example of how to merge data frames when the same id variable is named differently in each data set, and you have to explicitly tell R
We check and see trip_coords indeed has combined data, having start, end but also latitude and longitude for the station specified by start:
> head(trip_coords)
start end lat lon
1 1 3 lat1 lon1
2 2 4 lat2 lon2
Next, we want the latitude and longitude for end. We don't need to make a separate data frame, we can use merge() again, and build upon our trip_coords:
> trip_coords = merge(trip_coords, coords, by.x = "end", by.y = "station_id")
Check again:
> head(trip_coords)
end start lat.x lon.x lat.y lon.y
1 3 1 lat1 lon1 lat3 lon3
2 4 2 lat2 lon2 lat4 lon4
the .x and .y suffixes appear because merge combines two data frames, and our data frame 1 was trip_coords which already had a lat and lon, and data frame 2 coords also has lat and lon. So the merge() function needed to help us tell them apart after merge, so
for data frame 1, aka original trip_coords, lat and lon is automatically renamed to lat.x and lon.x
for data frame 2, aka coords, has lat and lon is automatically renamed to lat.y and lon.y
But now, the default result puts variable end first. We may prefer to see the order start followed by end, so to fix this:
> trip_coords = trip_coords[c(2, 1, 3, 4, 5, 6)]
we re-order and then save the result back into trip_coords
We can check the results:
> head(trip_coords)
start end lat.x lon.x lat.y lon.y
1 1 3 lat1 lon1 lat3 lon3
2 2 4 lat2 lon2 lat4 lon4
Export
> write.csv(trip_coords, file = "trip_coordinates.csv", row.names = FALSE)
saves csv
where file = to set the file path to save to. In this case just trip_coordinates.csv so this will appear in the current working dir, where you have the other csvs
row.names = FALSE otherwise by default, the first column is filled with automatic row numbers
You can check the results, for example on Linux, on your command prompt:
$ cat trip_coordinates.csv
"","start","end","lat.x","lon.x","lat.y","lon.y"
"1",1,3,"lat1","lon1","lat3","lon3"
"2",2,4,"lat2","lon2","lat4","lon4"
So now you have a method for taking trips.csv, getting lat/lon for each of start and end, and outputting a csv again.
Automation
Remember that with R you can automate, write the exact commands you want to run, save it in a myscript.R, so if your source data changes and you wish to re-generate the latest trip_coordinates.csv without having to type all those commands again, you have at least two options to run the script
Within R or the R console you see in rstudio:
> source('myscript.R')
Or, if on the Linux command prompt, use Rscript command:
$ Rscript myscript.R
and the trip_coordinates.csv would be automatically generated.
Further resources
How to Use the merge() Function...: Good VENN diagrams of the different joins

write cell array into text file as two column data

I have two different variables which are stored as cell arrays. I try to open text file and store these variables as two column arrays. Below is my code, i used \t to seperate x and y data, but in the output file, the x data is written first which is followed by the y data. How can I obtain two column array in the text file?
for j=1:size(data1,2)
file1=['dir\' file(j,1).name];
f1{j}=fopen(file1,'a+')
fprintf(f1{j},'%7.3f\t%20.10f\n',x{1,j}',y{1,j});
fclose(f1{j});
end
Thanks in advance!
You can use dlmwrite as well to accomplish this for numeric data:
x = [1;2;3]; y = [4;5;6]; % two column vectors
dlmwrite('foo.dat',{x,y},'Delimiter','\t')
This produces the output:
1 4
2 5
3 6
Use a MATLAB table if you have R2013b or beyond:
data1 = {'a','b','c'}'
data2 = {1, 2, 3}'
t = table(data1, data2)
writetable(t, 'data.csv')
More info here.

Why is this r code so slow?

I am trying to create a dataframe based on information in another dataframe.
the first dataframe (base_mar_bop) has data like:
201301|ABC|4
201302|DEF|12
my wish is to create a data frame from this with 16 rows in it:
4 times: 201301|ABC|1
12 times: 201302|DEF|1
I have written a script that takes ages to run. To get an idea the final dataframe has around 2 million rows and the source dataframe has about 10k rows. I can not post sourcefiles for the dataframes due to confidentiality of the data.
Since it took ages to run this code, I decided to do this in PHP and it ran in under a minute and got the job done, writing it to a txt file and then importing the txt file in R.
I have no clue why R takes so long.. Is it the calling of the function? Is it the nested for loop? From my point of view there are not that many computationally intensive steps in there.
# first create an empty dataframe called base_eop that will each subscriber on a row
identified by CED, RATEPLAN and 1
# where 1 is the count and the sum of 1 should end up with the base
base_eop <-base_mar_bop[1,]
# let's give some logical names to the columns in the df
names(base_eop) <- c('CED','RATEPLAN','BASE')
# define the function that enables us to insert a row at the bottom of the dataframe
insertRow <- function(existingDF, newrow, r) {
existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
existingDF[r,] <- newrow
existingDF
}
# now loop through the eop base for march, each row contains the ced, rateplan and number of subs
# we need to insert a row for each individual sub
for (i in 1:nrow(base_mar_eop)) {
# we go through every row in the dataframe
for (j in 1:base_mar_eop[i,3]) {
# we insert a row for each CED, rateplan combination and set the base value to 1
base_eop <- insertRow(base_eop,c(base_mar_eop[i,1:2],1),nrow(base_eop))
}
}
# since the dataframe was created using the first row of base_mar_bop we need to remove this first row
base_eop <- base_eop[-1,]
Here is one approach with data.table, though #BenBolker's timings are already awesome.
library(data.table)
DT <- data.table(d2) ## d2 from #BenBolker's answer
out <- DT[, ID:=1:.N][rep(ID, BASE)][, `:=`(BASE=1, ID=NULL)]
out
# CED RATEPLAN BASE
# 1: 1 A 1
# 2: 1 A 1
# 3: 1 A 1
# 4: 1 A 1
# 5: 1 A 1
# ---
# 1999996: 10000 Y 1
# 1999997: 10000 Y 1
# 1999998: 10000 Y 1
# 1999999: 10000 Y 1
# 2000000: 10000 Y 1
Here, I've used compound queries to do the following:
Create an ID variable that is really just 1 to the number of rows in the data.table.
Use rep to repeat the ID variable by the corresponding BASE value.
Replaced all BASE values with "1" and dropped the ID variable we created earlier.
Perhaps there is a more efficient way to do this though. For example, dropping one of the compound queries should make it a little faster. Perhaps something like:
out <- DT[rep(1:nrow(DT), BASE)][, BASE:=1]
I haven't tried any benchmarking yet, but this approach (illustrated on your mini-example) should be much faster:
d <- data.frame(x1=c(201301,201302),x2=c("ABC","DEF"),rep=c(4,12))
with(d,data.frame(x1=rep(x1,rep),x2=rep(x2,rep),rep=1))
A slightly more realistic example, with timing:
d2 <- data.frame(CED=1:10000,RATEPLAN=rep(LETTERS[1:25],
length.out=10000),BASE=200)
nrow(d2) ## 10000
sum(d2$BASE) ## 2e+06
system.time(d3 <- with(d2,
data.frame(CED=rep(CED,BASE),RATEPLAN=rep(RATEPLAN,BASE),
BASE=1)))
## user system elapsed
## 0.244 0.860 1.117
nrow(d3) ## 2000000 (== 2e+06)

Resources