RStudio Beginner: Joining tables - rstudio

So I am doing a project on trip start and end points for a bike sharing program. I have two .csv files - one with the trips, which shows a start and end station ID (e.g. Start at 1, end at 5). I then have another .csv file which contains the lat/lon coordinates for each station number.
How do I join these together? I basically just want to create a lat and lon column alongside my trip data so it's one .csv file ready to be mapped.
I am completely new to R and programming/data in general so go easy! I realize it's probably super simple. I could do it by hand in excel but I have over 100,000+ trips so it might take a while...
Thanks in advance!

You should be able to achieve this using just Excel and the VLOOKUP function.
You would need your two CSV files in the same spreadsheet but on different tabs. Your stations would need to be in order of ID (you can order it in Excel if you need to) and then follow the instructions in the video below.
Example use of VLOOKUP.
Hope that helps!

Here is a step-by-step on how to use start and end station ids from one csv, and get the corresponding latitude and longitudes from another.
In technical terms, this shows you how to make use of merge() to find commonalities between two data frames:
Files
Firstly, simple fake data for demonstration purposes:
coordinates.csv:
station_id,lat,lon
1,lat1,lon1
2,lat2,lon2
3,lat3,lon3
4,lat4,lon4
trips.csv:
start,end
1,3
2,4
Import
Start R or rstudio in the same directory containing the csvs.
Then import the csvs into two new data frames trips and coords. In R console:
> trips = read.csv('trips.csv')
> coords = read.csv('coordinates.csv')
Merges
A first merge can then be used to get start station's coordinates:
> trip_coords = merge(trips, coords, by.x = "start", by.y = "station_id")
by.x = "start" tells R that in the first data set trips, the unique id variable is named start
by.y = "station_id" tells R that in the second data set coords, the unique id variable is named station_id
this is an example of how to merge data frames when the same id variable is named differently in each data set, and you have to explicitly tell R
We check and see trip_coords indeed has combined data, having start, end but also latitude and longitude for the station specified by start:
> head(trip_coords)
start end lat lon
1 1 3 lat1 lon1
2 2 4 lat2 lon2
Next, we want the latitude and longitude for end. We don't need to make a separate data frame, we can use merge() again, and build upon our trip_coords:
> trip_coords = merge(trip_coords, coords, by.x = "end", by.y = "station_id")
Check again:
> head(trip_coords)
end start lat.x lon.x lat.y lon.y
1 3 1 lat1 lon1 lat3 lon3
2 4 2 lat2 lon2 lat4 lon4
the .x and .y suffixes appear because merge combines two data frames, and our data frame 1 was trip_coords which already had a lat and lon, and data frame 2 coords also has lat and lon. So the merge() function needed to help us tell them apart after merge, so
for data frame 1, aka original trip_coords, lat and lon is automatically renamed to lat.x and lon.x
for data frame 2, aka coords, has lat and lon is automatically renamed to lat.y and lon.y
But now, the default result puts variable end first. We may prefer to see the order start followed by end, so to fix this:
> trip_coords = trip_coords[c(2, 1, 3, 4, 5, 6)]
we re-order and then save the result back into trip_coords
We can check the results:
> head(trip_coords)
start end lat.x lon.x lat.y lon.y
1 1 3 lat1 lon1 lat3 lon3
2 2 4 lat2 lon2 lat4 lon4
Export
> write.csv(trip_coords, file = "trip_coordinates.csv", row.names = FALSE)
saves csv
where file = to set the file path to save to. In this case just trip_coordinates.csv so this will appear in the current working dir, where you have the other csvs
row.names = FALSE otherwise by default, the first column is filled with automatic row numbers
You can check the results, for example on Linux, on your command prompt:
$ cat trip_coordinates.csv
"","start","end","lat.x","lon.x","lat.y","lon.y"
"1",1,3,"lat1","lon1","lat3","lon3"
"2",2,4,"lat2","lon2","lat4","lon4"
So now you have a method for taking trips.csv, getting lat/lon for each of start and end, and outputting a csv again.
Automation
Remember that with R you can automate, write the exact commands you want to run, save it in a myscript.R, so if your source data changes and you wish to re-generate the latest trip_coordinates.csv without having to type all those commands again, you have at least two options to run the script
Within R or the R console you see in rstudio:
> source('myscript.R')
Or, if on the Linux command prompt, use Rscript command:
$ Rscript myscript.R
and the trip_coordinates.csv would be automatically generated.
Further resources
How to Use the merge() Function...: Good VENN diagrams of the different joins

Related

"Different row counts implied by arguments" in attempt to plot BAM file data

I'm attempting to use this tutorial to manipulate and plot ATAC-sequencing data. I have all the libraries listed in that tutorial installed and loaded, except while they use biocLite(BSgenome.Hsapiens.UCSC.hg19) for the human genome, I'm using biocLite(TxDb.Mmusculus.UCSC.mm10.knownGene) for the mouse genome.
Here I have loaded in my BAM file
sorted_AL1.1BAM <-"Sorted_1_S1_L001_R1_001.fastq.gz.subread.bam"
And created an object called TSS, which is transcription start site regions from the mouse genome. I want to ultimately plot the average signal in my read data across mouse transcription start sites.
TSSs <- resize(genes(TxDb.Mmusculus.UCSC.mm10.knownGene), fix = "start", 1)
The problem occurs with the following code:
nucFree <- regionPlot(bamFile = sorted_AL1.1BAM, testRanges = TSSs, style = "point",
format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100,
forceFragment = 50)
The error is as follows:
Reading Bam header information.....Done
Filtering regions which extend outside of genome boundaries.....Done
Filtered 24528 of 24528 regions
Splitting regions by Watson and Crick strand..Error in DataFrame(..., check.names = FALSE) :
different row counts implied by arguments
I assume my BAM file contains empty values that need to be changed to NAs. My issue is that I'm not sure how to visualize and manipulate BAM files in R in order to do this. Any help would be appreciated.
I tried the following:
data.frame(sorted_AL1.1BAM)
sorted_AL1.1BAM[sorted_AL1.1BAM == ''] <- NA
I expected this to resolve the issue of different row counts, but I get the same error message.

Input f into play3d() and movie3d() in the rgl package in R

I don't understand the input f expected by play3d and movie3d in the rgl package.
library(rgl)
nobs<-10
x<-runif(nobs)
y<-runif(nobs)
z<-runif(nobs)
n<-rep(1:nobs)
df<-as.data.frame(cbind(x,y,z,n))
listofobs<-split(df,n)
plot3d(df[,1],df[,2],df[,3], type = "n", radius = .2 )
myplotfunction<-function(x) {
rgl.spheres(x=x$x,y=x$y,z=x$z, type="s", r=0.025)
}
When executing the 2 lines below, the animation does play but both lines (play3d() and movie3d()) trigger the error displayed below:
play3d(f=lapply(listofobs,myplotfunction), fps=1 )
movie3d(f=lapply(listofobs,myplotfunction), fps=1 , duration=20)
I am hoping someone can correct my code and help me understand the f input to play3d and movie3d.
Question 1: Why is the play3d line above correct enough that the animation does display correctly?
Question 2: Why is the play3d line above incorrect enough that it triggers the error?
Question 3: What is wrong with the movie3d line that it does not produce a video output?
As the docs say, f is "A function returning a list that may be passed to par3d". It's not a list, which is what your usage passes.
To answer the questions:
R evaluates the lapply call which does the animation, then play3d looks at the result and dies because it's not a function.
f needs to be a function, as described in the help page.
It dies when it looks at f, because it's not a function.
This looks like it will do what you want:
library(rgl)
nobs<-10
x<-runif(nobs)
y<-runif(nobs)
z<-runif(nobs)
df<-data.frame(x,y,z)
plot3d(df, type = "n" )
id <- NA
myplotfunction<-function(time) {
index <- round(time)
# For a 3x faster display, use index <- round(3*time)
# To cycle through the points several times, use
# index <- round(3*time) %% nobs + 1
if (!is.na(id))
pop3d(id = id) # Delete previous item
id <<- spheres3d(df[index,], r=0.025)
list()
}
play3d(myplotfunction, startTime = 1, duration = nobs - 1)
movie3d(myplotfunction, startTime = 1, duration = nobs - 1, fps = 1)
This will leave a GIF in file.path(tempdir(), "movie.gif").
Some other notes:
don't call rgl.spheres. It will cause you immense pain later. Use spheres3d, or never call any *3d function, and never upgrade rgl: you're living in the past using the rgl.* functions. The *3d functions and the rgl.* functions don't play nicely together.
to construct a dataframe, just use the data.frame() function, don't convert
a matrix.
you don't need all those contortions to extract points from the dataframe.
Most rgl functions can handle a dataframe with x, y, and z columns.
You might notice the plot3d frame move a little: spheres are bigger than points, so it will adjust to accommodate them. You could use xlim, ylim and zlim to set the original frame a little bigger if you don't like this.

gnuplot : variable paths to data file in a for loop

I would like to plot multiple curve on the same graph using a for loop. Each data file (named stat_coupe) is located in a different folder (fwal055wal055/rep16/ and fwal055wal055_c2/rep20/). fwal055wal055 and fwal055wal055_c2 correspond to names of simulation. First, I need to get a previous result, a single number (Utau), in other files (named file_fwal055wal055 and file_fwal055wal055_c2). This is successfully done thanks to the command awk. The result depend on the file: Utaufwal055wal055=10.5 and Utaufwal055wal055_c2=12.2.
Then I need to divid the 1st column of the file stat_coupe corresponding to the path fwal055wal055/rep16/ by the value of Utaufwal055wal055 and do the same thing for the file stat_coupe corresponding to the path fwal055wal055_c2/rep20/ with the value of Utaufwal055wal055_c2. Moreover, each plot should have a specific format which depend on the type of simulation run (fwal055wal055 or fwal055wal055_c2).
The presented problem is reduced to 2 simulations fwal055wal055 and fwal055wal055_c2 and 1 plot but I have about 20 simulations and 15 various graphs to plot that is why I would like to use the for loop.
To summary at each iteration I have:
a specific format,
a specific path,
a specific value of Utau
I want to indicate the wright format, path and value of Utau at each iteration of the for loop. The solution I propose below successfully permits to obtain the value of Utau for each simulation but the code #path_.i and #format_.i does not work.
#!/bin/bash
for elem in fwal055wal055 fwal055wal055_c2;
do
Utau[${elem}]=$(awk 'FNR==5{print $1}' file_$elem)
done
gnuplot -persist <<-EOFMarker
format_fwal055wal055='pt 1 ps 1.0 lc 0 title "WALE"'
format_fwal055wal055_c2='pt 2 ps 1.0 lc 0 title "WALE c2"'
path_fwal055wal055='"fwal055wal055/rep16/stat_coupe"'
path_fwal055wal055_c2='"fwal055wal055_c2/rep20/stat_coupe"'
list="fwal055wal055 fwal055wal055_c2"
plot for [i in list] #path_.i u 1:(\$2/${Utau[${i}]}) #format_.i
EOFMarker
I would like to obtain something equivalent to:
plot #path_fwal055wal055 u 1:(\$2/${Utau[${i}]}) #format_fwal055wal055,\
#path_fwal055wal055_c2 u 1:(\$2/${Utau[${i}]}) #format_fwal055wal055_c2
Can someone help me to solve this issue ?
Thank you very much,
Martin
Check help sprintf, help words and help word.
I would create two strings with the same number of items and then combine them with sprintf(). From gnuplot 5.2 on you could also do it with arrays.
# Version 1
PATHS = '"fwal055wal055/rep16/stat_coupe" "fwal055wal055_c2/rep20/stat_coupe"'
FILES = "fwal055wal055 fwal055wal055_c2"
plot for [i=1:words(FILES)] sprintf("%s_%s",word(PATHS,i),word(FILES,i)) u 1:2
or you could define a function for your filenames to keep the plot command short and readable.
# Version 2
PATHS = '"rep16/stat_coupe" "rep20/stat_coupe"'
FILES = "fwal055wal055 fwal055wal055_c2"
myFilename(i) = sprintf("%s/%s_%s",word(FILES,i),word(PATHS,i),word(FILES,i))
plot for [i=1:words(FILES)] myFilename(i) u 1:2
Addition (after some clarifications...)
If I understand your question now correctly, the following code should do the job.
For the extraction of the UTAUS you do a separate loop before plotting and store the extracted values in a string. During plotting you get these values back via word(UTAUS,i). Since you do the mathematical operation column(2)/word(UTAUS,i), gnuplot will interpret them as number. Check help words, help word, help sprintf, help every.
Code:
### extract and normalize in a loop with individual files and directories
reset session
FILES = 'fwal055wal055 fwal055wal055_c2'
DIRS = 'rep16 rep20'
TITLES = '"WALE" "WALE c2"' # if you have spaces you need to put it into double quotes
UTAUS = ''
# define functions for better readability
myExtractionFile(i) = sprintf("file_%s",word(FILES,i))
myDataFile(i) = sprintf("%s/%s/stat_coupe",word(FILES,i),word(DIRS,i))
myTitle(i) = word(TITLES,i)
# define point or line appearance. Add more if you have more files
set style line 1 pt 1 ps 1.0 lc 0
set style line 2 pt 2 ps 1.0 lc 1
# extract the UTAUs
do for [i=1:words(FILES)] {
set table $Dummy
plot myExtractionFile(i) u (utau=$1) every ::4::4 w table # extract value row 5, column 1 (not counting header lines)
unset table
UTAUS = UTAUS.sprintf(" %g",utau) # append the extracted value as string
}
plot for [i=1:words(FILES)] myDataFile(i) u 1:(column(2)/word(UTAUS,i)) ls i title myTitle(i)
### end of code

How to export from data using esttab or estout

Say that I have these data:
sysuse auto2, clear
gen name = substr(make, 1,3)
drop if missing(rep78)
gen n = 1
collapse (mean) mpg (sum) (n), by(name)
replace name = "a b: c" if _n==1
I would like to export them to an .rtf (.tex, etc.) file directly from the data using esttab or estout. Is this possible? The key reason I want to do this is that I want to be able to preserve the spaces in the row names. And it would be nice to able to have the option to have commas after for 1,000's.
One partial approach is to save the data to a matrix, then export the matrix using esttab, but can I need this extra step?
mkmat mpg n, matrix(mat) rownames(name)
esttab matrix(mat)
A problem with this is that it replaces the spaces in the row names with _'s. Another problem is that if any of the rownames (from the variable name) are :, then this creates the category in the output. Is there another solution? Either to directly export from the data or possibly to somehow save the data in an estimation?
Instead of using collapse, you can calculate means and counts directly with estpost tabstat, statistics(mean count) by(). You can then use esttab to export the results.
If you really want to create a dataset first, you can still use estpost tabstat. This appears to work for your dataset:
estpost tabstat mpg n, by(name) nototal
esttab, cells("mpg n") varlabels(`e(labels)') noobs nonumber nomtitle
If you want to have "a b: c" on top again you can use the order option of esttab.

Julia: How to modify a column of a matrix that has been saved as a binary file?

I am working with large matrices of data (Nrow x Ncol) that are too large to be stored in memory. Instead, it is standard in my field of work to save the data into a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I also need to be able to modify a column and then save the updated column back into the binary file. So far I have managed to figure out how to save a matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after I edit the contents of a column I cannot figure out how to save that column back into the binary file.
As an example, suppose the data file is a 32-bit identity matrix that has been saved to disk.
Nrow = 500
Ncol = 325
data = eye(Float32,Nrow,Ncol)
stream_data = open("data","w")
write(stream_data,data[:])
close(stream_data)
Reading the entire file from disk and then reshaping back into the matrix is straightforward:
stream_data = open("data","r")
data_matrix = read(stream_data,Float32,Nrow*Ncol)
data_matrix = reshape(data_matrix,Nrow,Ncol)
close(stream_data)
As I said before, the data-matrices I am working with are too large to read into memory and as a result the code written above would normally not be possible to execute. Instead, I need to work with 1 column at a time. The following is a solution to read 1 column (e.g. the 7th column) of the matrix into memory:
icol = 7
stream_data = open("data","r")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
data_col = read(stream_data,Float32,Nrow)
close(stream_data)
Note that the coefficient '4' in the 'position_data' variable is because I am working with Float32. Also, I don't fully understand what the seek command is doing here, but it seems to be giving me the correct output based on the following tests:
data == data_matrix # true
data[:,7] == data_col # true
For the sake of this problem, lets say I have determined that the column I loaded (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
The problem now, is to figure out how to save this column back into the binary file without affecting any of the other data. Naturally I intend to use 'write' to perform this task. However, I am not entirely sure how to proceed. I know I need to start by opening up a stream to the data; however I am not sure what 'mode' I need to use: "w", "w+", "a", or "a+"? Here is a failed attempt using "w":
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
The original binary file (before my failed attempt to edit the binary file) occupied 650000 bytes on disk. This is consistent with the fact that the matrix is size 500x325 and Float32 numbers occupy 4 bytes (i.e. 4*500*325 = 650000). However, after my attempt to edit the binary file I have observed that the binary file now occupies only 14000 bytes of space. Some quick mental math shows that 14000 bytes corresponds to 7 columns of data (4*500*7 = 14000). A quick check confirms that the binary file has replaced all of the original data with a new matrix with size 500x7, and whose elements are all zeros.
stream_data = open("data","r")
data_new_matrix = read(stream_data,Float32,Nrow*7)
data_new_matrix = reshape(data_new_matrix,Nrow,7)
sum(abs(data_new_matrix)) # 0.0f0
What do I need to do/change in order to only modify only the 7th 'column' in the binary file?
Instead of
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
in the OP, write
icol = 7
stream_data = open("data","r+")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
i.e. replace "w" with "r+" and everything works.
The reference to open is http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and it explains the various modes. Preferably open shouldn't be used with the original somewhat confusing but definitely slower string parameter.
You can use SharedArrays for the need you describe:
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
# do something with data
data[:,1]=a[:,1].+1
exit()
# restart julia
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
#show data[1,1]
# prints 1
Now, be mindful that you're supposed to handle synchronisation to read/write from/to this file (if you have async workers) and that you're not supposed to change the size of the array (unless you know what you're doing).

Resources