How to export from data using esttab or estout - format

Say that I have these data:
sysuse auto2, clear
gen name = substr(make, 1,3)
drop if missing(rep78)
gen n = 1
collapse (mean) mpg (sum) (n), by(name)
replace name = "a b: c" if _n==1
I would like to export them to an .rtf (.tex, etc.) file directly from the data using esttab or estout. Is this possible? The key reason I want to do this is that I want to be able to preserve the spaces in the row names. And it would be nice to able to have the option to have commas after for 1,000's.
One partial approach is to save the data to a matrix, then export the matrix using esttab, but can I need this extra step?
mkmat mpg n, matrix(mat) rownames(name)
esttab matrix(mat)
A problem with this is that it replaces the spaces in the row names with _'s. Another problem is that if any of the rownames (from the variable name) are :, then this creates the category in the output. Is there another solution? Either to directly export from the data or possibly to somehow save the data in an estimation?

Instead of using collapse, you can calculate means and counts directly with estpost tabstat, statistics(mean count) by(). You can then use esttab to export the results.
If you really want to create a dataset first, you can still use estpost tabstat. This appears to work for your dataset:
estpost tabstat mpg n, by(name) nototal
esttab, cells("mpg n") varlabels(`e(labels)') noobs nonumber nomtitle
If you want to have "a b: c" on top again you can use the order option of esttab.

Related

Shuffle One Variable Within Group

This question is an extension of the excellent answer provided by Robert Picard here: How to Randomly Assign to Groups of Different Sizes
We have this dataset, which is the same as in the previous question, but adds the year variable:
sysuse census, clear
keep state region pop
order state pop region
decode region, gen(reg)
replace reg="NCntrl" if reg=="N Cntrl"
drop region
gen year=20
replace year=30 if _n>15
replace year=40 if _n>35
If I just wanted to re-randomly assign reg's across all observations (without regard to group), I could implement the answer to the previous post:
tempfile orig
save `orig'
keep reg
rename reg reg_new
set seed 234
gen double u = runiform()
sort u reg_new
merge 1:1 _n using `orig', nogen
How would the code be modified so that reg is shuffled, but only within year? For example, there are 15 observations where year==20. These observations should be shuffled separately than the other years.
Shuffling one variable doesn't require any file choreography. This can probably be shortened:
sysuse auto, clear
set seed 2803
gen double shuffle = runiform()
* example 1
sort shuffle
gen long which = _n
sort mpg
gen mpg_new = mpg[which]
list which mpg*
* example 2
bysort foreign (shuffle) : gen long which2 = _n
bysort foreign (mpg) : gen mpg2 = mpg[which2]
list which2 mpg mpg2, sepby(foreign)
All that said, I think sample does this so long as you specify the same sample size as the number in the dataset. It's overkill because you get all the variables.

Exporting data into excel using iterative loop

I am doing an iterative calculation on maple and I want to store the resulting data (which comes in a column matrix) from each iteration into a specific column of an Excel file. For example, my data is
mydat||1:= <<11,12,13,14>>:
mydat||2:= <<21,22,23,24>>:
mydat||3:= <<31,32,33,34>>:
and so on.
I am trying to export each of them into an excel file and I want each data to be stored in consecutive columns of the same excel file. For example, mydat||1 goes to column A, mydat||2 goes to column B and so on. I tried something like following.
with(ExcelTools):
for k from 1 to 3 do
Export(mydat||k, "data.xlsx", "Sheet1", "A:C"): #The problem is selecting the range.
end do:
How do I select the range appropriately here? Is there any other method to export the data and store in the way that I explained above?
There are couple of ways to do this. The easiest is certainly to put all of your data into one data structure and then export that. For example:
mydat1:= <<11,12,13,14>>:
mydat2:= <<21,22,23,24>>:
mydat3:= <<31,32,33,34>>:
mydata := Matrix( < mydat1 | mydat2 | mydat3 > );
This stores your data in a Matrix where mydat1 is the first column, mydat2 is the second column, etc. With the data in this form, either ExcelTools:-Export or the more generic Export command will work:
ExcelTools:-Export( data, "data.xlsx" );
Export( "data.xlsx", data );
Now since you mention that you are doing an iterative calculation, you may want to write the results out column by column. Here's another method that doesn't involve the creation of another data structure to house the results. This does assume that the data in mydat"i" has been created before the loop.
for i to 3 do
ExcelTools:-Export( cat(`mydat`,i), "data.xlsx", 1, ["A1","B1","C1"][i] );
end do;
If you want to write the data out to a file as you are building it, then just do the Export call after the creation of each of the columns, i.e.
ExcelTools:-Export( mydat1, "data.xlsx", 1, "A1" );
Note that I removed the "||" characters. These are used in Maple for concatenation and caused some issues with the second method.

RStudio Beginner: Joining tables

So I am doing a project on trip start and end points for a bike sharing program. I have two .csv files - one with the trips, which shows a start and end station ID (e.g. Start at 1, end at 5). I then have another .csv file which contains the lat/lon coordinates for each station number.
How do I join these together? I basically just want to create a lat and lon column alongside my trip data so it's one .csv file ready to be mapped.
I am completely new to R and programming/data in general so go easy! I realize it's probably super simple. I could do it by hand in excel but I have over 100,000+ trips so it might take a while...
Thanks in advance!
You should be able to achieve this using just Excel and the VLOOKUP function.
You would need your two CSV files in the same spreadsheet but on different tabs. Your stations would need to be in order of ID (you can order it in Excel if you need to) and then follow the instructions in the video below.
Example use of VLOOKUP.
Hope that helps!
Here is a step-by-step on how to use start and end station ids from one csv, and get the corresponding latitude and longitudes from another.
In technical terms, this shows you how to make use of merge() to find commonalities between two data frames:
Files
Firstly, simple fake data for demonstration purposes:
coordinates.csv:
station_id,lat,lon
1,lat1,lon1
2,lat2,lon2
3,lat3,lon3
4,lat4,lon4
trips.csv:
start,end
1,3
2,4
Import
Start R or rstudio in the same directory containing the csvs.
Then import the csvs into two new data frames trips and coords. In R console:
> trips = read.csv('trips.csv')
> coords = read.csv('coordinates.csv')
Merges
A first merge can then be used to get start station's coordinates:
> trip_coords = merge(trips, coords, by.x = "start", by.y = "station_id")
by.x = "start" tells R that in the first data set trips, the unique id variable is named start
by.y = "station_id" tells R that in the second data set coords, the unique id variable is named station_id
this is an example of how to merge data frames when the same id variable is named differently in each data set, and you have to explicitly tell R
We check and see trip_coords indeed has combined data, having start, end but also latitude and longitude for the station specified by start:
> head(trip_coords)
start end lat lon
1 1 3 lat1 lon1
2 2 4 lat2 lon2
Next, we want the latitude and longitude for end. We don't need to make a separate data frame, we can use merge() again, and build upon our trip_coords:
> trip_coords = merge(trip_coords, coords, by.x = "end", by.y = "station_id")
Check again:
> head(trip_coords)
end start lat.x lon.x lat.y lon.y
1 3 1 lat1 lon1 lat3 lon3
2 4 2 lat2 lon2 lat4 lon4
the .x and .y suffixes appear because merge combines two data frames, and our data frame 1 was trip_coords which already had a lat and lon, and data frame 2 coords also has lat and lon. So the merge() function needed to help us tell them apart after merge, so
for data frame 1, aka original trip_coords, lat and lon is automatically renamed to lat.x and lon.x
for data frame 2, aka coords, has lat and lon is automatically renamed to lat.y and lon.y
But now, the default result puts variable end first. We may prefer to see the order start followed by end, so to fix this:
> trip_coords = trip_coords[c(2, 1, 3, 4, 5, 6)]
we re-order and then save the result back into trip_coords
We can check the results:
> head(trip_coords)
start end lat.x lon.x lat.y lon.y
1 1 3 lat1 lon1 lat3 lon3
2 2 4 lat2 lon2 lat4 lon4
Export
> write.csv(trip_coords, file = "trip_coordinates.csv", row.names = FALSE)
saves csv
where file = to set the file path to save to. In this case just trip_coordinates.csv so this will appear in the current working dir, where you have the other csvs
row.names = FALSE otherwise by default, the first column is filled with automatic row numbers
You can check the results, for example on Linux, on your command prompt:
$ cat trip_coordinates.csv
"","start","end","lat.x","lon.x","lat.y","lon.y"
"1",1,3,"lat1","lon1","lat3","lon3"
"2",2,4,"lat2","lon2","lat4","lon4"
So now you have a method for taking trips.csv, getting lat/lon for each of start and end, and outputting a csv again.
Automation
Remember that with R you can automate, write the exact commands you want to run, save it in a myscript.R, so if your source data changes and you wish to re-generate the latest trip_coordinates.csv without having to type all those commands again, you have at least two options to run the script
Within R or the R console you see in rstudio:
> source('myscript.R')
Or, if on the Linux command prompt, use Rscript command:
$ Rscript myscript.R
and the trip_coordinates.csv would be automatically generated.
Further resources
How to Use the merge() Function...: Good VENN diagrams of the different joins

Highlighting mininimum row value in Pander

I am trying to display a dataframe in an RMarkdown document using the Pander package.
I would like to highlight the minimum value in each row of values. Here's what I have tried:
df <- replicate(4, rnorm(5))
df <- as.data.frame(df)
df$min <- apply(df, 1, min)
emphasize.strong.cells(which(df == df$min, arr.ind = T))
pander(df[1:4])
When I do this I get the error:
Error in check.highlight.parameters(emphasize.strong.cells, nrow(t), ncol(t)) :
Too high number passed for column indexes that should be kept below 6
I can print out the whole table (with the min column) without any trouble or I can print out a partial table without emphasis, but neither of these is ideal. I want the highlighting, but I do not wish to include the 'min' column.
I imagine the fact that I am leaving some highlighted cells out of the pander command is causing the error.
Is there a way around this? Or a better way to do this?
Thanks.
Subquestion: What if I wanted to highlight the minimum in the first few rows and the maximum in the next few. Is that possible in a single table?
Instead of the which lookup, with the possibility to match row minimums in the wrong rows, you can easily construct those array indices with a simple sequence (1:N) and calling which.min on each row, eg with apply:
> df <- replicate(4, rnorm(5))
> df <- as.data.frame(df)
> emphasize.strong.cells(cbind(1:nrow(df), apply(df, 1, which.min)))
> pander(df)
----------------------------------------------
V1 V2 V3 V4
----------- ----------- ----------- ----------
0.6802 0.1409 **-0.7992** 0.1997
0.6797 **-0.2212** 1.016 0.6874
2.031 -0.009855 0.3881 **-1.275**
1.376 0.2619 **-2.337** -0.1066
**-0.4541** 1.135 -0.1566 0.2912
----------------------------------------------
About your next question: you could of course do that in a single table, eg rbind two matrices created similarly as described above with which.min and which.max.

Stata - esttab and tabstat formatting

I am using esttab + tabstat to generate a .tex file to be opened in LaTeX. I am close to getting what I want, but there is one issue:
How do I get my standard deviations on the same line after the means? It currently shows up on the line following the mean.
A MWE follows. Note that I am actually creating two tables and appending them on to each other. This just shows up as two separate tables in Stata, but it works in LaTeX after I modify the code slightly so as to save to a file and not output to the screen. If there is a way to not append as I do and just do everything at once, that would be super, but I am not aware of one. Note also that I am following the code of this site to go between the two programs.
sysuse auto, replace
*create new categorical variable
quietly gen mod= ""
quietly replace mod="odd" if mod(_n, 2) == 1
quietly replace mod="even" if mod(_n, 2) == 0
*create table - by foreign
quietly eststo clear
quietly estpost tabstat price, by(foreign) statistics(mean sd) listwise nototal
quietly est store A
quietly estpost tabstat mpg, by(foreign) statistics(mean sd) listwise nototal
quietly est store B
esttab A B, main(mean 2) aux(sd 2) label noobs parentheses ///
varlabels(`e(labels)') mtitle("Mean price" "Mean mpg") nostar ///
unstack nonote nonumber collabels(none) refcat(Domestic "Origin", nolabel)
*append to table - by mod
quietly estpost tabstat price, by(mod) statistics(mean sd) listwise nototal
quietly est store A
quietly estpost tabstat mpg, by(mod) statistics(mean sd) listwise nototal
quietly est store B
esttab A B, append main(mean 2) aux(sd 2) label noobs parentheses ///
varlabels(`e(labels)') mtitle("Mean price" "Mean mpg") nostar ///
unstack nonote nonumber collabels(none) refcat(even "Type", nolabel)
Update 1 I solved the problem that I had previously included in this question. That problem had to do with decimal points not showing up in my LaTeX output. But I was doing something wrong in LaTeX relating to the implementation of a package. (I simply had to put in the correct number of columns.)
Update 2 I figured out how to get the standard errors in parentheses: remove plain form the code. I think it is default, but include the parentheses option. I have updated the code and text to reflect this change.
Simply include onecell to the esttab code. See the documentation.

Resources