Stata - esttab and tabstat formatting - format

I am using esttab + tabstat to generate a .tex file to be opened in LaTeX. I am close to getting what I want, but there is one issue:
How do I get my standard deviations on the same line after the means? It currently shows up on the line following the mean.
A MWE follows. Note that I am actually creating two tables and appending them on to each other. This just shows up as two separate tables in Stata, but it works in LaTeX after I modify the code slightly so as to save to a file and not output to the screen. If there is a way to not append as I do and just do everything at once, that would be super, but I am not aware of one. Note also that I am following the code of this site to go between the two programs.
sysuse auto, replace
*create new categorical variable
quietly gen mod= ""
quietly replace mod="odd" if mod(_n, 2) == 1
quietly replace mod="even" if mod(_n, 2) == 0
*create table - by foreign
quietly eststo clear
quietly estpost tabstat price, by(foreign) statistics(mean sd) listwise nototal
quietly est store A
quietly estpost tabstat mpg, by(foreign) statistics(mean sd) listwise nototal
quietly est store B
esttab A B, main(mean 2) aux(sd 2) label noobs parentheses ///
varlabels(`e(labels)') mtitle("Mean price" "Mean mpg") nostar ///
unstack nonote nonumber collabels(none) refcat(Domestic "Origin", nolabel)
*append to table - by mod
quietly estpost tabstat price, by(mod) statistics(mean sd) listwise nototal
quietly est store A
quietly estpost tabstat mpg, by(mod) statistics(mean sd) listwise nototal
quietly est store B
esttab A B, append main(mean 2) aux(sd 2) label noobs parentheses ///
varlabels(`e(labels)') mtitle("Mean price" "Mean mpg") nostar ///
unstack nonote nonumber collabels(none) refcat(even "Type", nolabel)
Update 1 I solved the problem that I had previously included in this question. That problem had to do with decimal points not showing up in my LaTeX output. But I was doing something wrong in LaTeX relating to the implementation of a package. (I simply had to put in the correct number of columns.)
Update 2 I figured out how to get the standard errors in parentheses: remove plain form the code. I think it is default, but include the parentheses option. I have updated the code and text to reflect this change.

Simply include onecell to the esttab code. See the documentation.

Related

xtdpdml: how to draw interaction plots

I have one IV, one DV, one moderator, and two control variables.
The panel data is strongly balanced with 5 years.
Is there any way to draw an interaction plot of the xtdpdml results?
I've tried the usual methods, (iv: independent variable with values from 1 to 6; mo: moderator with values from 1 to 6)
egen iv = rowmean (q1 q2 q3 q4 q5 q6)
egen dv = rowmean (qd1 qd2 qd3 qd4 qd5 qd6)
egen mo = rpwmean (qm1 qm2 qm3 qm4 qm5 qm6)
egen im = iv * dv
xtdpdml dv, inv(gender birthyear) pre(iv mo im) fiml
summarize mo
global moa = round(r(mean) + r(sd),0.1)
global mo = round(r(mean),0.1)
global mob = round(r(mean) - r(sd),0.1)
margins, at(iv=(1(1)6) mo=($moa $mo $mob))
and the error messages showed up saying 'iv ambiguous abbreviation.'
EDIT This answer was in response to a previous version of the question using quite different variable names. It addresses only a side-issue, how to present numeric results rounded to a desired number of decimal places.
Your code doesn't mention pnv so it is hard to comment.
I see what you want to do, but that is an awkward way to do it, and fallible. The problem is that Stata works in binary and can only hold a few of the decimal fractions 0.01(0.01)0.99 exactly, specifically 0.25 0.50 0.75. That's 3 out of 99; the rest must be held as approximations. Other Stata code means that you won't always see the approximations, but it's not guaranteed. I would try this, which makes the rounding into the formatting question it really is:
summarize sns
local snsa : di %2.1f r(mean) + r(sd)
local sns: di %2.1f r(mean)
local snsb: di %2.1f r(mean) - r(sd)
margins, at(priv=(0(1)6) sns=(`snsa' `sns' `snsb'))
I can't rule out the real problem being in some other part of the code.

How to export from data using esttab or estout

Say that I have these data:
sysuse auto2, clear
gen name = substr(make, 1,3)
drop if missing(rep78)
gen n = 1
collapse (mean) mpg (sum) (n), by(name)
replace name = "a b: c" if _n==1
I would like to export them to an .rtf (.tex, etc.) file directly from the data using esttab or estout. Is this possible? The key reason I want to do this is that I want to be able to preserve the spaces in the row names. And it would be nice to able to have the option to have commas after for 1,000's.
One partial approach is to save the data to a matrix, then export the matrix using esttab, but can I need this extra step?
mkmat mpg n, matrix(mat) rownames(name)
esttab matrix(mat)
A problem with this is that it replaces the spaces in the row names with _'s. Another problem is that if any of the rownames (from the variable name) are :, then this creates the category in the output. Is there another solution? Either to directly export from the data or possibly to somehow save the data in an estimation?
Instead of using collapse, you can calculate means and counts directly with estpost tabstat, statistics(mean count) by(). You can then use esttab to export the results.
If you really want to create a dataset first, you can still use estpost tabstat. This appears to work for your dataset:
estpost tabstat mpg n, by(name) nototal
esttab, cells("mpg n") varlabels(`e(labels)') noobs nonumber nomtitle
If you want to have "a b: c" on top again you can use the order option of esttab.

Stanford CRFClassifier performance evaluation output

I'm following this FAQ https://nlp.stanford.edu/software/crf-faq.shtml for training my own classifier and I noticed that the performance evaluation output does not match the results (or at least not in the way I expect).
Specifically this section
CRFClassifier tagged 16119 words in 1 documents at 13824.19 words per second.
Entity P R F1 TP FP FN
MYLABEL 1.0000 0.9961 0.9980 255 0 1
Totals 1.0000 0.9961 0.9980 255 0 1
I expect TP to be all instances where the predicted label matched the golden label, FP to be all instances where MYLABEL was predicted but the golden label was O, FN to be all instances where O was predicted but the golden was MYLABEL.
If I calculate those numbers myself from the output of the program, I get completely different numbers with no relation to what the program prints. I've tried this with various test files.
I'm using Stanford NER - v3.7.0 - 2016-10-31
Am I missing something?
The F1 scores are over entities not labels.
Example:
(Joe, PERSON) (Smith, PERSON) (went, O) (to, O) (Hawaii, LOCATION) (., O).
In this example there are two possible entities:
Joe Smith PERSON
Hawaii LOCATION
Entities are created by taking all adjacent tokens with the same label. (Unless you use a more complicated BIO labeling scheme ; BIO schemes have tags like I-PERSON and B-PERSON to indicate whether a token is the beginning of an entity, etc...).

RStudio Beginner: Joining tables

So I am doing a project on trip start and end points for a bike sharing program. I have two .csv files - one with the trips, which shows a start and end station ID (e.g. Start at 1, end at 5). I then have another .csv file which contains the lat/lon coordinates for each station number.
How do I join these together? I basically just want to create a lat and lon column alongside my trip data so it's one .csv file ready to be mapped.
I am completely new to R and programming/data in general so go easy! I realize it's probably super simple. I could do it by hand in excel but I have over 100,000+ trips so it might take a while...
Thanks in advance!
You should be able to achieve this using just Excel and the VLOOKUP function.
You would need your two CSV files in the same spreadsheet but on different tabs. Your stations would need to be in order of ID (you can order it in Excel if you need to) and then follow the instructions in the video below.
Example use of VLOOKUP.
Hope that helps!
Here is a step-by-step on how to use start and end station ids from one csv, and get the corresponding latitude and longitudes from another.
In technical terms, this shows you how to make use of merge() to find commonalities between two data frames:
Files
Firstly, simple fake data for demonstration purposes:
coordinates.csv:
station_id,lat,lon
1,lat1,lon1
2,lat2,lon2
3,lat3,lon3
4,lat4,lon4
trips.csv:
start,end
1,3
2,4
Import
Start R or rstudio in the same directory containing the csvs.
Then import the csvs into two new data frames trips and coords. In R console:
> trips = read.csv('trips.csv')
> coords = read.csv('coordinates.csv')
Merges
A first merge can then be used to get start station's coordinates:
> trip_coords = merge(trips, coords, by.x = "start", by.y = "station_id")
by.x = "start" tells R that in the first data set trips, the unique id variable is named start
by.y = "station_id" tells R that in the second data set coords, the unique id variable is named station_id
this is an example of how to merge data frames when the same id variable is named differently in each data set, and you have to explicitly tell R
We check and see trip_coords indeed has combined data, having start, end but also latitude and longitude for the station specified by start:
> head(trip_coords)
start end lat lon
1 1 3 lat1 lon1
2 2 4 lat2 lon2
Next, we want the latitude and longitude for end. We don't need to make a separate data frame, we can use merge() again, and build upon our trip_coords:
> trip_coords = merge(trip_coords, coords, by.x = "end", by.y = "station_id")
Check again:
> head(trip_coords)
end start lat.x lon.x lat.y lon.y
1 3 1 lat1 lon1 lat3 lon3
2 4 2 lat2 lon2 lat4 lon4
the .x and .y suffixes appear because merge combines two data frames, and our data frame 1 was trip_coords which already had a lat and lon, and data frame 2 coords also has lat and lon. So the merge() function needed to help us tell them apart after merge, so
for data frame 1, aka original trip_coords, lat and lon is automatically renamed to lat.x and lon.x
for data frame 2, aka coords, has lat and lon is automatically renamed to lat.y and lon.y
But now, the default result puts variable end first. We may prefer to see the order start followed by end, so to fix this:
> trip_coords = trip_coords[c(2, 1, 3, 4, 5, 6)]
we re-order and then save the result back into trip_coords
We can check the results:
> head(trip_coords)
start end lat.x lon.x lat.y lon.y
1 1 3 lat1 lon1 lat3 lon3
2 2 4 lat2 lon2 lat4 lon4
Export
> write.csv(trip_coords, file = "trip_coordinates.csv", row.names = FALSE)
saves csv
where file = to set the file path to save to. In this case just trip_coordinates.csv so this will appear in the current working dir, where you have the other csvs
row.names = FALSE otherwise by default, the first column is filled with automatic row numbers
You can check the results, for example on Linux, on your command prompt:
$ cat trip_coordinates.csv
"","start","end","lat.x","lon.x","lat.y","lon.y"
"1",1,3,"lat1","lon1","lat3","lon3"
"2",2,4,"lat2","lon2","lat4","lon4"
So now you have a method for taking trips.csv, getting lat/lon for each of start and end, and outputting a csv again.
Automation
Remember that with R you can automate, write the exact commands you want to run, save it in a myscript.R, so if your source data changes and you wish to re-generate the latest trip_coordinates.csv without having to type all those commands again, you have at least two options to run the script
Within R or the R console you see in rstudio:
> source('myscript.R')
Or, if on the Linux command prompt, use Rscript command:
$ Rscript myscript.R
and the trip_coordinates.csv would be automatically generated.
Further resources
How to Use the merge() Function...: Good VENN diagrams of the different joins

Removing entire entry if line is less than an amount desired

I have a long list made up of text like this
Email: example#example.com
Language Spoken: Sample
Points: 52600
Lifetime points: 100000
Country: US
Number: 1234
Gender: Male
Status: Activated
=============================================
I need a way of filtering this list so that only students with higher than 52600 points gets shown. I am currently looking at solutions for this, I thought maybe excel would be a start but am not too sure and wanted input.
Here's a solution in Excel:
1) Copy Text into Column A
2) In B1 enter "1", then in B2 enter the formula: =IF(LEFT(A1,1)="=",B1+1,B1), then copy that down to the end.
(This splits the text into groups divided by the equal signs)
3) In C1 enter the formula: =IF(LEFT(A1,8)="Points: ",VALUE(RIGHT(A1,LEN(A1)-8)),0), then copy that down to the end.
(Basically this is populating the points in column B)
4) In D1 enter the formula: =SUMIF(B:B,B1,C:C), then copy that down to the end.
(This just sums the amounts in column B by grouping)
5) Finally put a filter on Column D, and filter by greater than or equal to the amount desired.

Resources