StanfordNLP Training Steps Verification and loadClassifier check - stanford-nlp

I need help verifying the training steps below and can I add my classifier to -loadClassifier list?
-loadClassifier sample-ner-model.ser.gz, classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \
The fate of Lehman Brothers, the beleaguered investment bank, hung in the balance on Sunday as Federal Reserve officials and the leaders of major financial institutions continued to gather in emergency meetings trying to complete a plan to rescue the stricken bank. Several possible plans emerged from the talks, held at the Federal Reserve Bank of New York and led by Timothy R. Geithner, the president of the New York Fed, and Treasury Secretary Henry M. Paulson Jr.
Step 1 Tokenize
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer sample.txt > sample.tok
. . .
Step 2 Classify
Need a better command to replace EOL "\n" with "\tO\n" . Perl chomp not working. Edited sample.tzv manually.
perl -ne 'chomp; print "$_\tO"' sample.tok > sample.tsv
The 0
fate 0
of 0
Lehman 0
Brothers 0
, 0
the 0
beleaguered 0
investment 0
bank 0
, 0
hung 0
in 0
the 0
balance 0
. . .
president 0
of 0
the 0
New 0
York 0
Fed 0
, 0
and 0
Treasury 0
Secretary 0
Henry 0
M. 0
Paulson 0
Jr. 0
. 0
Step 3 Adjust Properties (sample.prop)
# location of the training file
trainFile = sample.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = sample-ner-model.ser.gz
. . .
Step 4 Modify Gold Standard (sample.tsv)
The 0
fate 0
of 0
Lehman ORG
Brothers ORG
, 0
the 0
beleaguered 0
investment 0
bank 0
, 0
hung 0
in 0
the 0
balance 0
. . .
president 0
of 0
the 0
York ORG
, 0
and 0
Treasury PERS
Secretary PERS
Henry PERS
Paulson PERS
. 0
Step 4 Train
java -cp stanford-ner.jar -prop sample.prop
Step 5 Test and Verify
java -cp stanford-ner.jar -loadClassifier sample-ner-model.ser.gz -testFile sample.tsv
Production Maybe:
java -mx1g -textFile sample.txt -ner.model \
-loadClassifier classifiers/english.all.3class.distsim.crf.ser.gz,classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz \
-outputFormat tabbedEntities -textFile sample.txt > sampleNew.tsv

This seems correct to me.
Yes, if you build a new model with Stanford CoreNLP you can just add it into the list.
Note that the models are run in order, and earlier NER taggers in the list tag first, and then later models cannot overwrite the tags (e.g. ORG, PER) written by previous ones (except O of course). So basically where you put the models matters, closer to the front takes priority.
Also ner.combinationMode = HIGH_RECALL will allow every classifier in the list to apply all of their tags. ner.combinationMode = NORMAL means only the first classifier that applies a tag (e.g. ORG, PER) can apply it. You can set ner.combinationMode in the .prop file.


Invalid syntax loop in Stata

I'm trying to run a for loop to make a balance table in Stata (comparing the demographics of my dataset with national-level statistics)
For this, I'm prepping my dataset and attempting to calculate the percentages/averages for some key demographics.
rename unearnedinc_wins95 unearninc_wins95
foreach var of varlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019 { //continuous or binary; to put categorical vars use kwallis test
dis "for variable `var':"
tabstat `var'
summ `var'
local `var'_samplemean=r(mean)
set obs 11
gen var=""
gen sample=.
gen F=.
gen pvalue=.
replace var="% Female" if _n==1
replace var="Age" if _n==2
replace var="% Non-white" if _n==3
replace var="HH size" if _n==4
replace var="% Parent" if _n==5
replace var="% Employed" if _n==6
replace var="Savings stock ($)" if _n==7
replace var="Debt stock ($)" if _n==8
replace var="Earned income last mo. ($)" if _n==9
replace var="Unearned income last mo. ($)" if _n==10
replace var="% Under FPL 2019" if _n==11
foreach col of varlist sample {
replace `col'=100*round(`fem_`col'mean', 0.01) if _n==1
replace `col'=round(`age_`col'mean') if _n==2
replace `col'=100*round(`nonwhite_`col'mean', 0.01) if _n==3
replace `col'=round(`hhsize_`col'mean', 0.1) if _n==4
replace `col'=100*round(`parent_`col'mean', 0.01) if _n==5
replace `col'=100*round(`employed_`col'mean', 0.01) if _n==6
replace `col'=round(`savings_wins95_`col'mean') if _n==7
replace `col'=round(`debt_wins95_`col'mean') if _n==8
replace `col'=round(`earnedinc_wins95_`col'mean') if _n==9
replace `col'=round(`unearninc_wins95_`col'mean') if _n==10
replace `col'=100*round(`underfpl2019_`col'mean', 0.01) if _n==11
I'm trying to run the following loop, but in the second half of the loop, I keep getting an 'invalid syntax' error. For context, in the first half of the loop (before clearing the dataset), the code stores the average values of the variables as a macro (`var'_samplemean). Can someone help me out and mend this loop?
My sample data:
input byte fem float(age nonwhite) byte(hhsize parent) float employed double(savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95) float underfpl2019
1 35 1 6 1 1 0 2500 0 0 0
0 40 0 4 1 1 0 10000 1043 0 0
0 40 0 4 1 1 0 20000 2400 0 0
0 40 0 4 1 1 .24 20000 2000 0 0
0 40 0 4 1 1 10 . 2600 0 0
Thanks for sharing the snippet of data. Apart from the fact the variable unearninc_wins95 has already been renamed in your sample data, the code runs fine for me without returning an error.
That being said, the columns for your F-statistics and p-values are empty once the loop at the bottom of your code completes. As far as I can see there is no local/varlist called sample which you're attempting to call with the line foreach col of varlist sample{. This could be because you haven't included it in your code, in which case please do, or it could be because you haven't created the local/varlist sample, in which case this could well be the source of your error message.
Taking a step back, there are more efficient ways of achieving what I think you're after. For example, you can get (part of) what you want using the package stat2data (if you don't have it installed already, run ssc install stat2data from the command prompt). You can then run the following code:
stat2data fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019, saving("~/yourstats.dta") stat(count mean)
*which returns:
use "~/yourstats.dta", clear
. list, sep(11)
| _name sN smean |
1. | fem 5 .2 |
2. | age 5 39 |
3. | nonwhite 5 .2 |
4. | hhsize 5 4.4 |
5. | parent 5 1 |
6. | employed 5 1 |
7. | savings_wins 5 2.048 |
8. | debt_wins95 4 13125 |
9. | earnedinc_wi 5 1608.6 |
10. | unearninc_wi 5 0 |
11. | underfpl2019 5 0 |
This is missing the empty F-statistic and p-value variables you created in your code above, but you can always add them in the same way you have with gen F=. and gen pvalue=.. The presence of these variables though indicates you want to run some tests at some point and then fill the cells with values from them. I'd offer advice on how to do this but it's not obvious to me from your code what you want to test. If you can clarify this I will try and edit this answer to include that.
This doesn't answer your question directly; as others gently point out the question is hard to answer without a reproducible example. But I have several small comments on your code which are better presented in this form.
Assuming that all the variables needed are indeed present in the dataset, I would recommend something more like this:
local myvarlist fem age nonwhite hhsize parent employed savings_wins95 debt_wins95 earnedinc_wins95 unearninc_wins95 underfpl2019
local desc `" "% Female" "Age" "% Non-white" "HH size" "% Parent" "% Employed" "Savings stock ($)" "Debt stock ($)" "Earned income last mo. ($)" "Unearned income last mo. ($)" "% Under FPL 2019" "'
local i = 1
gen variable = ""
gen mean = ""
local i = 1
foreach var of local myvars {
summ `var', meanonly
local this : word `i' of `desc'
replace variable = "`this'" in `i'
if inlist(`i', 1, 3, 5, 6, 11) {
replace mean = strofreal(100 * r(mean), "%2.0f") in `i'
else if `i' == 4 {
replace mean = strofreal(r(mean), "%2.1f") in `i'
else replace mean = strofreal(r(mean), "%2.0f") in `i'
local ++i
This has not been tested.
Points arising include:
Using in is preferable for what you want over testing the observation number with if.
round() is treacherous for rounding to so many decimal places. Most of the time you will get what you want, but occasionally you will get bizarre results arising from the fact that Stata works in binary, like any equivalent program. It is safer to treat rounding as a problem in string manipulation and use display formats as offering precisely what you want.
If the text you want to show is just the variable label for each variable, this code could be simplified further.
The code hints at intent to show other stuff, which is easily done compatibly with this design.

How to avoid row names in further analysis in R?

I´m just running the following example from GGEBiplotGUI package and of course, it works properly.
GGEBiplot(Data = Ontario)
But when I download "Ontario" data and I want to run the above cited script on my PC. See the example below.
Ontario <- read.csv("Book.csv")
GGEBiplot(Data = Ontario)
The result is the following table (from column 0 to 10) taking numbers (From 1 to 17) as genotypes and "X" as another location.
See the result below please.
X BH93 EA93 HW93 ID93 KE93 NN93 OA93 RN93 WP93
1 ann 4.460 4.150 2.849 3.084 5.940 4.450 4.351 4.039 2.672
2 ari 4.417 4.771 2.912 3.506 5.699 5.152 4.956 4.386 2.938
3 aug 4.669 4.578 3.098 3.460 6.070 5.025 4.730 3.900 2.621
4 cas 4.732 4.745 3.375 3.904 6.224 5.340 4.226 4.893 3.451
5 del 4.390 4.603 3.511 3.848 5.773 5.421 5.147 4.098 2.832
6 dia 5.178 4.475 2.990 3.774 6.583 5.045 3.985 4.271 2.776
7 ena 3.375 4.175 2.741 3.157 5.342 4.267 4.162 4.063 2.032
8 fun 4.852 4.664 4.425 3.952 5.536 5.832 4.168 5.060 3.574
9 ham 5.038 4.741 3.508 3.437 5.960 4.859 4.977 4.514 2.859
10 har 5.195 4.662 3.596 3.759 5.937 5.345 3.895 4.450 3.300
11 kar 4.293 4.530 2.760 3.422 6.142 5.250 4.856 4.137 3.149
12 kat 3.151 3.040 2.388 2.350 4.229 4.257 3.384 4.071 2.103
13 luc 4.104 3.878 2.302 3.718 4.555 5.149 2.596 4.956 2.886
14 m12 3.340 3.854 2.419 2.783 4.629 5.090 3.281 3.918 2.561
15 reb 4.375 4.701 3.655 3.592 6.189 5.141 3.933 4.208 2.925
16 ron 4.940 4.698 2.950 3.898 6.063 5.326 4.302 4.299 3.031
17 rub 3.786 4.969 3.379 3.353 4.774 5.304 4.322 4.858 3.382
How can I fix this problem? I mean, in order to avoid "rownames" and "x" as a variables in the GGEBiplotGUI analysis.
I have also tried with these codes and they didn´t work:
attributes(Ontario)$row.names <- NULL
print(Ontario, row.names = F)
row.names(Ontario) <- NULL
Ontario[, -1] ## It deletes the first column not the 0 one.
Many thanks in advance!
This code worked properly.
Ontario <- read.csv("Libro.csv")
GGEBiplot(Data = Ontario)

Sorting data with gnuplot

Sometimes it might be required to sort data. Unfortunately, gnuplot (as far as I know) doesn't offer this possibility. Of course, you can use external tools like awk, Perl, Python, etc. However, for maximum platform independence and avoiding the installation of additional programs and related complications, and also for curiosity, I was interested whether gnuplot can sort somehow nevertheless.
I will be grateful for comments on improvements, limitations.
Does anybody have ideas how to sort alphanumerical data with gnuplot only?
### Sorting with gnuplot
reset session
# generate some random example data
N = 10
set samples N
RandomNo(n) = sprintf("%.02f",rand(0)*n)
set table $Data
plot '+' u (RandomNo(10)):(RandomNo(10)):(RandomNo(10)) w table
unset table
print $Data
# Settings for sorting
ColNo = 2 # ColumnNo for sorting
stats $Data nooutput # get the number of rows if data is from file
RowCount = STATS_records # with the example data above, of course RowCount=N
# create the sortkey and put it into an array
array SortKey[RowCount]
set table $Dummy
plot $Data u (SortKey[$0+1] = sprintf("%.06f%02d",column(ColNo),$0+1)) w table
unset table
# print $Dummy
# get lines as whole into array
set datafile separator "\n"
array DataSeq[RowCount]
set table $Dummy2
plot $Data u (SortKey[$0+1]):(DataSeq[$0+1] = stringcolumn(1)) with table
unset table
print $Dummy2
set datafile separator whitespace
# do the actual sorting with 'smooth unique'
set table $Dummy3
plot $Dummy2 u 1:0 smooth unique
unset table
# print $Dummy3
# extract the sorted sortkeys
set table $Dummy4
plot $Dummy3 u (SortKey[$0+1]=$2) with table
unset table
# print $Dummy4
# create the table with sorted lines
set table $DataSorted
plot $Data u (DataSeq[SortKey[$0+1]+1]) with table
unset table
print $DataSorted
### end of code
First datablock unsorted data
second datablock intermediate with sortkeys
third datablock sorted data by the second column
5.24 6.68 3.09
1.64 1.27 9.82
6.44 9.23 7.03
8.14 8.87 3.82
4.27 5.98 0.93
7.96 3.64 6.15
6.21 6.28 6.17
1.52 3.17 3.58
4.24 2.16 8.99
8.73 6.54 1.13
6.68000001 5.24 6.68 3.09
1.27000002 1.64 1.27 9.82
9.23000003 6.44 9.23 7.03
8.87000004 8.14 8.87 3.82
5.98000005 4.27 5.98 0.93
3.64000006 7.96 3.64 6.15
6.28000007 6.21 6.28 6.17
3.17000008 1.52 3.17 3.58
2.16000009 4.24 2.16 8.99
6.54000010 8.73 6.54 1.13
1.64 1.27 9.82
4.24 2.16 8.99
1.52 3.17 3.58
7.96 3.64 6.15
4.27 5.98 0.93
6.21 6.28 6.17
8.73 6.54 1.13
5.24 6.68 3.09
8.14 8.87 3.82
6.44 9.23 7.03
For curiosity, I wanted to know whether an alphanumerical sort could be implemented with gnuplot code only.
This avoids the need for external tools and ensures maximum platform compatibility.
I haven't heard yet about an external tool which could assist gnuplot and which works under Windows and Linux and MacOS.
I am happy to take comments and suggestions about bugs, simplifications, improvements, performance comparisons, and limits.
For alphanumerical sort, the first stage is alphanumerical string comparison, which to my knowledge does not exist in gnuplot directly. So, the first part Compare.plt is about comparison of strings.
### compare function for strings
# Compare.plt
# function cmp(a,b,cs) returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
reset session
ASCII = ' !"' . "#$%&'()*+,-./0123456789:;<=>?#".\
ord(c) = strstrt(ASCII,c)>0 ? strstrt(ASCII,c)+31 : 0
# comparing char: case-sensitive
cmpcharcs(c1,c2) = sgn(ord(c1)-ord(c2))
# comparing char: case-insentitive
cmpcharci(c1,c2) = sgn(( cmpcharci_o1=ord(c1), ((cmpcharci_o1>96) && (cmpcharci_o1<123)) ?\
cmpcharci_o1-32 : cmpcharci_o1) - \
( cmpcharci_o2=ord(c2), ((cmpcharci_o2>96) && (cmpcharci_o2<123)) ?\
cmpcharci_o2-32 : cmpcharci_o2) )
# function cmp returns a<b:-1, a==b:0, a>b:+1
# cs=0: case-insensitive, cs=1: case-sensitive
cmp(a,b,cs) = ((cmp_r=0, cmp_flag=0, cmp_maxlen=strlen(a)>strlen(b) ? strlen(a) : strlen(b)),\
(sum[cmp_i=1:cmp_maxlen] \
((cmp_flag==0 && (cmp_c1 = substr(a,cmp_i,cmp_i), cmp_c2 = substr(b,cmp_i,cmp_i), \
(cmp_r = (cs==0 ? cmpcharci(cmp_c1,cmp_c2) : cmpcharcs(cmp_c1,cmp_c2) ) )!=0 ? \
(cmp_flag=1, cmp_r) : 0)), 1 )), cmp_r)
cmpsymb(a,b,cs) = (cmpsymb_r = cmp(a,b,cs))<0 ? "<" : cmpsymb_r>0 ? ">" : "="
### end of code
### example compare strings
load "Compare.plt"
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
print sprintf("% 2d: % 9s% 2s% 6s", cmp(a,b,0), a, cmpsymb(a,b,0), b)
### end of code
-1: Alligator < Tiger
0: Tiger = Tiger
1: Zebra > Tiger
The second part makes use of the comparison for sorting.
### alpha-numerical sort with gnuplot
reset session
load "Compare.plt"
$Data <<EOD
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
stats $Data u 0 nooutput
RowCount = STATS_records
ColSort = 3
array Key[RowCount]
array Index[RowCount]
set table $Dummy
plot $Data u (Key[$0+1]=stringcolumn(ColSort),Index[$0+1]=$0+1) w table
unset table
# Bubblesort
do for [n=RowCount:2:-1] {
do for [i=1:n-1] {
if ( cmp(Key[i],Key[i+1],0) > 0) {
tmp=Key[i]; Key[i]=Key[i+1]; Key[i+1]=tmp
tmp2=Index[i]; Index[i]=Index[i+1]; Index[i+1]=tmp2
set datafile separator "\n"
set table $Dummy # and reuse Key-array
plot $Data u (Key[$0+1]=stringcolumn(1)) with table
unset table
set datafile separator whitespace
set table $DataSorted
plot $Data u (Key[Index[$0+1]]) with table
unset table
print $DataSorted
set grid xtics,ytics
plot [-0.5:RowCount-0.5][0:1.1] $DataSorted u 0:2:xtic(3) w lp lt 7 lc rgb "red"
### end of code
1 0.123 Orange
2 0.456 Apple
3 0.789 Peach
4 0.987 Pineapple
5 0.654 Banana
6 0.321 Raspberry
7 0.111 Lemon
2 0.456 Apple
5 0.654 Banana
7 0.111 Lemon
1 0.123 Orange
3 0.789 Peach
4 0.987 Pineapple
6 0.321 Raspberry
and the output graph:

Apache PIG - How to get the Flop 10 data records?

I have data records like this:
Name customerID revenue(Mio) premium
Michael James 078932832 2.7 y
Susan Miller 024383490 3.9 n
John Cooper 021023023 2.1 y
How do I get the records - divided into the premium flag - each with the lowest revenue (=Flop 10)?
The result should be given as:
Nr Name customerID revenue(Mio) premium
1 John Cooper 021023023 2.1 y
2 Michael James 078932832 2.7 y
3 Andrew Murs 044834399 3.0 y
. ... ..... ... .
10 th entry with flag y
1 Susan Miller 024383490 3.9 n
. ... ..... ... .
10 th entry with flag n
As you see the list is ordered ascending (beginning with the lowest revenue).
I guess you should use split
Considering A is your load statement
A = load 'data' as (Nr,Name,customerID,revenue,premium);
B = split A into PRE if premium =='y', NONPRE if premium == 'n';
C = order PRE by revenue asc;
D = order NONPRE by revenue asc;
Disclaimer: Be careful while using split as null records get dropped. I have not compiled this code.

Stata: foreach creates too many variables -

I created a toy example of my code below.
In this toy example I would like to create a measure of all higher prices minus lower prices within a self-created reference group. So within each reference group, I would like to take each individual and subtract its price value from all higher price values from other individuals in the same group. I do not want to have negative differences. Then I would like to sum all these differences. In creating this code I found some help here:
However, the code didn't work perfectly for me, because my dataset is quite large (several 100K obs) and the examples on the website and my code only work until the numlist maximum of 1600 in Stata. (I am using version 12). The toy example with the auto dataset works, due to small size of the dataset.
I would like to ask if someone has an idea how to code this more efficiently, so that I can get around the numlist restriction. I thought about summing the differences directly without saving them in intermediate variables, but that also blow up the numlist restriction.
clear all
sysuse auto
ren headroom refgroup
bysort refgroup : egen pricerank = rank(price)
qui: su pricerank, meanonly
gen test = `r(max)'
su test
foreach i of num 1/`r(max)' {
qui: bys refgroup: gen intermediate`i' = price[_n+`i'] -price if price[_n+`i'] > price
egen price_diff = rowmax(intermediate*)
drop intermediate*
If I understand this correctly, this isn't even a problem that requires explicit loops. The sum of all higher prices is just the difference between two cumulative sums. You might need to think through what you want to do if prices are tied.
. clear
. set obs 10
obs was 0, now 10
. gen group = _n > 5
. set seed 2803
. gen price = ceil(1000 * runiform())
. bysort group (price) : gen sumhigherprices = sum(price)
. by group : replace sumhigherprices = sumhigherprices[_N] - sumhigherprices
(10 real changes made)
. list
| group price sumhig~s |
1. | 0 218 1448 |
2. | 0 264 1184 |
3. | 0 301 883 |
4. | 0 335 548 |
5. | 0 548 0 |
6. | 1 125 3027 |
7. | 1 213 2814 |
8. | 1 828 1986 |
9. | 1 988 998 |
10. | 1 998 0 |
Edit: For what the OP needs, there is an extra line
. by group : replace sumhigherprices = sumhigherprices - (_N - _n) * price
If I understand the wording of the problem correctly, maybe this can help. It uses joinby (new observations are created and depending on the size of the original database, you may or not hit the Stata hard-limit on number of observations). The code reproduces the results that would follow from the code of the original post. This is a second attempt. The code before this final edit did not provide the sought-after results. The wording of the problem was somewhat difficult for me to understand.
clear all
set more off
* Load data
sysuse auto
* Delete unnecessary vars
ren headroom refgroup
keep refgroup price
* Generate id´s based on rankings (sort)
bysort refgroup (price): gen id = _n
* Pretty list
order refgroup id
sort refgroup id price
list, sepby(refgroup)
* joinby procedure
tempfile main
save "`main'"
rename (price id) =0
joinby refgroup using "`main'"
list, sepby(refgroup)
* Do not compare with itself and drop duplicates
drop if id0 >= id
* Compute differences and max
gen dif = abs(price0 - price)
collapse (max) dif, by(refgroup id0)
list, sepby(refgroup)
