Stanford CRFClassifier performance evaluation output - stanford-nlp

I'm following this FAQ https://nlp.stanford.edu/software/crf-faq.shtml for training my own classifier and I noticed that the performance evaluation output does not match the results (or at least not in the way I expect).
Specifically this section
CRFClassifier tagged 16119 words in 1 documents at 13824.19 words per second.
Entity P R F1 TP FP FN
MYLABEL 1.0000 0.9961 0.9980 255 0 1
Totals 1.0000 0.9961 0.9980 255 0 1
I expect TP to be all instances where the predicted label matched the golden label, FP to be all instances where MYLABEL was predicted but the golden label was O, FN to be all instances where O was predicted but the golden was MYLABEL.
If I calculate those numbers myself from the output of the program, I get completely different numbers with no relation to what the program prints. I've tried this with various test files.
I'm using Stanford NER - v3.7.0 - 2016-10-31
Am I missing something?

The F1 scores are over entities not labels.
Example:
(Joe, PERSON) (Smith, PERSON) (went, O) (to, O) (Hawaii, LOCATION) (., O).
In this example there are two possible entities:
Joe Smith PERSON
Hawaii LOCATION
Entities are created by taking all adjacent tokens with the same label. (Unless you use a more complicated BIO labeling scheme ; BIO schemes have tags like I-PERSON and B-PERSON to indicate whether a token is the beginning of an entity, etc...).

Related

Relational Learning: What does this Prolog Output say?

I need to find relations in a prolog dataset. I have different kinds of trains with different features, for example Train1:
has_car(rel_east1,car_11).
has_car(rel_east1,car_12).
has_car(rel_east1,car_13).
infront(car_11,car_12).
infront(car_12,car_13).
size(car_11,long).
size(car_12,long).
size(car_13,short).
shape(car_11,hexagon).
shape(car_12,rectangle).
shape(car_13,hexagon).
load(car_11,rectangle).
load(car_12,circle).
load(car_13,triangle).
I have ten different trains. Now I used the Metagol algorithm which shall learn the different relations within the different trains. As a result I get a list with different clauses. And here is my problem: I don't understand the inductive steps in between the clauses. For example:
relational(A):-has_car(A,B),relational_1(B).
relational_1(A):-relational_2(A,rectangle).
relational_2(A,B):-relational_3(A),shape(A,B).
relational_3(A):-infront(A,B),relational_4(B).
relational_4(A):-load(A,triangle).
The only thing I know is, that the whole clause says: "There is a Train which contains a car which is in the shape rectangle. This car is in front of another car which contains a triangle."
But can anybody explain the code to me? Line for line?
For example, I don't understand how to read the 2nd line: "If there is a relation 1 with A, then there is also a relation 2 between A and rectangle"?
I am not 100% sure, but I think the relational_x predicates, are relations (predicates) that are 'invented' by metagol for your learning task.
As naming invented predicates is a hard task (for which there is not a good solution for ) you get these type of names.
For instance if you used the example `kinship1' here : https://github.com/metagol/metagol
?- [kinship1].
Warning: /media/sam/9bb6ab40-5f17-481e-aba8-7bd9e4e05d66/home/sam/Documents/Prolog_practise/metagol/metagol.pl:250:
Local definition of metagol:list_to_set/2 overrides weak import from lists
true.
?- a.
% learning grandparent/2
% clauses: 1
% clauses: 2
% clauses: 3
grandparent_1(A,B):-father(A,B).
grandparent_1(A,B):-mother(A,B).
grandparent(A,B):-grandparent_1(A,C),grandparent_1(C,B).
true .
It learns the grandparent/2 relation by also learning grandparent_1/2. Which we as humans would call parent/2.
So relational_4(A):-load(A,triangle). you might call 'car carrying a load which is triangle shaped' . and relational_3(A):-infront(A,B),relational_4(B). would then be 'car in front of a car carrying a load which is triangle shaped' etc
relational_4(A):-load(A,triangle).
If an object A is loaded with a triangle then A is relational_4
or
There is a car which has a load which is a triangle.
relational_3(A):-infront(A,B),relational_4(B).
If an object A is infront of object B and B is relational_4 then A is relational_3
or
Car A is infront of a car B which has a load which is a triangle
relational_2(A,B):-relational_3(A),shape(A,B).
If A is relational_3 and has shape B then A relational_2 B.
or
Car A is a car infront of another car which is loaded with a triangle, car A has an unspecified shape.
relational_1(A):-relational_2(A,rectangle).
If an object A is relational_2 rectangle then A is relational_1
or
There is a car which has a rectangle shape, it is infront of a car which caries a triangle load.
relational(A):-has_car(A,B),relational_1(B).
If an object A has a car B and B is relational_1 then relational A.
There is a train which has a car, that car is a rectangle in shape and it is infront of a car that has a triangle load

Poor h2o GBM Classification Performance in a balanced binomial response

In a fairly balanced binomial classification response problem, I am observing unusual level of error in h2o.gbm classification for determining class 0, on train set itself. It is from a competition which is over, so interest is only towards understanding what is going wrong.
Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
0 1 Error Rate
0 147857 234035 0.612830 =234035/381892
1 44782 271661 0.141517 =44782/316443
Totals 192639 505696 0.399260 =278817/698335
Any expert suggestions to treat the data and reduce the error is welcome.
Following approaches are tried and error is not found decreasing.
Approach 1: Selecting top 5 important variables via h2o.varimp(gbm)
Approach 2: Converting the negative normalized variable as zero and possitive as 1.
#Data Definition
# Variable Definition
#Independent Variables
# ID Unique ID for each observation
# Timestamp Unique value representing one day
# Stock_ID Unique ID representing one stock
# Volume Normalized values of volume traded of given stock ID on that timestamp
# Three_Day_Moving_Average Normalized values of three days moving average of Closing price for given stock ID (Including Current day)
# Five_Day_Moving_Average Normalized values of five days moving average of Closing price for given stock ID (Including Current day)
# Ten_Day_Moving_Average Normalized values of ten days moving average of Closing price for given stock ID (Including Current day)
# Twenty_Day_Moving_Average Normalized values of twenty days moving average of Closing price for given stock ID (Including Current day)
# True_Range Normalized values of true range for given stock ID
# Average_True_Range Normalized values of average true range for given stock ID
# Positive_Directional_Movement Normalized values of positive directional movement for given stock ID
# Negative_Directional_Movement Normalized values of negative directional movement for given stock ID
#Dependent Response Variable
# Outcome Binary outcome variable representing whether price for one particular stock at the tomorrow’s market close is higher(1) or lower(0) compared to the price at today’s market close
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/test_6lvBXoI.zip',temp)
test <- read.csv(unz(temp, "test.csv"))
unlink(temp)
temp <- tempfile()
download.file('https://github.com/meethariprasad/trikaal/raw/master/Competetions/AnalyticsVidhya/Stock_Closure/train_xup5Mf8.zip',temp)
#Please wait for 60 Mb file to load.
train <- read.csv(unz(temp, "train.csv"))
unlink(temp)
summary(train)
#We don't want the ID
train<-train[,2:ncol(train)]
# Preserving Test ID if needed
ID<-test$ID
#Remove ID from test
test<-test[,2:ncol(test)]
#Create Empty Response SalePrice
test$Outcome<-NA
#Original
combi.imp<-rbind(train,test)
rm(train,test)
summary(combi.imp)
#Creating Factor Variable
combi.imp$Outcome<-as.factor(combi.imp$Outcome)
combi.imp$Stock_ID<-as.factor(combi.imp$Stock_ID)
combi.imp$timestamp<-as.factor(combi.imp$timestamp)
summary(combi.imp)
#Brute Force NA treatment by taking only complete cases without NA.
train.complete<-combi.imp[1:702739,]
train.complete<-train.complete[complete.cases(train.complete),]
test.complete<-combi.imp[702740:804685,]
library(h2o)
y<-c("Outcome")
features=names(train.complete)[!names(train.complete) %in% c("Outcome")]
h2o.shutdown(prompt=F)
#Adjust memory size based on your system.
h2o.init(nthreads = -1,max_mem_size = "5g")
train.hex<-as.h2o(train.complete)
test.hex<-as.h2o(test.complete[,features])
#Models
gbmF_model_1 = h2o.gbm( x=features,
y = y,
training_frame =train.hex,
seed=1234
)
h2o.performance(gbmF_model_1)
You've only trained a single GBM with the default parameters, so it doesn't look like you've put enough effort into tuning your model. I'd recommend a random grid search on GBM using the h2o.grid() function. Here is an H2O R code example you can follow.

What is the Gensim word2vec output

I want to use gensim word2vec as input for neural network. I have 2 questions:
1) gensim.models.Word2Vec get as parameter the size. How this parameter is used? and size of what?
2) Once trained what is the output of gensim word2vec? As i could see this is not a probability values (not between 0 and 1). It seems to me for each word vector we get a distance (cosinus) between this word and some other words (but which words exactly?)
Thanks for your response.
Ans to 1 -> The size parameter is the dimension of the word vectors i.e. each vector will be having 100 dimensions if size=100
Ans to 2 -> You can save the word vectors using this function save_word2vec_format(fname="vectors.txt", fvocab=None, binary=False) ref. This will save a file "vectors.txt" which will have first line as <size of the vocabulary> <dimensions> and rest of the lines will be of the form <word> <vector of size dimension>.
Sample for "vectors.txt":
297820 100
the -0.18542234751 0.138813291635 0.0392148854213 0.0238721499736 -0.0443151295365 0.03226302388 -0.168626211895 -0.17397777844 -0.0547546409461 0.166621666046 0.0534506882806 0.0774947957067 -0.180520283779 -0.0938140452702 -0.0354599008902 -0.0533488133527 -0.0667684564816 -0.0210904306995 -0.103069115604 -0.138712344952 -0.035142440978 -0.125067138202 0.0514192233164 -0.142052171747 0.0795726729387 0.0310433094806 -0.00666224898992 0.047268806263 0.0339849190176 -0.181107631029 0.0477396587205 0.0483130822899 -0.090229393762 0.0224528628225 0.190814060668 -0.179506639849 0.00034066604609 0.0639057478 0.156444383949 -0.0366888977431 -0.170674385275 -0.053907152935 0.106572313582 0.0724497821903 -0.00848717936216 0.124053494271 -0.0420715605081 0.0460277422205 -0.0514693485657 0.132215091575 -0.0429308836475 -0.111784875385 -0.0543172053216 0.0849476776796 -0.015301892652 0.00992711997251 -0.00566113637219 0.00136359242972 -0.0382116842516 0.0681229985191 0.0685156463052 0.0759072640845 -0.0238136705161 0.168710450161 0.00879930186352 -0.179756801973 -0.210286559709 -0.161832152064 -0.0212640125813 -0.0115905356526 -0.0949562511822 0.126493155131 0.0215821686774 -0.164276918273 -0.0573806470616 -0.0147266125919 0.0566350339785 -0.0276969849679 0.0178970346094 0.0599163813161 0.0919867942845 0.172071394538 0.0714226787026 0.109037733251 0.00403647493576 0.044853743905 -0.0915639785243 -0.0242494817113 0.0705554654776 0.255584701079 0.001309754199 0.0872413719572 -0.0376289782286 0.158184379871 0.109245196088 -0.0727554069742 0.168820215174 0.0454895919746 0.0741726055733 -0.134467710995
...
...
Dimension Size in word2vec**
Word2vec is used to create a vector space that represents words based on the trained corpus.
The vector is a mathematical representation of the word compared to other words in the given corpus. The dimensions size is the vector length.
Performing mathematical operations on the vectors represent the relationship between words.
as the vector of "man" and "king" will be close and same for the vector of "Faris" and "France".
If the size is too small like two or three dimensions, the information representation will be very limited.
The dimensions can be simplified as a linkage between different words. Words can be linked to each other in different dimensions based on how the words are positioned to each other in the corpus.
How to use the vectors
The vector by itself is useless and the numbers represent the position of the word with relations to all other words in the corpus
The vector can be meaningful when measured against another vector
cosine similarity is one of the common methods to measure the similarity between different words.
Good luck

Include only complete groups in panel regression using Stata

I have a panel set of data but not all individuals are present for all periods. I see when I run my xtreg that there are between 1-4 observations per group with a mean of 1.9. I'd like to only include those with 4 observations. Is there any way I can do this easily?
I understand that you want to include in your regression only those groups for which there are exactly 4 observations. If this is the case, then one solution is to count the number of observations per group and condition the regression using if:
clear all
set more off
webuse nlswork
xtset idcode
list idcode year in 1/50, sepby(idcode)
bysort idcode: gen counter = _N
xtreg ln_w grade age c.age#c.age ttl_exp c.ttl_exp#c.ttl_exp tenure ///
c.tenure#c.tenure 2.race not_smsa south if counter == 12, be
In this example the regression is conditioned to groups with 12 observations. The xtreg command gives (among other things):
Number of obs = 1881
Number of groups = 158
which you can compare with the result of running the regression without the if:
Number of obs = 28091
Number of groups = 4697
As commented by #NickCox, if you don't mind losing observations you can drop or keep (un)desired groups:
bysort idcode: drop if _N != 4
or
bysort idcode: keep if _N == 4
followed by an unconditional xtreg (i.e. with no if).
Notice that both approaches count missings, so you may need to account for that.
On the other hand, you might want to think about why you want to discard that data in your analysis.

Algorithm to create unique random concatenation of items

I'm thinking about an algorithm that will create X most unique concatenations of Y parts, where each part can be one of several items. For example 3 parts:
part #1: 0,1,2
part #2: a,b,c
part #3: x,y,z
And the (random, one case of some possibilities) result of 5 concatenations:
0ax
1by
2cz
0bz (note that '0by' would be "less unique " than '0bz' because 'by' already was)
2ay (note that 'a' didn't after '2' jet, and 'y' didn't after 'a' jet)
Simple BAD results for next concatenation:
1cy ('c' wasn't after 1, 'y' wasn't after 'c', BUT '1'-'y' already was as first-last
Simple GOOD next result would be:
0cy ('c' wasn't after '0', 'y' wasn't after 'c', and '0'-'y' wasn't as first-last part)
1az
1cx
I know that this solution limit possible results, but when all full unique possibilities will gone, algorithm should continue and try to keep most avaible uniqueness (repeating as few as possible).
Consider real example:
Boy/Girl/Martin
bought/stole/get
bottle/milk/water
And I want results like:
Boy get milk
Martin stole bottle
Girl bought water
Boy bought bottle (not water, because of 'bought+water' and not milk, because of 'Boy+milk')
Maybe start with a tree of all combinations, but how to select most unique trees first?
Edit: According to this sample data, we can see, that creation of fully unique results for 4 words * 3 possibilities, provide us only 3 results:
Martin stole a bootle
Boy bought an milk
He get hard water
But, there can be more results requested. So, 4. result should be most-available-uniqueness like Martin bought hard milk, not Martin stole a water
Edit: Some start for a solution ?
Imagine each part as a barrel, wich can be rotated, and last item goes as first when rotates down, first goes as last when rotating up. Now, set barells like this:
Martin|stole |a |bootle
Boy |bought|an |milk
He |get |hard|water
Now, write sentences as We see, and rotate first barell UP once, second twice, third three and so on. We get sentences (note that third barell did one full rotation):
Boy |get |a |milk
He |stole |an |water
Martin|bought|hard|bootle
And we get next solutions. We can do process one more time to get more solutions:
He |bought|a |water
Martin|get |an |bootle
Boy |stole |hard|milk
The problem is that first barrel will be connected with last, because rotating parallel.
I'm wondering if that will be more uniqe if i rotate last barrel one more time in last solution (but the i provide other connections like an-water - but this will be repeated only 2 times, not 3 times like now). Don't know that "barrels" are good way ofthinking here.
I think that we should first found a definition for uniqueness
For example, what is changing uniqueness to drop ? If we use word that was already used ? Do repeating 2 words close to each other is less uniqe that repeating a word in some gap of other words ? So, this problem can be subjective.
But I think that in lot of sequences, each word should be used similar times (like selecting word randomly and removing from a set, and after getting all words refresh all options that they can be obtained next time) - this is easy to do.
But, even if we get each words similar number od times, we should do something to do-not-repeat-connections between words. I think, that more uniqe is repeating words far from each other, not next to each other.
Anytime you need a new concatenation, just generate a completely random one, calculate it's fitness, and then either accept that concatenation or reject it (probabilistically, that is).
const C = 1.0
function CreateGoodConcatenation()
{
for (rejectionCount = 0; ; rejectionCount++)
{
candidate = CreateRandomConcatination()
fitness = CalculateFitness(candidate) // returns 0 < fitness <= 1
r = GetRand(zero to one)
adjusted_r = Math.pow(r, C * rejectionCount + 1) // bias toward acceptability as rejectionCount increases
if (adjusted_r < fitness)
{
return candidate
}
}
}
CalculateFitness should never return zero. If it does, you might find yourself in an infinite loop.
As you increase C, less ideal concatenations are accepted more readily.
As you decrease C, you face increased iterations for each call to CreateGoodConcatenation (plus less entropy in the result)

Resources