Mallet, probabilities of labels - probability

I'm trying to use the Mallet Simple Tagger (http://mallet.cs.umass.edu/sequences.php) for detecting specific kinds of words in texts (specifically, prominent words). I'm running it with following standard commands:
for training
java -cp "C:\mallet\class;C:\mallet\lib\mallet-deps.jar" cc.mallet.fst.SimpleTagger --train true --model-file "ab.crf" "data/train.txt" --threads 3
for testing
java -cp "C:\mallet\class;C:\mallet\lib\mallet-deps.jar" cc.mallet.fst.SimpleTagger --model-file "ab.crf" "data\test.txt" >> "result.txt"
After testing I get a list of labels as a result (0 or 1 for each word in the training set). I would like to know, is it somehow possible to output a probability of a label (not the label itself)?
Thank you

Related

Questions about creating stanford CoreNLP training models

I've been working with Stanford's coreNLP to perform sentiment analysis on some data I have and I'm working on creating a training model. I know we can create a training model with the following command:
java -mx8g edu.stanford.nlp.sentiment.SentimentTraining -numHid 25 -trainPath train.txt -devPath dev.txt -train -model model.ser.gz
I know what goes in the train.txt file. You score sentences and put them in train.txt, something like this:
(0 (2 Today) (0 (0 (2 is) (0 (2 a) (0 (0 bad) (2 day)))) (..)))
But I don't understand what goes in the dev.txt file.
I read through this question several times to try to understand what goes in dev.txt, but it's still unclear to me. Also, scoring these sentences manually has become a pain, is there a tool available that makes it easier? I'm worried that I've been using the wrong number of parentheses or some other stupid mistake like that.
Also, any suggestions on how long my train.txt file should be? I'm thinking of scoring a 1000 sentences. Is that number too small, too large?
All your help is appreciated :)
dev.txt should be the same as train.txt just with a different set of sentences. Note that the same sentence should not appear in dev.txt and train.txt. The development set is used to evaluate the quality of the model you train on the training data.
We don't distribute a tool for tagging sentiment data. This class could be helpful in building data: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/sentiment/BuildBinarizedDataset.html
Here are the sizes of the train, dev, and test sets used for the sentiment model: train=8544, dev=1101, test=2210
Here is some sample code for evaluating a model
// load a model
SentimentModel model = SentimentModel.loadSerialized(modelPath);
// load devTrees
List<Tree> devTrees;
devTrees = SentimentUtils.readTreesWithGoldLabels(devPath);
// evaluate on devTrees
Evaluate eval = new Evaluate(model);
eval.eval(devTrees);
eval.printSummary();
You can find what you need to import, etc... by looking at:
edu/stanford/nlp/sentiment/SentimentTraining.java

how can i use hog/hof output as input of SS-US-ELM algorithm?

I have SS-US-ELM algorithm Matlab codes. I want to use hog/hof on a data set like KTH and use its output as input of SS-ELM algorithm. but I don't know how to save the output so that the SS-ELM algorithm can get it correctly.
there is a "g50c.mat" file in the demo file that contains 4 variables ,x, y, idxLabs, idxUnls. how can I make a ".mat" file of the KTH data set and use it as input instead of "g50c.mat". any help would be greatly appreciated.
here is the code for loading the "g50c.mat" and using its variables:
% Semi-supervised ELM (US-ELM) for semi-supervised classification.
% Ref: Huang Gao, Song Shiji, Gupta JND, Wu Cheng, Semi-supervised and
% unsupervised extreme learning machines, IEEE Transactions on Cybernetics, 2014
format compact;
clear;
addpath(genpath('functions'))
% load data
trial=1;
load g50c;
l=size(idxLabs,2);
u=ceil(size(y,1)*3/4)-2*l;
Xl=X(idxLabs(trial,:),:);
Yl=y(idxLabs(trial,:),:);
% Creat validation set
labels=unique(y);
idx_V=[];
for i=1:size(labels)
idx_V=[idx_V;find(y(idxUnls(trial,:))==labels(i),l/length(labels),'first')];
end
Xv=X(idxUnls(trial,idx_V),:);
Yv=y(idxUnls(trial,idx_V));
% Creat unlabeled and testing set
idxSet=1:size(idxUnls,2);
idx_UT=setdiff(idxSet,idx_V);
idx_rand=randperm(size(idx_UT,2));
Xu=X(idxUnls(trial,idx_UT(idx_rand(1:u))),:);
Yu=y(idxUnls(trial,idx_UT(idx_rand(1:u))),:);
Xt=X(idxUnls(trial,idx_UT(idx_rand(u+1:end))),:);
Yt=y(idxUnls(trial,idx_UT(idx_rand(u+1:end))),:);

SPSS syntax for naming individual analyses in output file outline

I have created syntax in SPSS that gives me 90 separate iterations of general linear model, each with slightly different variations fixed factors and covariates. In the output file, they are all just named as "General Linear Model." I have to then manually rename each analysis in the output, and I want to find syntax that will add a more specific name to each result that will help me identify it out of the other 89 results (e.g. "General Linear Model - Males Only: Mean by Gender w/ Weight covariate").
This is an example of one analysis from the syntax:
USE ALL.
COMPUTE filter_$=(Muscle = "BICEPS" & Subj = "S1" & SMU = 1 ).
VARIABLE LABELS filter_$ 'Muscle = "BICEPS" & Subj = "S1" & SMU = 1 (FILTER)'.
VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.
FORMATS filter_$ (f1.0). FILTER BY filter_$.
EXECUTE.
GLM Frequency_Wk6 Frequency_Wk9
Frequency_Wk12 Frequency_Wk16
Frequency_Wk20
/WSFACTOR=Time 5 Polynomial
/METHOD=SSTYPE(3)
/PLOT=PROFILE(Time)
/EMMEANS=TABLES(Time)
/CRITERIA=ALPHA(.05)
/WSDESIGN=Time.
I am looking for syntax to add to this that will name this analysis as: "S1, SMU1 BICEPS, GLM" Not to name the whole output file, but each analysis within the output so I don't have to do it one-by-one. I have over 200 iterations at times that come out in a single output file, and renaming them individually within the output file is taking too much time.
Making an assumption that you are exporting the models to Excel (please clarify otherwise).
There is an undocumented command (OUTPUT COMMENT TEXT) that you can utilize here, though there is also a custom extension TEXT also designed to achieve the same but that would need to be explicitly downloaded via:
Utilities-->Extension Bundles-->Download And Install Extension Bundles--->TEXT
You can use OUTPUT COMMENT TEXT to assign a title/descriptive text just before the output of the GLM model (in the example below I have used FREQUENCIES as an example).
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
oms /select all /if commands=['output comment' 'frequencies'] subtypes=['comment' 'frequencies']
/destination format=xlsx outfile='C:\Temp\ExportOutput.xlsx' /tag='ExportOutput'.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - jobcat".
freq jobcat.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - gender".
freq gender.
output comment text="##Model##: This is a long/descriptive title to help me identify the next model that is to be run - minority".
freq minority.
omsend tag=['ExportOutput'].
You could use TITLE command here also but it is limited to only 60 characters.
You would have to change the OMS tags appropriately if using TITLE or TEXT.
Edit:
Given the OP wants to actually add a title to the left hand pane in the output viewer, a solution for this is as follows (credit to Albert-Jan Roskam for the Python code):
First save the python file "editTitles.py" to a valid Python search path (for example (for me anyway): "C:\ProgramData\IBM\SPSS\Statistics\23\extensions")
#editTitles.py
import tempfile, os, sys
import SpssClient
def _titleToPane():
"""See titleToPane(). This function does the actual job"""
outputDoc = SpssClient.GetDesignatedOutputDoc()
outputItemList = outputDoc.GetOutputItems()
textFormat = SpssClient.DocExportFormat.SpssFormatText
filename = tempfile.mktemp() + ".txt"
for index in range(outputItemList.Size()):
outputItem = outputItemList.GetItemAt(index)
if outputItem.GetDescription() == u"Page Title":
outputItem.ExportToDocument(filename, textFormat)
with open(filename) as f:
outputItem.SetDescription(f.read().rstrip())
os.remove(filename)
return outputDoc
def titleToPane(spv=None):
"""Copy the contents of the TITLE command of the designated output document
to the left output viewer pane"""
try:
outputDoc = None
SpssClient.StartClient()
if spv:
SpssClient.OpenOutputDoc(spv)
outputDoc = _titleToPane()
if spv and outputDoc:
outputDoc.SaveAs(spv)
except:
print "Error filling TITLE in Output Viewer [%s]" % sys.exc_info()[1]
finally:
SpssClient.StopClient()
Re-start SPSS Statistics and run below as a test:
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
title="##Model##: jobcat".
freq jobcat.
title="##Model##: gender".
freq gender.
title="##Model##: minority".
freq minority.
begin program.
import editTitles
editTitles.titleToPane()
end program.
The TITLE command will initially add a title to main output viewer (right hand side) but then the python code will transfer that text to the left hand pane output tree structure. As mentioned already, note TITLE is capped to 60 characters only, a warning will be triggered to highlight this also.
This editTitles.py approach is the closest you are going to get to include a descriptive title to identify each model. To replace the actual title "General Linear Model." with a custom title would require scripting knowledge and would involve a lot more code. This is a simpler alternative approach. Python integration required for this to work.
Also consider using:
SPLIT FILE SEPARATE BY <list of filter variables>.
This will automatically produce filter labels in the left hand pane.
This is easy to use for mutually exclusive filters but even if you have overlapping filters you can re-run multiple times (and have filters applied to get as close to your desired set of results).
For example:
get file="C:\Program Files\IBM\SPSS\Statistics\23\Samples\English\Employee data.sav".
sort cases by jobcat minority.
split file separate by jobcat minority.
freq educ.
split file off.

Crossvalidation in Stanford NER

I'm trying to use cross validation in Stanford NER. The feature factory lists 3 properties:
numFolds int 1 The number of folds to use for cross-validation.
startFold int 1 The starting fold to run.
numFoldsToRun int 1 The number of folds to run.
which I think should be used for cross validation. But I don't think they actually work. Setting numFolds to 1 or 10 doesn't change the running time for training at all. And strangely, using numFoldsToRun gives the following warning:
Unknown property: |numFoldsToRun|
You're right. These options haven't been implemented. If you want to run cross-validation experiments, you'll have to do it completely manually by preparing the data sets yourself. (Sorry!)

Stanford NER: How do I create a new training set that I can use and test out?

From my understanding, to create a training file, you put your words in a text file. Then after each word, add a space or tab along with the tag (such as PERS, LOC, etc...)
I also copied text from a sample properties file into a word pad. How do I get these into a gz file that I can input into the classifier and use?
Please guide me though. I'm a newbie and am fairly inept with technology.
Your training file (say training-data.tsv) should look like this:
I O
drove O
to O
Vancouver LOCATION
BC LOCATION
yesterday O
where O means "Outside", as in not a named entity.
where the space between the columns is a tab.
You don't put them in a ser.gz file. The ser.gz file is the classifier model that is created by the training process.
To train the classifier run:
java -cp ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop my-classifier.properties
where my-classifier.properties would look like this:
trainFile = training-data.tsv
serializeTo = my-classification-model.ser.gz
map = word=0,answer=1
...
I'd advise you take a look at the NLTK documentation to learn more about training a parser http://nltk.googlecode.com/svn/trunk/doc/howto/tag.html
. Now, it seems that you want to train the CRFClassifier (not the parser!); for that you may want to check this FAQ http://nlp.stanford.edu/software/crf-faq.shtml#a

Resources