coreNLP feature ranking for sentimental analysis - stanford-nlp

I am using coreNLP library for sentimental analysis. I have a Amazon data set and when I pass to using the command mentioned below it only results values like 0(negative or neutral) or 1(positive, very positive). I want to modify the existing code for sentiment to return multiple values for example (0= negative, 1= neutral, 2= positive, 3=very positive)
and calculate the maximum of positive I received, maximum of very positive I received and decide whether the sentiment is positive or negative or very positive. How can I achieve this? Most of the files are in jar format and I am not sure how can i make the code change for the above requirement to achieve the result.
Downloaded coreNLP package code from here

Related

Similarity measure using vectors in gensim

I have a pair of word and semantic types of those words. I am trying to compute the relatedness measure between these two words using semantic types, for example: word1=king, type1=man, word2=queen, type2=woman
we can use gensim word_vectors.most_similar to get 'queen' from 'king-man+woman'. However, I am looking for similarity measure between vector represented by 'king-man+woman' and 'queen'.
I am looking for a solution to above (or)
way to calculate vector that is representative of 'king-man+woman' (and)
calculating similarity between two vectors using vector values in gensim (or)
way to calculate simple mean of the projection weight vectors(i.e king-man+woman)
You should look at the source code for the gensim most_similar() method, which is used to propose answers to such analogy questions. Specifically, when you try...
sims = wv_model.most_similar(positive=['king', 'woman'], negative=['man'])
...the top result will (in a sufficiently-trained model) often be 'queen' or similar. So, you can look at the source code to see exactly how it calculates the target combination of wv('king') - wv('man') + wv('woman'), before searching all known vectors for those closest vectors to that target. See...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L486
...and note that the local variable mean is the combination of the positive and negative values provided.
You might also find other methods there useful, either directly or as models for your own code, such as distances()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L934
...or n_similarity()...
https://github.com/RaRe-Technologies/gensim/blob/5f6b28c538d7509138eb090c41917cb59e4709af/gensim/models/keyedvectors.py#L1005

Stanford Classifier: What are non ngram activeFeatures used to determine scoreOf Datum?

I have number of classifiers to determine whether event descriptions fall into certain categories, i.e. a rock concert, a jazz evening, classical music, etc or not. I have created a servlet which uses the LinearClassifier scoresOf function to return a score for the event description's datum.
In order to look at cases which return unexpected results, I adapted the scoreOf function (public Counter scoresOf(Datum example)) in order to get an array of the individual features and their scores, so I could understand how the final score was arrived at. This works for the most part, i.e. I mostly have lines like:-
1-#-jazz -0.6317620789568879
1-#-saxo -0.2449097451977173
as I'd expect. However I also have a couple, which I don't understand:-
CLASS 1.4064007882810108
1-Len-31-Inf 0.4569598446321162
Can anybody please help by explaining what these are and how these scores are determined? (I really thought I was just working on a score built up from the weighted components of my description string).
(I appreciate that "CLASS" & "Len-xx" are set as properties for the classifier, I just don't understand why they then show up as scored elements in their own right)
For what you want for seeing feature weights, you might also look at LinearClassifier's justificationOf(). I think it's the same as what you've been writing....
For the questions:
The CLASS feature acts as a class prior or bias term. It will have a more positive weight to the extent that the class is more common in the data overall. You will get this feature iff you use the useClassFeature property. But it's generally a good idea to have it.
The 1-Len feature looks at the length of the String that is column 1. 31-Inf is a length of over 30. This will again have weights as to whether such a length is indicative or not of a particular class. This is employed iff you use the binnedLengths feature. This is useful only if there is some general correlation between field length and the target class.

Mahout: How to split into equally distributed training sets

Im using Mahout's Naive Bayes algorithm to classify Amazon reviews into positive or negative review.
The data set isn't equally distributed. There are far more positive then negative reviews. A randomly picked test and training set with the mahout split using randomly picked tuples leads to good positive classification results but the false positive rate is also very high. Negative reviews are rarely classified as negative.
I guess an equally distributed training set with equal numbers of positive and negative tupels might solve the problem.
I've tried using mahout split with these options and then just switch training and test but this seems to only produce tupels for one class.
--testSplitSize (-ss) testSplitSize The number of documents
held back as test data for
each category
--testSplitPct (-sp) testSplitPct The % of documents held
back as test data for each
category
--splitLocation (-sl) splitLocation Location for start of test
data expressed as a
percentage of the input
file size (0=start,
50=middle, 100=end
Is there a way with mahout split or another to get proper training set?
I would say the training and test sets should reflex the under population. I would not create a test set with equal positive and negative reviews.
A better solution might be to create multiple sets via bootstrapping. Let a committee vote improve your results.

mapreduce way to calculate user similarity matrix

I have a list of many users (over 10 million) each of which is represented by a userid followed by 10 floating-point numbers indicating their preference. I would like to efficiently calculate the user similarity matrix using cosine similarity based on mapreduce. However, since the values are floating-point numbers, it is hard to determine a key in the mapreduce framework. Any suggestions?
I think the easiest solution would be the Mahout library. There are a couple of map-reduce similarity matrix jobs in Mahout that might work for your use case.
The first is Mahout's ItemSimilarityJob that is part of its recommender system libraries. The specific info for that job can be found here. You would simply need to provide the input data in the required format and choose your VectorSimilarityMeasure (which for your case would be SIMILARITY_COSINE) along with any additional optimizations. Since you are looking to calculate user-user similarity based on a preference vector of ten floating point value, what you could do is assign a simple 1-to-10 numeric hash for the indices of the vector and generate a simple .csv file of vectorIndex, userID, decimalValue as input for the Mahout item-similarity job (the userID being a numeric Int or Long value). The resulting output should be a tab separated text file of userID,userID,similarity.
A second solution might be Mahout's RowSimilarityJob included in its math library. I've never used it myself, but some info can be found here and in this previous stackoverflow thread. Rather than a .csv as input, you would need to translate your input data as a DistributedRowMatrix, the userIDs being the rows of the matrix. The output, I believe, will also be a DistributedRowMatrix sequence file containing the user-user similarity data you are seeking.
I suppose which solution is better depends on what input/output format you prefer. All the best.

Confusion Matrix of Bayesian Network

I'm trying to understand bayesian network. I have a data file which has 10 attributes, I want to acquire the confusion table of this data table ,I thought I need to calculate tp,fp, fn, tn of all fields. Is it true ? if it's then what i need to do for bayesian network.
Really need some guidance, I'm lost.
The process usually goes like this:
You have some labeled data instances
which you want to use to train a
classifier, so that it can predict
the class of new unlabeled instances.
Using your classifier
of choice (neural networks, bayes
net, SVM, etc...) we build a
model with your training data
as input.
At this point, you usually would like
to evaluate the performance of the
model before deploying it. So using a
previously unused subset of the data
(test set), we compare the model
classification for these instances
against that of the actual class. A
good way to summarize these results
is by a confusion matrix which shows
how each class of instances is
predicted.
For binary classification tasks, the convention is to assign one class as positive, and the other as negative. Thus from the confusion matrix, the percentage of positive instances that are correctly classified as positive is know as the True Positive (TP) rate. The other definitions follows the same convention...
Confusion matrix is used to evaluate the performance of a classifier, any classifier.
What you are asking is a confusion matrix with more than two classes.
Here is the steps how you do:
Build a classifier for each class, where the training set consists of
the set of documents in the class (positive labels) and its
complement (negative labels).
Given the test document, apply each classifier separately.
Assign the document to the class with the maximum score, the
maximum confidence value, or the maximum probability
Here is the reference for the paper you can have more information:
Picca, Davide, Benoît Curdy, and François Bavaud.2006.Non-linear correspondence analysis in text retrieval: A kernel view. In Proc. JADT.

Resources