Question about how the sentimentr lexicon dictionary was built - sentiment-analysis

I have used the Sentimentr package to do some sentiment analysis because it includes valence shifters. However I cannot find how this lexicon lexicon::hash_sentiment_jockers_rinker was built, how individual words were scored. From what I understand, the lexicon was originally exported by the syuzhet and is a combination of AFFIN, bing, nrc and syuzhet. Could someone help me to understand how individual words in the lexicon were calculated?
Thanks!

The NEWS file on github says this for version "2.1.0 - 2.2.3":
The default sentiment sentiment lookup table used within sentimentr is now
lexicon::hash_sentiment_jockers_rinker, a combined and augmented version of
lexicon::hash_sentiment_jockers (Jockers, 2017) & Rinker's augmented
lexicon::hash_sentiment_huliu (Hu & Liu, 2004) sentiment lookup tables.
It doesn't state how it was "combined and augmented", so if that was the essence of your question this might not be much help. In that case I'd suggest studying git history to see where the change was introduced, as the commit message, or source code comments, might explain the details.

Related

next release of Stanza

I'm interested in the Stanza constituency parser for Italian.
In https://stanfordnlp.github.io/stanza/constituency.html it is said that a new release with updated models (including an Italian model trained on the Turin treebank) should have been available in mid-November.
Any idea about when the next release of Stanza will appear?
Thanks
alberto
Technically you can already get it! If you install the dev branch of stanza, you should be able to download an IT parser.
pip install git+git://github.com/stanfordnlp/stanza.git#704d90df2418ee199d83c92c16de180aacccf5c0
stanza.download("it")
It's trained on the Turin treebank, which has about 4000 trees. If you download the Bert version of the model, it gets over 91 F1 on the Evalita test set (but has a length limit of about 200 words per sentence).
We might splurge on getting the VIT treebank or something. I've been agitating that we use that budget on Danish or PT or some other language where we have very few users, but it's a hard sell...
Edit: there's also some scripts included for converting the publicly available Turin trees into brackets. Their MWT annotation style was to repeat the MWT twice in a row, which doesn't doesn't work too well for a task like parsing raw text.
It is still very much a live task ... either December or January, I would say.
p.s. This isn't really a great SO question....

Google cloud natural language API adding own context classifier

I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

Algorithms to recognize misspelled names in texts

I need to develop an application that will index several texts and I need to search for people’s names inside these texts. The problem is that, while a person’s correct name is “Gregory Jackson Junior”, inside the text, the name might me written as:
- Greg Jackson Jr
- Gegory Jackson Jr
- Gregory Jackson
- Gregory J. Junior
I plan to index the texts on a nightly bases and build a database index to speed up the search. I would like recommendation for good books and/or good articles on the subject.
Thanks
Check these related questions.
Algorithm to find articles with similar text
How to search for a person's name in a text? (heuristic)
Your question is incorrectly phrased. The examples do not indicate misspelling but change in the form of writing a full name.
And,
would your search expect to match on words like son with reference to the example?
would it expect to match bob when looking for a name called Robert?
Are you looking for things like this and this?
Ok, reading your comment suggests you do not want to venture into that.
For the record. Use a Bayesian filter. You may use mechanical truck for initializing your algorithm.

Searching a datastore for related topics by keyword

For example, how does StackOverflow decide other questions are similar?
When I typed in the question above and then tabbed to this memo control I saw a list of existing questions which might be the same as the one I am asking.
What technique is used to find similar questions?
I got an email from team#stackoverflow.com on Mar 20 that mentions how it works:
the "ask a question" search is
exclusively on title and will not
match anything in the body. It is a
mystery to me why people think it's
better.
The last sentence refers to the search bar, which I've found is less useful when I'm trying to find a specific question I've already seen.
I think it's plain old word matching. However, I might add that this feature does not work as well as I would like it to. It's much better to do google search with site:stackoverflow.com prefix than to rely on SO to provide the relevant suggestions.
Poorly -- using MS SQL Full Text Search, I believe. You'll have better luck using Lucene, IMO. For more background on the topic see the Wikipedia article on Lucene or the general topic of information retrieval.
The matching program would store an index of all questions. When you ask a question, all keywords in your question are matched against the index. This is similar to Google Search. Lucene open source search can be (and with high probability has been) used for this. Since the results are not quite accurate, I presume they index just the headlines of the questions, as an approximation.
The other related keyword is collaborative filtering, the algorithm popularized by Amazon to recommend products based on behavior of other similar customers. In the current case, an alternative algorithm based on collaborative filtering is: keywords are extracted from the question, then tags associated (in the history) with the keywords are found. Questions which have those tags are returned. Well, experiments are needed to see whether it works well at all.

Resources