Is there a smaller version of Stanford CoreNLP? - stanford-nlp

I am trying to reduce the size of a plugin (written in java) I am working on and we use CoreNLP to parse some text. We don't use too many features of CoreNLP, but it is by far the largest component of the plugin and makes downloading it more burdensome than desired for the end user. I am aware of the Simple CoreNLP API as well as the Client/Server functionality that is featured with CoreNLP, but it still seems to require downloading the entire package along with the models. Is there another version that is smaller, or has someone else made something smaller with a little less functionality? The only annotators we use are the tokenizer, ssplit, parse, sentiment analysis, the part of speech tagger, and the lemmatizer.

I will just add in that if you make a jar with just the models for the annotators you are using you can cut the size down a lot. The models are what take up most of the space, and you only need the ones you are using.

Related

Train a non-english Stanford NER models

I'm seeing several posts about training the Stanford NER for other languages.
eg: https://blog.sicara.com/train-ner-model-with-nltk-stanford-tagger-english-french-german-6d90573a9486
However, the Stanford CRF-Classifier uses some language dependent features (such as: Part Of Speechs tags).
Can we really train non-English models using the same Jar file?
https://nlp.stanford.edu/software/crf-faq.html
Training a NER classifier is language independent. You have to provide high quality training data and create meaningful features. The point is, that not all features are equally useful for every languages. Capitalization for instance, is a good indicator for a named entity in english. But in German all nouns are capitalized, which makes this features less useful.
In Stanford NER you can decide which features the classifier has to use and therefore you can disable POS tags (in fact, they are disabled by default). Of course, you could also provide your own POS tags in your desired language.
I hope I could clarify some things.
I agree with previous comment that NER classification model is language independent.
If you have issue with training data I could suggest you this link with a huge amount of labeled datasets for different languages.
If you would like to try another model, I suggest ESTNLTK - library for Estonian language, but it could fit language independent ner models (documentation).
Also, here you could find example how to train ner model using spaCy.
I hope it helps. Good luck!

How should I figure out the POS tag of "last" in this sentence?

I'm trying to use Stanford CoreNLP to tag a sentence.
"How long does a soccer game last?"
It seems on CoreNLP demo the token "last" is tagged as JJ instead of VB. Is there a way to fix this?
The short answer is "no." CoreNLP provides part of speech tags at a certain high but not perfect accuracy, and it will occasionally make mistakes. Beyond tweaking the tags yourself, there's no easy automatic way to have its accuracy go up.
The longer answer is that you can always re-train the POS tagger on a custom tagged corpus, and then performance will be better on that corpus. This, however, involves a fairly substantial annotation effort tagging a large corpus of text with part of speech tags.

How compare two images and check whether both images are having same object or not in OpenCV python or JavaCV

I am working on a feature matching project and i am using OpenCV Python as the tool for developed the application.
According to the project requirement, my database have images of some objects like glass, ball,etc ....with their descriptions. User can send images to the back end of the application and back end is responsible for matching the sent image with images which are exist in the database and send the image description to the user.
I had done some research on the above scenario. Unfortunately still i could not find a algorithm for matching two images and identifying both are matching or not.
If any body have that kind of algorithm please send me.(I have to use OpenCV python or JavaCV)
Thank you
This is a very common problem in Computer Vision nowadays. A simple solution is really simple. But there are many, many variants for more sophisticated solutions.
Simple Solution
Feature Detector and Descriptor based.
The idea here being that you get a bunch of keypoints and their descriptors (search for SIFT/SURF/ORB). You can then find matches easily with tools provided in OpenCV. You would match the keypoints in your query image against all keypoints in the training dataset. Because of typical outliers, you would like to add a robust matching technique, like RanSaC. All of this is part of OpenCV.
Bag-of-Word model
If you want just the image that is as much the same as your query image, you can use Nearest-Neighbour search. Be aware that OpenCV comes with the much faster Approximated-Nearest-Neighbour (ANN) algorithm. Or you can use the BruteForceMatcher.
Advanced Solution
If you have many images (many==1 Million), you can look at Locality-Sensitive-Hashing (see Dean et al, 100,000 Object Categories).
If you do use Bag-of-Visual-Words, then you should probably build an Inverted Index.
Have a look at Fisher Vectors for improved accuracy as compared to BOW.
Suggestion
Start by using Bag-Of-Visual-Words. There are tutorials on how to train the dictionary for this
model.
Training:
Extract Local features (just pick SIFT, you can easily change this as OpenCV is very modular) from a subset of your training images. First detect features and then extract them. There are many tutorials on the web about this.
Train Dictionary. Helpful documentation with a reference to a sample implementation in Python (opencv_source_code/samples/python2/find_obj.py)!
Compute Histogram for each training image. (Also in the BOW documentation from previous step)
Put your image descriptors from the step above into a FLANN-Based-matcher.
Querying:
Compute features on your query image.
Use the dictionary from training to build a BOW histogram for your query image.
Use that feature to find the nearest neighbor(s).
I think you are talking about Content Based Image Retrieval
There are many research paper available on Internet.Get any one of them and Implement Best out of them according to your needs.Select Criteria according to your application like Texture based,color based,shape based image retrieval (This is best when you are working with image retrieval on internet for speed).
So you Need python Implementation, I would like to suggest you to go through Chapter 7, 8 of book Computer Vision Book . It Contains Working Example with code of what you are looking for
One question you may found useful : Are there any API's that'll let me search by image?

Unsupervised automatic tagging algorithms?

I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!
The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.
LDA has several efficient implementations in several languages:
many implementations from the original researchers
http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation
These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
http://research.microsoft.com/pubs/79896/tagging.pdf
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.
Text documents can be tagged using this keyphrase extraction algorithm/package.
http://www.nzdl.org/Kea/
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?
You want Doc-Tags (https://www.Doc-Tags.com) which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent (https://xAIgent.com) is very inexpensive to use.
I posted a blog article today to answer your question.
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !
Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info
As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html
In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.
This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

Resources