Finding Collocation using Apache OpenNLP - opennlp

I would like to find collocated words using Apache OpenNLP framework. By looking at the API, it seems there seems to no API for Collocation Finder. How to find the collocated words in a given sentences using OpenNLP. For example, in the below given sentence "Learn to create Machine Learning Algorithms in Python and R from two Data Science experts. Code templates included at San Jose", I want to find Machine Learning Algorthims and Data Science as one word.

Related

What are the Machine Learning Algorithms to use when identifying given words

I'm new to the machine learning and I want to identify the given words using an algorithm.
As an example,
construct the triangle ABC such that AB=7cm, BAC=60 and AC=5.5cm
construct the square that is 7cm long each side.
in this example I need to identify the words triangle and square.
So it seems like you want to be the algorithm to be intelligence rather than just identifying couple of words. So in order to that you should go for Natural language processing. There you can identifying nouns of different geometrical objects and in order to gather those information, I mean if you want to list that AB=7, BAC=60 and AC=505 then learn recurrent neural networks (RNN). RNN models can remember what you have said at the beginning of the sentence and identify what details are belongs to that.
ex - Anna lives in Paris and she's fluent in French.
RNN can identify the word French.
So just by using a machine learning algorithm it is not possible to identify those words and gather details when provide such a sentence.
You can read this article for further understanding.
Understanding LSTM Networks

How to build a knowledge graph?

I prototyped a tiny search engine with PageRank that worked on my computer. I am interested in building a Knowledge Graph on top of it, and it should return only queried webpages that are within the right context, similarly to how Google found relevant answers to search questions. I saw a lot of publicity around Knowledge Graphs, but not a lot of literature and almost no pseudocode like guideline of building one. Does anyone know good references on how such Knowledge Graphs work internally, so that there will be no need to create models about a KG?
Knowledge graph is a buzzword. It is a sum of models and technologies put together to achieve a result.
The first stop on your journey starts with Natural language processing, Ontologies and Text mining. It is a wide field of artificial intelligence, go here for a research survey on the field.
Before building your own models, I suggest you try different standard algorithms using dedicated toolboxes such as gensim. You will learn about tf-idf, LDA, document feature vectors, etc.
I am assuming you want to work with text data, if you want to do image search using other images it is different. Same for the audio part.
Building models is only the first step, the most difficult part of Google's knowledge graph is to actually scale to billions of requests each day ...
A good processing pipeline can be built "easily" on top of Apache Spark, "the current-gen Hadoop". It provides a resilient distributed datastore which is mandatory if you want to scale.
If you want to keep your data as a graph, as in graph theory (like pagerank), for live querying, I suggest you use Bulbs which is a framework which is "Like an ORM for graphs, but instead of SQL, you use the graph-traversal language Gremlin to query the database". You can switch the backend from Neo4j to OpenRDF (useful if you do ontologies) for instance.
For graph analytics you can use Spark, GraphX module or GraphLab.
Hope it helps.
I know I'm really late but first to clarify some terminology: Knowledge Graph and Ontology are similar (I'm talking in the Semantic Web paradigm). In the semantic web stack the foundation is RDF which is a language for defining graphs as triples (Subject, Predicate, Object). RDFS is a layer on top of RDF. It defines a meta-model, e.g., predicates such as rdf:type and nodes such as rdfs:Class. Although RDFS provides a meta-model there is no logical foundation for it so there are no reasoners that can validate the model or do further reasoning on it. The layer on top of RDFS is OWL (Web Ontology Language). That has a formal semantics defined by Description Logic which is a decidable subset of First Order Logic. It has more predefined nodes and links such as owl:Class, owl:ObjectProperty, etc. So when people use the term ontology they typically mean an OWL model. When they use the term Knowledge Graph it may refer to an ontology defined in OWL (because OWL is still ultimately an RDF graph) or it may mean just a graph in RDF/RDFS.
I said that because IMO the best way to build a knowledge graph is to define an ontology and then use various semantic web tools to load data (e.g., from spreadsheets) into the ontology. The best tool to start with IMO is the Protege ontology editor from Stanford. It's free and for a free open source tool very reliable and intuitive. And there is a good tutorial for how to use Protege and learn OWL as well as other Semantic Web tools such as SPARQL and SHACL. That tutorial can be found here: New Protege Pizza Tutorial (disclosure: that links to my site, I wrote the tutorial). If you want to get into the lower levels of the graph you probably want to check out a triplestore. It is a graph database designed for OWL and RDF models. The free version of Franz Inc's AllegroGraph triplestore is easy to use and supports 5M triples. Another good triplestore that is free and open source is part of the Apache Jena framework.

Unsupervised automatic tagging algorithms?

I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!
The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.
LDA has several efficient implementations in several languages:
many implementations from the original researchers
http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation
These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
http://research.microsoft.com/pubs/79896/tagging.pdf
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.
Text documents can be tagged using this keyphrase extraction algorithm/package.
http://www.nzdl.org/Kea/
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?
You want Doc-Tags (https://www.Doc-Tags.com) which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent (https://xAIgent.com) is very inexpensive to use.
I posted a blog article today to answer your question.
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott

Sentiment analysis

while performing sentiment analysis, how can I make the machine understand that I'm referring apple (the iphone), instead of apple (the fruit)?
Thanks for the advise !
Well, there are several methods,
I would start with checking Capital letter, usually, when referring to a name, first letter is capitalized.
Before doing sentiment analysis, I would use some Part-of-speech and Named Entity Recognition to tag the relevant words.
Stanford CoreNLP is a good text analysis project to start with, it will teach
you the basic concepts.
Example from CoreNLP:
You can see how the tags can help you.
And check out more info
As described by Ofiris, NER is only one way to do solve your problem. I feel it's more effective to use word embedding to represent your words. In that way machine automatically recognize the context of the word. As an example "Apple" is mostly coming together with "eat" and But if the given input "Apple" is present with "mobile" or any other word in that domain, Machine will understand it's "iPhone apple" instead of "apple fruit". There are 2 popular ways to generate word embeddings such as word2vec and fasttext.
Gensim provides more reliable implementations for both word2vec and fasttext.
https://radimrehurek.com/gensim/models/word2vec.html
https://radimrehurek.com/gensim/models/fasttext.html
In presence of dates, famous brands, vip or historical figures you can use a NER (named entity recognition) algorithm; in such case, as suggested by Ofiris, the Stanford CoreNLP offers a good Named entity recognizer.
For a more general disambiguation of polysemous words (i.e., words having more than one sense, such as "good") you could use a POS tagger coupled with a Word Sense Disambiguation (WSD) algorithm. An example of the latter can be found HERE, but I do not know any freely downloadable library for this purpose.
This problem has already been solved by many open source pre-trained NER models. Anyways you can try retraining an existing NER models to finetune them to solve this issue.
You can find an demo of NER results as done by Spacy NER here.

How to tackle twitter sentiment analysis?

I'd like you to give me some advice in order to tackle this problem. At college I've been solving opinion mining tasks but with Twitter the approach is quite different. For example, I used an ensemble learning approach to classify users opinions about a certain Hotel in Spain. Of course, I was given a training set with positive and negative opinions and then I tested with the test set. But now, with twitter, I've found this kind of categorization very difficult.
Do I need to have a training set? and if the answer to this question is positive, don't you think twitter is so temporal so if I have that set, my performance on future topics will be very poor?
I was thinking in getting a dictionary (mainly adjectives) and cross my tweets with it and obtain a term-document matrix but I have no class assigned to any twitter. Also, positive adjectives and negative adjectives could vary depending on the topic and time. So, how to deal with this?
How to deal with the problem of languages? For instance, I'd like to study tweets written in English and those in Spanish, but separately.
Which programming languages do you suggest to do something like this? I've been trying with R packages like tm, twitteR.
Sure, I think the way sentiment is used will stay constant for a few months. worst case you relabel and retrain. Unsupervised learning has a shitty track record for industrial applications in my experience.
You'll need some emotion/adj dictionary for sentiment stuff- there are some datasets out there but I forget where they are. I may have answered previous questions with better info.
Just do English tweets, it's fairly easy to build a language classifier, but you want to start small, so take it easy on yourself
Python (NLTK) if you want to do it easily in a small amount of code. Java has good NLP stuff, but Python and it's libraries are way more user friendly
This site: https://sites.google.com/site/miningtwitter/questions/sentiment provides 3 ways to do sentiment analysis using R.
The twitter package is now updated to work with the new twitter API. I'd you download the source version of the package to avoid getting duplicated tweets.
I'm working on a spanish dictionary for opinion mining, and would publish somewhere accesible.
cheers!
Sentiment Analysis will give only 3 results as said above - positive, negative and neutral. I found a tutorial on Twitter Sentiment analysis and it's quiet easy.
I found it here - https://www.ai-ml.tech/twitter-sentiment-analysis/
Only 3 dependencies, i downloaded and lesser code, done. Just go through it, you will get the solution.

Resources