I have been looking around for Sentiment and text analysis services but most of them seem to analyse the whole text and provide one result for it.
Is there a way of analysing the same piece of text against two different keywords? For example, the same article could be talking about two entities, positively towards one and negatively towards the other.
How could one get these two sentiments within the same text? Is there a service or API already for that?
I have found IBM's AlchemyAPI but doesn't seem to return accurate results...
What you want is aspect-based sentiment analysis. There are lots of algorithms with different precisions and recalls for this aspect-based sentiment analysis.
You can use Aylien's text analysis api.
Related
I need to perform sentiment analysis on news articles about a specific topic using the Stanford NLP tool.
Such tool only allows sentence based sentiment analysis while I would like to extract a sentiment evaluation of the whole articles with respect to my topic.
For instance, if my topic is Apple, I would like to know the sentiment of a news article with respect to Apple.
Just computing the average of the sentences in my articles won't do. For instance, I might have an article saying something along the lines of "Apple is very good at this, and this and that. While Google products are very bad for these reasons". Such an article would result in a Neutral classification using the average score of sentences, while it is actually a Very positive article about Apple.
On the other hand filtering my sentences to include only the ones containing the word Apple would miss articles along the lines of "Apple's product A is pretty good. However, it lacks the following crucial features: ...". In this case the effect of the second sentence would be lost if I were to use only the sentences containing the word Apple.
Is there a standard way of addressing this kind of problems? Is Stanford NLP the wrong tool to accomplish my goal?
Update: You might want to look into
http://blog.getprismatic.com/deeper-content-analysis-with-aspects/
This is a very active area of research so it would be hard to find an off-the-shelf tool to do this (at least nothing is built in the Stanford CoreNLP). Some pointers: look into aspect-based sentiment analysis. In this case, Apple would be an "aspect" (not really but can be modeled that way). Andrew McCallum's group at UMass, Bing Liu's group at UIC, Cornell's NLP group, among others, have worked on this problem.
If you want a quick fix, I would suggest to extract sentiment from sentences that have reference to Apple and its products; use coref (check out dcoref annotator in Stanford CoreNLP), which will increase the recall of sentences and solve the problem of sentences like "However, it lacks..".
I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!
The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.
LDA has several efficient implementations in several languages:
many implementations from the original researchers
http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation
These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
http://research.microsoft.com/pubs/79896/tagging.pdf
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.
Text documents can be tagged using this keyphrase extraction algorithm/package.
http://www.nzdl.org/Kea/
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?
You want Doc-Tags (https://www.Doc-Tags.com) which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent (https://xAIgent.com) is very inexpensive to use.
I posted a blog article today to answer your question.
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott
Given a few words of input, I want to have a utility that will return a diverse set of relevant terms, phrases, or concepts. A caveat is that it would need to have a large graph of terms to begin with, or else the feature would not be very useful.
For example, submitting "baseball" would return
["shortstop", "Babe Ruth", "foul ball", "steroids", ... ]
Google Sets is the best example I can find of this kind of feature, but I can't use it since they have no public API (and I wont go against their TOS). Also, single-word input doesn't garner a very diverse set of results. I'm looking for a solution that goes off on tangents.
The closest I've experimented with is using WikiPedia's API to search Categories and Backlinks, but there's no way to directly sort those results by "relevance" or "popularity". Without that, the suggestion list is massive and all over the place, which is not immediately useful and very hard to whittle down.
Using A Thesaurus could also work minimally, but that would leave out any proper nouns or tangentially relevant terms (like any of the results listed above).
I would happily reuse an open service, if one exists, but I haven't found anything sufficient.
I'm looking for either a way to implement this either in-house with a decently-populated starting set, or reuse a free service that offers this.
Have a solution? Thanks ahead of time!
UPDATE: Thank you for the incredibly dense & informative answers. I'll choose a winning answer in 6 to 12 months, when I'll hopefully understand what you've all suggested =)
You might be interested in WordNet. It takes a bit of linguistic knowledge to understand the API, but basically the system is a database of meaning-based links between English words, which is more or less what you're searching for. I'm sure I can dig up more information if you want it.
Peter Norvig (director of research at Google) spoke about how they do this at Google (specifically mentioning Google Sets) in a Facebook Tech Talk. The idea is that a relatively simple algorithm on a huge dataset (e.g. the entire web) is much better than a complicated algorithm on a small data set.
You could look at Google's n-gram collection as a starting point. You'd start to see what concepts are grouped together. Norvig hinted that internally Google has up to 7-grams for use in things like Google Translate.
If you're more ambitious, you could download all of Wikipedia's articles in the language you desire and create your own n-gram database.
The problem is even more complicated if you just have a single word; check out this recent thesis for more details on word sense disambiguation.
It's not an easy problem, but it is useful as you mentioned. In the end, I think you'll find that a really successful implementation will have a relatively simple algorithm and a whole lot of data.
Take a look at the following two papers:
Clustering User Queries of a Search Engine [pdf]
Topic Detection by Clustering Keywords [pdf]
Here is my attempt at a very simplified explanation:
If we have a database of past user queries, we can define a similarity function between two queries. For example: number of words in common. Now for each query in our database, we compute its similarity with each other query, and remember the k most similar queries. The non-overlapping words from these can be returned as "related terms".
We can also take this approach with a database of documents containing information users might be searching for. We can define the similarity between two search terms as the number of documents containing both divided by the number of documents containing either. To decide which terms to test, we can scan the documents and throw out words that are either too common ('and', 'the', etc.) or that are too obscure.
If our data permits, then we could see which queries led users to choosing which results, instead of comparing documents by content. For example if we had data that showed us that users searching for "Celtics" and "Lakers" both ended up clicking on espn.com, then we could call these related terms.
If you're starting from scratch with no data about past user queries, then you can try Wikipedia, or the Bag of Words dataset as a database of documents. If you are looking for a database of user search terms and results, and if you are feeling adventurous, then you can take a look at the AOL Search Data.
When developing a database of articles in a Knowledge Base (for example) - what are the best ways to sort and display the most relevant answers to a users' question?
Would you use additional data such as keyword weighting based on whether previous users found the article of help, or do you find a simple keyword matching algorithm to be sufficient?
Perhaps the easiest and most naive approach that will give immediately useful results would be to implement *tf-idf:
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
In a recent related question of mine here I learned of an excellent free book on this topic which you can download or read online:
An Introduction to Information Retrieval
That's a hard question, and companies like Google are pushing a lot of efforts to address this question. Have a look at Google Enterprise Search Appliance or Exalead Enterprise Search.
Then, as a personal opinion, I don't think that any "naive" approach is going to improve much the result compared to naive keyword search and ordering by the number of views on the documents.
If you have the possibility to expose your knowledge base to the web, then, just do it, and let your favorite search engine handles the search for you.
I think the angle here is not the retrieval itself... its about scoring the relevence of the information retrieved (A more reactive and passive approach) which can be later used to improve the search engine.
I guess you can try -
knn on tfidf for retrieving information
Hand tagging these retrieved info a relevency score
Then regress that score to predict the score for an unknwon search result and sort it.
Just a thought...
The third point is actually based on Rocchio algorithm. You can see it here
A little more specificity of your exact problem would be good. There are a lot of different techniques that you can use. Many of these are driven by other pieces of data. You can of course use Lucene and build your own indexes. There are bindings for many languages to lucene. Moving up there is also the Solr project which is Lucene with a lot of tools and extra functionality around it. That may be more along the lines of what you are looking for.
Intent is tricky and most modern search engines rely on statistical intent to aid in the ordering of results. You can always have an is this article useful button and store the query text that leads to useful documents. You could then add a layer of information to the index to boost specific words or phrases and help them point to certain documents.
Some things to think about...How many documents? What is the average length? Are they updated frequently? What do users do with the documents? What does the spread of unique words to documents look like? (More simply is it easy to match a query with a specific document(s) based on common unique features.)
If it is on the web you can always make a google custom search engine that just searches your site although you may find this to be sub-optimal for a variety of reasons.
You can always start with a simple index and gradually make it more sophisticated by talking with users and capturing data.
keyword matching is not enough when dealing with questions, you need to understand intent, as joannes say a very hot topic in search
I would like to extract a reduced collection of "meaningful" tags (10 max) out of an english text of any size.
http://tagcrowd.com/ is quite interesting but the algorithm seems very basic (just word counting)
Is there any other existing algorithm to do this?
There are existing web services for this. Two Three examples:
Yahoo's Term Extraction API
Topicalizer
OpenCalais
When you subtract the human element (tagging), all that is left is frequency. "Ignore common English words" is the next best filter, since it deals with exclusion instead of inclusion. I tested a few sites, and it is very accurate. There really is no other way to derive "meaning", which is why the Semantic Web gets so much attention these days. It is a way to imply meaning with HTML... of course, that has a human element to it as well.
Basically, this is a text categorization problem/document classification problem. If you have access to a number of already tagged documents, you could analyze which (content) words trigger which tags, and then use this information for tagging new documents.
If you don't want to use a machine-learning approach and you still have a document collection, then you can use metrics like tf.idf to filter out interesting words.
Going one step further, you can use Wordnet to find synonyms and replace words by their synonym, if the frequency of the synonym is higher.
Manning & Schütze contains a lot more introduction on text categorization.
In text classification, this problem is known as dimensionality reduction. There are many useful algorithms in the literature on this subject.
You want to do the semantic analysis of a text.
Word frequency analysis is one of the easiest ways to do the semantic analysis. Unfortunately (and obviously) it is the least accurate one. It can be improved by using special dictionaries (like for synonims or forms of a word), "stop-lists" with common words, other texts (to find those "common" words and exclude them)...
As for other algorithms they could be based on:
Syntax analysis (like trying to find the main subject and/or verb in a sentence)
Format analysis (analyzing headers, bold text, italic... where applicable)
Reference analysis (if the text is in Internet, for example, then a reference can describe it in several words... used by some search engines)
BUT... you should understand that these algorithms are mereley heuristics for semantic analysis, not the strict algorithms of achieving the goal.
The problem of semantic analysis is one of the main problems in Artificial Intelligence/Machine Learning studies since the first computers appeared.
Perhaps "Term Frequency - Inverse Document Frequency" TF-IDF would be useful...
You can use this in two steps:
1 - Try topic modeling algorithms:
Latent Dirichlet Allocation
Latent word Embeddings
2 - After that you can select the most representative word of every topic as a tag