Aspect extraction and Vector Space Model - sentiment-analysis

I have a dataset of reviews like:
"Teacher","Subject","Feedback"
"Dr.Reddy","DSP","He has very good subject knowledge. He didn't take all the lectures. He teaches and explains concepts very well."
"Ms. Vibha","OOPS","She is very regular with classes. But she does not teach in an easy way."
Is it possible to extract aspects(features) from this data along with opinion words using any Vector Space Model (Word2Vec, Tf-idf, etc.)?

Related

Given a list of words, how to develop an algorithmic way to semantically group them?

I am working with the Google Places API, and they contain a list of 97 different locations. I want to reduce the list of locations into a lesser number
of them, as many of them are groupable. For example, atm and bank into financial; temple, church, mosque, synagogue into worship; school, university into education; subway_station, train_station, transit_station, gas_station into transportation.
But also, it should not overgeneralize; for example, pet_store, city_hall, courthouse, restaurant into something like buildings.
I tried quite a few methods to do this. First I downloaded synonyms of each of the 97 words in the list from multiple dictionaries. Then, I found out the similarity between 2 words based on what fraction of unique synonyms they share in common (Jaccard similarity):
But after that, how do I group words into clusters? Using traditional clustering methods (k-means, k-medoid, hierarchical clustering, and FCM), I am not getting any good clustering (I identified several misclassifications by scanning the results manually):
I even tried the word2vec model trained on Google news data (where each word is expressed as a vector of 300 features), and I do not get good clusters based on that as well:
You are probably looking for something related to vector space dimensionality reduction. In these techniques, you'll need a corpus of text that uses the locations as words in the text. Dimensionality reduction will then group the terms together. You can do some reading on Latent Dirichlet Allocation and Latent semantic indexing. A good reference is "Introduction to Information Retrieval" by Manning et al., chapter 18. Note that this book is from 2009, so a lot of advances are not captured. As you noted, there has been a lot of work such as word2vec. Another good reference is "Speech and Language Processing" by Jurafsky and Martin, chapter 16.
You need much more data.
No algorithm ever, without additional data, will relate ATM and bank to financial. Because that requires knowledge of these terms.
Jaccard similarity doesn't have access to such knowledge, it can only work on the words. And then "river bank" and "bank branch" are very similar.
So don't expect magic to happen by the algorithm. You need the magic to be in the data...

Algorithm to compare similarity of ideas (as strings)

Consider an arbitrary text box that records the answer to the question, what do you want to do before you die?
Using a collection of response strings (max length 240), I'd like to somehow sort and group them and count them by idea (which may be just string similarity as described in this question).
Is there another or better way to do something like this?
Is this any different than string similarity?
Is this the right question to be asking?
The idea here is to have people write in a text box over and over again, and me to provide a number that describes, generally speaking, that 802 people wrote approximately the same thing
It is much more difficult than string similarity. This is what you need to do at a minimum:
Perform some text formatting/cleaning tasks like removing punctuations characters and common "stop words"
Construct a corpus (collection of words with their usage statistics) from the terms that occur answers.
Calculate a weight for every term.
Construct a document vector from every answer (each term corresponds to a dimension in a very high dimensional Euclidian space)
Run a clustering algorithm on document vectors.
Read a good statistical natural language processing book, or search google for good introductions / tutorials (likely terms: statistical nlp, text categorization, clustering) You can probably find some libraries (weka or nltk comes to mind) depending on the language of your choice but you need to understand the concepts to use the library anyway.
The Latent Semantic Analysis (LSA) might interest you. Here is a nice introduction.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular in vectorial semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
[...]
What you want is very much an open problem in NLP. #Ali's answer describes the idea at a high level, but the part "Construct a document vector for every answer" is the really hard one. There are a few obvious ways of building a document vector from a the vectors of the words it contains. Addition, multiplication and averaging are fast, but they affectively ignore the syntax. Man bites dog and Dog bites man will have the same representation, but clearly not the same meaning. Google compositional distributional semantics- as far as I know, there are people at Universities of Texas, Trento, Oxford, Sussex and at Google working in the area.

Comparing two English strings for similarities

So here is my problem. I have two paragraphs of text and I need to see if they are similar. Not in the sense of string metrics but in meaning. The following two paragraphs are related but I need to find out if they cover the 'same' topic. Any help or direction to solving this problem would be greatly appreciated.
Fossil fuels are fuels formed by natural processes such as anaerobic
decomposition of buried dead organisms. The age of the organisms and
their resulting fossil fuels is typically millions of years, and
sometimes exceeds 650 million years. The fossil fuels, which contain
high percentages of carbon, include coal, petroleum, and natural gas.
Fossil fuels range from volatile materials with low carbon:hydrogen
ratios like methane, to liquid petroleum to nonvolatile materials
composed of almost pure carbon, like anthracite coal. Methane can be
found in hydrocarbon fields, alone, associated with oil, or in the
form of methane clathrates. It is generally accepted that they formed
from the fossilized remains of dead plants by exposure to heat and
pressure in the Earth's crust over millions of years. This biogenic
theory was first introduced by Georg Agricola in 1556 and later by
Mikhail Lomonosov in the 18th century.
Second:
Fossil fuel reforming is a method of producing hydrogen or other
useful products from fossil fuels such as natural gas. This is
achieved in a processing device called a reformer which reacts steam
at high temperature with the fossil fuel. The steam methane reformer
is widely used in industry to make hydrogen. There is also interest in
the development of much smaller units based on similar technology to
produce hydrogen as a feedstock for fuel cells. Small-scale steam
reforming units to supply fuel cells are currently the subject of
research and development, typically involving the reforming of
methanol or natural gas but other fuels are also being considered such
as propane, gasoline, autogas, diesel fuel, and ethanol.
That's a tall order. If I were you, I'd start reading up on Natural Language Processing. NLP is a fairly large field -- I would recommend looking specifically at the things mentioned in the Wikipedia Text Analytics article's "Processes" section.
I think if you make use of information retrieval, named entity recognition, and sentiment analysis, you should be well on your way.
In general, I believe that this is still an open problem. Natural language processing is still a nascent field and while we can do a few things really well, it's still extremely difficult to do this sort of classification and categorization.
I'm not an expert in NLP, but you might want to check out these lecture slides that discuss sentiment analysis and authorship detection. The techniques you might use to do the sort of text comparison you've suggested are related to the techniques you would use for the aforementioned analyses, and you might find this to be a good starting point.
Hope this helps!
You can also have a look on Latent Dirichlet Allocation (LDA) model in machine learning. The idea there is to find a low-dimensional representation of each document (or paragraph), simply as a distribution over some 'topics'. The model is trained in an unsupervised fashion using a collection of documents/paragraphs.
If you run LDA on your collection of paragraphs, then by looking into the similarity of the hidden topics vector, you can find whether a given two paragraphs are related or not.
Of course, the baseline is to not use the LDA, and instead use the term frequencies (augmented with tf/idf) to measure similarities (vector space model).

Looking for algorithms to generate realistic planets

I'd like to collect a list of algorithms and other resources to generate realistic and interesting visuals of planets. The visual should look like something which you'd expect to find on the NASA homepage. Key attributes would be:
a nice colorful atmosphere for gas giants
rings (optional)
impact craters for solid rocks without atmosphere
inhabitable planets could have features like oceans, mountains, rivers, forests
inhabitables could even have a realistic distribution for the civilization on the surface
The final goal should be to give Science Fiction(SciFi) writers a tool to generate a world which helps them to spark ideas, create locations for scenes, or as a basis to render nice images for their books.
Note: This is a wiki, so no single "correct" answer.
Fractal terrain generation works wonders for creating realistic landscapes. I imagine you could scale the processs up in order to generate landmasses on a plantary scale. This site has a detailed description of the process used for landscapes.
If you want high-level descriptions of a very mature procedural planet renderer, Infinity is perhaps the most venerable. The development blog covers many of the concepts used to create some very nice procedural planets and some other very nice space phenomena.
Check out conworlding links. There is actually commercial software out there (ProFantasy comes to mind) but if you wanted to do something from scratch, I have a link you may be interested in :
Magical World Builder
Finally, Guy Lecky-Thompson has written some interesting books on using procedural content in game design. I have both of his books and they are very inspiring. Many algorithms are listed, including a few RNG implementations, name generators (HINT: pick a list of name parts, then how many parts each name should have, then randomise), two whole chapters on terrain and landscape generation, a dungeon chapter...
Oooh ! Speaking of dungeons, dunno if you have heard of Roguelikes, but I have recently been looking into these. I imagine that many of the same general principles they use for dungeons can be applied - and there are wilderness algorithms they share, besides. Try:
Temple of The Roguelike - possibly the largest Roguelike dev forum
Wilderness Generation using Vornoi Diagrams - this blog is run by a developer of Unangband, a very popular Rogue variant. Many people in the Roguelike dev community share sources.
Markov Chain - this article is about how to put together randomised names using Markov Chains. The wiki where this is hosted has quite a few algorithms of interest to anyone generating procedural content of any sort.
Roguebasin - many useful aglorithms and code examples here.
Have fun !
I'm no astronomer, but you might consider some sort of decision tree for a preliminary classification of the planet:
Main Composition (methane/rock/etc.)
Mass
Additional atmosphere (how much, what of, etc.)
Temperature (Alternately, specify distance from star, model the star and write an algorithm based on the above)
Age
Asteroid/Meteor activity
Things like craters would be indirectly determined by 1, 3, and 6. Radius could be calculated from 1 and 2. And higher elements on the list might put boundaries on lower elements.
You still have many algorithms to research, but maybe having an order of information might structure your calculations or what variables you use.

Methods for Geotagging or Geolabelling Text Content

What are some good algorithms for automatically labeling text with the city / region or origin? That is, if a blog is about New York, how can I tell programatically. Are there packages / papers that claim to do this with any degree of certainty?
I have looked at some tfidf based approaches, proper noun intersections, but so far, no spectacular successes, and I'd appreciate ideas!
The more general question is about assigning texts to topics, given some list of topics.
Simple / naive approaches preferred to full on Bayesian approaches, but I'm open.
You're looking for a named entity recognition system, or short NER. There are several good toolkits available to help you out. LingPipe in particular has a very decent tutorial. CAGEclass seems to be oriented around NER on geographical place names, but I haven't used it yet.
Here's a nice blog entry about the difficulties of NER with geographical places names.
If you're going with Java, I'd recommend using the LingPipe NER classes. OpenNLP also has some, but the former has a better documentation.
If you're looking for some theoretical background, Chavez et al. (2005) have constructed an interesting syntem and documented it.
Latent Semantic Mapping seems like potentially a good fit. That's just about as naive of an algorithm as you're likely to find.

Resources