Extract only English sentences - algorithm

I need to extract posts and tweets from Facebok and Twitter into our database for analysis. My problem is the system can process on the English sentences (phrases) only. So how can I remove non-English posts, tweets from my database.
If you do know any algorithm in NLP can do this, please tell me.
Thanks and regards

Avoiding automatic language identification where possible is usually preferable - for instance, https://dev.twitter.com/docs/api/1/get/search shows that returned tweets contain a field iso_language_code which might be helpful.
If that's not good enough, you'll have to either
look for existing language identification libraries in whatever language you're using; or
get your hands on a sufficient amount of English text (dumps of English Wikipedia, say, or any of the Google n-gram models) and implement something like http://www.cavar.me/damir/LID/.

Get an English dictionary and see if the majority of the words in your text are in it. Since you are looking at online text, be sure to include common slang and abbreviations.
This can run very quickly if you store the dictionary in a trie data structure.
I think fancy NLP is a bit overkill for this task. You don't need to identify the language if it's not English so all you have to do is test your text with some simple characteristics of the English language.

I have tried using standard libraries for language detection on tweets. You will get a lot of false negatives because there are a lot of non-standard characters in names, smilies etc. This problem is more severe in smaller posts where the signal-to-noise ratio is lower.
The main problem is not the algorithm but the outdated data-sources. I would suggest crawling/streaming a new one from Twitter. The language flag in Twitter is based on geographical information, so that will not work in all cases. (A chinese person can still make chinese posts in USA). I would suggest using a white-list of a lot of English speaking persons and collect their posts.

I wrote a little tweet language classifier (either english or not) that was 95+% accurate if I'm remembering right. I think it was just naive bayes + 1000 training instances. Combine that with location information and you can do even better.

I found this project, the source code is very clear. I have tested and it runs pretty well.
http://code.google.com/p/guess-language/

Have you tried SVD (Single Value Decomposition) for LSI (Latent Semantic Indexing) & LSA (Latent Semantic Analysis) ? see: http://alias-i.com/lingpipe/demos/tutorial/svd/read-me.html

Related

Is language translation algorithm non-deterministic in nature?

A few days ago, I got a Thai translation of the string "Reward points issues" as "คะแนนสะสม".
But when I checked it today, the Google translator gave a different Thai translation - "ประเด็นคะแนนรางวัล"
So, I am guessing the algorithm might be non-deterministic.
But, there is a thing that I am not able to understand. In any language, new words are added everyday and not new characters and not new ways to form a pre-defined word. Then why did Google Translate gave a different output?
Also, is my assumption of non-deterministic nature correct?
NOTE: I could observe same behaviour with other languages like russian, dutch, chinese, and polish.
I don't think that the algorithms used by Google are non-deterministic, there is no reason for them to be.
Anyway Google translates by reference to a huge corpus of known translations. This corpus is constantly updated and this influences day-by-day translations. It is made of complete sentences rather than isolated words.
In fact Google translation... learns.

Natural Language Processing for Smart Homes

I'm writing up a Smart Home software for my bachelor's degree, that will only simulate the actual house, but I'm stuck at the NLP part of the project. The idea is to have the client listen to voice inputs (already done), transform it into text (done) and send it to the server, which does all the heavy lifting / decision making.
So all my inputs will be fairly short (like "please turn on the porch light"). Based on this, I want to take the decision on which object to act, and how to act. So I came up with a few things to do, in order to write up something somewhat efficient.
Get rid of unnecessary words (in the previous example "please" and "the" are words that don't change the meaning of what needs to be done; but if I say "turn off my lights", "my" does have a fairly important meaning).
Deal with synonyms ("turn on lights" should do the same as "enable lights" -- I know it's a stupid example). I'm guessing the only option is to have some kind of a dictionary (XML maybe), and just have a list of possible words for one particular object in the house.
Detecting the verb and subject. "turn on" is the verb, and "lights" is the subject. I need a good way to detect this.
General implementation. How are these things usually developed in terms of algorithms? I only managed to find one article about NLP in Smart Homes, which was very vague (and had bad English). Any links welcome.
I hope the question is unique enough (I've seen NLP questions on SO, none really helped), that it won't get closed.
If you don't have a lot of time to spend with the NLP problem, you may use the Wit API (http://wit.ai) which maps natural language sentences to JSON:
It's based on machine learning, so you need to provide examples of sentences + JSON output to configure it to your needs. It should be much more robust than grammar-based approaches, especially because the voice-to-speech engine might make mistakes that will break your grammar (but the machine learning module can still get the meaning of the sentence).
I am no way a pioneer in NLP(I love it though) but let me try my hand on this one. For your project I would suggest you to go through Stanford Parser
From your problem definition I guess you don't need anything other then verbs and nouns. SP generates POS(Part of speech tags) That you can use to prune the words that you don't require.
For this I can't think of any better option then what you have in mind right now.
For this again you can use grammatical dependency structure from SP and I am pretty much sure that it is good enough to tackle this problem.
This is where your research part lies. I guess you can find enough patterns using GD and POS tags to come up with an algorithm for your problem. I hardly doubt that any algorithm would be efficient enough to handle every set of input sentence(Structured+unstructured) but something that is more that 85% accurate should be good enough for you.
First, I would construct a list of all possible commands (not every possible way to say a command, just the actual function itself: "kitchen light on" and "turn on the light in the kitchen" are the same command) based on the actual functionality the smart house has available. I assume there is a discrete number of these in the order of no more than hundreds. Assign each some sort of identifier code.
Your job then becomes to map an input of:
a sentence of english text
location of speaker
time of day, day of week
any other input data
to an output of a confidence level (0.0 to 1.0) for each command.
The system will then execute the best match command if the confidence is over some tunable threshold (say over 0.70).
From here it becomes a machine learning application. There are a number of different approaches (and furthermore, approaches can be combined together by having them compete based on features of the input).
To start with I would work through the NLP book from Jurafsky/Manning from Stanford. It is a good survey of current NLP algorithms.
From there you will get some ideas about how the mapping can be machine learned. More importantly how natural language can be broken down into a mathematical structure for machine learning.
Once the text is semantically analyzed, the simplest ML algorithm to try first would be of the supervised ones. To generate training data have a normal GUI, speak your command, then press the corresponding command manually. This forms a single supervised training case. Make some large number of these. Set some aside for testing. It is also unskilled work so other people can help. You can then use these as your training set for your ML algorithm.

How to tackle twitter sentiment analysis?

I'd like you to give me some advice in order to tackle this problem. At college I've been solving opinion mining tasks but with Twitter the approach is quite different. For example, I used an ensemble learning approach to classify users opinions about a certain Hotel in Spain. Of course, I was given a training set with positive and negative opinions and then I tested with the test set. But now, with twitter, I've found this kind of categorization very difficult.
Do I need to have a training set? and if the answer to this question is positive, don't you think twitter is so temporal so if I have that set, my performance on future topics will be very poor?
I was thinking in getting a dictionary (mainly adjectives) and cross my tweets with it and obtain a term-document matrix but I have no class assigned to any twitter. Also, positive adjectives and negative adjectives could vary depending on the topic and time. So, how to deal with this?
How to deal with the problem of languages? For instance, I'd like to study tweets written in English and those in Spanish, but separately.
Which programming languages do you suggest to do something like this? I've been trying with R packages like tm, twitteR.
Sure, I think the way sentiment is used will stay constant for a few months. worst case you relabel and retrain. Unsupervised learning has a shitty track record for industrial applications in my experience.
You'll need some emotion/adj dictionary for sentiment stuff- there are some datasets out there but I forget where they are. I may have answered previous questions with better info.
Just do English tweets, it's fairly easy to build a language classifier, but you want to start small, so take it easy on yourself
Python (NLTK) if you want to do it easily in a small amount of code. Java has good NLP stuff, but Python and it's libraries are way more user friendly
This site: https://sites.google.com/site/miningtwitter/questions/sentiment provides 3 ways to do sentiment analysis using R.
The twitter package is now updated to work with the new twitter API. I'd you download the source version of the package to avoid getting duplicated tweets.
I'm working on a spanish dictionary for opinion mining, and would publish somewhere accesible.
cheers!
Sentiment Analysis will give only 3 results as said above - positive, negative and neutral. I found a tutorial on Twitter Sentiment analysis and it's quiet easy.
I found it here - https://www.ai-ml.tech/twitter-sentiment-analysis/
Only 3 dependencies, i downloaded and lesser code, done. Just go through it, you will get the solution.

What's the best way to generate keywords from a given Text?

I want to generate Keywords for my CMS.
Does someone know a good PHP Script (or something else) which generates keywords?
I have a HTML Site like this: http://pastebin.com/ZU8vdyeP
This is a very hard problem for a computer to solve. It would be much easier to get somebody (else?) to do it manually, or simply not do it at all.
If you'd really need a computer to do it, I'd head over to the excellent Python library NLTK which has many tools for this sort of thing (=natural language processing), and it's a lot of fun to work with.
For example, you could calculate a frequency distribution of the words, and then search for the most common hypernyms of larger (above say 5 char) words that appear most frequently and use that as a hint of what the keywords could be.
Again, it is much easier to get it done by a human, however.
to automate, get the words from the article, match them against a blacklist and dont include words under 4 chars.
Additionally, Let user manually edit. So only automate if no present keywords.
This can be done by trigger or application layer.
regards,
/t
If I understand the problem, you have text and you want to determine keywords that are most relevant to the text.
Three approaches:
1) Have user enter keywords
2) Statistical analysis of text, for example determine the words that are far more common in the text than they are in the language overall. Any good text on Information Retrieval will have some algorithms.
3) If you have a set of documents that are already classified (perhaps previously classified by humans) then you can use a machine learning algorithm (perhaps a Bayesian classifier) to train the system to classify the new documents. If you let the users override/correct the suggested keywords, the system can learn over time.
Personally, I'd do #3, since it is more adaptive.

How do I data mine text?

Here's the problem. I have a bunch of large text files with paragraphs and paragraphs of written matter. Each para contains references to a few people (names), and documents a few topics (places, objects).
How do I data mine this pile to assemble some categorised library? ... in general, 2 things.
I don't know what I'm looking for, so I need a program to get the most used words/multiple words ("Jacob Smith" or "bluewater inn" or "arrow").
Then knowing the keywords, I need a program to help me search for related paras, then sort and refine results (manually by hand).
Your question is a tiny bit open-ended :)
Chances are, you will find modules for whatever analysis you want to do in the UIMA framework:
Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.
UIMA is made of many things
UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.
You may also find Open Calais a useful API for text analysis; depending on how big your heap of documents is, it may be more or less appropriate.
If you want it quick and dirty -- create an inverted index that stores all locations of words (basically a big map of words to all file ids in which they occur, paragraphs in those files, lines in the paragraphs, etc). Also index tuples so that given a fileid and paragraph you can look up all the neighbors. This will do what you describe, but it takes quite a bit of tweaking to get it to pull up meaningful correlations (some keywords to start you off on your search: information retrieval, TF-IDF, Pearson correlation coefficient).
Looks like you're trying to create an index?
I think Learning Perl has information on finding the frequency of words in a text file, so that's not a particularly hard problem.
But do you really want to know that "the" or "a" is the most common word?
If you're looking for some kind of topical index, the words you actually care about are probably down the list a bit, intermixed with more words you don't care about.
You could start by getting rid of "stop words" at the front of the list to filter your results a bit, but nothing would beat associating keywords that actually reflect the topic of the paragraphs, and that requires context.
Anyway, I could be off base, but there you go. ;)
The problem with what you ask is that you don't know what you're looking for. If you had some sort of weighted list of terms that you cared about, then you'd be in good shape.
Semantically, the problem is twofold:
Generally the most-used words are the least relevant. Even if you use a stop-words file, a lot of chaff remains
Generally, the least-used words are the most relevant. For example, "bluewater inn" is probably infrequent.
Let's suppose that you had something that did what you ask, and produced a clean list of all the keywords that appear in your texts. There would be thousands of such keywords. Finding "bluewater inn" in a list of 1000s of terms is actually harder than finding it in the paragraph (assuming you don't know what you're looking for) because you can skim the texts and you'll find the paragraph that contains "bluewater inn" because of its context, but you can't find it in a list because the list has no context.
Why don't you talk more about your application and process and then perhaps we can help you better??
I think what you want to do is called "entity extraction". This Wikipedia article has a good overview and a list of apps, including open source ones. I used to work on one of the commercial tools in the list, but not in a programming capacity, so I can't help you there.
Ned Batchelder gave a great talk at DevDays Boston about Python.
He presented a spell-corrector written in Python that does pretty much exactly what you want.
You can find the slides and source code here:
http://nedbatchelder.com/text/devdays.html
I recommend that you have a look at R. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
More generally, there are a large number of text mining packages on the Natural Language Processing view on CRAN.

Resources