Sentiment Analysis of given text - algorithm

This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.

First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.

For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.

NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.

Related

Learning efficient algorithms

Up until now I've mostly concentrated on how to properly design code, make it as readable as possible and as maintainable as possible. So I alway chose to learn about the higher level details of programming, such as class interactions, API design, etc.
Algorithms I never really found particularly interesting. As a result, even though I can come up with a good design for my programs, and even if I can come up with a solution to a given problem it rarely is the most efficient.
Is there a particular way of thinking about problems that helps you come up with an as efficient solution as possible, or is it simple a matter of practice and/or memorizing?
Also, what online resources can you recommend that teach you various efficient algorithms for different problems?
Data dominates. If you design your program around the right abstract data structures (ADTs), you often get a clean design, the algorithms follow quite naturally and when performance is lacking, you should be able to "plug in" more efficient ones.
A strong background in maths and logic helps here, as it allows you to visualize your program at a high level as the interaction between functions, sets, graphs, sequences, etc. You then decide whether the sets need to be ordered (balanced BST, O(lg n) operations) or not (hash tables, O(1) operations), what operations need to supported on sequences (vector-like or list-like), etc.
If you want to learn some algorithms, get a good book such as Cormen et al. and try to implement the main data structures:
binary search trees
generic binary search trees (that work on more than just int or strings)
hash tables
priority queues/heaps
dynamic arrays
Introduction To Algorithms is a great book to get you thinking about efficiency of different algorithms/data structures.
The authors of the book also teach an algorithms course on MIT . You can find most lectures here
I would say that in coming up with good algorithms (which is actually part of good design IMHO), you have to develop a way of thinking. This is best done by studying algorithm design. By study I don't mean just knowing all the common algorithms covered in a textbook, but actually understanding how and why they work, and being able to apply the basic idea contained in them to actual problems you are trying to solve.
I would suggest reading a good book on algorithms (my favourite is CLRS). For an online resource I would recommend the series of articles in the TopCoder Algorithm Tutorials.
I do not understand why you would mention practice and memorization in the same breath. Memorization won't help you at all (you probably already know this), but practice is essential. If you cannot apply what you learned, its not really learning. You can practice at various online programming contest/puzzle sites like SPOJ, Project Euler and PythonChallenge.
Recommendations:
First of all i recommend the book "Intro to Algorithms, Second Edition By corman", great book has most(if not all) of the algorithms you will need. (Some of the more important topics are sorting-algorithms, shortest paths, dynamic programing, many data structures like bst, hash maps, heaps).
another great way to learn algorithms is http://ace.delos.com/usacogate, great practice after the begining.
To your questions you will just get used to write good fast running code, after a little practice you just wouldnt want to write un-efficient code.
While I think #larsmans is correct inasmuch that understanding logic and maths is a fast way to understanding how to choose useful ADTs for solving a given problem, studying existing solutions may be more instructive for those of us who struggle with those topics. In particular, reviewing code of established software (OSS) that solves some similar problem as the one you're interested in.
I find a particularly good method for this method of study is reviewing unit tests of such a project. Apache Lucene, for example, has a source control repository containing numerous examples. While it doesn't reveal the underlying algorithms, it helps trace to particular functionality that solves a certain problem. This leads to an opportunity for studying its innards - i.e. an interesting algorithm. In Lucene's case inverted indices come to mind.
While this does not guarantee the algorithm you discover is the best, it's likely one that's received a lot scrutiny and probably comes from project with an active mailing that may answer your questions. So it's a good resource for finding a solution that is probably better than what most of us would come up with on our own.

state-of-the-art of classification algorithms

We know there are like a thousand of classifiers, recently I was told that, some people say adaboost is like the out of the shell one.
Are There better algorithms (with
that voting idea)
What is the state of the art in
the classifiers.Do you have an example?
First, adaboost is a meta-algorithm which is used in conjunction with (on top of) your favorite classifier. Second, classifiers which work well in one problem domain often don't work well in another. See the No Free Lunch wikipedia page. So, there is not going to be AN answer to your question. Still, it might be interesting to know what people are using in practice.
Weka and Mahout aren't algorithms... they're machine learning libraries. They include implementations of a wide range of algorithms. So, your best bet is to pick a library and try a few different algorithms to see which one works best for your particular problem (where "works best" is going to be a function of training cost, classification cost, and classification accuracy).
If it were me, I'd start with naive Bayes, k-nearest neighbors, and support vector machines. They represent well-established, well-understood methods with very different tradeoffs. Naive Bayes is cheap, but not especially accurate. K-NN is cheap during training but (can be) expensive during classification, and while it's usually very accurate it can be susceptible to overtraining. SVMs are expensive to train and have lots of meta-parameters to tweak, but they are cheap to apply and generally at least as accurate as k-NN.
If you tell us more about the problem you're trying to solve, we may be able to give more focused advice. But if you're just looking for the One True Algorithm, there isn't one -- the No Free Lunch theorem guarantees that.
Apache Mahout (open source, java) seems to pick up a lot of steam.
Weka is a very popular and stable Machine Learning library. It has been around for quite a while and written in Java.
Hastie et al. (2013, The Elements of Statistical Learning) conclude that the Gradient Boosting Machine is the best "off-the-shelf" Method. Independent of the Problem you have.
Definition (see page 352):
An “off-the-shelf” method is one that
can be directly applied to the data without requiring a great deal of timeconsuming data preprocessing or careful tuning of the learning procedure.
And a bit older meaning:
In fact, Breiman (NIPS Workshop, 1996) referred to AdaBoost with trees as the “best off-the-shelf classifier in the world” (see also Breiman (1998)).

Computational Linguistics project idea using Hadoop MapReduce

I need to do a project on Computational Linguistics course. Is there any interesting "linguistic" problem which is data intensive enough to work on using Hadoop map reduce. Solution or algorithm should try and analyse and provide some insight in "lingustic" domain. however it should be applicable to large datasets so that i can use hadoop for it. I know there is a python natural language processing toolkit for hadoop.
If you have large corpora in some "unusual" languages (in the sense of "ones for which limited amounts of computational linguistics have been performed"), repeating some existing computational linguistics work already performed for very popular languages (such as English, Chinese, Arabic, ...) is a perfectly appropriate project (especially in an academic setting, but it might be quite suitable for industry, too -- back when I was in computational linguistics with IBM Research I got interesting mileage from putting together a corpus for Italian, and repeating [[in the relatively new IBM Scientific Center in Rome]] very similar work to what the IBM Research team in Yorktown Heights [[of which I had been a part]] had already done for English.
The hard work is usually finding / preparing such corpora (it was definitely the greatest part of my work back then, despite wholehearted help from IBM Italy to put me in touch with publishing firms who owned relevant data).
So, the question looms large, and only you can answer it: what corpora do you have access to, or can procure access to (and clean up, etc), especially in "unusual" languages? If all you can do is, e.g., English, using already popular corpora, the chances of doing work that's novel and interesting are of course harder, though there may of course be some.
BTW, I assume you're thinking strictly about processing "written" text, right? If you had a corpus of spoken material (ideally with good transcripts), the opportunities would be endless (there has been much less work on processing spoken text, e.g. to parameterize pronunciation variants by different native speakers on the same written text -- indeed, such issues are often not even mentioned in undergrad CL courses!).
One computation-intensive problem in CL is inferring semantics from large corpora. The basic idea is to take a big collection of text and infer the semantic relationships between words (synonyms, antonyms, hyponyms, hypernyms, etc) from their distributions, i.e. what words they occur with or close to.
This involves a lot of data pre-processing and then can involve many nearest neighbor searches and N x N comparisons, which are well-suited for MapReduce-style parallelization.
Have a look at this tutorial:
http://wordspace.collocations.de/doku.php/course:acl2010:start
Download 300M words from 60K OA papers published by BioMed Central. Try to discover propositional attitudes and related sentiment constructions. Point being that the biomed literature is chock full of hedging and related constructions, because of the difficulty of making flat declarative statements about the living world and its creatures - their form and function and genetics and biochemistry.
My feelings about Hadoop is that it's a tool to consider, but to consider after you have done the important tasks of setting goals. Your goals, strategies, and data should dictate how you proceed computationally. Beware the hammer in search of a nail approach to research.
This is part of what my lab is hard at work on.
Bob Futrelle
BioNLP.org
Northeastern University
As you mention there is a Python toolkit called NLTK which can be used with dumbo to make use of Hadoop.
PyCon 2010 had a good talk on just this subject. You can access the slides from the talk using the link below.
The Python and the Elephant: Large Scale Natural Language Processing with NLTK and Dumbo

Where can i find sample alogrithms for analyzing historical stock prices?

Can anyone direct me in the right direction?
Basically, I'm trying to analyze stock prices and see if I can spot any patterns. I'm using PHP and MySQL to do this. Where can I find sample algorithms like the ones used in MetaStock or thinkorswim? I know they are closed source, but are there any tutorials available for beginners?
Thank you,
P.S. I don't even know what to search for in google :(
A basic, educational algorithm to start with is a dual-crossover moving average. Simply chart fast (say, 5-day) and slow (say, 10-day) moving averages of a stock's closing price, and you have a weak predictor of when to buy long (fast line goes above slow) and sell short (slow line goes above the fast). After getting this working, you could implement exponential smoothing (see previously linked wiki article).
That would be a decent start. Take a look at other technical analysis techniques, but do keep in mind that this is quite a perilous method of trading.
Update: As for actually implementing this? You're a PHP programmer, so here is a charting library for PHP. This is the one I used a few years ago for this very project, and it worked out swimmingly. Maybe someone else can recommend a better one. If you need a free source of data, take a look at Yahoo! Finance's historical data. They dispense CSV files containing daily opening prices, closing prices, trading volume, etc. of virtually every indexed corporation.
Check out algorithms at investopedia and FM Labs has formulas for a lot of technical analysis indicators.
First you will need a solid math background : statistics in general, correlation analysis, linear algebra... If you really want to push it check out dimensional transposition. Then you will need solid basis in Data Mining. Associations can be useful if yo want to link strict numerical data with news headlines and other events.
One thing for sure you will most likely not find pre-digested algorithms out there that will make you rich...
I know someone who is trying just that... He is somewhat successful (meaning is is not loosing money and is making a bit) and making his own algorithms... I should mention he has a doctorate in Actuarial science.
Here are a few more links... hope they help out a bit
http://mathworld.wolfram.com/ActuarialScience.html
http://www.actuary.com/actuarial-science/
http://www.actuary.ca/
Best of luck to you
Save yourself time and use programs like NinjaTrader and Wealth-Lab. Both of them are great technical analysis platforms and accept C# as a programming language for defining your trading rules. Every possible technical indicator you can imagine is already included and if you need something more advanced you can always write your own indicator. You would also need a lot of data in order for your analysis to be statistically significant. For US stocks and ETFs, visit www.Kibot.com. We have good experience using their data.
Here's a pattern for ya
http://ddshankar.files.wordpress.com/2008/02/image001.jpg
I'd start with a good introduction to time series analysis and go from there. If you're interested in finding patterns then the interesting term is "1D-Pattern Matching". But for that you need nice features, so google for "Feature extraction in time series". Remember GiGo. So make sure you have error-free stock price data for a sufficiently long timeperiod before you start.
May I suggest that you do a little reading with respect to the Kalman filter? Wikipedia is a pretty good place to start:
http://en.wikipedia.org/wiki/Kalman_filter/
This should give you a little background on the problem of estimating and predicting the variables of some system (the stock market in this case).
But the stock market is not very well behaved so you may want to familiarize yourself with non linear extensions to the KF. Yes, the wikipedia entry has sections on the extended KF and the unscented KF, but here is an introduction that is just a little more in-depth:
http://cslu.cse.ogi.edu/nsel/ukf/
I suppose if anyone had ever tried this before then it would have been all over the news and very well known. So you may very well be on to something.
Use TradeStation
It is a platform that lets you write software to analyze historical stock data. You can even write programs that would trade the stock, and you can back test your program on historical data or run it real time through out the day.

Minimum CompSci Knowledge Needed for Writing Desktop Apps

Having been a hobbyist programmer for 3 years (mainly Python and C) and never having written an application longer than 500 lines of code, I find myself faced with two choices :
(1) Learn the essentials of data structures and algorithm design so I can become a l33t computer scientist.
(2) Learn Qt, which would help me build projects I have been itching to build for a long time.
For learning (1), everyone seems to recommend reading CLRS. Unfortunately, reading CLRS would take me at least an year of study (or more, I'm not Peter Krumins). I also understand that to accomplish any moderately complex task using (2), I will need to understand at least the fundamentals of (1), which brings me to my question : assuming I use C++ as the programming language of choice, which parts of CLRS would give me sufficient knowledge of algorithms and data structures to work on large projects using (2)?
In other words, I need a list of theoretical CompSci topics absolutely essential for everyday application programming tasks. Also, I want to use CLRS as a handy reference, so I don't want to skip any material critical to understanding the later sections of the book.
Don't get me wrong here. Discrete math and the theoretical underpinnings of CompSci have been on my "TODO: URGENT" list for about 6 months now, but I just don't have enough time owing to college work. After a long time, I have 15 days off to do whatever the hell I like, and I want to spend these 15 days building applications I really want to build rather than sitting at my desk, pen and paper in hand, trying to write down the solution to a textbook problem.
(BTW, a less-math-more-code resource on algorithms will be highly appreciated. I'm just out of high school and my math is not at the level it should be.)
Thanks :)
This could be considered heresy, but the vast majority of application code does not require much understanding of algorithms and data structures. Most languages provide libraries which contain collection classes, searching and sorting algorithms, etc. You generally don't need to understand the theory behind how these work, just use them!
However, if you've never written anything longer than 500 lines, then there are a lot of things you DO need to learn, such as how to write your application's code so that it's flexible, maintainable, etc.
For a less-math, more code resource on algorithms than CLRS, check out Algorithms in a Nutshell. If you're going to be writing desktop applications, I don't consider CLRS to be required reading. If you're using C++ I think Sedgewick is a more appropriate choice.
Try some online comp sci courses. Berkeley has some, as does MIT. Software engineering radio is a great podcast also.
See these questions as well:
What are some good computer science resources for a blind programmer?
https://stackoverflow.com/questions/360542/plumber-programmers-vs-computer-scientists#360554
Heed the wisdom of Don and just do it. Can you define the features that you want your application to have? Can you break those features down into smaller tasks? Can you organize the code produced by those tasks into a coherent structure?
Of course you can. Identify any 'risky' areas (areas that you do not understand, e.g. something that requires more math than you know, or special algorithms you would have to research) and either find another solution, prototype a solution, or come back to SO and ask specific questions.
Moving from 500 loc to a real (eve if small) application it's not that easy.
As Don was pointing out, you'll need to learn a lot of things about code (flexibility, reuse, etc), you need to learn some very basic of configuration management as well (visual source safe, svn?)
But the main issue is that you need a way to don't be overwhelmed by your functiononalities/code pair. That it's not easy. What I can suggest you is to put in place something to 'automatically' test your code (even in a very basic way) via some regression tests. Otherwise it's going to be hard.
As you can see I think it's no related at all to data structure, algorithms or whatever.
Good luck and let us know
I must say that sitting down with a dry old textbook and reading it through is not the way to learn how to do anything effectively, even if you are making notes. Doing it is the best way to learn, using the textbooks as a reference. Indeed, using sites like this as a reference.
As for data structures - learn which one is good for whatever situation you envision: Sets (sorted and unsorted), Lists (ArrayList, LinkedList), Maps (HashMap, TreeMap). Complexity of doing basic operations - adding, removing, searching, sorting, etc. That will help you to select an appropriate library data structure to use in your application.
And also make sure you're reasonably warm with MVC - i.e., ensure your model is separate from your view (the QT front-end) as best as possible. Best would be to have the model and algorithms working on their own, and then put the GUI on top. Or a unit test on top. Etc...
Good luck!
It's like saying you want to move to France, so should you learn french from a book, and what are the essential words - or should you just go to France and find out which words you need to know from experience and from copying the locals.
Writing code is part of learning computer science. I was writing code long before I'd even heard of the term, and lots of people were writing code before the term was invented.
Besides, you say you're itching to write certain applications. That can't be taught, so just go ahead and do it. Some things you only learn by doing.
(The theoretical foundations will just give you a deeper understanding of what you wind up doing anyway, which will mainly be copying other people's approaches. The only caveat is that in some cases the theoretical stuff will tell you what's futile to attempt - e.g. if one of your itches is to solve an NP complete problem, you probably won't succeed :-)
I would say the practical aspects of coding are more important. In particular, source control is vital if you don't use that already. I like bzr as an easy to set up and use system, though GUI support isn't as mature as it could be.
I'd then move on to one or both of the classics about the craft of coding, namely
The Pragmatic Programmer
Code Complete 2
You could also check out the list of recommended books on Stack Overflow.

Resources