I want to make a program that can classify news into real or fake news. In this project I've used Naive Bayes Algorithm to solve it. I have categorized fake news with sentimental analysis and I already know how to classify it.
But now I want to add constraint in this project, which is I want to have punctuation amount to decide if its a real or fake news but I am confused how to mix it. In the like-hood or what?
Thank you. Before it I have already used n-gram and laplace smoothing in the sentimental analysis of the news.
Related
I recently finished a machine learning course and would like to make a forum sentiment analysis tool, to apply it in stock-related forums.
The idea is to:
Capture (text mining) users with their comments, and evaluate their comment's sentiment (positive, negative, neutral).
Capture what happens (stock market) after those comments, and assign a weight to the user accordingly (bigger weight if the user's sentiments is spot-on and the market follows the same direction)
Use the comments as a tool to predict market direction.
Actually, I do this myself (pay attention on forums) plus my own technical analysis and the obligatory due diligence, and it has been working very well for me. I just wanted to try to automate it a little bit and maybe even allow a program to play with some of my accounts (paper trading first, and if it performs decently assign some money in a real account)
This would be my first machine learning project (just as a proof-of-concept) so any comments would be very kindly appreciated.
The biggest problem that I find is that I would like to make an unsupervised training, and I need a sample dataset to do the training.
Question: Is there any known forum-sentiment dataset available to be used for unsupervised training?
I've found several sentiment datasets (twitter, imbd, amazon reviews) but they are very specific to their niche (short messages, movies, products...) but I'm looking for something more general.
Since you are looking for an unsupervised approach you can use any set of data that matches your "real case scenario". Text mining and sentiment analysis are are often tailored to the problem at hand so it is easy to start directly with the real data. The best approach is to built a scraper that grabs directly the forum posts that you want to analyze. You can build the scraper easily enough with Python (beautifulsoup/selenium). Online is full of nice tutorial eg: https://www.dataquest.io/blog/web-scraping-tutorial-python/
We are looking at building recommender system for our brand-new Learning Management System. There are a bunch of users and items (learning modules) onboarded, but no ratings yet - typical cold start problem.
To begin with, we are thinking of using a simple item-based similarity using item attributes (tags, category, etc.) The idea is to switch to more robust collaborative filtering as the ratings start coming in.
Questions:
Is this a good approach? Is there a recommended ML pattern to handle such cold-start conditions?
To realise item-based similarity, which is the right algorithm? Say, cosine similarity. However, please note there is no "matrix". Should we try to use a standard ML algorithm or maybe roll our own?
Your approach is good. I would start with an unsupervised learning algorithm such as 'k-Nearest Neighbors classifier'. If your team doesn't know the first thing about ML, I recommend you to read this tutorial http://www.astroml.org/sklearn_tutorial/general_concepts.html . It uses python and a great library called scikit-learn. From there you could do Andrew's NG course (https://www.coursera.org/learn/machine-learning/) although it does not cover any recommendation systems.
I usually go with a Pearson Correlation algorithm (https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) and that suffices me for my problems. The problem with this approach is that it is linear. I have read that the Orange data mining tool provides many correlation measures. Using it you could find which one is best for your data. I would advice against using your own algorithm.
There is an older question which provides further information on the matter: How can I implement a recommendation engine?
I tried naive bayes classifier and it's working very bad. SVM works a little better but still horrible. Most of the papers which i read about SVM and naive bayes with some variations(n-gram, POS etc) but all of them gives results close to 50% (authors of articles talk about 80% and high but i cannt to get same accurate on real data).
Is there any more powerfull methods except lexixal analys? SVM and Bayes suppose that words independet. These approach called "bag of words". What if we suppose that words are associated?
For example: Use apriory algorithm to detect that if sentences contains "bad and horrible" then 70% probality that sentence is negative. Also we can use distance between words and so on.
Is it good idea or i'm inventing bicycle?
You're confusing a couple of concepts here. Neither Naive Bayes nor SVMs are tied to the bag of words approach. Neither SVMs nor the BOW approach have an independence assumption between terms.
Here's some things you can try:
include punctuation marks in your bags of words; esp. ! and ? can be helpful for sentiment analysis, while many feature extractors geared toward document classification throw them away
same for stop words: words like "I" and "my" may be indicative of subjective text
build a two-stage classifier; first determine whether any opinion is expressed, then whether it's positive or negative
try a quadratic kernel SVM instead of a linear one to capture interactions between features.
Algorithms like SVM, Naive Bayes and maximum entropy ones are supervised machine learning algorithms and the output of your program depends on the training set you have provided.
For large scale sentiment analysis I prefer using unsupervised learning method in which one can determine the sentiments of the adjectives by clustering documents into same-oriented parts, and label the clusters positive or negative. More information can be found out from this paper.
http://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
Hope this helps you in your work :)
You can find some useful material on Sentimnetal analysis using python.
This presentation summarizes Sentiment Analysis as 3 simple steps
Labeling data
Preprocessing &
Model Learning
Sentiment Analysis is an area of ongoing research. And there is a lot of research going on right now. For an overview of the most recent, most successful approaches, I would generally advice you to have a look at the shared tasks of SemEval. Usually, every year they run a competition on Sentiment Analysis in Twitter. You can find the paper describing the task, and the results for 2016 here (might be a bit technical though): http://alt.qcri.org/semeval2016/task4/data/uploads/semeval2016_task4_report.pdf
Starting from there, you can have a look in the papers describing the individual systems (as referenced there).
I'd like you to give me some advice in order to tackle this problem. At college I've been solving opinion mining tasks but with Twitter the approach is quite different. For example, I used an ensemble learning approach to classify users opinions about a certain Hotel in Spain. Of course, I was given a training set with positive and negative opinions and then I tested with the test set. But now, with twitter, I've found this kind of categorization very difficult.
Do I need to have a training set? and if the answer to this question is positive, don't you think twitter is so temporal so if I have that set, my performance on future topics will be very poor?
I was thinking in getting a dictionary (mainly adjectives) and cross my tweets with it and obtain a term-document matrix but I have no class assigned to any twitter. Also, positive adjectives and negative adjectives could vary depending on the topic and time. So, how to deal with this?
How to deal with the problem of languages? For instance, I'd like to study tweets written in English and those in Spanish, but separately.
Which programming languages do you suggest to do something like this? I've been trying with R packages like tm, twitteR.
Sure, I think the way sentiment is used will stay constant for a few months. worst case you relabel and retrain. Unsupervised learning has a shitty track record for industrial applications in my experience.
You'll need some emotion/adj dictionary for sentiment stuff- there are some datasets out there but I forget where they are. I may have answered previous questions with better info.
Just do English tweets, it's fairly easy to build a language classifier, but you want to start small, so take it easy on yourself
Python (NLTK) if you want to do it easily in a small amount of code. Java has good NLP stuff, but Python and it's libraries are way more user friendly
This site: https://sites.google.com/site/miningtwitter/questions/sentiment provides 3 ways to do sentiment analysis using R.
The twitter package is now updated to work with the new twitter API. I'd you download the source version of the package to avoid getting duplicated tweets.
I'm working on a spanish dictionary for opinion mining, and would publish somewhere accesible.
cheers!
Sentiment Analysis will give only 3 results as said above - positive, negative and neutral. I found a tutorial on Twitter Sentiment analysis and it's quiet easy.
I found it here - https://www.ai-ml.tech/twitter-sentiment-analysis/
Only 3 dependencies, i downloaded and lesser code, done. Just go through it, you will get the solution.
Does anyone know how to build automatic tagging (blog post/document) algorithm? Any example will be appreciated.
I agree with what Wooble is saying. However the naïve solution is to simply write an algorithm that calculates the lexical similarities and differences of the given blog post compared to a corpus of text. This lexical difference will give you words that are found in the blog post with more frequency than those found in the corpus. And from those words, you can infer a tag.
But I strongly recommend against it. Automatic tagging doesn't seem to work in practice. Just outsource the tagging work to your users or to services like Mechanical Turk
Late response but also had this task for a course - so in case someone else is looking to explore this, here is a starting point:
If you are looking for simple solutions or perhaps as a machine learning exercise, you might view automatic tagging as a text categorization/classification task. Naive Bayes classifiers are simple tools to figure out and there is plenty of pseudocode and material to understand these. TFIDF (term frequency-inverse document frequency) metric is something else you can look into - although commonly associated with information retrieval it can be tasked for this problem when combined with other machine learning techniques.
However, instead of assigning the new sample a single label based on a the definition of NB classifier, you will have to determine multiple labels. You can probably use the tag co-occurrence information from training set to help you with this.
This is a simplistic and naive solution and there are a lot of details on feature selection left out (stemming to reduce independent parameters, information gain, etc). Plenty of easily accessible papers on this research topic to try it out!