Unsupervised automatic tagging algorithms? - algorithm

I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!

The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.
LDA has several efficient implementations in several languages:
many implementations from the original researchers
http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation

These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
http://research.microsoft.com/pubs/79896/tagging.pdf
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.

Text documents can be tagged using this keyphrase extraction algorithm/package.
http://www.nzdl.org/Kea/
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?

You want Doc-Tags (https://www.Doc-Tags.com) which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent (https://xAIgent.com) is very inexpensive to use.

I posted a blog article today to answer your question.
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott

Related

Is there any sentiment forum dataset for unsupervised training available?

I recently finished a machine learning course and would like to make a forum sentiment analysis tool, to apply it in stock-related forums.
The idea is to:
Capture (text mining) users with their comments, and evaluate their comment's sentiment (positive, negative, neutral).
Capture what happens (stock market) after those comments, and assign a weight to the user accordingly (bigger weight if the user's sentiments is spot-on and the market follows the same direction)
Use the comments as a tool to predict market direction.
Actually, I do this myself (pay attention on forums) plus my own technical analysis and the obligatory due diligence, and it has been working very well for me. I just wanted to try to automate it a little bit and maybe even allow a program to play with some of my accounts (paper trading first, and if it performs decently assign some money in a real account)
This would be my first machine learning project (just as a proof-of-concept) so any comments would be very kindly appreciated.
The biggest problem that I find is that I would like to make an unsupervised training, and I need a sample dataset to do the training.
Question: Is there any known forum-sentiment dataset available to be used for unsupervised training?
I've found several sentiment datasets (twitter, imbd, amazon reviews) but they are very specific to their niche (short messages, movies, products...) but I'm looking for something more general.
Since you are looking for an unsupervised approach you can use any set of data that matches your "real case scenario". Text mining and sentiment analysis are are often tailored to the problem at hand so it is easy to start directly with the real data. The best approach is to built a scraper that grabs directly the forum posts that you want to analyze. You can build the scraper easily enough with Python (beautifulsoup/selenium). Online is full of nice tutorial eg: https://www.dataquest.io/blog/web-scraping-tutorial-python/

How to tackle twitter sentiment analysis?

I'd like you to give me some advice in order to tackle this problem. At college I've been solving opinion mining tasks but with Twitter the approach is quite different. For example, I used an ensemble learning approach to classify users opinions about a certain Hotel in Spain. Of course, I was given a training set with positive and negative opinions and then I tested with the test set. But now, with twitter, I've found this kind of categorization very difficult.
Do I need to have a training set? and if the answer to this question is positive, don't you think twitter is so temporal so if I have that set, my performance on future topics will be very poor?
I was thinking in getting a dictionary (mainly adjectives) and cross my tweets with it and obtain a term-document matrix but I have no class assigned to any twitter. Also, positive adjectives and negative adjectives could vary depending on the topic and time. So, how to deal with this?
How to deal with the problem of languages? For instance, I'd like to study tweets written in English and those in Spanish, but separately.
Which programming languages do you suggest to do something like this? I've been trying with R packages like tm, twitteR.
Sure, I think the way sentiment is used will stay constant for a few months. worst case you relabel and retrain. Unsupervised learning has a shitty track record for industrial applications in my experience.
You'll need some emotion/adj dictionary for sentiment stuff- there are some datasets out there but I forget where they are. I may have answered previous questions with better info.
Just do English tweets, it's fairly easy to build a language classifier, but you want to start small, so take it easy on yourself
Python (NLTK) if you want to do it easily in a small amount of code. Java has good NLP stuff, but Python and it's libraries are way more user friendly
This site: https://sites.google.com/site/miningtwitter/questions/sentiment provides 3 ways to do sentiment analysis using R.
The twitter package is now updated to work with the new twitter API. I'd you download the source version of the package to avoid getting duplicated tweets.
I'm working on a spanish dictionary for opinion mining, and would publish somewhere accesible.
cheers!
Sentiment Analysis will give only 3 results as said above - positive, negative and neutral. I found a tutorial on Twitter Sentiment analysis and it's quiet easy.
I found it here - https://www.ai-ml.tech/twitter-sentiment-analysis/
Only 3 dependencies, i downloaded and lesser code, done. Just go through it, you will get the solution.

Prioritizing text based on content

If you have a list of texts and a person interested in certain topics what are the algorithms dealing with choosing the most relevant text for a given person?
I believe that this is quite a complex topic and as an answer I expect a few directions to study various methodologies of text analysis, text statistics, artificial intelligence etc.
thank you
There are quite a few algorithms out there for this task. At least way too many to mention them all here. First some starting points:
Topic discovery and recommendation are two quite distinctive tasks, although they often overlap. If you have a stable userbase, you might be able to give very good recommendations without any topic discovery.
Discovering topics and assigning names to them are also two different tasks. This means it is often easier to be able to tell that text A and text B share a similar topic, than to explicetly be able to state what this common topic might be. Giving names to the topics is best done by humans, for example by having them tag the items.
Now to some actual examples.
TF-IDF is often a good starting point, however it also has severe drawbacks. For example it will not be able to tell that "car" and "truck" in two texts mean that these two probably share a topic.
http://websom.hut.fi/websom/ A Kohonen map for automatically clustering data. It learns the topics and then organizes the texts by topics.
http://de.wikipedia.org/wiki/Latent_Semantic_Analysis Will be able to boost TF-IDF by detecting semantic similarity among different words. Also note, that this has been patented, so you might not be able to use it.
Once you have a set of topics assigned by users or experts, you can also try almost any kind of machine learning method (for example SVM) to map the TF-IDF data to topics.
As a search engine engieneer I think this problem is best solved using two techniques in conjuction.
Technology 1, Search (TF-IDF or other algorithms)
Use search to create a baseline model for content where you dont have user statistics. There are a number of technologies out there but I think the Apache Lucene/Solr code base is by fare the most mature and stable.
Technology 2, User based recommenders (k-nearest neighborhood other algorithms)
When you start getting user statistics use this to enhance your relevance model used by the text analysis system. A fast growing codebase to solv these kinds of problem is the Apache Mahout project.
Check out Programming Collective Intelligence, a really good overview of various techniques along these lines. Also very readable.

Media recommendation engine - Single user system - How to start

I want to implement a media recommendation engine. I saw a similar posts on this, but I think my requirements are bit different from those, so posting here.
Here is the deal.
I want to implement a recommendation engine for media players like VLC, which would be an engine that has to care for only single user. Like, it would be embedded in a media player on a PC which is typically used by single user. And it will start learning the likes and dislikes of the user and gradually learns what a user likes. Here it will not be able to find similar users for using their data for recommendation as its a single user system. So how to go about this?
Or you can consider it as a recommendation engine that has to be put in say iPods, which has to learn about a single user and recommend music/Movies from the collections it has.
I thought of start collecting the genre of music/movies (maybe even artist name) that user watches and recommend movies from the most watched Genre, but it look very crude, isn't it?
So is there any algorithms I can use or any resources I can refer up to?
Regards,
MicroKernel :)
What you're trying to do is quite challenging... particularly because it's still in the research stage and a lot of PHDs from reputable universities across the world are trying to get a good solution for that.
SO here are some things that you might need:
Data that you can analyze:
Lots, and lots, and lots of data!
It could be meta data about the media (name, duration, title, author, style, etc.)
Or you can try to do some crazy feature extraction from the media itself.
References to correlate the data to.
Since you can't get other users, you always need the user feedback.
If you don't want to annoy your user to death with feedback questions, then make your application connect to a central server so you can compare users.
An algorithm that can model your data sufficiently well.
If you have no experience at all, then try k-nearest neighbor (the simplest one).
Collaborative filtering
Pearson Correlation
Matrix Factorization/Decomposition
Singular value decomposition (SVD)
Ensemble learning <-- Allows you to combine multiple algorithms and take advantage of their strengths.
The winners of the NetFlix prize said this:
Predictive accuracy is substantially
improved when blending multiple
predictors. Our experience is that
most efforts should be concentrated in
deriving substantially different
approaches, rather than refining a
single technique. Consequently, our
solution is an ensemble of many
methods.
Conclusion:
There is no silver bullet for recommendation engines and it takes years of exploration to find a good combination of algorithms that produce sufficient results. :)

Search ranking/relevance algorithms

When developing a database of articles in a Knowledge Base (for example) - what are the best ways to sort and display the most relevant answers to a users' question?
Would you use additional data such as keyword weighting based on whether previous users found the article of help, or do you find a simple keyword matching algorithm to be sufficient?
Perhaps the easiest and most naive approach that will give immediately useful results would be to implement *tf-idf:
Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.
In a recent related question of mine here I learned of an excellent free book on this topic which you can download or read online:
An Introduction to Information Retrieval
That's a hard question, and companies like Google are pushing a lot of efforts to address this question. Have a look at Google Enterprise Search Appliance or Exalead Enterprise Search.
Then, as a personal opinion, I don't think that any "naive" approach is going to improve much the result compared to naive keyword search and ordering by the number of views on the documents.
If you have the possibility to expose your knowledge base to the web, then, just do it, and let your favorite search engine handles the search for you.
I think the angle here is not the retrieval itself... its about scoring the relevence of the information retrieved (A more reactive and passive approach) which can be later used to improve the search engine.
I guess you can try -
knn on tfidf for retrieving information
Hand tagging these retrieved info a relevency score
Then regress that score to predict the score for an unknwon search result and sort it.
Just a thought...
The third point is actually based on Rocchio algorithm. You can see it here
A little more specificity of your exact problem would be good. There are a lot of different techniques that you can use. Many of these are driven by other pieces of data. You can of course use Lucene and build your own indexes. There are bindings for many languages to lucene. Moving up there is also the Solr project which is Lucene with a lot of tools and extra functionality around it. That may be more along the lines of what you are looking for.
Intent is tricky and most modern search engines rely on statistical intent to aid in the ordering of results. You can always have an is this article useful button and store the query text that leads to useful documents. You could then add a layer of information to the index to boost specific words or phrases and help them point to certain documents.
Some things to think about...How many documents? What is the average length? Are they updated frequently? What do users do with the documents? What does the spread of unique words to documents look like? (More simply is it easy to match a query with a specific document(s) based on common unique features.)
If it is on the web you can always make a google custom search engine that just searches your site although you may find this to be sub-optimal for a variety of reasons.
You can always start with a simple index and gradually make it more sophisticated by talking with users and capturing data.
keyword matching is not enough when dealing with questions, you need to understand intent, as joannes say a very hot topic in search

Resources