Mahout Content Based Recommendation Engine - algorithm

I am working on a recommendation problem (Content based recommendation). I have my data set in mongodb in json format.
Problem Statement
There are items which have their own properties, and users have some preference regarding each properties. Now I am thinking to predict how much the item x will be liked by the user based on the properties of item and comparing the preferences of the user for same properties that item x have. I want to build a recommendation system to recommend the items to user , based on their preference.
I am thinking of using Mahout and CBAYES Classifier algorithm to predict , "how much item x will be liked by User A ". But I haven't found any example and data set for implementing CBAYES using mahout.
If you have any other suggestion to use any other classifier algorithm then please recommend.

You can calculate “how much item x will be liked by User A” by using cosine similarity. Please refer the following link for your more information.
Reference link: What's difference between Collaborative Filtering Item-based recommendation and Content-based recommendation
Regards,
Rajasekar

Related

How do I access h2o xgb model input features after saving a model to disk and reloading it?

I'm using h2o's xgboost implementation in Python. I've saved a model to disk and I'm trying to load it later on for analysis and predicting. I'm trying to access the input features list or, even better, the feature list used by the model which does not include the features it decided not to use. The way people advise doing this is to use varimp function to get the variable importance and while this does remove features that aren't used in the model this actually gives you the variable importance of intermediate features created by OHE the categorical features, not the original categorical feature names.
I've searched for how to do this and so far I've found the following but no concrete way to do this:
Someone asking something very similar to this and being told the feature has been requested in Jira
Said Jira ticket which has been marked resolved but I believe says this was implemented but not customer visible.
A similar ticket requesting this feature (original categorical feature importance) for variable importance heatmaps but it is still open.
Someone else who found an unofficial way to access the columns with model._model_json['output']['names'] but that doesn't give the features that weren't used by the model and they are told to use a different method that doesn't work if you have saved the model to disk and reloaded it (which I am doing).
The only option I see is to just use the varimp features, split on period character to break the OHE feature names, select the first part of all the splits, and then run a set over everything to get the unique column names. But I'm hoping there's a better way to do this.

Google cloud natural language API adding own context classifier

I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).

text mining/analyse user commands/questions algorithm or library

I got a financial application and I wish to add to it the ability to get user command or input in textbox and then take the right action. for example, wish the user to write "show the revenue in the last 10 days" and it'll show the revenue to him/her - the point is that I wish it to really understand the meaning of the question, so the previus statement will bring the same results as "do I got any revenue in the last 10 days" or something like that - BI (something like the Wolfram|Alpha engine).
I wonder if there's any opensource library or algorithm books or whatever that I can use to learn the subject. Regards to opensource libraries - I don't mind which language it'll be written in.
I've read about this subject and saw many engines and services (OpenNLP, Apache UIMA, CoreNLP etc.) but did not figure out if they're right for my needs.
Any answer or suggestion is welcome.
Many thanks!
The field you're talking about is usually called "natural language processing". It's hard, and an active field of research. There are various libraries which you could consider based on your preferred programming language and use case:
http://en.wikipedia.org/wiki/List_of_natural_language_processing_toolkits
I've used NLTK a little bit. This field is seriously difficult to get right, so you might want to try to restrict your application to some small set of verbs and nouns such that people are using a controlled vocabulary in the first instance, and then try to extend it beyond that.

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

Tag/Keyword based recommendation

I am wondering what algorithm would be clever to use for a tag driven e-commerce enviroment:
Each item has several tags. IE:
Item name: "Metallica - Black Album CD", Tags: "metallica", "black-album", "rock", "music"
Each user has several tags and friends(other users) bound to
them. IE:
Username: "testguy", Interests: "python", "rock", "metal", "computer-science"
Friends: "testguy2", "testguy3"
I need to generate recommendations to such users by checking their interest tags and generating recommendations in a sophisticated way.
Ideas:
A Hybrid recommendation algorithm can be used as each user has friends.(mixture of collaborative + context based recommendations).
Maybe using user tags, similar users (peers) can be found to generate recommendations.
Maybe directly matching tags between users and items via tags.
Any suggestion is welcome. Any python based library is also welcome as I will be doing this experimental engine on python language.
1) Weight your tags.
Tags fall into several groups of interest:
My tags that none of my friends share
Tags a number of my friends share, but I don't
My tags that are shared by a number of my friends.
(sometimes you may want to consider friend-of-a-friend tags too, but in my experience the effort hasn't been worth it. YMMV.)
Identify all tags that the person and/or the person's friends have in interests, and attach a weight to the tags for this individual. One simple possible formula for tag weight is
(tag_is_in_my_list) * 2 + (friends_with_tag)/(number_of_friends)
Note the magic number 2, which makes your own opinion worth twice as much as that of all of your friends put together. Feel free to tweak :-)
2) Weight your items
For each item that has any of the tags in your list, just add up all of the weighted values of the tags. A higher value = more interest.
3) Apply a threshold.
The simplest way is to show the user the top n results.
More sophisticated systems also apply anti-tags (i.e. topics of non-interest) and do many other things, but I have found this simple formula effective and quick.
If you can, track down a copy of O'Reilly's Programming Collective Intelligence, by Toby Segaran. There's a model solution in it for exactly this problem (with a whole bunch of really, really good other stuff).
Your problem is similar to product recommendation engines, such as Amazon's well publicized site. These use a learning algorithm called association rules, which basically build a conditional probability of user X buying product Y based on common features Z between the user and product. A lot of open source toolkits implement association rules, such as Orange and Weka.
You can use the Python Semantic module for Drools to specify your rules in python scripting language. You can accomplish this easily using Drools. It is a terrific rules engine that we used to solve several recommendation engines.
I would use a Restricted Boltzmann Machine. Gets around the problem of similar but not identical tags quite neatly.

Resources