Mahout recommender returns no results for a user - hadoop

I'm curious why in the example below the Mahout recommender isn't returning a recommendation for user 1.
My input file is below. I added blank lines to enhance readability. This file will need the blank lines removed before it's run through Mahout.
The columns in this file are:
User ID | item number | item rating
1 101 0
1 102 0
1 103 5
1 104 0
2 101 4
2 102 5
2 103 4
2 104 0
3 101 0
3 102 5
3 103 5
3 104 3
You'll note that item 103 is the only common item that all 3 users rated.
I ran:
hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input small_data_set.txt --output small_data_set_output
The Mahout recommendation output file shows:
2 [104:4.5]
3 [101:5.0]
Which I believe means:
User 2 would be recommended item 104. Since user 3 rated item 104 a 3 this may account for the 4.5 recommendation score vs. the result below…
User 3 would be recommended item 101. Since user 2 rated item 101 a "4" this may account for a slightly higher recommendation score of 5.
Is this correct?
Why isn't user 1 included in the recommendation output file? User 1 could have received a recommendation for Item 102 because user 2 and user 3 rated it. Is the data set too small?
Thanks in advance.

Several mistakes may be present in your data, the first two here will cause undefined behavior:
IDs must be contiguous non-zero integers starting at 0 so you need to map your IDs above somehow. So your-user-ID = 1 will be a Mahout-user-ID = 0. The same for items, your-item-ID = 101 will be Mahout-user-ID = 0.
You should omit the 0 values from the input altogether if you mean that the user has expressed no preference, this makes the preference "undefined" in a sense. To do this omit the lines entirely.
Always use SIMILARITY_LOGLIKELIHOOD, it is widely measured as doing significantly better than the other methods unless you are trying to predict ratings, in that case use cosine.
If you use LLR similarity you should omit the values since they will be ignored.
There are very few uses for preference values unless you are trying to predict a user's rating for an item. The preference weights are useless in determining recommendation ranking, which is the typical thing to optimize. If you want to recommend the right things in the right order toss the values and use LLR.
The other thing that people sometimes do with values is show some weight of preference so 1 = a view of a product page and 5 = a product purchase. This will not work! I tried this with a large ecommerce dataset and found the recommendations were worse when adding in product views, even though there was 100 times more data. They are fundamentally different user actions with different user intent and so can't be mixed in this way.
If you really do want to mix different actions use the new multimodal recommender based on Mahout, Spark, and Solr described on the Mahout site here: It allows cross-cooccurrence type indicator calculations so you can use user location, likes and dislikes, view and purchase. Virtually the entire user clickstream can be used. But only with cross-cooccurrence correlating one action to the canonical "best" action, the one you want to recommend.

Related

Amazon QuickSight - Single Activity ID has Multiple entries - need to SUM one column but display only ONE VALUE from another column

I have been working on this for some time, but have come up empty.
I have a data set called 'Technical Assistance'.
In that data set, there is a column with the 'ActivityID', another column 'NumberAssisted', and a third column 'ContactHours'.
The issue is that for each ActivityID, there can be multiple entries, each with its own NumberAssisted and ContactHours.
Additionally, for EACH ACTIVITYID I need to show NUMBERASSISTED as a SUM, but only display CONTACTHOURS once (no sum, no calculation at all, just display).
In my example scenario, I have ONE Activity ID with FOUR entries - Each entry has a Number Served, and Contact Hours. By using SUM, I can get the correct Number Assisted (5), but cannot figure out how to get Contact Hours to display what I need. It SHOULD DISPLAY as 0.5 based on the scenario below:
ActivityID NumberAssisted ContactHours
101 1 0.5
101 1 0.5
101 1 0.5
101 2 0.5
TOTAL: 5 0.5
Thank you for any guidance!!
Troy

Finding mutually exclusive data to fulfill multiple criteria

I am trying to figure out an algorithm which would allow me to programmatically resolve the following problem.
There is a book club that has a set of rules for permission to join. You must have read 3 separate books which each conform to at least one of the following rules. All 3 rule criteria must be fulfilled by a separate book. Those rules are as follows:
Rules:
id
field
comparator
value
1
Author
IN
["John Steinbeck", "J.D. Salinger"]
2
Rating
>=
4.0
3
Number of Pages
>
300
The following is the list of books a user has read thus far:
Books:
id
title
author
rating
pages
1
Of Mice and Men
John Steinbeck
3.88
107
2
To Kill a Mockingbird
Harper Lee
4.28
281
3
Animal Farm
George Orwell
3.96
112
4
The Grapes of Wrath
John Steinbeck
3.98
464
5
1984
George Orwell
4.19
328
6
The Catcher in the Rye
J.D. Salinger
3.81
277
I've written a function which can allow me to check one specific rule with one specific book to return a boolean response indicating whether that book satisfies a given rule. The function is has the following signature
function is_valid(rule, book)
However, by running each book against each list of rules, I come up with the following list of books which match each rule:
rule_id
book_id
1
1
1
4
1
6
2
2
2
5
3
4
3
5
3
6
Now by manually looking through the resulting list of matching rules & books, I can tell that one of the combinations that would work is:
rule_id
book_id
1
1
2
2
3
4
Therefore the user who had read these 6 books would qualify to join the book club because they had read 3 separate books to satisfy each of the 3 rules.
What I'm hoping to find here is someone with expertise in data analysis that can help point me in the direction of an algorithm which can help me solve this problem programatically and allow me to scale this problem with many more rules and books that number in the tens of thousands potentially.
Any guidance would be greatly appreciated.

Store a tree structure as a table

Let me describe the one record structure in pseudo-code:
Record
UserName
E-mail
Items[8]
ItemPropertyA
ItemPropertyB
ItemPropertyC
ItemPropertyD
ItemPropertyE
There are 1-8 items in a record and exactly 5 properties each in each item. So I need to store these many records as (excel) table and I want it to be human readable, if possible. The straitforward approach is to put items and properties in 8 * 5 = 40 columns, but this is difficult to review. I'm going to place a JSON array of properties in each cell (one celll per item), using as many cells in each rows as needed. I'm just curious about other tree-to-table possibilities.
There is an alternative to 40 separate columns (some of which may be unused if there are fewer than 8 items in a record). You can use database style normalized records:
SHEET 1
RecordId UserName Email
1 Bobby bobby#example.com
2 Susan sueb#example.com
SHEET 2
RecordId ItemId PropertyA PropertyB PropertyC PropertyD PropertyE
1 1 Chocolate Electric Round Silver Hebrew
1 2 Raspberry Steam Trapezoid Brass Esperanto
1 3 Durian Gravity Bezier Titanium Bahasa Melayu
2 1 Vanilla Solar Rhombus Copper Pashto
Of course you could normalize even further and have just a single Property column, but the above seems perhaps enough when you know each item has exactly the same five properties.

Generate pairs from list that hasn't already historically existed

I'm building a pairing system that is supposed to create a pairing between two users and schedule them in a meeting. The selection is based upon a criteria that I am having a hard time figuring out. The criteria is that an earlier match cannot have existed between the pair.
My input is a list of size n that contains email addresses. This list is supposed to be split into pairs. The restriction is that this match hasn't occured previously.
So for example, my list would contain a couple of user ids
list = {1,5,6,634,533,515,61,53}
At the same time i have a database table where the old pairs exist:
previous_pairs
---------------------
id date status
1 2016-10-14 12:52:24.214 1
2 2016-10-15 12:52:24.214 2
3 2016-10-16 12:52:24.214 0
4 2016-10-17 12:52:24.214 2
previous_pair_users
---------------------
id userid
1 1
1 5
2 634
2 553
3 515
3 61
4 53
4 1
What would be a good approach to solve this problem? My test solution right now is to pop two random users and checking them for a previous match. If there exists no match, i pop a new random (if possible) and push one of the incorrect users back to the list. If the two people are last they will get matched anyhow. This doesn't sound good to me since i should predict which matches that cannot occur based on my list with already "existing" pairs.
Do you have any idea on how to get me going in regards to building this procedure? Java 8 streams looks interesting and might be a way to solve this, but i am very new to that unfortunately.
The solution here was to create a list with tuples that contain the old matches using group_concat feature of MySQL:
SELECT group_concat(MatchProfiles.ProfileId) FROM Matches
INNER JOIN MatchProfiles ON Matches.MatchId = MatchProfiles.MatchId
GROUP BY Matches.MatchId
old_matches = ((42,52),(12,52),(19,52),(10,12))
After that I select the candidates and generate a new list of tuples using my pop_random()
new_matches = ((42,12),(19,48),(10,36))
When both lists are done I look at the intersection to find any duplicates
duplicates = list(set(new_matches) & set(old_matches))
If we have duplicates we simply run the randomizer again X attemps until I find it impossible.
I know that this is not very effective when having a large set of numbers but my dataset will never be that large so I think it will be good enough.

Real time data processing

I am parsing keywords several times per second. Every second i have 1000 - 5000 keywords. So i want to find outlier, growing and other stuff which called technical analysis. One of the problem is how to store data.
I will be able to do someting like:
20-01 20-02 20-03
brother 0 3 4
table 1 0 0
cup 34 54 78
But it might be a lot of keywords. For every new part of data i need to look is this word exists? If donnt then i must to add new words and add new rows for them. What is right way to organize store? Should i use key\value database, NoSQL or something else?

Resources