Finding mutually exclusive data to fulfill multiple criteria - algorithm

I am trying to figure out an algorithm which would allow me to programmatically resolve the following problem.
There is a book club that has a set of rules for permission to join. You must have read 3 separate books which each conform to at least one of the following rules. All 3 rule criteria must be fulfilled by a separate book. Those rules are as follows:
Rules:
id
field
comparator
value
1
Author
IN
["John Steinbeck", "J.D. Salinger"]
2
Rating
>=
4.0
3
Number of Pages
>
300
The following is the list of books a user has read thus far:
Books:
id
title
author
rating
pages
1
Of Mice and Men
John Steinbeck
3.88
107
2
To Kill a Mockingbird
Harper Lee
4.28
281
3
Animal Farm
George Orwell
3.96
112
4
The Grapes of Wrath
John Steinbeck
3.98
464
5
1984
George Orwell
4.19
328
6
The Catcher in the Rye
J.D. Salinger
3.81
277
I've written a function which can allow me to check one specific rule with one specific book to return a boolean response indicating whether that book satisfies a given rule. The function is has the following signature
function is_valid(rule, book)
However, by running each book against each list of rules, I come up with the following list of books which match each rule:
rule_id
book_id
1
1
1
4
1
6
2
2
2
5
3
4
3
5
3
6
Now by manually looking through the resulting list of matching rules & books, I can tell that one of the combinations that would work is:
rule_id
book_id
1
1
2
2
3
4
Therefore the user who had read these 6 books would qualify to join the book club because they had read 3 separate books to satisfy each of the 3 rules.
What I'm hoping to find here is someone with expertise in data analysis that can help point me in the direction of an algorithm which can help me solve this problem programatically and allow me to scale this problem with many more rules and books that number in the tens of thousands potentially.
Any guidance would be greatly appreciated.

Related

Store a tree structure as a table

Let me describe the one record structure in pseudo-code:
Record
UserName
E-mail
Items[8]
ItemPropertyA
ItemPropertyB
ItemPropertyC
ItemPropertyD
ItemPropertyE
There are 1-8 items in a record and exactly 5 properties each in each item. So I need to store these many records as (excel) table and I want it to be human readable, if possible. The straitforward approach is to put items and properties in 8 * 5 = 40 columns, but this is difficult to review. I'm going to place a JSON array of properties in each cell (one celll per item), using as many cells in each rows as needed. I'm just curious about other tree-to-table possibilities.
There is an alternative to 40 separate columns (some of which may be unused if there are fewer than 8 items in a record). You can use database style normalized records:
SHEET 1
RecordId UserName Email
1 Bobby bobby#example.com
2 Susan sueb#example.com
SHEET 2
RecordId ItemId PropertyA PropertyB PropertyC PropertyD PropertyE
1 1 Chocolate Electric Round Silver Hebrew
1 2 Raspberry Steam Trapezoid Brass Esperanto
1 3 Durian Gravity Bezier Titanium Bahasa Melayu
2 1 Vanilla Solar Rhombus Copper Pashto
Of course you could normalize even further and have just a single Property column, but the above seems perhaps enough when you know each item has exactly the same five properties.

Generate pairs from list that hasn't already historically existed

I'm building a pairing system that is supposed to create a pairing between two users and schedule them in a meeting. The selection is based upon a criteria that I am having a hard time figuring out. The criteria is that an earlier match cannot have existed between the pair.
My input is a list of size n that contains email addresses. This list is supposed to be split into pairs. The restriction is that this match hasn't occured previously.
So for example, my list would contain a couple of user ids
list = {1,5,6,634,533,515,61,53}
At the same time i have a database table where the old pairs exist:
previous_pairs
---------------------
id date status
1 2016-10-14 12:52:24.214 1
2 2016-10-15 12:52:24.214 2
3 2016-10-16 12:52:24.214 0
4 2016-10-17 12:52:24.214 2
previous_pair_users
---------------------
id userid
1 1
1 5
2 634
2 553
3 515
3 61
4 53
4 1
What would be a good approach to solve this problem? My test solution right now is to pop two random users and checking them for a previous match. If there exists no match, i pop a new random (if possible) and push one of the incorrect users back to the list. If the two people are last they will get matched anyhow. This doesn't sound good to me since i should predict which matches that cannot occur based on my list with already "existing" pairs.
Do you have any idea on how to get me going in regards to building this procedure? Java 8 streams looks interesting and might be a way to solve this, but i am very new to that unfortunately.
The solution here was to create a list with tuples that contain the old matches using group_concat feature of MySQL:
SELECT group_concat(MatchProfiles.ProfileId) FROM Matches
INNER JOIN MatchProfiles ON Matches.MatchId = MatchProfiles.MatchId
GROUP BY Matches.MatchId
old_matches = ((42,52),(12,52),(19,52),(10,12))
After that I select the candidates and generate a new list of tuples using my pop_random()
new_matches = ((42,12),(19,48),(10,36))
When both lists are done I look at the intersection to find any duplicates
duplicates = list(set(new_matches) & set(old_matches))
If we have duplicates we simply run the randomizer again X attemps until I find it impossible.
I know that this is not very effective when having a large set of numbers but my dataset will never be that large so I think it will be good enough.

Mahout recommender returns no results for a user

I'm curious why in the example below the Mahout recommender isn't returning a recommendation for user 1.
My input file is below. I added blank lines to enhance readability. This file will need the blank lines removed before it's run through Mahout.
The columns in this file are:
User ID | item number | item rating
1 101 0
1 102 0
1 103 5
1 104 0
2 101 4
2 102 5
2 103 4
2 104 0
3 101 0
3 102 5
3 103 5
3 104 3
You'll note that item 103 is the only common item that all 3 users rated.
I ran:
hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input small_data_set.txt --output small_data_set_output
The Mahout recommendation output file shows:
2 [104:4.5]
3 [101:5.0]
Which I believe means:
User 2 would be recommended item 104. Since user 3 rated item 104 a 3 this may account for the 4.5 recommendation score vs. the result below…
User 3 would be recommended item 101. Since user 2 rated item 101 a "4" this may account for a slightly higher recommendation score of 5.
Is this correct?
Why isn't user 1 included in the recommendation output file? User 1 could have received a recommendation for Item 102 because user 2 and user 3 rated it. Is the data set too small?
Thanks in advance.
Several mistakes may be present in your data, the first two here will cause undefined behavior:
IDs must be contiguous non-zero integers starting at 0 so you need to map your IDs above somehow. So your-user-ID = 1 will be a Mahout-user-ID = 0. The same for items, your-item-ID = 101 will be Mahout-user-ID = 0.
You should omit the 0 values from the input altogether if you mean that the user has expressed no preference, this makes the preference "undefined" in a sense. To do this omit the lines entirely.
Always use SIMILARITY_LOGLIKELIHOOD, it is widely measured as doing significantly better than the other methods unless you are trying to predict ratings, in that case use cosine.
If you use LLR similarity you should omit the values since they will be ignored.
There are very few uses for preference values unless you are trying to predict a user's rating for an item. The preference weights are useless in determining recommendation ranking, which is the typical thing to optimize. If you want to recommend the right things in the right order toss the values and use LLR.
The other thing that people sometimes do with values is show some weight of preference so 1 = a view of a product page and 5 = a product purchase. This will not work! I tried this with a large ecommerce dataset and found the recommendations were worse when adding in product views, even though there was 100 times more data. They are fundamentally different user actions with different user intent and so can't be mixed in this way.
If you really do want to mix different actions use the new multimodal recommender based on Mahout, Spark, and Solr described on the Mahout site here: It allows cross-cooccurrence type indicator calculations so you can use user location, likes and dislikes, view and purchase. Virtually the entire user clickstream can be used. But only with cross-cooccurrence correlating one action to the canonical "best" action, the one you want to recommend.

How can I count the nested replies?

I have a table structure like this.
--comments
id article_id comment_parent
1 9 0
2 0 1
3 0 1
4 0 2
5 0 4
Basically, the first comment is on the article_id, and replies to comments are on the comment_parent. The database above creates a nested comments such as this:
- Comment 1
- Comment 2
- Comment 4
- Comment 5
- Comment 3
The problem is, I couldn't find how to determine how many comments are on the article. Right now, article 9 has 5 comments.
I believe a recursive function would solve this issue, but my Eloquent experience is pretty basic.
How can I do something like this?
Article::find(9)->getAllCommentsAmount(); //5
I would suggest adding the article_id to the child comments as well, if that is allowed. It will make it easier to count the comments for a certain article.
You may want to look at implementing a Nested Set pattern in your database.
There is a fairly popular Laravel/Eloquent implementation available here:
https://github.com/etrepat/baum
Nested Sets are specifically designed for data that is heavily nested like yours, and allows you to quickly and easily (i.e. without heavy recursion) query your data.

Crystal reports foreach loop in the formula field

Please help me to get the count of profile Ids foreach category in crystal report formula field.
I need to display Like : 2 people registered for Electrical Category
This my Sql Query result in the report.
ProID CATID Description
1 2 Inspection
1 4 Fabric Maintenance
1 6 Electrical
1 10 General Qualifications
3 6 Electrical
3 10 General Qualifications
4 1 QA /QC Vendor Inspection
6 1 QA /QC Vendor Inspection
11 1 QA /QC Vendor Inspection
12 1 QA /QC Vendor Inspection
12 2 Inspection
12 3 Coatings Inspection
12 10 General Qualifications
Thanks in advance
First you have to create a new group for category and you can use a sum(,) to solve the above requirement
http://answers.yahoo.com/question/index?qid=20080918181105AAsqVoC
http://social.msdn.microsoft.com/Forums/en-US/vscrystalreports/thread/9da5e3fb-a2ea-4337-b098-33f2b142fd0c/
Conditional group SUM in Crystal Reports

Resources