Sorting sets Q of a set S such, that all members are represented biased on S (shuffling music titles of different artists within a directory) - algorithm

Starting position:
My wife has a collection of music titles stored on her walkman. The files are all stored in a single directory, with the help of a tool I wrote plus some manual assistance, in the format
Artist1 - Title1
Artist1 - Title2
...
Artist1 - TitleN
Artist2 - Title1
...
ArtistM - TitleZ
Some artists have just as few as a handful titles (one featuring just 1 title), while others sport as many as 56 titles. All in all, the growing collection encompasses a 4-digits number of n titles now (about 1200).
These titles are all sorted alphabetically within that single directory, so that a title can be found quite easily, mainly for the purpose of removing a now disliked title from the collection.
This said, it should be obvious, that this collection is to be played randomly, using the walkman's built-in shuffle mode.
Problem:
The shuffle mode of this walkman is really not worth being called this way.
It is quite common, that songs from the same artist (stored sequentially in the folder) are played consecutively, not rarely even 3 of them in a row. This behaviour certainly is not perceived as "random".
Approach:
My approach was to pre-shuffle these titles programmatically, to be played sequentially. The strategy was to ensure, that an artist does not play 2 of her songs in a row.
For this I let my program evaluate for each artist the number of titles present, to determine the artist with the most titles, and found m=56. From that number I programmatically find the next prime p=59.
Now the idea was to iterate through the original alphabetically sorted list in a step width of n mod p, so that eventually all titles get copied in a way that they don't expose two consecutive titles of the same artist.
This works as expected.
However, after about n/p (currently about 30) titles, the artists repeat (with different titles, but still giving the impression "just heard her"), while many other artists featuring less than m titles are quite never played.
Question:
Is there a mathematical approach to improve my psychologically quite trivial approach to play as many titles from different artists as possible before playing an arbitrary other one of her titles (even at the expense of playing underrepresented artists more often), other than just trivially keeping track of every artist's already copied title and duplicating titles in the play list?

Related

Create a Dynamic Array formula (Excel) to combine multiple results columns into one column that is filtered & sorted using multiple criteria?

The sample data in the image below is collected from a round robin tournament.
There is a Round column,Home team & Away team columns listing who is playing who. A team could be either Home or Away.
For each match in a round (including any "Bye" match) the number of games won for the Home and Away team are recorded in separate columns respectively.
"Ff" = forfeit and has a value of 0. "Bye" result is left blank (at this stage).
Output columns are "Won, Lost, Round".
Required output (shown in the image) is, for any selected team, the top n most-games-won matches (from both Home & Away) sorted in descending order and then the corresponding games lost but sorted in ascending order where the games won are equal. Finally show the rounds where those scores occurred.
These are the challenges I've faced in going from data to output in one step using dynamic array formula:
Collating/Combining the the Win results into 1 column. Likewise the Losses.
Getting the array to ignore blanks or convert "Ff" to 0 without getting #NUM or #VALUE errors.
Ensuring that if I used separate single column arrays the corresponding Loss and Round matched the Win result
Although "Round, Won, Lost" would be acceptable. But I wasn't able to get the Dynamic Array capability to give the required output with this order.
SUMPRODUCT, INDEX(MATCH), SORT(FILTER) functions all hint at a possible one step formula solution.
The solutions are numerous for sorting & filtering where the existing values are already in one column. There was one solution that dealt with 2 columns of values which was somewhat useful How to get the highest values from 2 columns in excel - Stackoverflow 2013
Many other responses are around the use of concatenation, combining/merging array sets, aggregation etc.
My work around solution is to use a Helper Sheet to combine the Wins from the separate results columns and convert blanks & "Ff" to -1. Likewise for Losses. Using the formula for each line
=IF($C5=L$2,IF($F5="",-1,IF($F5="Ff",0,$F5)),IF($D5=L$2,IF($G5="",-1,IF($G5="Ff",0,$G5)),-1))
Example Helper Sheet
To get the final output the Dynamic Array formula was used on the Helper Sheet data
=SORT(FILTER(L$26:N$40,L$26:L$40>=LARGE(L$26:L$40,$J$3),""),{1,2},{-1,1},FALSE)
I'm trying to avoid using pivottable, VBA solutions. Powerquery possible but not preferred.
Apologies for the screenshots but I couldn't work out how to attach the sample spreadsheet file. (Unfortunately Stackoverflow Help didn't help me to/not to do this.)
Based on the comments I changed my answer with a different approach:
=LET(data,A5:F19,
round,INDEX(data,,1),
ha,CHOOSECOLS(data,3,4),
HAwonR,CHOOSECOLS(data,5,6,1),
w,BYROW(ha,LAMBDA(h,IFERROR(XMATCH(L2,h),0))),
clm,CHOOSE(w,{1,2},{2,1}),
srtwon,DROP(REDUCE(0,SEQUENCE(ROWS(data)),LAMBDA(y,z,VSTACK(y,INDEX(HAwonR,z,HSTACK(INDEX(clm,z,),3))))),1),
res,FILTER(srtwon,w),
TAKE(SORT(res,{1,2},{-1,1}),J3))
Old answer:
=LET(data,A5:F19,
round,INDEX(data,,1),
home,INDEX(data,,3),
away,INDEX(data,,4),
HAwonR,CHOOSECOLS(data,5,6,1),
w,MAP(home,away,LAMBDA(h,a,OR(h=L2,a=L2))),
won,FILTER(HAwonR,w),
TAKE(SORT(won,{1,2},{-1,1}),J3))
In your example you selected round 3 for the third result, but that wasn't won, so I guess that was by mistake.
As you can see making use of LET avoids helpers. Let allows you to create names (helpers) that are stored and because you can name them, you can make complex formulas be more readable.
Basically what it does is filter the columns Home, Away and Round (in that order) for either Home or Away equal the team in cell L2. That's sorted column 1 descending and column 2 ascending. Than the number of rows mentioned in cell J3 are displayed from that sorted array.
Here is my solution based on the excellent contribution by #P.b. Thank you much appreciated.
The wins (likewise losses) required mapping the presence, of the team in question, as hT (home team) to the games it won (hG) and adding to that a 2nd mapping of the games it won (aG) when it was the away team (aT). Essentially what was being done on the Helper Sheet. Result was a 1 column array for game wins and a 1 column array for game losses.
In the process I was able to convert the "Ff" text to 0. I attempted without the conversion and it threw an error.
Instead of CHOOSECOLS used HSTACK to create the new array (wins, losses & round) for the FILTER, SORT, TAKE to work on.
If it could be made conciser(?) that is the next challenge. Overall (not just my solution), this exercise has provided greater flexibility and solved the problems stated. I'm happy!
=LET(data,A5:G19,
round,INDEX(data,,1),
hT,INDEX(data,,3),
aT,INDEX(data,,4),
hG,INDEX(data,,6),
aG,INDEX(data,,7),
wins,MAP(hG,
MAP(hT,LAMBDA(h,h=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))) +
MAP(aG,
MAP(aT,LAMBDA(a,a=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))),
losses,MAP(aG,
MAP(hT,LAMBDA(h,h=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))) +
MAP(hG,
MAP(aT,LAMBDA(a,a=L2)),
LAMBDA(w,t,IF(w="Ff",0,w)*IF(t=TRUE,1,0))),
HAwonR,HSTACK(wins,losses,round),
w,MAP(home,away,LAMBDA(h,a,OR(h=L2,a=L2))),
won,FILTER(HAwonR,w),
TAKE(SORT(won,{1,2},{-1,1}),J3))

evenly distributing duplicates throughout a randomized list (playlist shuffle)

I need to shuffle about 1,000 strings of the form "title - artist", where a number of the titles repeat (say "Silent Night"), and a number of the artists repeat (say, "Bing Crosby"). None of the "title - artist" combos repeat, and there are no additional hyphens.
I'd like to end up with a list with as much space as possible between identical titles and between identical artists.
I'm leaning toward just randomly shuffling the whole list many thousands of times, and keeping whichever one has the greatest distance between the closest pair of identical repeats.
Another brute force: shuffle (just once) then repeat tons of times: find the closest pair and swap one of them into a different random spot.
One seem better than the other? Anything a little smarter, but still easy?
Thanks a ton!
Perfect John Coleman! Thank you!
Identical question, with suggestions, here:
https://softwareengineering.stackexchange.com/q/194480/233981
------------ EDIT ------------
I couldn't be happier with my result!
Contrary to the op (me) I ignored repeated artists. Two "Bing Crosby"s in a row really doesn't matter compared with two "White Christmas"s.
I took the most popular song and spread it evenly across an unpopulated array. Then the second most popular song and did the same, incrementing past any collisions with #1. Repeat through the entire list, including songs with 0 repeats filling in the last empty slots.
So every 34th track is (a different) "Silent Night" (#1), which is impossible to detect as a listener. Every 46th track is "White Christmas" (#2, starting at a different location, with collisions bumped to 1 past "Silent Night")....
For the listener it couldn't be more random, and has zero of the gotcha/repeats I always get with rand(), noise(), and all of their sisters. ;)
Something like this:
populate the result list with null's
for each unique title, sorted from greatest # of repeats -> fewest # of repeats {
perfectSpacing = list.size() div (number of times this title repeats)
i = random(list.size()) // random starting location for this title
for each unique version of this title {
while (list[i] != null) // bump past populated slots, wrapping around
i = (i + 1) % list.size()
list[i] = current version of the current title
i = (i + perfectSpacing) % list.size()
// jump the ideal distance for the next version of this title
}
}
Although "bumping past collisions" introduces the risk of wrapping around and creating a back-to-back repeat, the number of repeats in the data set are too few to allow it. Almost half of them don't repeat at all, and for those that do repeat, the curve from "lots of repeats" to "only a few repeats" degrades very rapidly.
Hope that help somebody some day!

Removing duplicate chess games and then storing unique games in Postgresql

I have a very large number of chess games(around 5 million) stored in several pgn files(portable game notation). If you aren't familiar with PGN, the result will basically be a csv file when parsed, with several fields containing info about the players, location, etc and then one larger text field with the moves separated by some delimeter, possible a space. There will be one row with such data per game.
The catch is there may be duplicate games. Ultimately, I would like to store the unique set in Postgres, but what is the best way to get there? I had two approaches in mind:
1.Insert a game at a time and then with each subsequent insert run a uniqueness test script that would only insert the game if it is unique. Of course I would index fields as necessary to optimize this process(should I index all fields or just the 'cheap' ones like rating which are just integers)
2.Do a batch insert from the generated csv and only then check for duplicates. The algorithm I was thinking of was to just loop through 1..(# games) ids, find the game with that unique id in Postrgres(if not already deleted) and then look forward for all game that are identical, delete all but one, and then move onto the next id/game.
The second method would insert much faster but would require searching n games each search. The first one would insert more slowly, but would only search through n/2 games on average. What are people's expectations about the efficiency of each approach?

Efficiently and dynamically rank many users in memory?

I run a Java game server where I need to efficiently rank players in various ways. For example, by score, money, games won, and other achievements. This is so I can recognize the top 25 players in a given category to apply medals to those players, and dynamically update them as the rankings change. Performance is a high priority.
Note that this cannot easily be done in the database only, as the ranks will come from different sources of data and different database tables, so my hope is to handle this all in memory, and call methods on the ranked list when a value needs to be updated. Also, potentially many users can tie for the same rank.
For example, let's say I have a million players in the database. A given player might earn some extra points and instantly move from 21,305th place to 23rd place, and then later drop back off the top 25 list. I need a way to handle this efficiently. I imagine that some kind of doubly-linked list would be used, but am unsure of how to handle quickly jumping many spots in the list without traversing it one at a time to find the correct new ranking. The fact that players can tie complicates things a little bit, as each element in the ranked list can have multiple users.
How would you handle this in Java?
I don't know whether there is library that may help you, but I think you can maintain a minimum heap in the memory. When a player's point updates, you can compare this to the root of the heap, if less than,do nothing.else adjust the heap.
That means, you can maintain a minimum heap that has 25 nodes which are the highest 25 of all the players in one category.
Forget linked list. It allows fast insertions, but no efficient searching, so it's of no use.
Use the following data
double threshold
ArrayList<Player> top;
ArrayList<Player> others; (3)
and manage the following properties
each player in top has a score greater or equal to threshold
each player in others has a score lower than threshold
top is sorted
top.size() >= 25
top.size() < 25 + N where N is some arbitrary limit (e.g., 50)
Whenever some player raises it score, do the following:
if they're in top, sort top (*)
if they're in others, check if their score promotes them to top
if so, remove them from others, insert in top, and sort top
if top grew too big, move the n/2 worst players from top to others and update threshold
Whenever some player lowers it score, do the following:
- if they're in others, do nothing
- if they're in top, check if their new score allows them to stay in top
- if so, sort top (1)
- otherwise, demote them to bottom, and check if top got too small
- if so, determine an appropriate new threshold and move all corresponding players to top. (2)
(1) Sorting top is cheap as it's small. Moreover, TimSort (i.e., the algorithm behind Arrays.sort(Object[])) works very well on partially sorted sequences. Instead of sorting, you can simply remember that top is unsorted and sort it later when needed.
(2) Determining a proper threshold can be expensive and so can be moving the players. That's why only N/2 player get moved away from it when it grows too big. This leaves some spare players and makes this case pretty improbable assuming that players rarely lose score.
EDIT
For managing the objects, you also need to be able to find them in the lists. Either add a corresponding field to Player or use a TObjectIntHashMap.
EDIT 2
(3) When removing an element from the middle of others, simply replace the element by the last one and shorten the list by one. You can do it as the order doesn't matter and you must do it because of speed. (4)
(4) The whole others list needn't be actually stored anywhere. All you need is a possibility to iterate all the players not contained in top. This can be done by using an additional Set or by simply iterating though all the players and skipping those scoring above threshold.
FINAL RECOMMENDATIONS
Forget the others list (unless I'm overlooking something, you won't need it).
I guess you will need no TObjectIntHashMap either.
Use a list top and a boolean isTopSorted, which gets cleared whenever a top score changes or a player gets promoted to top (simple condition: oldScore >= threshold | newScore >= threshold).
For handling ties, make top contain at least 25 differently scored players. You can check this condition easily when printing the top players.
I assume you may use plenty of memory to do that or memory is not a concern for you. Now as you want only the top 25 entries for any category, I would suggest the following:
Have a HashSet of Player objects. Player objects have the info like name, games won, money etc.
Now have a HashMap of category name vs TreeSet of top 25 player objects in that category. The category name may be a checksum of some columns say gamewon, money, achievent etc.
HashMap<String /*for category name */, TreeSet /*Sort based on the criteria */> ......
Whenever you update a player object, you update the common HashSet first and then check if the player object is a candidate for top 25 entries in any of the categories. If it is a candidate, some player object unfortunately may lose their ranking and hence may get kicked out of the corresponding treeset.
>> if you make the TreeSet sorted by the score, it'll break whenever the score changes (and the player will not be found in it
Correct. Now I got the point :) . So, I will do the following to mitigate the problem. The player object will have a field that indicate whether it is already in some categories, basically a set of categories it is already in. While updating a player object, we have to check if the player is already in some categories, if it is in some categories already, we will rearrange the corresponding treeset first; i.e. remove the player object and readjust the score and add it back to the treeset. Whenever a player object is kicked out of a category, we remove the category from the field which is holding a set of categories the player is in
Now, what you do if the look-up is done with a brand new search criteria (means the top 25 is not computed for this criteria already) ? ------
Traverse the HashMap and build the top entries for this category from "scratch". This will be expensive operation, just like indexing something afresh.

Multi Attribute Matching of Profiles

I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)

Resources