Can't distribute items between arrays - algorithm

Imagine you have a list of objects. Each object looks like:
{'itemName':'name',
'totalItemAppearance':100,
'appearancePerList': 20}
and some number X which stands for number of lists that can contain such items.
What i need to do is randomly picking an item put them into lists with respecting item parameters.
In the end I expect X number of lists whit item which is used(in all lists) exactly 'totalItemAppearance' times but in each list it should be less or equal than 'appearancePerList'
It looks simple but i don't know how to build an algorithm properly and I can't classify the type of "distribution problem" I need for this issue so i could properly ask Google.
Thank you for replies!

First of all, you need not consider all different types of objects at the same time: There are no relations between different kinds of objects. So I will only consider the case where there is only one type of object.
What you want to do is to pick a uniform random sample from a set of objects satisfying some condition. The objects here are all possible distributions of the objects to the lists, and the condition is that the total number of objects should be 'totalItemAppearance' and that no list contains more than 'appearancePerList' objects.
If 'appearancePerList' is not too small then you can apply the following algorithm (and not wait for an eternity):
--> Pick a uniform random distribution of 'totalItemAppearance' items to lists (much easier to do)
--> If there are at most 'appearancePerList' objects in each list accept
--> Otherwise repeat
This algorithm will produce the uniform samples you wanted. I do not know if this sampling technique has a name (maybe a special case of rejection sampling?).

Related

Can we merge rankings from somewhat-similar data sets to produce a global rank?

Another way of asking this is: can we use relative rankings from separate data sets to produce a global rank?
Say I have a variety of data sets with their own rankings based upon the criteria of cuteness for baby animals: 1) Kittens, 2) Puppies, 3) Sloths, and 4) Elephants. I used pairwise comparisons (i.e., showing people two random pictures of the animal and asking them to select the cutest one) to obtain these rankings. I also have the full amount of comparisons within data sets (i.e., all puppies were compared with each other in the puppy data set).
I'm now trying to merge the data sets together to produce a global ranking of the cutest animal.
The main issue of relative ranking is that the cutest animal in one set may not necessarily be the cutest in the other set. For example, let's say that baby elephants are considered to be less than attractive, and so, the least cutest kitten will always beat the cutest elephant. How should I get around this problem?
I am thinking of doing a few cross comparisons across data sets (Kittens vs Elephants, Puppies vs Kittens, etc) to create some sort of base importance, but this may become problematic as I add on the number of animals and the type of animals.
I was also thinking of looking further into filling in sparse matrices, but I think this is only applicable towards one data set as opposed to comparing across multiple data sets?
You can achieve your task using a rating system, like most known Elo, Glicko, or our rankade. A rating system allows to build a ranking starting from pairwise comparisons, and
you don't need to do all comparisons, neither have all animals be involved in the same number of comparisons,
you don't need to do comparison inside specific data set only (let all animals 'play' against all other animals, then if you need ranking for one dataset, just use global ranking ignoring animals from others).
Using rankade (here's a comparison with aforementioned ranking systems and Microsoft's TrueSkill) you can record outputs for 2+ items as well, while with Elo or Glicko you don't. It's extremely messy and difficult for people to rank many items, but a small multiple comparison (e.g. 3-5 animals) should be suitable and useful, in your work.

Sorting a list based on multiple indices and weights

Sort of a very long winded explanation of what I'm looking at so I apologize in advance.
Let's consider a Recipe:
Take the bacon and weave it ...blahblahblah...
This recipe has 3 Tags
author (most important) - Chandler Bing
category (medium importance) - Meat recipe (out of meat/vegan/raw/etc categories)
subcategory (lowest importance) - Fast food (our of fast food / haute cuisine etc)
I am a new user that sees a list of randomly sorted recipes (my palate/profile isn't formed yet). I start interacting with different recipes (reading them, saving them, sharing them) and each interaction adds to my profile (each time I read a recipe a point gets added to the respective category/author/subcategory). After a while my profile starts to look something like this :
Chandler Bing - 100 points
Gordon Ramsey - 49 points
Haute cuisine - 12 points
Fast food - 35 points
... and so on
Now, the point of all this exercise is to actually sort the recipe list based on the individual user's preferences. For example in this case I will always see Chandler Bing's recipes on the top (regardless of category), then Ramsey's recipes. At the same time, Bing's recipes will be sorted based on my preferred categories and subcategories, seeing his fast food recipes higher than his haute cuisine ones.
What am I looking at here in terms of a sorting algorithm?
I hope that my question has enough information but if there's anything unclear please let me know and I'll try to add to it.
I would allow the "Tags" with the most importance to have the greatest capacity in point difference. Example: Give author a starting value of 50 points, with a range of 0-100 points. Give Category a starting value of 25 points, with a possible range of 0-50 points, give subcategory a starting value of 12.5 points, with a possible range of 0-25 points. That way, if the user's palate changes over time, s/he will only have to work down from the maximum, or work up from the minimum.
From there, you can simply add up the points for each "Tag", and use one of many languages' sort() methods to compare each recipe.
You can write a comparison function that is used in your sort(). The point is when you're comparing two recipes just add up the points respectively based on their tags and do a simple comparison. That and whatever sorting algorithm you choose should do just fine.
You can use a recursively subdividing MSD (sort of radix sort algorithm). Works as follows:
Take the most significant category of each recipe.
Sort the list of elements based on that category, grouping elements with the same category into one bucket (Ramsay bucket, Bing bucket etc).
Recursively sort each bucket, starting with the next category of importance (Meat bucket etc).
Concatenate the buckets together in order.
Complexity: O(kn) where k is the number of category types and N is the number of recipes.
I think what you're looking for is not a sorting algorithm, but a rating scheme.
You say, you want to sort by preferences. Let's assume, these preferences have different “dimensions”, like level of complexity, type of cuisine, etc.
These dimensions have different levels of measurement. These can be e.g. numeric or simple categories/tags. It would be your job to:
Create a scheme of dimensions and scales that can represent a user's preferences.
Operationalize real-world data to fit into this scheme.
Create a profile for the users which reflects their preferences. Same for the chefs; treat them just like normal users here.
To actually match a user to a chef (or, even to another user), create a sorting callback that matches all your dimensions against each other and makes sure that in each of the dimension the compared users have a similar value (on a numeric scale), or an overlapping set of properties (on a nominal scale, like tags). Then you sort the result by the best match.

list-of-list vs. hash-of-hashes

Setup:
I need to store feature vectors associated with string-string pairs. The string-string pairs encode an input-output relationship. There will be a relatively small number of inputs X (e.g. 5), and for each input x, there will be a relatively small number outputs Y|x (e.g. 10).
The question is, what data structure is fastest?
Additional relevant information:
The outputs are generally different for each input, and it cannot be assumed that each X has the same number of outputs.
Lookup will be done "many" times (perhaps 1000).
Inputs will be sampled equally frequently, but for each input, usually one or 2 outputs will be accessed frequently, and the remainder will be accessed infrequently or not at all.
At present, I am considering three possibilities:
list-of-lists: access outer list with index (representing input X[i]), access inner list with index (representing output Y[i][j]).
hash-of-hashes: same as above.
flat hash: key = (input,output).
If you have strings, it's unclear how you would look up the index to use a list of lists efficiently without utilizing hashing anyway. If you can pass around something that keeps the reference to the index (e.g. if the set of outputs is fixed, and you can define an enumeration of them), instead of the string a list of lists would be faster (assuming you mean list in the 'not necessarily linked list' sense, with O(1) element access). Otherwise you may as well just hash directly and save yourself the effort.
If not, that leaves hash of hashes v. flat hash. What's your access pattern like? Are you always going to ask for X,Y, or would you ever need to access all outputs for X? Hash(X+Y) is likely roughly equivalent to hash(X) + hash(Y) (both are going to generally walk over all the letters to generate the hash. So individual hashes is more flexible, at a slight (almost certainly negligible) overhead. From 3, it sounds like you might need the hash of hashes, anyhow.

Multi-criteria sorting/distribution into sets

I'm trying to figure out an algorithm...
Input is a bunch of objects that have multiple values (eg 3 values per object, colour/taste/age, though it could be more).
The algorithm would then distribute the objects into a pre-defined number of sets. Each set should end up with almost the same number of objects (preferably the object count per set shouldn't differ more than 1), and achieve the objective of as fair a distribution of values per set as possible (eg try to have close to as many red in each set, and same for other colours, as well as tastes and ages, etc).
Values are tied to objects and cannot be changed. If you move an object from one set to another it brings all its values.
I found this related question: Algorithm for fair distribution of numbers into two sets
and the "number partitioning problem" suggested seems to help with single value distributions, but I'm looking for information/algorithms with multiple values per object (as described above).
Also note that the values cannot be normalized, ie each object cannot be totalled up into a single value.
Thank you kindly for any assistance.
IMHO, you should approach this as a clustering problem http://en.wikipedia.org/wiki/Cluster_analysis .

Algorithm for grouping RESTful routes

Given a list of URLs known to be somewhat "RESTful", what would be a decent algorithm for grouping them so that URLs mapping to the same "controller/action/view" are likely to be grouped together?
For example, given the following list:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
It would group them as follows:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
Nothing is known about the order or structure of the URLs ahead of time. In my example, it would be somewhat easy since the IDs are obviously numeric. Ideally, I'd like an algorithm that does a good job even if IDs are non-numeric (as in http://www.example.com/products/rocket and http://www.example.com/products/ufo).
It's really just an effort to say, "Given these URLs, I've grouped them by removing what I think it he 'variable' ID part of the URL."
Aliza has the right idea, you want to look for the 'articulation points' (in REST, basically where a parameter is being passed). Looking only for a single point of change gets tricky
Example
http://www.example.com/foo/1/new
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/bar/1/new
These can be grouped several equally good ways since we have no idea of the URL semantics. This really boils down to the question of this - is this piece of the URL part of the REST descriptor or a parameter. If we know what all the descriptors are, the rest are parameters and we are done.
Give a sufficiently large dataset, we'd want to look at the statistics of all URLs at each depth. e.g., /x/y/z/t/. We would count the number of occurrences in each slot and generate a large joint probability distribution table.
We can now look at the distribution of symbols. A high count in a slot means it's likely a parameter. We would start from the bottom, look for conditional probability events, ie., What is the probability of x being foo, then what is the probability y being something given x, etc. etc. I'd have to think more to determine a systematic way to extracting these, but it seems like a promisign start
split each url to an array of strings with the delimiter being '/'
e.g. http://www.example.com/foo/1/edit will give the array [http:,www.example.com,foo,1,edit]
if two arrays (urls) share the same value in all indecies except for one, they will be in the same group.
e.g. http://www.example.com/foo/1/edit = [http:,www.example.com,foo,1,edit] and
http://www.example.com/foo/2/edit = [http:,www.example.com,foo,2,edit]. The arrays match in all indices except for #3 which is 1 in the first array and 2 in the second array. Therefore, the urls belong to the same group.
It is easy to see that urls like http://www.example.com/foo/3 and http://www.example.com/foo/1/edit will not belong to the same group according to this algorithm.

Resources