How do you group data objects subject to several constraints? - algorithm

I'm writing an application for a society in my campus and the main job of this application is to put members into groups subject to several
I need to determine the number of groups which depends on the number of members.
Then I need to arbitrarily/randomly select leaders for each group.
Then I need to add members to each group ensuring that each group satisfies the following constraints:
The number of members(including the leader) should be less than or equal to 7 and greater than or equal to 4 .
No more than 2/3 of the group should be the same gender.
No more than 2/3 of the group should be the same year of study.
Each member is classified as coming from a certain region depending on their place of residency. All members of the group should come from the same region.
Now I'd like to know how to go about this in terms of what known data structures and abstract data types can I use? What known algorithms can come in handy? Is there already a known computer science problem similar to mine that I can read up on?... etc... I think you get the question. I've done some googling around the web but nothing really helpful so far.

Related

How to design an algorithm to put elements into groups with constraints?

I was given a task of putting students into groups (to prepare a coding camp), but with several constraints. Though I've finished the task by hand, I'd like to know is there already exist some algorithms for tasks like this, or how can I design such an algorithm.
Background: 40 students in total, with these attributes:
gender: F/M
grade: Year 1/2
school: School 1/School 2/...
early assessment result: Rank from 1 to 40
Constraints: All of them needs to be satisfied.
Exactly 4 people per group
Each group needs to have at least a girl
Each group needs to have at least a Year 2 student
4 group members needs to come from 4 different schools
Each group needs to have at least a student who ranked top 10 in early assessment
What I'm expecting:
The Best: An existing algorithm/program for these kind of problems
Or, An algorithm for this specific problem
Or at least, Some ideas of creating an algorithm for this specific problem
My thoughts:
Since I've successed in making groups by hand, I know that such a solution indeed exists for my current dataset. But if I need an algorithm to find a solution for me, it should first try to check whether a solution even exists, by check if the number of girl / Year 2 students is greater than 10 (with pigeonhole principle), and some other conditions. And obviously, Constraint 5 is the easiest, and can provide a base solution for the rest. However, I still can not find a systematic way of doing it. Perhaps bruteforce and randomization can help? I'm not sure.
And sorry, since the data is confidential, I can not post it.
Update: After consulting a friend, here is a possible method:
First put the top 1 to 10 into 10 different groups.
Then iterate through groups. If the only person in the group is a boy/girl, try to add a girl/boy from a different school.
Then the problem size is reduced from 2^40 to 2^20, making bruthforce a viable solution.

Using scoring to find customers

I have a site where customers purchase items that are tagged with a variety of taxonomy terms. I want to create a group of customers who might be interested in the same items by considering the tags associated with purchases they've made. Rather than comparing a list of tags for each customer each time I want to build the group, I'm wondering if I can use some type of scoring to solve the problem.
The way I'm thinking about it, each tag would have some unique number assigned to it. When I perform a scoring operation it would render a number that could only be achieved by combining a specific set of tags.
I could update a customer's "score" periodically so that it remains relevant.
Am I on the right track? Any ideas?
Your description of the problem looks much more like a clustering or recommendation problem. I am not sure if those tags are enough of an information to use clustering or recommendation tough.
Your idea of the score doesn't look promising to me, because the same sum could be achieved in several ways, if those numbers aren't carefully enough chosen.
What I would suggest you:
You can store tags for each user. When some user purchases a new item, you will add the tags of the item to the user's tags. On periodical time you will update the users profiles. Let's say we have users A and B. If at the time of the update the similarity between A and B is greater than some threshold, you will add a relation between the users which will indicate that the two users are similar. If it's lower you will remove the relation (if previously they were related). The similarity could be either a number of common tags or num_common_tags / num_of_tags_assigned_either_in_A_or_B.
Later on, when you will want to get users with particular set of tags, you will just do a query which checks which users have that set of tags. Also you can check for similar users to given user, just by looking up which users are linked with the user in question.
If you assign a unique power of two to each tag, then you can sum the values corresponding to the tags, and users with the exact same sets of tags will get identical values.
red = 1
green = 2
blue = 4
yellow = 8
For example, only customers who have the set of { red, blue } will have a value of 5.
This is essentially using a bitmap to represent a set. The drawback is that if you have many tags, you'll quickly run out of integers. For example, if your (unsigned) integer type is four bytes, you'd be limited to 32 tags. There are libraries and classes that let you represent much larger bitsets, but, at that point, it's probably worth considering other approaches.
Another problem with this approach is that it doesn't help you cluster members that are similar but not identical.

Formulating an algorithm for a group sorting program with exclusionfactors

I'm trying to formulate an equation/algorithm to solve this problem (for a program I'm writing):
Rules:
A person, p, that is to be sorted can exclude n amount of people from the list. The excluded people, n, cannot be in the same group as p.
The list will contain around 100-150 people.
A group should contain 5-7 people (ideally 6)
My current thoughts:
Take the list count and divide it by 6, which will give me the amount of groups.
Feed people into the groups untill an exclusion occurs. When this happens, try to move the mismatched persons into other groups, based on some sort of score-system untill proper groups are formed.
However, I still feel like I need to put a limit on the amount of people allowed to be excluded per person.
My question is basically how I would figure out how many people a certain person can exclude to make this endeavor possible. Considering there will be around 150 people, each with its own list of persons to exclude, is it even possible? However, some exceptions are ofcourse allowed. Ideas and thoughts are also appriciated!
I'm planning to write the program in java.

Logic Implemention: Determining availability by resource type, when a resource can belong to multiple types

Consider a hotel which has multiple room types (e.g. single, double, twin, family), and multiple rooms. Each room can be a combination of room types (e.g. one particular room can be a double/twin room).
The problem I'm facing is how to determine availability of rooms based on what is booked already. Consider a hotel with 2 rooms:
Single / Double
Double / Family
We have a basic availability of:
Single: 1
Double: 2
Family: 1
(yes, it seems like there are four rooms, but so long as the availability > 1, it can be assigned, that's the premise I'm working on right now)
In this way, I can sell any combination of rooms, and only when a room availability counter hits zero will it affect the other rooms. E.g. I can sell a double room, and still keep the option of single or family room available. Only when another room is sold will everything close off.
So far, so good.
Except when I come up with a multiple S/D rooms (e.g. two or more) and sell them separately (e.g. a single, then a double) the counter doesn't reach 0 (so I can't use that as a trigger to close off other rooms) but I've sold the maximum number of physical rooms the hotel has anyway.
Clearly there's some fault in my approach to how I'm determining what's available, and I'd appreciate any pointers if this issue has been resolved before (as pseudo-code for now, I'll translate to MySQL/PHP once I've got my head around it).
Thanks
I managed to resolve this eventually through SQL.
My reservations table holds a room_type_id, and a room_id. Depending on whether a room is assigned, I either join the pivot table and then room_types table, or the room_types table directly using the room_type_id. And then I just SUM() 1 for each tuple which thankfully returns the right amount when you group by room_type.id in the end.

Sorting and merging in Stata on categorical variables

I am in the process of merging two data sets together in Stata and came up with a potential concern.
I am planning on sorting each data set in exactly the same manner on several categorical variables that are common to both sets of data. HOWEVER, several of the categorical variables have more categories present in one data set over the other. I have been careful enough to ensure that the coding matches up in both data sets (e.g. Red is coded as 1 in both data set A and B, but data set A has only Red, Green and Blue whereas data set B has Red, Green, Blue, and Yellow).
If I were to sort each data set the same way and generate an id variable (gen id = _n) and merge on that, would I run into any problems?
There is no statistical question here, as this is purely about data management in Stata, so I too shall shortly vote for this to be migrated to Stack Overflow, where I would be one of those who might try to answer it, so I will do that now.
What you describe to generate identifiers is not how to think of merging data sets, regardless of any of the other details in your question.
Imagine any two data sets, and then in each data set, generate an identifier that is based on the observation numbers, as you propose. Generating such similar identifiers does not create a genuine merge key. You might as well say that four values "Alan" "Bill" "Christopher" "David" in one data set can be merged with "William" "Xavier" "Yulia" "Zach" in another data set because both can be labelled with observation numbers 1 to 4.
My advice is threefold:
Try what you are proposing with your data and try to understand the results.
Consider whether you have something else altogether, namely an append problem. It is quite common to confuse the two.
If both of those fail, come back with a real problem and real code and real results for a small sample, rather than abstract worries.
I think I may have solved my problem - I figured I would post an answer specifically relating to the problem in case anybody has the same issue.
~~
I have two data sets: One containing information about the amount of time IT help spent at a customer and another data set with how much product a customer purchased. Both data sets contain unique ID numbers for each company and the fiscal quarter and year that link the sets together (e.g. ID# 1001 corresponds to the same company in both data sets). Additionally, the IT data set contains unique ID numbers for each IT person and the customer purchases data set contains a unique ID number for each purchase made. I am not interested in analysis at the individual employee level, so I collapsed the IT time data set to the total sum of time spent at a given company regardless of who was there.
I was interested in merging both data sets so that I could perform analysis to estimate some sort of "responsiveness" (or elasticity) function linking together IT time spent and products purchased.
I am certain this is a case of "merging" data because I want to add more VARIABLES not OBSERVATIONS - that is, I wish to horizontally elongate not vertically elongate my final data set.
Stata 12 has many options for merging - one to one, many to one, and one to many. Supposing that I treat my IT time data set as my master and my purchases data set as my merging set, I would perform a "m:1" or many to one merge. This is because I have MANY purchases corresponding to one observation per quarter per company.

Resources