Appropriate data structure for combination of options? - data-structures

I want to do something similar to a product options selector resulting in a single product variant:
For instance, maybe I want a shirt for my favorite sports team and have size, color, and team options groups. I may have various options in each group, some combinations aren’t available, and I could choose in any order, what might be a good structure for this?
I’ve considered a hash table containing various combinations of option values as a key and variant as a value. I’ve considered a trie containing ordered combinations of options leading down to a variant.
Neither really ‘feel’ right.

Related

Nested sorting in dimension hierarchies (Tableau)

I am working on a vizualisation in Tableau that has dimension hierarchy (Product category, product sub-category, product type etc.) sorted descending by number of orders. I want my viz to show by default only first product level (product category) sorted the same way, but give an option to drill down (using "+" on the dimension) to detailed product levels and using nested sorting (again, descending by number of orders).
superstore data sample
I tried using nested sorting option for each product level, but when I drill up and down again, the sorting is wrong again, as if it clears out. I cannot find an option to keep them fixed unless I keep all product levels visible in the viz (without drill-down option).
Does anyone know how can I do it? I tried also different ways of indexing and ranking calculations, but nothing seem to work. I know there is one option to combine hierarchy dimensions and using sorting option on them, but it keeps the viz really untidy.
Thanks in advance!
Tableau will always sort based on the left most column. With the newer nested sorting you can more easily do a secondary sorting. However, when you expand up/down hierarchies like you are noting the formatting might not be retained.
The "classic" way to do this is to create a Rank by number of orders (sounds like you were close on this one). rank(COUNT([Order ID]),'desc'). Make this a discrete measure and put it to the left of all the other dimensions.
To clean it up, you can uncheck "Show Header" on the rank pill.
And if you expand/collapse the hierarchy, it keeps the sorting... Final product:
EDIT: Here is another way to try to accomplish this. It seems to work for 3 levels but starts to break down after that. (It also didn't seem to work well on grouped dimensions.)
Expand the hierarchy to all three levels.
On each dimension, enforce a sort order of Count of Order ID Descending.

Using scoring to find customers

I have a site where customers purchase items that are tagged with a variety of taxonomy terms. I want to create a group of customers who might be interested in the same items by considering the tags associated with purchases they've made. Rather than comparing a list of tags for each customer each time I want to build the group, I'm wondering if I can use some type of scoring to solve the problem.
The way I'm thinking about it, each tag would have some unique number assigned to it. When I perform a scoring operation it would render a number that could only be achieved by combining a specific set of tags.
I could update a customer's "score" periodically so that it remains relevant.
Am I on the right track? Any ideas?
Your description of the problem looks much more like a clustering or recommendation problem. I am not sure if those tags are enough of an information to use clustering or recommendation tough.
Your idea of the score doesn't look promising to me, because the same sum could be achieved in several ways, if those numbers aren't carefully enough chosen.
What I would suggest you:
You can store tags for each user. When some user purchases a new item, you will add the tags of the item to the user's tags. On periodical time you will update the users profiles. Let's say we have users A and B. If at the time of the update the similarity between A and B is greater than some threshold, you will add a relation between the users which will indicate that the two users are similar. If it's lower you will remove the relation (if previously they were related). The similarity could be either a number of common tags or num_common_tags / num_of_tags_assigned_either_in_A_or_B.
Later on, when you will want to get users with particular set of tags, you will just do a query which checks which users have that set of tags. Also you can check for similar users to given user, just by looking up which users are linked with the user in question.
If you assign a unique power of two to each tag, then you can sum the values corresponding to the tags, and users with the exact same sets of tags will get identical values.
red = 1
green = 2
blue = 4
yellow = 8
For example, only customers who have the set of { red, blue } will have a value of 5.
This is essentially using a bitmap to represent a set. The drawback is that if you have many tags, you'll quickly run out of integers. For example, if your (unsigned) integer type is four bytes, you'd be limited to 32 tags. There are libraries and classes that let you represent much larger bitsets, but, at that point, it's probably worth considering other approaches.
Another problem with this approach is that it doesn't help you cluster members that are similar but not identical.

Seeking appropriate clustering algorithm

I'm analyzing the GDELT dataset and I want to determine thematic clusters. Simplifying considerably, GDELT parses news articles and extracts events. As part of that, it recognizes, let's say, 250 "themes" and tags each "event" it records in a column a semi-colon separated list of all themes identified in the article.
With that preamble, I've extracted, for 2016, a list of approximately 350,000 semi-colon separated theme lists, such as these two:
TAX_FNCACT;TAX_FNCACT_QUEEN;CRISISLEX_T11_UPDATESSYMPATHY;CRISISLEX_CRISISLEXREC;MILITARY;TAX_MILITARY_TITLE;TAX_MILITARY_TITLE_SOLDIER;TAX_FNCACT_SOLDIER;USPEC_POLITICS_GENERAL1;WB_1458_HEALTH_PROMOTION_AND_DISEASE_PREVENTION;WB_1462_WATER_SANITATION_AND_HYGIENE;WB_635_PUBLIC_HEALTH;WB_621_HEALTH_NUTRITION_AND_POPULATION;MARITIME_INCIDENT;MARITIME;MANMADE_DISASTER_IMPLIED;
CRISISLEX_CRISISLEXREC;EDUCATION;SOC_POINTSOFINTEREST;SOC_POINTSOFINTEREST_COLLEGE;TAX_FNCACT;TAX_FNCACT_MAN;TAX_ECON_PRICE;SOC_POINTSOFINTEREST_UNIVERSITY;TAX_FNCACT_JUDGES;TAX_FNCACT_CHILD;LEGISLATION;EPU_POLICY;EPU_POLICY_LAW;TAX_FNCACT_CHILDREN;WB_470_EDUCATION;
As you can see, both of these lists both contain "TAX_FNACT" and "CRISISLEX_CRISISLEXREC". Thus, "TAX_FNACT;CRISISLEX_CRISISLEXREC" is a 2-item cluster. A better understanding of GDELT informs us that it isn't a particularly useful cluster, but it is one nevertheless.
What I'd like to do, ideally, is compose a dictionary of lists. The key for the dictionary is the number of items in the cluster and value is a list of tuples of all theme clusters with that "key" number of elements paired with the number of times that cluster appeared. This ideal algorithm would run until it identified the largest cluster.
Does an algorithm already exist that I can use for this purpose and if so, what is it named? If I had to guess, I would imagine we've created something to extract x-item clusters and then I would just loop from 2->? until I don't get any results.
Clustering won't work well here.
What you describe looks rather like frequent itemset mining. Where the task is to find frequent combinations of 'items' in lists.

Design DS for Multi-pattern Object

There are various varieties of Shirt. Varieties are based on parameters like pattern, size, colour, etc.
Assuming you have all types of shirts available. Now there are various queries like:
Show all types of shirt having colour “red”.
Show all types of shirt having size “small” and pattern “checks” etc. etc.
So, assuming we have 'K' diffrent varieties ,and N shirts , what Data-structure can we design to store the following data , to answer the above queries in most optimal manner ?
One obvious solution i thought is to store , 'K' instances of data , grouped according to each variety .But that will be very space in-efficient .
What better can we do , keeping in mind the space/time bounds ?
How about store K pointers for each item, indicates the next item with the same K-th variety .
Then for each query, pick one variety and enumerate all the items with that variety satisfied, check if it meets other constraints and show it. Thus take an O(NK) for each query and O(1) for adding a new item, while the space is O(NK).
your scenario looks just like a database, why don't you check on that.

Multi Attribute Matching of Profiles

I am trying to solve a problem of a dating site. Here is the problem
Each user of app will have some attributes - like the books he reads, movies he watches, music, TV show etc. These are defined top level attribute categories. Each of these categories can have any number of values. e.g. in books : Fountain Head, Love Story ...
Now, I need to match users based on profile attributes. Here is what I am planning to do :
Store the data with reverse indexing. i.f. Each of Fountain Head, Love Story etc is index key to set of users with that attribute.
When a new user joins, get the attributes of this user, find which index keys for this user, get all the users for these keys, bucket (or radix sort or similar sort) to sort on the basis of how many times a user in this merged list.
Is this good, bad, worse? Any other suggestions?
Thanks
Ajay
The algorithm you described is not bad, although it uses a very simple notion of similarity between people.
Let us make it more adjustable, without creating a complicated matching criteria. Let's say people who like the same book are more similar than people who listen to the same music. The same goes with every interest. That is, similarity in different fields has different weights.
Like you said, you can keep a list for each interest (like a book, a song etc) to the people who have that in their profile. Then, say you want to find matches of guy g:
for each interest i in g's interests:
for each person p in list of i
if p and g have mismatching sexual preferences
continue
if p is already in g's match list
g->match_list[p].score += i->match_weight
else
add p to g->match_list with score i->match_weight
sort g->match_list based on score
The choice of weights is not a simple task though. You would need a lot of psychology to get that right. Using your common sense however, you could get values that are not that far off.
In general, matching people is much more complicated than summing some scores. For example a certain set of matching interests may have more (or in some cases less) effect than the sum of them individually. Also, an interest in one may totally result in a rejection from the other no matter what other matching interest exists (Take two very similar people that one of them loves and the other hates twilight for example)

Resources