Elasticsearch field collapsing with minimum inner hits count

Elasticsearch field collapsing with minimum inner hits count - elasticsearch

When using field collapsing, is there a way to filter out results which inner hits count is less than a give threshold?
In an hotel database I want to find hotels with three cheapest available rooms cheaper than X. Each document has a hotel_id, room_id and price. If the hotel has not 3 available rooms cheaper than X, I cannot do anything with it.
So I do a search for rooms cheaper than X, sorted by price, collapsing with hotel_id, but I want to see only groups that contains 3 rooms in inner hits, otherwise that hotel result is unusable. With the size parameter I define a maximum, but I cannot find a way to define a minimum.
Aggregation is not an option due performance constraints.

Related

Persistent partitioning of the data

I am looking for the approach or algorithm that can help with the following requirements:
Partition the elements into a defined number of X partitions. Number of partitions might be redefined manually over time if needed.
Each partition should not have more than Y elements
Elements have a "category Id" and "element Id". Ideally all elements with the same category Id should be within the same partition. They should overflow to as few partitions as possible only if a given category has more than Y elements. Number of categories is orders of magnitude larger than number of partitions.
If the element from the set has been previously assigned to a given partition it should continue being assigned to the same partition
Account for change in the data. Existing elements might be removed and new elements within each of the categories can be added.
So far my naive approach is to:
sort the categories descending by their number of elements
keep a variable with a count-of-elements for a given partition
assign the rows from the first category to the first partition and increase the count-of-elements
if count-of-elements > Y: assign overflowing elements to the next partition, but only if the number of elements in a category is bigger than Y. Otherwise assign all elements from a given category to the next partition
continue till all elements are assigned to partitions
In order to persist the assignments store in the database all pairs: (element Id, partition Id)
On the consecutive re-assignments:
remove from the database any elements that were deleted
assign existing elements to the partitions based on (element Id, partition Id)
for any new elements follow the above algorithm
My main worry is that after few such runs we will end up with categories spread all across the partitions as the initial partitions will get all filled in. Perhaps adding a buffer (of 20% or so) to Y might help. Also if one of the categories will see a sudden increase in a number of elements the partitions will need rebalancing.
Are there any existing algorithms that might help here?

This is NP hard (knapsack) on NP hard (finding optimal way to split too large categories) on currently unknown because of future data changes. Obviously the best that you can do is a heuristic.
Sort the categories by descending size. Using a heap/priority queue for the partitions, put each category into the least full available partition. If the category won't fit, then split it as evenly as you can into the smallest number of possible partitions. My guess (experiment!) is that trying to leave partitions at the same fill is best.
On reassignment, delete the deleted elements first. Then group new elements by category. Sort the categories by how many preferred locations they have ascending, and then by descending size. Now move the categories with 0 preferred locations to the end.
For each category, if possible split its new elements across the preferred partitions, leaving them equally full. If this is not possible, put them into the emptiest possible partition. If that is not possible, then split them to put them across the fewest possible partitions.
It is, of course, possible to come up with data sets that eventually turn this into a mess. But it makes a pretty good good faith effort to try to come out well.

Selecting N elements from a list of (item,ID) , in the ratio of weights assigned to their IDs

So I have a large list of items ,each of which has an ID assigned . Now I need to pick N items from the list ,such that the ratio of the number of items from each ID is given.
Lets say:
There are 3 ids , and their weights are in the ratio - 1:3:2
so if N = 6 ,
I'll pick 1 item of id 1, 3 of id 2 ,and 2 of id 3
However in some cases there might not be enough items of a particular ID , in those cases it will have to be adjusted between the other ids.Total number of items picked has to be N.
One possible solution I thought of was to convert this problem to a weighted sampling problem. However converting the weights of the IDs to weights of each item would add a lot of complexity I believe .

Conceptually it is not that difficult, although you will have to handle a few edge cases.
Compute the actual quantity of items needed for each id based on the ratio between the sum of your input ratios and the total requested quantity, N. You may have to take care of rounding issues, so one quantity (perhaps the largest one) may need to be adjusted.
Scan over your list, and for each id create a list of "selected" items which will go in the final result, and a list of "available" items which may be used later, in case some ids don't reach the requested quantity.
One possibility is, at a point all ids will have reached the requested quantity, in which case we don't even need to loop over the full input list.
If this is not the case, then compute how many items are still needed, recalculate the new ratios for the ids which have available items and use those items to reach the total requested quantity; repeat until N is reached.

Analysis of different Sets and optimizations. Best approach?

For the last few days, I've tried to accomplish the following task regarding the analysis of a Set of Objects, and the solutions I've come up with rely heavily on Memory (obtaining OutOfMemory exceptions in some cases) or take an incredible long time to process. I now think is a good idea to post it here, as I'm out of ideas. I will explain the problem in detail, and provide the logic I've followed so far.
Scenario:
First, we have an object, which we'll name Individual, that contains the following properties:
A date
A Longitude - Latitude pair
Second, we have another object, which we'll name Group, which definition is:
A set of Individuals that, together, match the following conditions:
All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.
The distance between each object is less than Y meters.
A group can have N>1 individuals, as long as each of the Individuals match the conditions within each other.
All individuals are stored in a database.
All groups would also be stored in a database.
The task:
Now, consider a new individual.
The system has to check if the new individual:
Belongs to an existing Group or Groups
The Individual now forms one or multiple new Groups with other Individuals.
Notes:
The new individual could be in multiple existing groups, or could create multiple new groups.
SubGroups of Individuals are not allowed, for example if we have a Group that contains Individuals {A,B,C}, there cannot exist a group that contains {A,B}, {A,C} or {B,C}.
Solution (limited in processing time and Memory)
First, we filter the database with all the Individuals that match the initial conditions. This will output a FilteredIndividuals enumerable, containing all the Individuals that we know will form a Group (of 2) with the new one.
Briefly, a Powerset is a set that contains all the possible subsets of
a particular set. For example, a powerset of {A,B,C} would be:
{[empty], A, B, C, AB, AC, BC, ABC}
Note: A powerset will output a new set with 2^N combinations, where N is the length of the originating set.
The idea with using powersets is the following:
First, we create a powerset of the FilteredIndividuals list. This will give all possible combinations of Groups within the FilteredIndividuals list. For analysis purposes and by definition, we can omit all the combinations that have less than 2 Individuals in them.
We Check if each of the Individuals in a combination of the powerset, match the conditions within each other.
If they match, that means that all of the Individuals in that combination form a Group with the new Individual. Then, to avoid SubGroups, we can eliminate all of the subsets that contain combinations of the Checked combination. I do this by creating a powerset of the Checked combination, and then eliminating the new powerset from the original one.
At this point, we have a list of sets that match the conditions to form a Group.
Before formally creating a Group, I compare the DB with other existing Groups that contain the same elements as the new sets:
If I find a match, I eliminate the newly created set, and add the new Individual to the old Group.
If I don't find a match, it means they are new Groups. So I add the new Individual to the sets and finally create the new Groups.
This solution works well when the FilteredIndividuals enumerable has less than 52 Individuals. After that, Memory exceptions are thrown (I know this is because of the maximum size allowed for data types, but incrementing such size is not of help with very big sets. For your consideration, the top amount of Individuals that match the conditions I've found is 345).
Note: I have access to the definition of both entities. If there's a new property that would reduce the processing time, we can add it.
I'm using the .NET framework with C#, but if the language is something that requires changing, we can accept this as long as we can later convert the results to object understandable by our main system.

All individuals in the set have a date which, within each other, is not superior to 10 days. This means that all of the Individuals, if compared within each other, don´t differ in 10 days between each other.
The distance between each object is less than Y meters.
So your problem becomes how to cluster these points in 3-space, a partitioning where X and Y are your latitude and longitude, Z is the time coordinate, and your metric is an appropriately scaled variant of the Manhattan distance. Specifically you scale Z so that 10*Z days equals your maximum distance of Y meters.
One possible shortcut would be to use the divide et impera and classify your points (Individuals) in buckets, Y meters wide and 10 days high. You do so by dividing their coordinates by Y and by 10 days (you can use Julian dates for that). If an individual is in bucket H { X=5, Y=3, Z=71 }, then it cannot be than any individual in buckets with X < (5-1) or X > (5+1), Y < (3-1) or Y > (3+1), or Z < (71-1) or Z > (71+1), is in his same group, because their distance would certainly be above the threshold. This means that you can quickly select a subset of 27 "buckets" and worry with only those individuals in there.
At this point you can enumerate the possible groups your new individual can be in (if you use a database back end, they would be SELECT groups.* FROM groups JOIN iig USING (gid) JOIN individuals USING (uid) WHERE individuals.bucketId IN ( #bucketsId )), and compare those with the group your individual may form from other individuals (SELECT individuals.id WHERE bucketId IN ( #bucketsId ) AND ((x-#newX)*(x-#newX)+(y-#newY)*(y-#newY)) < #YSquared AND ABS(z - #newZ) < 10)).
This approach is not very performant (it depends on the database, and you'll want an index on bucketId at a minimum), but it has the advantage of using as little memory as possible.
On some database backends with geographical extensions, you might want to use the native latitude and longitude functions instead of implicitly converting to meters.

Efficient cypher query matching subgraph connecting two groups of nodes

my problem is the following. I have a small but dense network in Neo4j (~280 nodes, ~3600 relationships). There is only one type of node and one type of edge (i.e. a single label for each). Now, I'd like to specify two distinct groups of nodes, given by values for their "group" property, and match the subgraph consisting of all paths up to a certain length connecting the two groups. In addition I would like to add constraints on the relations. So, at the moment I have this:
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE ALL(c IN r WHERE c.weight > {w})
AND ALL(n in NODES(p) WHERE 1=length(filter(m in NODES(p) WHERE m=n)))
WITH DISTINCT r AS dr, NODES(p) AS ns
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)
which achieves what I want but in some cases seems to be too slow. Here the WHERE statement filters out paths with relationships whose weight property is below a threshold as well as those containing cycles.
The last three lines have to do with the desired output format. Given the matching subgraph (paths), I want all unique relationships in one list, and all unique nodes in another (for visualization with d3.js). The only way I found to do this is to UNWIND all elements and then COLLECT them as DISTINCT.
Also note that the group properties and the weight limit are passed in as query parameters.
Now, is there any way to achieve the same result faster? E.g., with paths up to a length of 3 the query takes about 5-10 seconds on my local machine (depending on the connectedness of the chosen node groups), and returns on the order of ~50 nodes and a few hundred relationships. This seems to be in reach of acceptable performance. Paths up to length 4 however are already prohibitive (several minutes or never returns).
Bonus question: is there any way to specify the upper limit on path length as a parameter? Or does a different limit imply a totally different query plan?

This probably won't work at all, but it might give you something to play with. I tried changing a few things that may or may not work.
MATCH (n1) WHERE n1.group={group1}
MATCH (n2) WHERE n2.group={group2}
MATCH p=(n1)-[r*1..3]-(n2)
WHERE r.weight > {w}
WITH n1, NODES(p) AS ns, n2, DISTINCT r AS dr
WHERE length(ns) = 1
UNWIND dr AS udr UNWIND ns AS uns
RETURN COLLECT(DISTINCT udr), COLLECT(DISTINCT uns)

How do I pick the most beneficial combination of items from a set of items?

I'm designing a piece of a game where the AI needs to determine which combination of armor will give the best overall stat bonus to the character. Each character will have about 10 stats, of which only 3-4 are important, and of those important ones, a few will be more important than the others.
Armor will also give a boost to 1 or all stats. For example, a shirt might give +4 to the character's int and +2 stamina while at the same time, a pair of pants may have +7 strength and nothing else.
So let's say that a character has a healthy choice of armor to use (5 pairs of pants, 5 pairs of gloves, etc.) We've designated that Int and Perception are the most important stats for this character. How could I write an algorithm that would determine which combination of armor and items would result in the highest of any given stat (say in this example Int and Perception)?

Targeting one statistic
This is pretty straightforward. First, a few assumptions:
You didn't mention this, but presumably one can only wear at most one kind of armor for a particular slot. That is, you can't wear two pairs of pants, or two shirts.
Presumably, also, the choice of one piece of gear does not affect or conflict with others (other than the constraint of not having more than one piece of clothing in the same slot). That is, if you wear pants, this in no way precludes you from wearing a shirt. But notice, more subtly, that we're assuming you don't get some sort of synergy effect from wearing two related items.
Suppose that you want to target statistic X. Then the algorithm is as follows:
Group all the items by slot.
Within each group, sort the potential items in that group by how much they boost X, in descending order.
Pick the first item in each group and wear it.
The set of items chosen is the optimal loadout.
Proof: The only way to get a higher X stat would be if there was an item A which provided more X than some other in its group. But we already sorted all the items in each group in descending order, so there can be no such A.
What happens if the assumptions are violated?
If assumption one isn't true -- that is, you can wear multiple items in each slot -- then instead of picking the first item from each group, pick the first Q(s) items from each group, where Q(s) is the number of items that can go in slot s.
If assumption two isn't true -- that is, items do affect each other -- then we don't have enough information to solve the problem. We'd need to know specifically how items can affect each other, or else be forced to try every possible combination of items through brute force and see which ones have the best overall results.
Targeting N statistics
If you want to target multiple stats at once, you need a way to tell "how good" something is. This is called a fitness function. You'll need to decide how important the N statistics are, relative to each other. For example, you might decide that every +1 to Perception is worth 10 points, while every +1 to Intelligence is only worth 6 points. You now have a way to evaluate the "goodness" of items relative to each other.
Once you have that, instead of optimizing for X, you instead optimize for F, the fitness function. The process is then the same as the above for one statistic.

If, there is no restriction on the number of items by category, the following will work for multiple statistics and multiple items.
Data preparation:
Give each statistic (Int, Perception) a weight, according to how important you determine it is
Store this as a 1-D array statImportance
Give each item-statistic combination a value, according to how much said item boosts said statistic for the player
Store this as a 2-D array itemStatBoost
Algorithm:
In pseudocode. Here assume that itemScore is a sortable Map with Item as the key and a numeric value as the value, and values are initialised to 0.
Assume that the sort method is able to sort this Map by values (not keys).
//Score each item and rank them
for each statistic as S
for each item as I
score = itemScore.get(I) + (statImportance[S] * itemStatBoost[I,S])
itemScore.put(I, score)
sort(itemScore)
//Decide which items to use
maxEquippableItems = 10 //use the appropriate value
selectedItems = new array[maxEquippableItems]
for 0 <= idx < maxEquippableItems
selectedItems[idx] = itemScore.getByIndex(idx)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio