I want to realize weighted random distribution in elastic search. In my index each document has weight from 1 to N. So element with weight 1 must appears in result 2 times less, than document with weight 2.
For example I have 3 documents (one with weight 2, two with weight 1):
[
{
"_index": "we_recommend_on_main",
"_type": "we_recommend_on_main",
"_id": "5-0",
"_score": 1.1245852,
"_source": {
"id_map_placement": 6151,
"image": "/upload/banner1",
"weight": 2
}
},
{
"_index": "we_recommend_on_main",
"_type": "we_recommend_on_main",
"_id": "8-0",
"_score": 0.14477867,
"_source": {
"id_map_placement": 6151,
"image": "/upload/banner1",
"weight": 1
}
},
{
"_index": "we_recommend_on_main",
"_type": "we_recommend_on_main",
"_id": "8-1",
"_score": 0.0837487,
"_source": {
"id_map_placement": 6151,
"image": "/upload/banner2",
"weight": 1
}
}
]
I found the solution with search like this:
{
"size": 1,
"query": {
"function_score": {
"functions": [
{
"random_score": {}
},
{
"field_value_factor": {
"field": "weight",
"modifier": "none",
"missing": 1
}
}
],
"score_mode": "multiply",
"boost_mode": "replace"
}
},
"sort": [
{
"_score": "desc"
}
]
}
After i tested this query with 10000 times result is
{
"5-0": 6730,
"8-1": 1613,
"8-0": 1657
}
But not
{
"5-0": 5000,
"8-1": 2500,
"8-0": 2500
}
as I expected. What is wrong?
Unfortunately, the problem here is - that your assumption about this distribution - is wrong. We have a classic probabilistic theory problem here. Variables A, B, C uniformly distributed (A, B between 0 and 1, C between 0 and 2). We need to find the probability that C will be greater than either A or B.
Explanation: since C is distributed between 0 and 2 uniformly, by the simple formula it's clear that with 50% probability it's distributed between 1 and 2 which automatically means that it will be greater than either A or B.
However, there are cases, when C will be less than 1 but still greater than either A or B, which makes probability strictly greater than 50% and much more than 50%.
2nd part of the distribution - where all 3 variables are between 0 and 1. Probability that C will be greater than either A or B is 1/3. However, C is distributed here only 50% of the time, which makes this probability - 1/6. Total probability is 1/2 + 1/6 = 4/6 which makes roughly numbers you got with Monte-Carlo simulation
Upd. It’s not possible to achieve expected behaviour, since you have no control over scoring, when you will collect aggregations - like sum of weights. I would recommend to do it in a rescore fashion with first requesting the sum aggregation on the field and later re-using it.
Related
Suppose i have a random sequence(ordering array) which contain n positive float .
How to find subsequence of size k such that the minimum distance between all pairs of float in the subsequence is maximized , i mean they are at farthest distance .
Note: A subsequence of a sequence is an ordered subset of the sequence's elements having the same sequential ordering as the original sequence.
CONSTRAINTS
n>10^5
n>k>2
example :
sequence a[]={1.1,2.34,6.71,7.01,10.71} and k=3 ,
subsequence = {1.1,6.71,10.71} , the minimum distance is 4 between 10.71 and 6.71 .
Wrong subsequence :
{1.1,7.01,10.71} , minimum distance is 3.7
{1.1,2.34,6.71} , minimum distance is 1.24
I came up with a solution :
1) sort array
2) select a[0] , now find ceil(a[0]+ x) = Y in array ....and then ceil(Y+ x) and
so on k-1 times , also kth element will be a[n-1]
To find x :
dp[i,j] be the x for selecting j elements from first i elements .
Finally we want dp[n][k] which is x
But i am facing problem in finding x and reordering the indexes.
dp[i,j] = max( min( dp[k,j-1], dp[i]-A[k] ) )
over k=1 to i-1 , i=2 to n , j=2 to i
dp[i][1] = 0 over i = 1 to n
I want to correct the dynamic programming solution , though i know x can be found out by binary searching over x , but by sorting i loose ordering of sequence and time consuming(O(n^2)).How do i overcome this problems?
If there is a solution involving a sort, you first want to map the array to an array of tuples, which contain a value and a position of the element. Now when you sort the array you know the original positions as well.
However I don't believe that sorting actually helps you in the end.
The approach that I see which works is for each 0 <= i < n, for each 1 < j <= min(k, i+1), to store the minimum distance and previous entry for the best subsequence of length j ending at i.
You then look for the best subsequence of length k. And then decode the subsequence.
Using JSON notation (for clarity, and not because I this is the right data structure), and your example, you could wind up with a data structure like this:
[
{"index": 0, "value": 1.1},
{"index": 1, "value": 2.34,
"seq": {2: {"dist": 1.34, "prev": 0}},
{"index": 2, "value": 6.71,
"seq": {2: {"dist": 5.61, "prev": 0},
3: {"dist": 1.34, "prev": 1}},
{"index": 3, "value": 7.01,
"seq": {2: {"dist": 5.91, "prev": 0},
3: {"dist": 1.34, "prev": 1}},
{"index": 4, "value": 10.71,
"seq": {2: {"dist": 9.61, "prev": 0},
3: {"dist": 4, "prev": 2}}
]
And now we find that the biggest dist for length 3 is 3.7 at index 4. Walking backwards we want index 4, 2 and 0. Pull those out and reverse them to get the solution of [1.1, 6.71, 10.71]
For example, I have a dataset like this
test = spark.createDataFrame([
(0, 1, 5, "2018-06-03", "Region A"),
(1, 1, 2, "2018-06-04", "Region B"),
(2, 2, 1, "2018-06-03", "Region B"),
(3, 3, 1, "2018-06-01", "Region A"),
(3, 1, 3, "2018-06-05", "Region A"),
])\
.toDF("orderid", "customerid", "price", "transactiondate", "location")
test.show()
and I can obtain the customer-region order count matrix by
overall_stat = test.groupBy("customerid").agg(count("orderid"))\
.withColumnRenamed("count(orderid)", "overall_count")
temp_result = test.groupBy("customerid").pivot("location").agg(count("orderid")).na.fill(0).join(overall_stat, ["customerid"])
for field in temp_result.schema.fields:
if str(field.name) not in ['customerid', "overall_count", "overall_amount"]:
name = str(field.name)
temp_result = temp_result.withColumn(name, col(name)/col("overall_count"))
temp_result.show()
The data would look like this
Now, I want to calculate the weighted average by the overall_count, how can I do it?
The result should be (0.66*3+1*1)/4 for region A, and (0.33*3+1*1)/4 for region B
My thoughts:
It can certainly be achieved through turning the data into python/pandas and then do some calculation, but in what cases should we use Pyspark?
I can get something like
temp_result.agg(sum(col("Region A") * col("overall_count")), sum(col("Region B")*col("overall_count"))).show()
but it doesn't feel right, especially if there is many regions to count.
you can achieve a weighted average by breaking your above steps into multiple stages.
Consider the following:
Dataframe Name: sales_table
[ total_sales, count_of_orders, location]
[ 50 , 9 , A ]
[ 80 , 4 , A ]
[ 90 , 7 , A ]
To calculate the grouped weighted average of the above (70) is broken into two steps:
Multiplying sales by importance
Aggregating the sales_x_count product
Dividing sales_x_count by the sum of the original
If we break the above into several stages within our PySpark code, you can get the following:
new_sales = sales_table \
.withColumn("sales_x_count", col("total_sales") * col("count_orders")) \
.groupBy("Location") \
.agg(sf.sum("total_sales").alias("sum_total_sales"), \
sf.sum("sales_x_count").alias("sum_sales_x_count")) \
.withColumn("count_weighted_average", col("sum_sales_x_count") / col("sum_total_sales"))
So... no fancy UDF is really necessary here (and would likely slow you down).
If there is an unlimited number of every coin then the complexity is O(n*m) where is n is the total change and m is the number of coin types. Now when the coins for every type are limited then we have to take into account the remaining coins. I managed to make it work with a complexity of O(n*m2) using another for of size n so I can track the remaining coins for each type. Is there a way-trick to make the complexity better? EDIT : The problem is to compute the least ammount of coins required to make the exact given change and the number of times that we used each coin type
There is no need for an extra loop. You need to:
recurse with a depth of at most m (number of coins) levels, dealing with one specific coin per recursion level.
Loop at most n times at each recursion level in order to decide how many you will take of a given coin.
Here is how the code would look in Python 3:
def getChange(coins, amount, coinIndex = 0):
if amount == 0:
return [] # success
if coinIndex >= len(coins):
return None # failure
coin = coins[coinIndex]
coinIndex += 1
# Start by taking as many as possible from this coin
canTake = min(amount // coin["value"], coin["count"])
# Reduce the number taken from this coin until success
for count in range(canTake, -1, -1): # count will go down to zero
# Recurse to decide how many to take from the next coins
change = getChange(coins, amount - coin["value"] * count, coinIndex)
if change != None: # We had success
if count: # Register this number for this coin:
return change + [{ "value": coin["value"], "count": count }]
return change
# Example data and call:
coins = [
{ "value": 20, "count": 2 },
{ "value": 10, "count": 2 },
{ "value": 5, "count": 3 },
{ "value": 2, "count": 2 },
{ "value": 1, "count": 10 }
]
result = getChange(coins, 84)
print(result)
Output for the given example:
[
{'value': 1, 'count': 5},
{'value': 2, 'count': 2},
{'value': 5, 'count': 3},
{'value': 10, 'count': 2},
{'value': 20, 'count': 2}
]
Minimising the number of coins used
As stated in comments, the above algorithm returns the first solution it finds. If there is a requirement that the number of individual coins must be minimised when there are multiple solutions, then you cannot return halfway a loop, but must retain the "best" solution found so far.
Here is the modified code to achieve that:
def getchange(coins, amount):
minCount = None
def recurse(amount, coinIndex, coinCount):
nonlocal minCount
if amount == 0:
if minCount == None or coinCount < minCount:
minCount = coinCount
return [] # success
return None # not optimal
if coinIndex >= len(coins):
return None # failure
bestChange = None
coin = coins[coinIndex]
# Start by taking as many as possible from this coin
cantake = min(amount // coin["value"], coin["count"])
# Reduce the number taken from this coin until 0
for count in range(cantake, -1, -1):
# Recurse, taking out this coin as a possible choice
change = recurse(amount - coin["value"] * count, coinIndex + 1,
coinCount + count)
# Do we have a solution that is better than the best so far?
if change != None:
if count: # Does it involve this coin?
change.append({ "value": coin["value"], "count": count })
bestChange = change # register this as the best so far
return bestChange
return recurse(amount, 0, 0)
coins = [{ "value": 10, "count": 2 },
{ "value": 8, "count": 2 },
{ "value": 3, "count": 10 }]
result = getchange(coins, 26)
print(result)
Output:
[
{'value': 8, 'count': 2},
{'value': 10, 'count': 1}
]
Here's an implementation of an O(nm) solution in Python.
If one defines C(c, k) = 1 + x^c + x^(2c) + ... + x^(kc), then the program calculates the first n+1 coefficients of the polynomial product(C(c[i], k[i]), i = 1...ncoins). The j'th coefficient of this polynomial is the number of ways of making change for j.
When all the ks are unlimited, this polynomial product is easy to calculate (see, for example: https://stackoverflow.com/a/20743780/1400793). When limited, one needs to be able to calculate running sums of k terms efficiently, which is done in the program using the rs array.
# cs is a list of pairs (c, k) where there's k
# coins of value c.
def limited_coins(cs, n):
r = [1] + [0] * n
for c, k in cs:
# rs[i] will contain the sum r[i] + r[i-c] + r[i-2c] + ...
rs = r[:]
for i in xrange(c, n+1):
rs[i] += rs[i-c]
# This line effectively performs:
# r'[i] = sum(r[i-j*c] for j=0...k)
# but using rs[] so that the computation is O(1)
# and in place.
r[i] += rs[i-c] - (0 if i<c*(k+1) else rs[i-c*(k+1)])
return r[n]
for n in xrange(50):
print n, limited_coins([(1, 3), (2, 2), (5, 3), (10, 2)], n)
I have the following code where eratosthenes(N) returns an array of primes from 1 to N. What I want to do is remove any numbers from this list that contain the digits 0, 2, 4, 5, 6, 8. My code seems quite inefficient and wrong as it takes about 20 seconds (eratosthenes(N) is instantaneous) to get to just 100,000 and doesn't remove all the numbers I want it to. Is there a better, scalable solution to this problem?
N = 1_000_000
primes = eratosthenes(N)
primes.each do |num|
if ["0", "2", "4", "5", "6", "8"].any? { |digit| num.to_s.include?(digit) }
primes.delete(num)
end
end
The problem with your approach is that each delete rewrites the array, and it's called for every deleted item, so the complexity of the algorithm is O(n^2) instead of O(n).
You should do something like this:
primes.reject!{|num| ["0", "2", "4", "5", "6", "8"].any? { |digit| num.to_s.include?(digit) }}
Or simply:
primes.reject!{|num| num.to_s[/[024568]/]}
It's just a matter of style, but I'd put everything together in one line (note the lack of ! in reject here):
primes = eratosthenes(N).reject{|num| num.to_s[/[024568]/]}
I should think that you're looking for something like:
primes.reject!{|num| num % 2 == 0}
In MongoDB, a field can have multiple values (an array of values). Each of them is indexed, so you can filter on any of the values. But can you also "order by" a field with multiple values and what is the result?
Update:
> db.test.find().sort({a:1})
{ "_id" : ObjectId("4f27e36b5eaa9ebfda3c1c53"), "a" : [ 0 ] }
{ "_id" : ObjectId("4f27e3845eaa9ebfda3c1c54"), "a" : [ 0, 1 ] }
{ "_id" : ObjectId("4f27df6e5eaa9ebfda3c1c4c"), "a" : [ 1, 1, 1 ] }
{ "_id" : ObjectId("4f27df735eaa9ebfda3c1c4d"), "a" : [ 1, 1, 2 ] }
{ "_id" : ObjectId("4f27df795eaa9ebfda3c1c4e"), "a" : [ 2, 1, 2 ] }
{ "_id" : ObjectId("4f27df7f5eaa9ebfda3c1c4f"), "a" : [ 2, 2, 1 ] }
{ "_id" : ObjectId("4f27df845eaa9ebfda3c1c50"), "a" : [ 2, 1 ] }
{ "_id" : ObjectId("4f27e39a5eaa9ebfda3c1c55"), "a" : [ 2 ] }
With unequal length arrays the longer array is "lower" than the shorter array
So, why is [0] before [0,1], but [2] after [2,1] ?
Is maybe sorting only done on the first array element? Or the lowest one? And after that it is insertion order?
Also, how is this implemented in the case of an index scan (as opposed to a table scan)?
Sorting of array elements is pretty complicated. Since array elements are indexed seperately sorting on an array field will actually result in some interesting situations. What happens is that MongoDB will sort them based on the lowest or highest value in the array (depending on sort direction). Beyond that the order is natural.
This leads to things like :
> db.test.save({a:[1]})
> db.test.save({a:[0,2]})
> db.test.find().sort({a:1})
{ "_id" : ObjectId("4f29026f5b6b8b5fa49df1c3"), "a" : [ 0, 2 ] }
{ "_id" : ObjectId("4f2902695b6b8b5fa49df1c2"), "a" : [ 1 ] }
> db.test.find().sort({a:-1})
{ "_id" : ObjectId("4f29026f5b6b8b5fa49df1c3"), "a" : [ 0, 2 ] }
{ "_id" : ObjectId("4f2902695b6b8b5fa49df1c2"), "a" : [ 1 ] }
In other words. The same order for reversed sorts. This is due to the fact that the "a" field of the top document holds both the lowest and the highest value.
So effectively for the sort MongoDB ignores all values in the array that are not either the highest ({field:-1} sort) or the lowest ({field:1} sort) and orders the remaining values.
To paint an (oversimplified) picture it works something like this :
flattened b-tree for index {a:1} given above sample docs :
"a" value 0 -> document 4f29026f5b6b8b5fa49df1c3
"a" value 1 -> document 4f2902695b6b8b5fa49df1c2
"a" value 2 -> document 4f29026f5b6b8b5fa49df1c3
As you can see scanning from both top to bottom and bottom to top will result in the same order.
Empty arrays are the "lowest" possible array value and thus will appear at the top and bottom of the above queries respectively.
Indexes do not change the behaviour of sorting on arrays.