How do MongoDB multi-keys sort? - sorting

In MongoDB, a field can have multiple values (an array of values). Each of them is indexed, so you can filter on any of the values. But can you also "order by" a field with multiple values and what is the result?
Update:
> db.test.find().sort({a:1})
{ "_id" : ObjectId("4f27e36b5eaa9ebfda3c1c53"), "a" : [ 0 ] }
{ "_id" : ObjectId("4f27e3845eaa9ebfda3c1c54"), "a" : [ 0, 1 ] }
{ "_id" : ObjectId("4f27df6e5eaa9ebfda3c1c4c"), "a" : [ 1, 1, 1 ] }
{ "_id" : ObjectId("4f27df735eaa9ebfda3c1c4d"), "a" : [ 1, 1, 2 ] }
{ "_id" : ObjectId("4f27df795eaa9ebfda3c1c4e"), "a" : [ 2, 1, 2 ] }
{ "_id" : ObjectId("4f27df7f5eaa9ebfda3c1c4f"), "a" : [ 2, 2, 1 ] }
{ "_id" : ObjectId("4f27df845eaa9ebfda3c1c50"), "a" : [ 2, 1 ] }
{ "_id" : ObjectId("4f27e39a5eaa9ebfda3c1c55"), "a" : [ 2 ] }
With unequal length arrays the longer array is "lower" than the shorter array
So, why is [0] before [0,1], but [2] after [2,1] ?
Is maybe sorting only done on the first array element? Or the lowest one? And after that it is insertion order?
Also, how is this implemented in the case of an index scan (as opposed to a table scan)?

Sorting of array elements is pretty complicated. Since array elements are indexed seperately sorting on an array field will actually result in some interesting situations. What happens is that MongoDB will sort them based on the lowest or highest value in the array (depending on sort direction). Beyond that the order is natural.
This leads to things like :
> db.test.save({a:[1]})
> db.test.save({a:[0,2]})
> db.test.find().sort({a:1})
{ "_id" : ObjectId("4f29026f5b6b8b5fa49df1c3"), "a" : [ 0, 2 ] }
{ "_id" : ObjectId("4f2902695b6b8b5fa49df1c2"), "a" : [ 1 ] }
> db.test.find().sort({a:-1})
{ "_id" : ObjectId("4f29026f5b6b8b5fa49df1c3"), "a" : [ 0, 2 ] }
{ "_id" : ObjectId("4f2902695b6b8b5fa49df1c2"), "a" : [ 1 ] }
In other words. The same order for reversed sorts. This is due to the fact that the "a" field of the top document holds both the lowest and the highest value.
So effectively for the sort MongoDB ignores all values in the array that are not either the highest ({field:-1} sort) or the lowest ({field:1} sort) and orders the remaining values.
To paint an (oversimplified) picture it works something like this :
flattened b-tree for index {a:1} given above sample docs :
"a" value 0 -> document 4f29026f5b6b8b5fa49df1c3
"a" value 1 -> document 4f2902695b6b8b5fa49df1c2
"a" value 2 -> document 4f29026f5b6b8b5fa49df1c3
As you can see scanning from both top to bottom and bottom to top will result in the same order.
Empty arrays are the "lowest" possible array value and thus will appear at the top and bottom of the above queries respectively.
Indexes do not change the behaviour of sorting on arrays.

Related

elasticsearch weighted random distribution

I want to realize weighted random distribution in elastic search. In my index each document has weight from 1 to N. So element with weight 1 must appears in result 2 times less, than document with weight 2.
For example I have 3 documents (one with weight 2, two with weight 1):
[
{
"_index": "we_recommend_on_main",
"_type": "we_recommend_on_main",
"_id": "5-0",
"_score": 1.1245852,
"_source": {
"id_map_placement": 6151,
"image": "/upload/banner1",
"weight": 2
}
},
{
"_index": "we_recommend_on_main",
"_type": "we_recommend_on_main",
"_id": "8-0",
"_score": 0.14477867,
"_source": {
"id_map_placement": 6151,
"image": "/upload/banner1",
"weight": 1
}
},
{
"_index": "we_recommend_on_main",
"_type": "we_recommend_on_main",
"_id": "8-1",
"_score": 0.0837487,
"_source": {
"id_map_placement": 6151,
"image": "/upload/banner2",
"weight": 1
}
}
]
I found the solution with search like this:
{
"size": 1,
"query": {
"function_score": {
"functions": [
{
"random_score": {}
},
{
"field_value_factor": {
"field": "weight",
"modifier": "none",
"missing": 1
}
}
],
"score_mode": "multiply",
"boost_mode": "replace"
}
},
"sort": [
{
"_score": "desc"
}
]
}
After i tested this query with 10000 times result is
{
"5-0": 6730,
"8-1": 1613,
"8-0": 1657
}
But not
{
"5-0": 5000,
"8-1": 2500,
"8-0": 2500
}
as I expected. What is wrong?
Unfortunately, the problem here is - that your assumption about this distribution - is wrong. We have a classic probabilistic theory problem here. Variables A, B, C uniformly distributed (A, B between 0 and 1, C between 0 and 2). We need to find the probability that C will be greater than either A or B.
Explanation: since C is distributed between 0 and 2 uniformly, by the simple formula it's clear that with 50% probability it's distributed between 1 and 2 which automatically means that it will be greater than either A or B.
However, there are cases, when C will be less than 1 but still greater than either A or B, which makes probability strictly greater than 50% and much more than 50%.
2nd part of the distribution - where all 3 variables are between 0 and 1. Probability that C will be greater than either A or B is 1/3. However, C is distributed here only 50% of the time, which makes this probability - 1/6. Total probability is 1/2 + 1/6 = 4/6 which makes roughly numbers you got with Monte-Carlo simulation
Upd. It’s not possible to achieve expected behaviour, since you have no control over scoring, when you will collect aggregations - like sum of weights. I would recommend to do it in a rescore fashion with first requesting the sum aggregation on the field and later re-using it.

Elastic search aggregator price range from 0 to 0

I am using elastic search aggregator query to get a list of available products based on the price range.
This is how my aggregator query looks like :
'aggs': {
'prices': {
'range': {
'field': 'price',
'ranges': [
{'from': 0, 'to': 0},
{'to': 4.99},
{'from': 5, 'to': 9.99},
{'from': 10}
]
}
}
}
I want to get the number of products that is free, so i have the ranges from 0 to 0. But that didn't work. The rest of the ranges are working fine. How can i get agg for price 0?
Quoted from the Range Aggregations
Note that this aggregation includes the from value and excludes the to
value for each range.
So, range aggregations excludes the to value you have entered. That is why, you didn't get any documents in bucket 0-0.
Again, if you have given from: 0, to: 1 this means the bucket of 0 ≤ value < 1 . And for from: 0, to: 0 means bucket of 0 ≤ value < 0 , which doesn't includes 0.
Solution:
Although, if you want to get the bucket of 0 values with the range aggregation then you can set the range from: 0, to: 0.000000001. Here to value is a minimum value greater than 0 (you can set as of your application).

Sorting a Maple dataframe by the contents of a column

I have a dataset stored in a Maple dataframe that I'd like to sort by values in a given column. My example is larger, but the data is such that I have two columns of data, one that has some numeric values, and the other that has strings. So for example, say if I have a dataframe constructed as:
Mydata := DataFrame(<<2,1,3,0>|<"Red","Blue","Green","Orange">>, columns = [Value,Color] );
I'd like something like the sort command to be able to return the same dataframe with the numbers in the Value column sorted in ascending or descending order, but the sort command doesn't seem to support dataframes. Any ideas on how I can sort this?
You're right that the sort command doesn't currently support DataFrames (but it should!). I've gotten around this by converting the DataFrame column (a DataSeries) to a Vector, sorting the Vector using output = permutation option and then indexing the DataFrame by the result. Using your example:
Mydata := DataFrame(<<2,1,3,0>|<"Red","Blue","Green","Orange">>, columns = [Value,Color] );
sort( convert( Mydata[Value], Vector ), output = permutation );
Which returns:
[4, 2, 1, 3]
Indexing the original DataFrame by this result then returns the sorted DataFrame in ascending order of the Value column:
Mydata[ sort( convert( Mydata[Value], Vector ), output = permutation ), .. ];
Mydata[ [4, 2, 1, 3], .. ];
returns:
[ Value Color ]
[ ]
[4 0 "Orange"]
[ ]
[2 1 "Blue" ]
[ ]
[1 2 "Red" ]
[ ]
[3 3 "Green" ]
That said, I have needed to sort DataFrames a number of times, so I have also created a procedure that seems to work for most my data sets. This procedure uses a similar approach of using the sort command, however it doesn't require any data conversions since it works on the Maple DataFrame object itself. To do so, I need to set kernelopts(opaquemodules = false) in order to work directly with the internal DataFrame data object (you could also make a bunch of conversions to intermediate Matrices and Vectors, but this approach limits the amount of duplicate internal data being created):
DSort := proc( self::{DataFrame,DataSeries}, {ByColumn := NULL} )
local i, opacity, orderindex;
opacity := kernelopts('opaquemodules' = false):
if type( self, ':-DataFrame' ) and ByColumn <> NULL then
orderindex := sort( self[ByColumn]:-data, ':-output' = ':-permutation', _rest );
elif type( self, ':-DataSeries' ) and ByColumn = NULL then
orderindex := sort( self:-data, ':-output' = ':-permutation', _rest );
else
return self;
end if;
kernelopts(opaquemodules = opacity): #Set opaquemodules back to original setting
if type( self, ':-DataFrame' ) then
return DataFrame( self[ orderindex, .. ] );
else
return DataSeries( self[ orderindex ] );
end if;
end proc:
For example:
DSort( Mydata, ByColumn=Value );
returns:
[ Value Color ]
[ ]
[4 0 "Orange"]
[ ]
[2 1 "Blue" ]
[ ]
[1 2 "Red" ]
[ ]
[3 3 "Green" ]
This also works on strings, so DSort( Mydata, ByColumn=Color ); should work.
[ Value Color ]
[ ]
[2 1 "Blue" ]
[ ]
[3 3 "Green" ]
[ ]
[4 0 "Orange"]
[ ]
[1 2 "Red" ]
In this procedure, I pass additional arguments to the sort command, which means that you can also add in the ascending or descending options, so you could do DSort( Mydata, ByColumn=Value, `>` ); to return the DataFrame in descending 'Value' order (this doesn't seem to play well with strings though).

How do I sort a coffeescript Map object by property value?

I have a list of articles each of which has a simple string array of Tags. I count the tag frequency like this:
Count The Tag Frquency
getTags = (articles) ->
tags= {}
for article in articles
for tag in article.Tags
tags[tag] = (tags[tag] or 0) + 1
tags
Example Result
The tags map produced is an object with property names set to the Tag name and property values set to the frequency count, like so:
Question
I would like to order this list by the property value (the frequency count), how can I achieve this?
Note: I am happy to change the counting method if required
Edit 1
Thanks to #LeonidBeschastny I now have working code:
getTags = (articles) ->
tags = {}
for article in articles
for tag in article.Tags
tags[tag] = (tags[tag] or 0) + 1
tags = do (tags) ->
keys = Object.keys(tags).sort (a, b) -> tags[b] - tags[a]
{name, count: tags[name]} for name in keys
tags
You can see that I am having to project the unsorted tags map object into a new array of sorted {name:value} objects.
This feels like it is too much work and I think maybe the original unsorted object was a mistake. Is there a way to get to the sorted array without going through this intermediate step?
Edit 2
Thanks to #hpaulj for doing some time tests and discovering that the code above is actually reasonably efficient compared to other potential solutions, such as a running sorted heap.
I have now put this code into production and it is working well.
You may sort your tags using Array::sort and then rebuild tags object:
tags = do (tags) ->
res = {}
keys = Object.keys(tags).sort (a, b) -> tags[b] - tags[a]
for k in keys
res[k] = tags[k]
res
Update
As for insertion order, mu is too short is right, it's not guaranteed by ECMA specification. V8 maintains it for literal (non-numerical) keys, but I'm not so sure about other JS engines.
So, the right solution is to use arrays anyway:
tags = do (tags) ->
keys = Object.keys(tags).sort (a, b) -> tags[b] - tags[a]
{name, count: tags[name]} for name in keys
Using a heapq. This is more complex than simply counting followed by sorting, but may be useful if we need a running sorted count.
Using the Coffeescript translation of Python heapq, https://github.com/qiao/heap.js
heap = require './heap'
# adapted from
# http://docs.python.org/2/library/heapq.html#priority-queue-implementation-notes
pq = [] # list of entries arranged in a heap
entry_finder = {} # mapping of tasks to entries
REMOVED = '<removed-task>'
counter = [0]
remove_task = (task) ->
# Mark an existing task as REMOVED. return null if not found.
entry = entry_finder[task]
if entry?
delete entry_finder[task]
entry[entry.length-1] = REMOVED
return entry
count_task = (task) ->
entry = remove_task(task)
if entry?
[priority, count, _] = entry
priority += 1
else
counter[0] += 1
count = counter[0]
priority = 1
entry = [priority, count, task]
entry_finder[task] = entry
heap.push(pq, entry)
console.log h = ['one','two','one','three','four','two','one']
for task in h
count_task(task)
console.log entry_finder
console.log pq
alist = heap.nlargest(pq, 10)
for x in alist
[priority, count, task] = x
if task != REMOVED
console.log task, priority, count
produces
[ 'one', 'two', 'one', 'three', 'four', 'two', 'one' ]
{ three: [ 1, 3, 'three' ],
four: [ 1, 4, 'four' ],
two: [ 2, 2, 'two' ],
one: [ 3, 1, 'one' ] }
[ [ 1, 1, '<removed-task>' ],
[ 1, 2, '<removed-task>' ],
[ 2, 1, '<removed-task>' ],
[ 1, 3, 'three' ],
[ 1, 4, 'four' ],
[ 2, 2, 'two' ],
[ 3, 1, 'one' ] ]
one 3 1
two 2 2
four 1 4
three 1 3

What is the best in place sorting algorithm to sort a singly linked list

I've been reading on in place sorting algorithm to sort linked lists. As per Wikipedia
Merge sort is often the best choice for sorting a linked list: in this situation it is relatively easy to implement a merge sort in such a way that it requires only Θ(1) extra space, and the slow random-access performance of a linked list makes some other algorithms (such as quicksort) perform poorly, and others (such as heapsort) completely impossible.
To my knowledge, the merge sort algorithm is not an in place sorting algorithm, and has a worst case space complexity of O(n) auxiliary. Now, with this taken into consideration, I am unable to decide if there exists a suitable algorithm to sort a singly linked list with O(1) auxiliary space.
As pointed out by Fabio A. in a comment, the sorting algorithm implied by the following implementations of merge and split in fact requires O(log n) extra space in the form of stack frames to manage the recursion (or their explicit equivalent). An O(1)-space algorithm is possible using a quite different bottom-up approach.
Here's an O(1)-space merge algorithm that simply builds up a new list by moving the lower item from the top of each list to the end of the new list:
struct node {
WHATEVER_TYPE val;
struct node* next;
};
node* merge(node* a, node* b) {
node* out;
node** p = &out; // Address of the next pointer that needs to be changed
while (a && b) {
if (a->val < b->val) {
*p = a;
a = a->next;
} else {
*p = b;
b = b->next;
}
// Next loop iter should write to final "next" pointer
p = &(*p)->next;
}
// At least one of the input lists has run out.
if (a) {
*p = a;
} else {
*p = b; // Works even if b is NULL
}
return out;
}
It's possible to avoid the pointer-to-pointer p by special-casing the first item to be added to the output list, but I think the way I've done it is clearer.
Here is an O(1)-space split algorithm that simply breaks a list into 2 equal-sized pieces:
node* split(node* in) {
if (!in) return NULL; // Have to special-case a zero-length list
node* half = in; // Invariant: half != NULL
while (in) {
in = in->next;
if (!in) break;
half = half->next;
in = in->next;
}
node* rest = half->next;
half->next = NULL;
return rest;
}
Notice that half is only moved forward half as many times as in is. Upon this function's return, the list originally passed as in will have been changed so that it contains just the first ceil(n/2) items, and the return value is the list containing the remaining floor(n/2) items.
This somehow kind of remind me about my answer to a Dutch National Flag Problem question.
After giving it some thought this is what I came up to, let's see if this works out. I suppose the main problem is the merging step of the mergesort in O(1) extra-space.
Our representation of a linked-list:
[ 1 ] => [ 3 ] => [ 2 ] => [ 4 ]
^head ^tail
You end up with this merging step:
[ 1 ] => [ 3 ] => [ 2 ] => [ 4 ]
^p ^q ^tail
Being p and q the pointers for the segments we want to merge.
Simply add your nodes after the tail pointer. If *p <= *q you add p at the tail.
[ 1 ] => [ 3 ] => [ 2 ] => [ 4 ] => [ 1 ]
^p ^pp ^q/qq ^tail ^tt
Otherwise, add q.
[ 1 ] => [ 3 ] => [ 2 ] => [ 4 ] => [ 1 ] => [ 2 ]
^p ^pp ^q ^qq/tail ^tt
(Keeping track of the ending of our list q becomes tricky)
Now, if you move them you will rapidly lose track of where you are. You can beat this having a clever way to move your pointers or add the lengths into the equation. I definitely prefer the latter. The approach becomes:
[ 1 ] => [ 3 ] => [ 2 ] => [ 4 ]
^p(2) ^q(2) ^tail
[ 3 ] => [ 2 ] => [ 4 ] => [ 1 ]
^p(1) ^q(2) ^tail
[ 3 ] => [ 4 ] => [ 1 ] => [ 2 ]
^p(1) ^q(1) ^tail
[ 4 ] => [ 1 ] => [ 2 ] => [ 3 ]
^p(0)/q(1) ^tail
[ 1 ] => [ 2 ] => [ 3 ] => [ 4 ]
^q(0) ^tail
Now, you use that O(1) extra-space to be able to move your elements.

Resources