ElasticSearch sort buckets based on a list - sorting

In a terms aggregation query, I want to order the buckets based on
an external sorted list - If the item exists in this external sorted list, it should use this
bucket internal aggregations like sum - When it does not exist in the external sorted list, use these aggregations
Example: Let's say in ES we get the buckets for A1, A2, A3, A4. The external sorted list has [A3, A1]. And against each buckets we have a sum aggregation as follows:
A1: 20
A2: 30
A3: 10
A4: 60
I want the final order as [A3, A1, A4, A2].
How do I write this query? I have seen this use case at document level which people have done using script query. But for sorting buckets, I am unable to see.

The idea would be to use a bucket_sort pipeline aggregation with script-based sorting.
However, this is not yet supported but there's an open issue which aims at tackling this.

Related

Optimization algorithm to restore order of heterogeneous list separated into two homogeneous lists

We started with an heterogeneous list of items of types A and B recorded in single physical run in a building, e.g.: a1, b1, b2, a2, a3, b3....
We then split the list to two separate homogeneous lists: a1, a2,... and b1, b2, ... (keeping the internal order of each group).
We're looking to merge the lists back, while keeping the original order as much as possible.
Each item has a location attribute. Usually:
Items in the same location are sequential, e.g. let say a1 and b1 are both in the kitchen then a1.location = "kithchen" and b1.location = "kithchen".
One item of type A is followed by one item of type B in the same location
Essentially we would like to maximize the number of A objects followed by a B object with the same location while keeping the internal order of each group, i.e. a(n) must be before a(n+1)
There can be multiple items from each type in the same location.
The number of items of each type in each location can be different or even 0.
How can I find an optimal ordering?

Get neighboring results for an ElasticSearch query

Is it possible to get nearby results for ElasticSearch query?
Example 1. If I have items, named as:
one
two
three
four
then search "three", ordered, for example, by name, ascending, should return something like (given number of neighbors is 1):
one
*three*
two
Example 2. I have query results and ID of an document in it, I want IDs of next and previous document. The order is set in query.

How to achive Union All in pig?

I have 3 data sets each having 415 GB of data and of different domain.
I need to union all of them using pig but all i can use it union clause which launches the reducers at the end of job to remove distinct values.
a = union a1, a2
data = union a, a3
Is there a way to skip the reducer part as the data is already distinct.
From the docs on UNION:
Use the UNION operator to merge the contents of two or more relations.
The UNION operator:
Does not preserve the order of tuples. Both the input and output
relations are interpreted as unordered bags of tuples.
Does not ensure
(as databases do) that all tuples adhere to the same schema or that
they have the same number of fields. In a typical scenario, however,
this should be the case; therefore, it is the user's responsibility to
either (1) ensure that the tuples in the input relations have the same
schema or (2) be able to process varying tuples in the output
relation.
Does not eliminate duplicate tuples.
Emphasis is mine. This indicates to me there wouldn't need to be a reducer step to complete the UNION since it doesn't need to remove duplicate rows. Are you sure that the reducer job is a result of the UNION? It could be the result of another operator.
BONUS: You can simplify your example to:
B = UNION a1, a2, a3 ;

Merge Sorting 3 Sorted Arrays

Say I have 3 sorted arrays A1, A2, A3. I want to merge them using merge aspect in merge sort. How would I find that runtime?
I can't even suggest a solution, I'm completely stuck...
Thanks!

Alternative of ORDER BY in hive

By using ORDER BY in hive, It only uses single reducer. So ORDER BY is inefficient. Is there any alternative solution available for ORDER BY.
Regards,
Ratto
You will probably want to use the combination of DISTRIBUTE BY and SORT BY. DISTRIBUTE BY will ensure that all keys with a certain value will end up on the same data node. SORT BY will then sort the data on each node.
For Example:
SELECT a, b, c
FROM table
DISTRIBUTE by a
SORT BY a, b
ORDER BY will sort all of the data together, which is why it has to pass through one reducer.
SORT BY should do the trick. This will sort the data within each reducer, so the values for a given key will be in order, but the keys are not guaranteed to be in order. You can use any number of reducers for SORT BY.

Resources