Alternative of ORDER BY in hive - hadoop

By using ORDER BY in hive, It only uses single reducer. So ORDER BY is inefficient. Is there any alternative solution available for ORDER BY.
Regards,
Ratto

You will probably want to use the combination of DISTRIBUTE BY and SORT BY. DISTRIBUTE BY will ensure that all keys with a certain value will end up on the same data node. SORT BY will then sort the data on each node.
For Example:
SELECT a, b, c
FROM table
DISTRIBUTE by a
SORT BY a, b
ORDER BY will sort all of the data together, which is why it has to pass through one reducer.

SORT BY should do the trick. This will sort the data within each reducer, so the values for a given key will be in order, but the keys are not guaranteed to be in order. You can use any number of reducers for SORT BY.

Related

Database index that support arbitrary sort order

Is there database index type (or data structure in general, not just B-tree) that provides efficient enumeration of objects sorted in arbitrarily customizable order?
In order to execute query like below efficiently
select *
from sample
order by column1 asc, column2 desc, column3 asc
offset :offset rows fetch next :pagesize rows only
DBMSes usually require composite index with the fields mentioned in "order by" clause with the same order and asc/desc directions. I.e.
create index c1_asc_c2_desc_c3_asc on sample(column1 asc, column2 desc, column3 asc)
The order of index columns does matter, and the index can't be used if the order of columns in "order by" clause does not match.
To make queries with every possible "order by" clause efficient we could create indexes for every possible combination of sort columns. But this is not feasible since the number of indexes depends on the number of sort columns exponentionally.
Namely, if k is the number of sort columns, k! will be the number of permutation of the sort columns, 2k will be every possible combination of asc/desc directions, then the number of indexes will be (k!·2(k-1)). (Here we use 2(k-1) instead of 2k because we assume that DBMS will be smart enough to use the same index in both direct and reverse directions, but unfortunately this doesn't help much.)
So, I wish to have something like
create universal index c1_c2_c3 on sample(column1, column2, column3)
that would have the same effect as 24 (k=3) plain indexes (that cover every "order by"), but consume reasonable disk/memory space. As for reasonable disk/memory space I think that O(k·n) is ok, where k is the number of sort columns, n is the number of rows/entries and assuming that ordinary index consumes O(n). In other words, universal index with k sort columns should consume approximately as much as k ordinary indexes.
What I want looks to me as multidimensional indexes, but when I googled this term I have found pages that relate to either
ordinary composite indexes - this is not what I need for obvious reason;
spatial structures like k-d tree, quad/octo- tree, R-tree and so on, which are more suitable for the nearest-neighbor search problem rather than sorting.

How to join two already sorted arrays into one sorted array in M$ Flow efficiently

Microsoft Flow doesn't support any sort function for arrays or lists.
For my problems I can use sort function within ODATA request to have some data presorted by the databases I'm accessing. In my case, I want to have a list of all start and end dates from a sharepoint calendar in a single array.
I can pull all dates sorted by the start date and I can pull all dates sorted by the end date into separate arrays. Now I have two sorted arrays which I want to join into a single array.
There are very few possibilites in iterating over an array. But the task has some properties which could ease the problem.
Two arrays,
both presorted by the same property as the desired final arrays.
same size.
Perhaps I'm missing some feature of the ODATA-request or there's a simple workaround. I'd prefer not to use REST-api or messing around with the JSON or manually, but if there's really an elegant solution I won't reject it.
I have a solution, but I don't think it is a good one.
Prerequesites are the two already sorted arrays and two additional arrays.
Let's call the two sorted arrays I have extracted from the sharepoint list array A and B.
And let's call the additional arrays array S1 and S2.
Then I set up a foreach-loop on array B.
Within that loop I filter array A for all elements lesser or equal to the current item of array B.
The output of the filter operation is saved to array S1.
current item of array B is appendet to array S1.
Again filter Array A for all elements, but this time for greater than the current item of array B.
save the output of the filter operation to array S2.
make a union from S1 and S2.
save the output of the union expression to array A.
As every element of array A has to be copied n times for a n-element array, the effort for processing two arrays of n elements is not quite optimal, especially if you consider both arrays already sorted in advance.
n² comparisons
2n²+n copy operations (not taking into account the imperfections of the implementation of flow)
If I'd implement a complete sort from scratch it would perform better, I think, but I also think, there must be other means to join two presorted arrays of compatible content.

Select and filter algorithm

I would like to select the top n values from a dataset, but ignore elements based on what I have already selected - i.e., given a set of points (x,y), I would like to select the top 100 values of x (which are all distinct) but not select any points such that y equals the y of any already-selected point. I would like to make sure that the highest values of x are prioritized.
Is there any existing algorithm for this, or at least similar ones? I have a huge amount of data and would like to do this as efficiently as possible. Memory is not as much of a concern.
You can do this in O(n log k) time where n is the number of values in the dataset and k are the number of top values you'd like to get.
Store the values you wish to exclude in a hash table.
Make an empty min-heap.
Iterate over all of the values and for each value:
If it is in the hash table skip it.
If the heap contains fewer than k values, add it to the heap.
If the heap contains >=k values, if the value you're looking at is greater than the smallest member of the minheap, pop that value and add the new one.
I will share my thoughts and since the author still has not specified the scope of data to be processed, I will assume that it is too large to be handled by a single machine and I will also assume that the author is familiar with Hadoop.
So I would suggest using the MapReduce as follows:
Mappers simply emit pairs (x,y)
Combiners select k pairs with largest values of x (k=100 in this case) in the meantime maintaining the unique y's in the hashset to avoid duplicates, then emit k pairs found.
There should be only one reducer in this job since it has to get all pairs from combiners to finalize the job by selecting k pairs for the last time. Reducer's implementation is identical to combiner.
The number of combiners should be selected considering memory resources needed to select top k pairs out of incoming dataset since whichever method is used (sorting, heap or anything else) it is going to be done in-memory, as well as keeping that hashset with unique y's

Iterating over unordered_map C++

Is it true that keys inserted in a particular order in an unordered_map, will come in the same order while iterating over the map using iterator?
Like for example: if we insert (4,3), (2, 5), (6, 7) in B.
And iterate like:
for(auto it=B.begin();it!=B.end();it++) {
cout<<(it->first);
}
will it give us 4, 2, 6 or keys may come in any order?
From the cplusplus.com page about the begin member function of unordered_map (link):
Notice that an unordered_map object makes no guarantees on which specific element is considered its first element.
So no, there is no guarantee the elements will be iterated over in the order they were inserted.
FYI, you can iterate over an unordered_map more simply:
for (auto& it: B) {
// Do stuff
cout << it.first;
}
Information added to the answer provided by #Aimery,
Unordered map is an associative container that contains key-value pairs with unique keys. Search, insertion, and removal of elements
have average constant-time complexity.
Internally, the elements are not sorted in any particular order but organized into buckets. Which bucket an element is placed into
depends entirely on the hash of its key. This allows fast access to
individual elements since once the hash is computed, it refers to the
exact bucket the element is placed into.
See the ref. from https://en.cppreference.com/w/cpp/container/unordered_map.
According to Sumod Mathilakath gave an answer in Quora
If you prefer to keep intermediate data in sorted order, use std::map<key,value> instead std::unordered_map. It will sort on key by default using std::less<> so you will get result in ascending order.
std::unordered_map is an implementation of hash table data structure, so it will arrange the elements internally according to the hash value using by std::unordered_map. But in case std::map it is usually a red black binary tree implementation.
See the ref. from What will be order of key in unordered_map in c++ and why?.
So, I think we got the answer more clearly.

sorting a table in lua based on an inner tables value

So currently I have a table in Lua that contains another table (much like a hashtable). It's called email_table, and I have my person_table inside it. The email_table's keys are email_addresses and the person_table holds all information about a person.
Currently what I'm trying to do is sort my email_table based on a value that's inside of person_table. The built in sort function for Lua does not work with such values unfortunately. How would I get started?
You cannot sort something that isn't an array. If your keys aren't monotonically increasing integers, then you can't sort it. Sorting implies order, and there is no ordering on non-integer keys of tables.
If "The email_table's keys are email_addresses", then email_table cannot be sorted. You can have another table that is a sorted list of email addresses. But this must be a list: the keys must be monotonically increasing integer values (1, 2, 3, 4, etc). Those have an explicit order.

Resources