What is the efficient way of sorting columnar data - algorithm

In my program, I need an efficient way of sorting in-memory columnar data.
Let me explain this problem.
The data consists of four objects:
(1, 15, ‘apple’), (9, 27, ‘pear’),(7, 38, ‘banana’),(4, 99, ‘orange’)
And the four objects are kept in memory in a columnar, and looks like this:
[1,9,7,4],[15, 27, 38, 99],[‘apple’,’pear’,’banana’,’orange’]
I need to sort this list according to the second column in ascending order and the third in descending order.
In the case of only one columnar, it is a simple sorting problem.
But the case is different when two or more columns exist for in-memory columnar data.
The swap function may incur too many overheads during sorting columnar data.
I’ve checked several Open-sourced implementations to find the best practice, e.g., Apache Arrow, Presto, TDengine, and other projects.
And I found that using index sort is a way that can avoid the overhead introduced by swapping since the index, instead of columnar data, will be swapped.
And I’m wondering is the index sort the most efficient means to handle this problem?

If you want speed then fastest language is C++.
You can use std::sort for sorting purposes with parallel execution policy std::execution::par_unseq (which enables multi-threaded multi-core parallel sort).
As you can notice in my code below, I did arg-sort, because you wished for it. But in C++ it is not really necessary, it is enough to do regular sort, for two reasons.
One is because of cache locality, swapping data elements themselves instead of indices is faster because sorting algorithms are often more or less cache friendly, meaning that swapping/comparing nearby-in-memory elements happens often.
Second is because swapping of elements in std::sort is done through std::swap, which in turn uses std::move, this move function swaps classes like std::string very efficiently by just swapping pointers to data instead of copying data itself.
From above it follows that doing arg-sort instead of regular sort might be even slower.
Following code snippet first creates small example file with several data tuples that you provided. In real case you should remove this file writing code so that you don't overwrite your file. At the end it outputs to new file.
After program finishes see created files data.in and data.out.
Try it online!
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <tuple>
#include <execution>
#include <algorithm>
int main() {
{
std::ofstream f("data.in");
f << R"(
7, 38, banana
9, 27, pear
4, 99, orange
1, 15, apple
)";
}
std::vector<std::tuple<int, int, std::string>> data;
{
std::ifstream f("data.in");
std::string c;
while (true) {
int a = 0, b = 0;
char comma = 0;
c.clear();
f >> a >> comma >> b >> comma >> c;
if (c.empty() && !f)
break;
data.push_back({a, b, c});
}
}
std::vector<size_t> idx(data.size());
for (size_t i = 0; i < idx.size(); ++i)
idx[i] = i;
std::sort(std::execution::par_unseq, idx.begin(), idx.end(),
[&data](auto const & i, auto const & j){
auto const & [_0, x0, y0] = data[i];
auto const & [_1, x1, y1] = data[j];
if (x0 < x1)
return true;
else if (x0 == x1)
return y0 > y1;
else
return false;
});
{
std::ofstream f("data.out");
for (size_t i = 0; i < idx.size(); ++i) {
auto const & [x, y, z] = data[idx[i]];
f << x << ", " << y << ", " << z << std::endl;
}
}
}
Input:
7, 38, banana
9, 27, pear
4, 99, orange
1, 15, apple
Output:
1, 15, apple
9, 27, pear
7, 38, banana
4, 99, orange

Normally this kind of activities is duty of databases.
Database can store data in any sorted order; then no need to sort data every time you want to retrieve them. If you use SQL Server as Database you can create table like:
Create Table TableName (
FirstColumn int not null,
SecondColumn int not null,
ThirdColumn nvarchar(1000) not null,
Primary Key(SecondColumn ASC, ThirdColumn DESC))
The most important part is specifying clustered index which I choose to be the combination of SecondColumn in ascending order & ThirdColumn in descending order as you requested in the question. Why? To answer the reason you should know the facts, that Clustered Index (Primary Key) specify physical order of data in disk (So already data is in your favorite order sorted). Also you can add more Nonclustered index which covers your query to ensure other queries (which their sort order are not same as physical's order) will retrieve fast enough.
Be aware of poor performance of bad design. So If you are not familiar with Database Design, get help from Database Developers or administrators.

Related

CUDA: Sort an array according to the order defined by another array using thrust

I have 10 arrays. I want to sort them. But since their elements have the same behavior , I want to save computations and sort only one, and the others will be sorted based on the sorted array.
I'm using thrust.
There is an optimal why to do it?
Thank you in advance.
From the comments, my suggestion was:
Use thrust::sort_by_key on the first data set (array), passing the first data set as the keys, and an index sequence (0, 1, 2, ...) as the values. Then use the rearranged index sequence in a thrust gather or scatter operation to rearrange the remaining arrays.
As requested, here is a worked example:
$ cat t282.cu
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/sequence.h>
#include <iostream>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/zip_iterator.h>
const size_t ds = 5;
typedef float ft;
int main(){
ft a1[ds] = {0.0f, -3.0f, 4.0f, 2.0f, 1.0f};
// data setup
thrust::device_vector<ft> d_a1(a1, a1+ds);
thrust::device_vector<ft> d_a2(ds);
thrust::device_vector<ft> d_a3(ds);
thrust::device_vector<ft> d_a2r(ds);
thrust::device_vector<ft> d_a3r(ds);
thrust::device_vector<size_t> d_i(ds);
thrust::sequence(d_i.begin(), d_i.end());
thrust::sequence(d_a2.begin(), d_a2.end());
thrust::sequence(d_a3.begin(), d_a3.end());
// sort
thrust::sort_by_key(d_a1.begin(), d_a1.end(), d_i.begin());
// copy, using sorted indices
thrust::copy_n(thrust::make_permutation_iterator(thrust::make_zip_iterator(thrust::make_tuple(d_a2.begin(), d_a3.begin())), d_i.begin()), ds, thrust::make_zip_iterator(thrust::make_tuple(d_a2r.begin(), d_a3r.begin())));
// output results
thrust::host_vector<ft> h_a1 = d_a1;
thrust::host_vector<ft> h_a2 = d_a2r;
thrust::host_vector<ft> h_a3 = d_a3r;
std::cout << "a1: " ;
thrust::copy_n(h_a1.begin(), ds, std::ostream_iterator<ft>(std::cout, ","));
std::cout << std::endl << "a2: " ;
thrust::copy_n(h_a2.begin(), ds, std::ostream_iterator<ft>(std::cout, ","));
std::cout << std::endl << "a3: " ;
thrust::copy_n(h_a3.begin(), ds, std::ostream_iterator<ft>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t282 t282.cu
$ cuda-memcheck ./t282
========= CUDA-MEMCHECK
a1: -3,0,1,2,4,
a2: 1,0,4,3,2,
a3: 1,0,4,3,2,
========= ERROR SUMMARY: 0 errors
$
Here, in lieu of a thrust::gather or thrust::scatter operation, I'm simply doing a thrust::copy_n with a thrust::permutation_iterator, in order to effect the reordering. I combine the remaining arrays to be reordered using thrust::zip_iterator, but this isn't the only way to do it.
Note that I'm not doing it for 10 arrays but for 3, however this should illustrate the method. The extension to 10 arrays should be just mechanical. Note however that the method would have to be modified somewhat for more than 10-11 arrays, as thrust::tuple is limited to 10 items. As a modification, you could simply call thrust::copy_n in a loop, once for each array to be reordered, rather than using zip_iterator. I don't think this should make a large difference in efficiency.
A few ways to go about doing this (irrespective of Thrust):
‏
Initialize an array of indices indices to 0, 1, 2, 3... etc.
Sort indices, with the comparison function being accessing elements in one of the arrays (the one for which comparisons are cheapest), and comparing those. Call the resulting array
For each one of the 10 arrays, arr apply a Gather operation using the sorted indices and arr as the data from which to gather. i.e. sorted_arr[i] = arr[indices[i]] for all i.
‏
Adapt one of the sort implementations to also do "index log-keeping", i.e. whenever you swap or position data in a "real" array, also set an index in an indices array.
Run this index-keeping sorty on one of the 10 arrays (the one that's cheapest to sort).
Apply the Gather from 1.3 to the other 9 arrays
Let cheap be the cheapest array to sort (or to compare elements)
Create an array of pairs pairs[i] = { i, cheap[i] } of the approrpriate type.
Have the comparison of these pairs only use the second element of the pair.
Sort pairs
Project pairs onto its first element: indices[i] = pairs[i].first
Project pairs onto its second element: sorted_cheap[i] = pairs[i].second
Apply the Gather from 1.3 to the other nine arrays
The second option should be the fastest, but would require more effort; and with thrust, it's probably quite difficult. Either the first or the third should be the easiest; and thrust accepts custom comparators, right? If it doesn't, you might have to define a wrapper class with the appropriate comparators.

why is the standard merge sort not in place?

In the merge algorithm of merge sort, I don't understand we have to use auxiliary arrays L, R? Why can't we just keep 2 pointers corresponding which element we're comparing in the 2 subarrays L and R so that the merge-sort algorithm remains inplace?
Thanks.
say you split your array s.th. L uses the first half of the original array, and R uses the second half.
then say that durign merge the first few elements from R are smaller than the smallest in L. If you want to put them in the correct place for the merge result, you will have to overwrite elements from L that have not been processed during the merge step yet.
of course you can make a diferent split. But you can always construct such an (then slightly different) example.
My first post here. Be gentle!
Here's my solution for a simple and easy-to-understand stable in-place merge-sort. I wrote this yesterday. I'm not sure it hasn't been done before, but I've not seen it about, so maybe?
The one drawback to the following in-place merge algorithm can degenerate into O(n²) under certain conditions, but is typically O(n.log₂n) in practice. This degeneracy can be mitigated with certain changes, but I wanted to keep the base algorithm pure in the code sample so it can be easily understood.
Coupled with the O(log₂n) time complexity for the driving merge_sort() function, this presents us with a typical time complexity of O(n.(log₂n)²) overall, and O(n².log₂n) worst case, which is not fantastic, but again with some tweaks, it can be made to almost always run in O(n.(log₂n)²) time, and with its good CPU cache locality it is decent even for n values up to 1M, but it is always going to be slower than quicksort.
// Stable Merge In Place Sort
//
//
// The following code is written to illustrate the base algorithm. A good
// number of optimizations can be applied to boost its overall speed
// For all its simplicity, it does still perform somewhat decently.
// Average case time complexity appears to be: O(n.(log₂n)²)
#include <stddef.h>
#include <stdio.h>
#define swap(x, y) (t=(x), (x)=(y), (y)=t)
// Both sorted sub-arrays must be adjacent in 'a'
// Assumes that both 'an' and 'bn' are always non-zero
// 'an' is the length of the first sorted section in 'a', referred to as A
// 'bn' is the length of the second sorted section in 'a', referred to as B
static void
merge_inplace(int A[], size_t an, size_t bn)
{
int t, *B = &A[an];
size_t pa, pb; // Swap partition pointers within A and B
// Find the portion to swap. We're looking for how much from the
// start of B can swap with the end of A, such that every element
// in A is less than or equal to any element in B. This is quite
// simple when both sub-arrays come at us pre-sorted
for(pa = an, pb = 0; pa>0 && pb<bn && B[pb] < A[pa-1]; pa--, pb++);
// Now swap last part of A with first part of B according to the
// indicies we found
for (size_t index=pa; index < an; index++)
swap(A[index], B[index-pa]);
// Now merge the two sub-array pairings. We need to check that either array
// didn't wholly swap out the other and cause the remaining portion to be zero
if (pa>0 && (an-pa)>0)
merge_inplace(A, pa, an-pa);
if (pb>0 && (bn-pb)>0)
merge_inplace(B, pb, bn-pb);
} // merge_inplace
// Implements a recursive merge-sort algorithm with an optional
// insertion sort for when the splits get too small. 'n' must
// ALWAYS be 2 or more. It enforces this when calling itself
static void
merge_sort(int a[], size_t n)
{
size_t m = n/2;
// Sort first and second halves only if the target 'n' will be > 1
if (m > 1)
merge_sort(a, m);
if ((n-m)>1)
merge_sort(a+m, n-m);
// Now merge the two sorted sub-arrays together. We know that since
// n > 1, then both m and n-m MUST be non-zero, and so we will never
// violate the condition of not passing in zero length sub-arrays
merge_inplace(a, m, n-m);
} // merge_sort
// Print an array */
static void
print_array(int a[], size_t size)
{
if (size > 0) {
printf("%d", a[0]);
for (size_t i = 1; i < size; i++)
printf(" %d", a[i]);
}
printf("\n");
} // print_array
// Test driver
int
main()
{
int a[] = { 17, 3, 16, 5, 14, 8, 10, 7, 15, 1, 13, 4, 9, 12, 11, 6, 2 };
size_t n = sizeof(a) / sizeof(a[0]);
merge_sort(a, n);
print_array(a, n);
return 0;
} // main
If you ever tried to write a merge sort in place, you will soon find out why you can't wen you are merging the 2 sub arraies - you basically need to read from and write to the same range of the array, and it will overwrite each other. Hence we need any auxiliary array:
vector<int> merge_sort(vector<int>& vs, int l, int r, vector<int>& temp)
{
if(l==r) return vs; // recursion must have an end condition
int m = (l+r)/2;
merge_sort(vs, l, m, temp);
merge_sort(vs, m+1, r, temp);
int il = l, ir=m+1, i=l;
while(il <= m && ir <= r)
{
if(vs[il] <= vs[ir])
temp[i++] = vs[il++];
else
temp[i++] = vs[ir++];
}
// copy left over items(only one of below will apply
while(il <= m) temp[i++] = vs[il++];
while(ir <= r) temp[i++] = vs[ir++];
for(i=l; i<=r; ++i) vs[i] = temp[i];
return vs;
}

Is row-major and column-major order really a property of a programming language

I think I have discovered a widespread misunderstanding (professors do it wrong!). People say that C and C++ represents matrices in row-major order and Fortran column-major order. But I doubt that C and C++ have build in major-order because there is no true matrix type? If I enter
int A[2][3] = { {1, 2, 3}
, {4, 5, 6} };
The order is row-major just because my editor is row-oriented rather than column-oriented. This has nothing to do with the language itself, or has it? If the editor were column-oriented:
i {
n { {
t 1 4
, ,
A 2 5
[ , ,
2 3 6
] } }
[ ;
3
]
=
Now the matrix A has two columns and three rows.
To illustrate further, consider a matrix print loop
for(int k=0; k<M; ++k)
{
for(int l=0; l<N; ++l)
{printf("%.7g\t",A[k][l]);}
putchar('\n');
}
Why does it print by row? Because '\n' moves to the next row rather than the next column. If '\n' were interpreted as "go to the next column and first row" and '\t' go to the next row, then A is printed column-wise. But I know that my terminal is row-oriented, so if I want to print column-wise, the only way is to swap these loops.
If A[k] logically represents a row or column depends on the functions that operates on A and then there is a trade-off what order to choose. For example gauss elimination walks rows{column,rows{column}}. The advantage of placing row-index first is that it makes it easer to swap rows when pivoting. However, to perform the pivoting one has to loop through all rows in the same column, which should be faster by choosing the opposite. The innermost elimination loop has access two rows at the time and neither is really good.
A better terminology probably is first-index indexing and last-index indexing. This is a pure language feature: First-index indexing refers to the situation when the first given index is supposed to increment slowest, while last-index indexing is the opposite. "Rows" and "columns" is an interpretation issue much like byte order and character encodings: The compiler will never know what a row or column is but it may have a language defined input order (Most languages happens to accept numeric constants in big endian order but my computer wants little endian). These terms come from conventions in the environment and library routines.
This has nothing to do with how your text editor works, and everything to do with how the elements of the 2D array are laid out in memory. That in turn determines whether it is more efficient for nested loops (looping over all the elements of the matrix) are more efficient with the row loop as the inner loop or with the column loop as the inner loop.
As one commenter suggested, it's really just the order of indices in the array access syntax that makes C row-major.
Here's a better example program that initialises a 2D array using a flat list of values.
#include <stdio.h>
#include <string.h>
int main() {
int data[9] = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
int arr[3][3];
memcpy(arr, data, sizeof(int)*9);
printf("arr[0][1] = %d\n", arr[0][1]);
}
So now we can avoid any confusion added by the 2D array declaration syntax, or how that syntax is laid out in the text editor. We are just concerned with how C interprets the linear list of values that we have shoved into memory.
And if we run the program we will see:
$ ./a.out
arr[0][1] = 2
This is what makes C row-major. The fact that the array syntax is interpreted as [row][column] when accessing data in memory.

Modified Parallel Scan

This is more of an algorithms question than a programming one. I'm wondering if the prefix sum (or any) parallel algorithm can be modified to accomplish the following. I'd like to generate a result from two input lists on a GPU in less than O(N) time.
The rule is: Carry forth the first number from data until the same index in keys contains a lesser value.
Whenever I try mapping it to a parallel scan, it doesn't work because I can't be sure which values of data to propagate in upsweep since it's not possible to know which prior data might have carried far enough to compare against the current key. This problem reminds me of a ripple carry where we need to consider the current index AND all past indices.
Again, don't need code for a parallel scan (though that would be nice), more looking to understand how it can be done or why it can't be done.
int data[N] = {5, 6, 5, 5, 3, 1, 5, 5};
int keys[N] = {5, 6, 5, 5, 4, 2, 5, 5};
int result[N];
serial_scan(N, keys, data, result);
// Print result. should be {5, 5, 5, 5, 3, 1, 1, 1, }
code to do the scan in serial is below:
void serial_scan(int N, int *k, int *d, int *r)
{
r[0] = d[0];
for(int i=1; i<N; i++)
{
if (k[i] >= r[i-1]) {
r[i] = r[i-1];
} else if (k[i] >= d[i]) {
r[i] = d[i];
} else {
r[i] = 0;
}
}
}
The general technique for a parallel scan can be found here, described in the functional language Standard ML. This can be done for any associative operator, and I think yours fits the bill.
One intuition pump is that you can calculate the sum of an array in O(log(n)) span (running time with infinite processors) by recursively calculating the sum of two halves of the array and adding them together. In calculating the scan you just need know the sum of the array before the current point.
We could calculate the scan of an array doing two halves in parallel: calculate the sum of the 1st half using the above technique. Then calculating the scan for the two halves sequentially; the 1st half starts at 0 and the 2nd half starts at the sum you calculated before. The full algorithm is a little trickier, but uses the same idea.
Here's some pseudo-code for doing a parallel scan in a different language (for the specific case of ints and addition, but the logic is identical for any associative operator):
//assume input.length is a power of 2
int[] scanadd( int[] input) {
if (input.length == 1)
return input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
//assume this for loop is done in parallel
int[] collapsed = new int[input.length/2]
for (i <- 0 until collapsed.length)
collapsed[i] = input[2 * i] + input[2*i+1]
//recursively scan collapsed values
int[] scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
//also assume this for loop is in parallel
int[] output = int[input.length]
for (i <- 0 until input.length)
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
if (i %2 ==0)
output[i] = scancollapse[i/2]
else
output[i] = scancollapse[(i-1)/2] + input[i]
return output
}
}

Algorithm to find if two sets intersect

Let's say I have two arrays:
int ArrayA[] = {5, 17, 150, 230, 285};
int ArrayB[] = {7, 11, 57, 110, 230, 250};
Both arrays are sorted and can be any size. I am looking for an efficient algorithm to find if the arrays contain any duplicated elements between them. I just want a true/false answer, I don't care which element is shared or how many.
The naive solution is to loop through each item in ArrayA, and do a binary search for it in ArrayB. I believe this complexity is O(m * log n).
Because both arrays are sorted, it seems like there should be a more efficient algorithm.
I would also like a generic solution that doesn't assume that the arrays hold numbers (i.e. the solution should also work for strings). However, the comparison operators are well defined and both arrays are sorted from least to greatest.
Pretend that you are doing a mergesort, but don't send the results anywhere. If you get to the end of either source, there is no intersection. Each time you compare the next element of each, if they are equal, there is an intersection.
For example:
counterA = 0;
counterB = 0;
for(;;) {
if(counterA == ArrayA.length || counterB == ArrayB.length)
return false;
else if(ArrayA[counterA] == ArrayB[counterB])
return true;
else if(ArrayA[counterA] < ArrayB[counterB])
counterA++;
else if(ArrayA[counterA] > ArrayB[counterB])
counterB++;
else
halt_and_catch_fire();
}
Since someone wondered about stl. Out-of-the-box, the set_intersection algorithm would do more than you want: it would find all the common values.
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;
// ...
int ArrayA[] = {5, 17, 150, 230, 285};
int ArrayB[] = {7, 11, 57, 110, 230, 250};
vector<int> intersection;
ThrowWhenWritten output_iterator;
set_intersection(ArrayA, ArrayA + sizeof(ArrayA)/sizeof(int),
ArrayB, ArrayB + sizeof(ArrayB)/sizeof(int),
back_insert_iterator<vector<int> >(intersection));
return !intersection.empty();
this runs in O(m+n) time, but it requires storing all the duplicates and doesn't stop when it finds the first dup.
Now, modifying the code from the gnu implementation of the stl, we can get more precisely what you want.
template<typename InputIterator1, typename InputIterator2>
bool
has_intersection(InputIterator1 first1, InputIterator1 last1,
InputIterator2 first2, InputIterator2 last2)
{
while (first1 != last1 && first2 != last2)
{
if (*first1 < *first2)
++first1;
else if (*first2 < *first1)
++first2;
else
return true;
}
return false;
}
If one list is much much shorter than the other, binary search is the way to go. If the lists are of similar length and you're happy with O(m+n), a standard "merge" would work. There are fancier algorithms that are more flexible. One paper I've come across in my own searches is:
http://www.cs.uwaterloo.ca/~ajsaling/papers/paper-spire.pdf
If you don't care about memory consumption, you can achieve good performance by using hash, i.e. create hash with keys = values of one array, and test values of second array against this hash
If you are using C# 3.0 then why not take advantage of LINQ here?
ArrayA.Intersect(ArrayB).Any()
Not only is this generic (works for any comparable type) the implementation under the hood is pretty efficient (uses a hashing algorithm).
If the range of values is small, you could build a lookup table for one of them (time cost = O(N)) and then check if the bit is set from the other list (time cost = O(N)). If the range is large, you could do something similar with a hash table.
The mergesort trick from Glomek is an even better idea.
Glomek is on the right track, but kinda glossed over the algorithm.
Start by comparing ArrayA[0] to ArrayB[0]. if they are equal, you're done.
If ArrayA[0] is less than ArrayB[0], then move to ArrayA[1].
If ArrayA[0] is more than ArrayB[0], then move to ArrayB[1].
Keeping stepping through till you reach the end of one array or find a match.

Resources