Algorithm to simplify an SQL query over many combinations of values - algorithm

Suppose you have a set of data you want to query from a database table based on two or more columns and you have various combinations of values from those columns to query over.
For example, with these combination
A 1
A 2
A 3
A 4
B 2
B 3
B 4
C 2
C 3
C 4
C 5
the simplest automated way to write the query would be
...WHERE column1=A AND column2=1
OR column1=A AND column2=2
OR ...
This is accurate, but verbose.
Another approach would be to choose one list and group by that.
...WHERE column1=A AND column2 IN (1,2,3,4)
OR column1=B AND column2 IN (2,3,4)
OR column1=C AND column2 IN (2,3,4,5)
This can also be pretty verbose if the lists are large.
What I'm looking for is a good, programmatic/algorithmic way to simplify this to something more optimal, like
...WHERE column1 IN (A,B,C) AND column2 IN (2,3,4,5) AND NOT (column1=A AND column2=5)
OR column1=A AND column2=1
Realistically, I guess I'm looking for the shortest string representation, but rather than worry about the lengths of column names or values or the other code in between, I'd be more than satisfied with a way to minimize the number of times that values have to be printed, or something along those lines.
The given example is for two columns, but I would like to find an approach that can generalize to more columns.
It also doesn't have to be perfect. Pretty good is good enough for me.

Related

Database index that support arbitrary sort order

Is there database index type (or data structure in general, not just B-tree) that provides efficient enumeration of objects sorted in arbitrarily customizable order?
In order to execute query like below efficiently
select *
from sample
order by column1 asc, column2 desc, column3 asc
offset :offset rows fetch next :pagesize rows only
DBMSes usually require composite index with the fields mentioned in "order by" clause with the same order and asc/desc directions. I.e.
create index c1_asc_c2_desc_c3_asc on sample(column1 asc, column2 desc, column3 asc)
The order of index columns does matter, and the index can't be used if the order of columns in "order by" clause does not match.
To make queries with every possible "order by" clause efficient we could create indexes for every possible combination of sort columns. But this is not feasible since the number of indexes depends on the number of sort columns exponentionally.
Namely, if k is the number of sort columns, k! will be the number of permutation of the sort columns, 2k will be every possible combination of asc/desc directions, then the number of indexes will be (k!·2(k-1)). (Here we use 2(k-1) instead of 2k because we assume that DBMS will be smart enough to use the same index in both direct and reverse directions, but unfortunately this doesn't help much.)
So, I wish to have something like
create universal index c1_c2_c3 on sample(column1, column2, column3)
that would have the same effect as 24 (k=3) plain indexes (that cover every "order by"), but consume reasonable disk/memory space. As for reasonable disk/memory space I think that O(k·n) is ok, where k is the number of sort columns, n is the number of rows/entries and assuming that ordinary index consumes O(n). In other words, universal index with k sort columns should consume approximately as much as k ordinary indexes.
What I want looks to me as multidimensional indexes, but when I googled this term I have found pages that relate to either
ordinary composite indexes - this is not what I need for obvious reason;
spatial structures like k-d tree, quad/octo- tree, R-tree and so on, which are more suitable for the nearest-neighbor search problem rather than sorting.

Algorithm to find occurence of subsequence with the least number of breaks

Given a query (subsequence) and a string (sequence) I'd like to know the minimal number of groups of consecutive characters (or the number of breaks between them) in the string that, when added together, produce the query. The order of characters matters: abc does not occur in cba.
Example 1: Given the query abcjkl and the string a_b_c_abc_j_k_l_jkl I'd like the algorithm to find an occurence of abc and then of jkl — that makes 2 groups. It should prefer the set of groups (abc jkl) over, say, (a b c jkl), since it has less breaks. I don't care about the number of characters between abc and jkl in the string, it still counts as 1 break.
Example 2: Given the query abcjkl and the string abc_jkl_abcj_kl I'd also like the algorithm to find 2 groups. I don't really care which set it picks — (abc jkl) or (abcj kl).

Largest subset of lines with two unique columns

Given a text file with two columns, produce the largest possible subset of lines for which no value is repeated within either column.
For example, given these four lines :
1 a
1 b
2 a
2 b
One can use something like "sort -u" on the command line, to unique first on column 1, leaving
1 a
2 a
and then on column two, leaving just
1 a
This satisfies "no value is repeated" but not "largest possible subset"
In an ideal world, I would have produced either
1 a
2 b
or
1 b
2 a
Given the further constraint that these files might be many gigabytes (i.e. much larger than available RAM, but much smaller than available disk), I can't just keep all the values in a data structure.
Can anyone think of an approach?
I would also be happy with "a pretty large subset", if I can't literally get "the largest possible subset"
If I sort by (column 1 ascending and then column 2 random), uniq'ing on column 1 will give me slightly better results, but I feel like there's something simple that I'm missing.
For each unique item from col 1 create a list of unique items from col 2. Then starting with the smallest of lists build the final output by taking first value from each list and each col-1-item, that has not been used in the output yet.

How to select subset of values in bash

I have a file say input.dat like this
column1 column2
0 0
1.3 1.6
1.8 2.1
2.0
2.6
I need to extract subset of values from 1st column, which are closest to those in column 2, so that the total number of entries in both columns is equal.
In this example, the output I need to obtain
column1 column2
0 0
1.8 1.6
2.0 2.1
How can I get this ?
It's possible to do this with bash scripts if that is what you are limited to, but it would be easier to handle a problem like this with Python / C++ / Java because this is a version of optimized bipartite matching problem (you'd have to read each line repeatedly if done in script, or use a lot of helper variables)
==> If we can assume that values in both columns are sorted and increasing, a naive solution would be:
For every value in the 2nd column:
Read over values in the 1st column sequentially until the difference of col2_value - col1_value goes from negative to positive
Then find min( abs(negative_difference), positive_difference ) and pick the col1_value that corresponds to the smaller difference
Remove both entries from col1 and col2 and add them to the result table
Repeat this process until there is nothing left in col2 of the original table
This has worst-case run time of m*n, where m is # entries in col1 and n is # entries in col2 and average run time of O(n) if you are clever and do a constant time alternating check (compare -1, +1 from index of last chosen col1_value, since -2, +2 etc would of course result in bigger differences) instead of a sequential one to find the minimal difference between the current value in col2 and the values in vol1.
This is a naive solution because it does not minimize overall difference in the system. Optimum solution is NP, so for large datasets, the best you can probably do is use one of the approximation graphing algorithms for matching.

what is natural ordering when we talk about sorting?

What is meant by natural ordering . Suppose I have an Employee object with name , age and date of joining , sorting by what is natural ordering ?
Natural ordering is a kind of alphanumerical sort which seems natural to humans.
In a classical alphanumerical sort we will have something like :
1 10 11 12 2 20 21 3 4 5 6 7
If you're using Natural ordering, it will be :
1 2 3 4 5 6 7 10 11 12 20 21
Depending on the language, natural ordering sometimes ignore Capital letters and accentuated one (ie all accentuated letters are treated like their non-accentuated counterpart).
Many languages have a function to order a String naturally. However, an Employee is too "high level" for the language, you must decide what it means for you to order them naturally and create the according function.
In my point of view, ordering Employee will start by ordering them by name using a natural sort, then age and finally date of joining.
According to statistics there are two types of categorical variables. Variables having categories without a numerical ordering (nominal) and those which do have ordered categories (ordinal). The example of an Employee's name, age and date of joining is actually considered a nominal variable so there can be no sorting by natural ordering. Natural ordering could exist for example in age had you categorized it in levels of child, teenager, adult, in which one can observe an ascending type of sorting.
For strings containing numbers it means 1,2,3,4,5,6,7,8,9,10,11 instead of 1,10,11,2,3,4,5,6,7,8,9
Quite an old question, but very simply put, the Natural Order is an ascending order of the enumerable collection of the comparable elements:
For the numbers: 1, 2, 3...
For the characters: A, B, C...
If someone like me found himself reading the following article:
https://www.copterlabs.com/natural-sorting-in-mysql/
(which by the way is really useful), beware it because that's another method of sorting.
A correct natural sorting algorithm states that you order alphabetically but when you encounter a digit you will order that digit and all the subsequent digits as a single character.
Natural sorting has nothing to do with sorting by string length first, and then alphabetically when two strings have the same length. Though the article I linked is interesting, don't make the mistake I made and think that that's the correct way to sort naturally.
For Java, The ordering provided by the Comparable interface is called the natural ordering, so the Comparator interface provides, so to speak, an unnatural ordering.

Resources