How to combine ascending and descending sorting? - bash

I have a very big file (many gigabytes) which looks like
input.txt
a|textA|2
c|textB|4
b|textC|5
e|textD|1
d|textE|4
b|textF|5
At the first step, I want to sort lines numerically by the third column in descending order, and if lines have the same value of the third column, they must be sorted by the text of the first column – in ascending order. And if lines have equal values for their 1st and 3rd columns, they must be sorted by the 2nd column in ascending order. The second columns are guaranteed to be unique and different.
So, I want the result to be:
desiredOutput.txt
b|textC|5
b|textF|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
I can take the first step:
sort -t\| -bfrnk3 path/to/input.txt > path/to/output.txt
But what is the next steps? And maybe the result might be achieved in a single pass?
EDIT
I tested sort -t '|' -k 3,3nr -k 1,1 -k 2,2 input.txt > output.txt. It gives the following "output.txt":
b|textF|5
b|textC|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
which is not what I want.

$ cat file
a|textA|2
c|textB|4
b|textC|5
e|textD|1
d|textE|4
b|textF|5
$ sort -t '|' -k 3,3nr -k 1,1 -k 2,2 file
b|textC|5
b|textF|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
$ sort -t '|' -k 3,3nr file
b|textC|5
b|textF|5
c|textB|4
d|textE|4
a|textA|2
e|textD|1
$
n in 3,3nr means numeric sorting, r means reverse. Seems like -k 1,1 -k 2,2 is optional as I guess sort would sort in the ascending order by default.

If this is UNIX:
sort -k 3 path/to/input.txt > path/to/output.txt
You can use multiple -k flags to sort on more than one column. For example, to sort by 3rd column then 1st column as a tie breaker:
sort -k 3,2 -k 1,1 input.txt > output.txt
Relevant options from "man sort":
-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)
POS is F[.C][OPTS], where F is the field number and C the character position in the field. OPTS is one or more single-letter ordering options, which override global ordering options for that key. If no key is given, use the entire line as the key.
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition.

You can do it by Sort Command only :-
sort -t"|" -k3 -n -k1 -k2 inputFile.txt
k3 specifying that sort according to 3rd column and similarly k1 & k2 according to column 1st & 2nd respectively.

Related

get unique combination of values of two columns

I can get the unique values from col using below command
cut -d',' -f3 file.txt | uniq -c.
This gives me unique values in field 3.
But if I want to get unique combination of two fields, how can I get that ?
input
A,B,C
B,C,D
D,B,C
H,C,D
K,C,D
output
2 B,C
3 C,D
You can specify range of fields using -f 2-3 or -f 2,3
cut -d',' -f2-3 file.txt | sort | uniq -c
uniq does not detect repeated lines unless they are adjacent. Input should be sorted before using uniq command
Output
2 B,C
3 C,D
Another option you may find provides greater flexibility in processing the input is awk. You can use a concatenation of the fields at issue as an index for an array to sum the occurrences of each unique combination of fields and then output the results using the END rule, e.g.
awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' file
Example Use/Output
With your example file in input you would have:
$ awk -F, '{a[$2","$3]++} END{for(i in a)print a[i], i}' input
3 C,D
2 B,C
awk arrays are associative rather than indexed, but you can preserve the order of appearance using a 3rd array if needed. Or you can simply pipe the output to sort for whatever order you like.

Unix- smallest value in column

I have a txt file which is just three columns of numbers separated by space. I need to use "sort" to display the smallest value of column 3 and only that value.
I tried
sort -k3 file.txt|head -1
but it shows the first value of all three columns.
This is what's expected. sort -k3 file.txt | head -1 says "show me the first line of output"
Use just plain sort -k3 file.txt | head to get the first 10 lines.
What were you expecting or wanting?
In response to the comment: No worries! We're all beginners at the beginning :-)
sort -r file.txt will sort in reverse order, and as #shellter says, sort -r -k3 file.txt | awk 'NR==1{print $3} will print the third value on the first line.

shell script to print and sort the date column in csv

I have a large csv file having the date column at column number 4 of csv file
the format of data is format YYYY-MM-DD HH:MM:SS.0000000 +11:30
I want to sort this date in ascending order and dump it into the another csv file container top 10 entries or print.
I have tried with the following command:
sort -t, nk4 file.csv >/tmp/s.csv
It should be sort -t, -nk4 (- is missing before options).
To output only the 10 first lines, you can pipe your sort to head:
sort -t, -nk4 file.csv | head -n10 > /tmp/s.csv
The same maybe a bit more readable:
sort -t "," -k 4 -n file.csv | head -n10 > /tmp/s.csv
head -n10 is to print only the 10 first line of the sort output.

Why does coreutils sort give a different result when I use a different field delimiter?

When using sort on the command line, why does the sorted order depend on which field delimiter I use? As an example,
$ # The test file:
$ cat test.csv
2,az,a,2
3,a,az,3
1,az,az,1
4,a,a,4
$ # sort based on fields 2 and 3, comma separated. Gives correct order.
$ LC_ALL=C sort -t, -k2,3 test.csv
4,a,a,4
3,a,az,3
2,az,a,2
1,az,az,1
$ # replace , by ~ as field separator, then sort as before. Gives incorrect order.
$ tr "," "~" < test.csv | LC_ALL=C sort -t"~" -k2,3
2~az~a~2
1~az~az~1
4~a~a~4
3~a~az~3
The second case not only gets the ordering wrong, but is inconsistent between field 2 (where az < a) and field 3 (where a < az).
There is a mistake in -k2,3. That means that sort should sort starting at the 2nd field and ending at the 3rd field. That means that the delimiter between them is also part of what is to be sorted and therefore counts as character. That's why you encounter different sorts with different delimiters.
What you want is the following:
LC_ALL=C sort -t"," -k2,2 -k3,3 file
And:
tr "," "~" < file | LC_ALL=C sort -t"~" -k2,2 -k3,3
That means sort should sort the 2nd field and is the 2nd field has dublicates sort the 3rd field.

Bash: sort csv file by first 4 columns

I have a csv file with fields delimited by ";". There are 8 fields, and I want to sort my data by the first 4 columns, in increasing order (first sort by column 1, then column 2, etc)
How I can do this from a command line in linux?
I tried with open office, but it only lets me select 3 columns.
EDIT: among the fields on which I want to sort my data, three fields contain strings with numerical values, one only strings. How can I specify this with the sort command?
Try:
sort -t\; -k 1,1n -k 2,2n -k 3,3n -k 4,4n test.txt
eg:
1;2;100;4
1;2;3;4
10;1;2;3
9;1;2;3
> sort -t\; -k 1,1n -k 2,2n -k 3,3n -k 4,4n temp3
1;2;3;4
1;2;100;4
9;1;2;3
10;1;2;3
sort -k will allow you to define the sort key. From man sort:
-k, --key=POS1[,POS2]
start a key at POS1 (origin 1), end it at POS2 (default end of line).
So
$ sort -t\; -k1,4
should do it. Note that I've escaped the semi-colon, otherwise the shell will interpret it as an end-of-statement.

Resources