hi I just ran into either a bug or more certainly an error from me.
I am trying to sort a file that has 5 column by three specific column.
I am using -k option.
sort -k1,1 -k3,3 -k4,4 < abundance_key_60.tsv
SO90 TARA_031_SRF M00370 0.0004796352593680699 5380.716788521779
SO90 TARA_072_MES M00370 6.704622779795495 889.5003464019538
WDU TARA_072_MES M00165 0.00010342611234558623 1372.1512123790574
WDU TARA_046_SRF M00165 0.00011353279569781544 582.9204804414709
WDU TARA_025_DCM M00165 0.00028966684296873025 2486.7113286682593
Everything work fine then I realised one of my column is numeric and I add the -g option for this column. At this point sort seems to only filter by this column :
sort -k1,1 -k3,3 -gk4,4 < test_.sort.txt
SO90 TARA_031_SRF M00370 0.0004796352593680699 5380.716788521779
WDU TARA_025_DCM M00165 0.00028966684296873025 2486.7113286682593
WDU TARA_046_SRF M00165 0.00011353279569781544 582.9204804414709
WDU TARA_072_MES M00165 0.00010342611234558623 1372.1512123790574
SO90 TARA_072_MES M00370 6.704622779795495 889.5003464019538
I try to use -s option but I did not change the results.
any help appreciated!
ps: this is sample from my file that reproduce the bug.
I am on ubuntu 16.04 with default bash and sort for this distribution.
You want to specify the g only for -k4,4, like this:
bash$ sort -k1,1 -k3,3 -k4,4g test_.sort.txt
SO90 TARA_031_SRF M00370 0.0004796352593680699 5380.716788521779
SO90 TARA_072_MES M00370 6.704622779795495 889.5003464019538
WDU TARA_072_MES M00165 0.00010342611234558623 1372.1512123790574
WDU TARA_046_SRF M00165 0.00011353279569781544 582.9204804414709
WDU TARA_025_DCM M00165 0.00028966684296873025 2486.7113286682593
(Experimentally verified by changing the number to 6.704622779795495E-10 and observing how that changes the sort order. A better test case would contain samples which trivially reveal when you get the correct result.)
Related
In my OS, I can find
-h, --human-numeric-sort
compare human readable numbers (e.g., 2K 1G)
And I have a file aaa.txt:
2M
5904K
1G
Then I type
sort -h aaa.txt
The output is
5904K
2M
1G
It's wrong. It should be
2M
5904K
1G
Questions:
Why does sort -h not work? The result is wrong even in lexicographically order perspective. How to sort the aaa.txt file in human readable numbers.
Or it can work only with du -h? But the most vostes answer seems can work with awk.
With du -h, sort does not need to specify which field, like sort -k1h,1 ? Why? What would happend if the memory size is not in the first field?
Why does sort -h not work?
Below is a comment from GNU sort's source code.
/* Compare numbers ending in units with SI xor IEC prefixes
<none/unknown> < K/k < M < G < T < P < E < Z < Y
Assume that numbers are properly abbreviated.
i.e. input will never have both 6000K and 5M. */
It's not mentioned in the man page, but -h is not supposed to work with your input.
How to sort the aaa.txt file in human readable numbers.
You can use numfmt to perform a Schwartzian transform as shown below.
$ numfmt --from=auto < aaa.txt | paste - aaa.txt | sort -n | cut -f2
2M
5904K
1G
I'm working with fairly large tsv zip files whereby each file has 3 columns only. I would like to count the number of unique occurrences for a particular regex (which is contained in column 3) across all files.
How do I make sure the count number in the output removes any duplicates based on values contained in column 1?
Tried both of these, but not sure if they are correct:
zgrep -c ",80447," AU_AAID_201812*.tsv.gz | uniq -c
zgrep -c ",80447," AU_AAID_201812*.tsv.gz
I want to get the unique count number so that if:
Column 1/Row 1 = "xyz123" and Column 3/Row 1 = ",80447,"
Column 1/Row 2 = "xyz123" and Column 3/Row 2 = ",80447,"
Then my output would be still be "1".
Use cut to get just column1 and column3, use sort -u to remove duplicates, and then use wc -l to get the count:
zgrep ',80447,' AU_AAID_201812*.tsv.gz | cut -d, -f1,3 | sort -u | wc -l
I tried this solution to my list and I can't get what I want after sorting.
I got list:
m_2_mdot_3_a_1.dat ro= 303112.12
m_1_mdot_2_a_0.dat ro= 300.10
m_2_mdot_1_a_3.dat ro= 221.33
m_3_mdot_1_a_1.dat ro= 22021.87
I used sort -k 2 -n >name.txt
I would like to get list from the lowest ro to the highest ro. What I did wrong?
I got a sorting but by the names of 1 column or by last value but like: 1000, 100001, 1000.2 ... It sorted like by only 4 meaning numbers or something.
cat test.txt | tr . , | sort -k3 -g | tr , .
The following link gave a good answer Sort scientific and float
In brief,
you need -g option to sort on decimal numbers;
the -k option start
from 1 not 0;
and by default locale, sort use , as seperator
for decimal instead of .
However, be careful if your name.txt contains , characters
Since there's a space or a tab between ro= and the numeric value, you need to sort on the 3rd column instead of the 2nd. So your command will become:
cat input.txt | sort -k 3 -n
This question already has answers here:
Sort and keep a unique duplicate which has the highest value
(3 answers)
Closed 7 years ago.
I am trying to obtain the smallest $2 value for every $1 value. My data looks like follows:
0 0
23.9901 13.604
23.9901 13.604
23.9901 3.364
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.79 15.36
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
The output must look like:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
I can manage this by creating multiple files for each $1 value and finding the min in each file. But I am wondering if there might be a more elegant solution for this?
Thanks.
With Gnu or FreeBSD sort, you can do it as follows;
sort -k1,1 -k2,2g file | sort -k1,1g -su
The first sort sorts the file into order by first and then second column value. The second sort uniquifies the file (-u) using only the first column to determine uniqueness. It also uses the -s flag to guarantee that the second column is still in order. In both cases, the sort uses the -g flag when it matters (see below), which does general numeric comparison, unlike the Posix-standard -n flag which only compares leading integers.
Performance note: (And thanks to OP for spurring me to do the measurements):
Leaving the g off of -k1,1 in the first sort is not a typo; it actually considerably speeds the sort up (on large files, with Gnu sort). Standard or integer (-n) sorts are much faster than general numeric sorts, perhaps 10 times as fast. However, all key types are about twice as fast for files which are "mostly sorted". For more-or-less uniformly sampled random numbers, a lexicographic sort is a close approximation to a general numeric sort; close enough that the result shows the "mostly sorted" speed-up.
It would have been possible to only sort by the second field in the first sort: sort -k2,2g file | sort -k1,1g -su but this is much slower, both because the primary sort in the first pass is general numeric instead of lexicographic and because the file is no longer mostly sorted for the second pass.
Here's just one sample point, although I did a few tests with similar results. The input file consists of 299,902 lines, each containing two numbers in the range 0 to 1,000,000, with three decimal digits. There are precisely 100,000 distinct numbers in the first column; each appears from one to five times with different numbers in the second column. (All numbers in the second column are distinct, as it happens.)
All timings were collected with bash's time verb, taking the real (wallclock) time. (Sort multithreads nicely so the user time was always greater).
With the first column correctly sorted and the second column randomised:
sort -k1,1 -k2,2g sorted | sort -k1,1g -su 1.24s
sort -k1,1g -k2,2g sorted | sort -k1,1g -su 1.78s
sort -k2,2g sorted | sort -k1,1g -su 3.00s
With the first column randomised:
sort -k1,1 -k2,2g unsorted | sort -k1,1g -su 1.42s
sort -k1,1g -k2,2g unsorted | sort -k1,1g -su 2.19s
sort -k2,2g unsorted | sort -k1,1g -su 3.01s
You can use this gnu-awk command:
awk '!($1 in m) || m[$1]>$2{m[$1]=$2} END{for (i in m) print i, m[i]}' file
Or to get the order same as the input file:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} !($1 in m) || m[$1] > $2 {m[$1] = $2}
END{for (i in m) print i, m[i]}' file
BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} is used to order the associative array by numerical index.
Output:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
You can do that:
awk 'NR==1{k=$1;v=$2;next} k==$1 { if (v>$2) v=$2; next} {print k,v; k=$1;v=$2}END{print k,v}'
indented:
# for the first record store the two fields
NR==1 {
k=$1
v=$2
next
}
# when the first field doesn\'t change
k==$1 {
# check if the second field is lower
if (v>$2)
v=$2
next
}
{
# otherwise print stored fields and reinitialize them
print k,v
k=$1
v=$2
}
END {
print k,v
}'
In Perl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %min;
while (<>) {
chomp;
my ($key, $value) = split;
if (!exists $min{$key} or $value < $min{$key}) {
$min{$key} = $value;
}
}
for (sort { $a <=> $b } keys %min) {
say "$_ $min{$_}";
}
It's written as a Unix filter, so it reads from STDIN and writes to STDOUT. Call it as:
$ ./get_min < input_file > output_file
When you want to use sort, you first have to fix the ordering. Sort will not understand the decimal point, so temporary change that for a x.
Now sort numeric on the numeric fields and put back the decimal point.
The resulting list is sorted correctly, take the first value of each key.
sed 's/\./ x /g' inputfile | sort -n -k1,3 -k4,6 | sed 's/ x /./g' | sort -u -k1,1
List of files:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-16
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndwr-threads-1
sysbench-size-256M-mode-rndwr-threads-16
sysbench-size-256M-mode-rndwr-threads-4
sysbench-size-256M-mode-rndwr-threads-8
sysbench-size-256M-mode-seqrd-threads-1
sysbench-size-256M-mode-seqrd-threads-16
sysbench-size-256M-mode-seqrd-threads-4
sysbench-size-256M-mode-seqrd-threads-8
sysbench-size-256M-mode-seqwr-threads-1
sysbench-size-256M-mode-seqwr-threads-16
sysbench-size-256M-mode-seqwr-threads-4
sysbench-size-256M-mode-seqwr-threads-8
I would like to sort them by mode (rndrd, rndwr etc.) and then number:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndrw-threads-16
....
I've tried the following loop but it's sorting by number but I need sequence like 1,4,8,16:
$ for f in $(ls -1A); do echo $f; done | sort -t '-' -k 7n
EDIT:
Please note that numeric sort (-n) sort it by number (1,1,1,1,4,4,4,4...) but I need sequence like 1,4,8,16,1,4,8,16...
Sort by more columns:
sort -t- -k5,5 -k7n
Primary sort is by 5th column (and not the rest, that's why 5,5), secondary sorting by number in the 7th column.
The for loop is completely unnecessary as is the -1 argument to ls when piping its output. This yields
ls -A | sort -t- -k 5,5 -k 7,7n
where the first key begins and ends at column 5 and the second key begins and ends at column 7 and is numeric.