Unix sort: inconsistent between 2 files - sorting

[1.txt]
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample12_1.fq.gz
Sample1_1.fq.gz
Sample13_1.fq.gz
[2.txt]
Sample10_2.fq.gz
Sample11_2.fq.gz
Sample12_2.fq.gz
Sample1_2.fq.gz
Sample13_2.fq.gz
As you can see, the only difference is the digit after the "_".
Anyway, here are the results of sort:
[sort 1.txt]
Sample10_2.fq.gz
Sample11_2.fq.gz
Sample12_2.fq.gz
Sample1_2.fq.gz
Sample13_2.fq.gz
[sort 2.txt]
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample1_1.fq.gz
Sample12_1.fq.gz
Sample13_1.fq.gz
Discrepancy: "Sample1_" is sorted between "Sample12" and "Sample13" in 1.txt, but it's between "Sample11" and "Sample12" in 2.txt.
Am I doing something wrong to make this inconsistency happen?

Use sort -V
cat 1.txt | sort -V
Sample1_1.fq.gz
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample12_1.fq.gz
Sample13_1.fq.gz

Related

find smallest $2 for every unique $1 [duplicate]

This question already has answers here:
Sort and keep a unique duplicate which has the highest value
(3 answers)
Closed 7 years ago.
I am trying to obtain the smallest $2 value for every $1 value. My data looks like follows:
0 0
23.9901 13.604
23.9901 13.604
23.9901 3.364
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.79 15.36
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
The output must look like:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
I can manage this by creating multiple files for each $1 value and finding the min in each file. But I am wondering if there might be a more elegant solution for this?
Thanks.
With Gnu or FreeBSD sort, you can do it as follows;
sort -k1,1 -k2,2g file | sort -k1,1g -su
The first sort sorts the file into order by first and then second column value. The second sort uniquifies the file (-u) using only the first column to determine uniqueness. It also uses the -s flag to guarantee that the second column is still in order. In both cases, the sort uses the -g flag when it matters (see below), which does general numeric comparison, unlike the Posix-standard -n flag which only compares leading integers.
Performance note: (And thanks to OP for spurring me to do the measurements):
Leaving the g off of -k1,1 in the first sort is not a typo; it actually considerably speeds the sort up (on large files, with Gnu sort). Standard or integer (-n) sorts are much faster than general numeric sorts, perhaps 10 times as fast. However, all key types are about twice as fast for files which are "mostly sorted". For more-or-less uniformly sampled random numbers, a lexicographic sort is a close approximation to a general numeric sort; close enough that the result shows the "mostly sorted" speed-up.
It would have been possible to only sort by the second field in the first sort: sort -k2,2g file | sort -k1,1g -su but this is much slower, both because the primary sort in the first pass is general numeric instead of lexicographic and because the file is no longer mostly sorted for the second pass.
Here's just one sample point, although I did a few tests with similar results. The input file consists of 299,902 lines, each containing two numbers in the range 0 to 1,000,000, with three decimal digits. There are precisely 100,000 distinct numbers in the first column; each appears from one to five times with different numbers in the second column. (All numbers in the second column are distinct, as it happens.)
All timings were collected with bash's time verb, taking the real (wallclock) time. (Sort multithreads nicely so the user time was always greater).
With the first column correctly sorted and the second column randomised:
sort -k1,1 -k2,2g sorted | sort -k1,1g -su 1.24s
sort -k1,1g -k2,2g sorted | sort -k1,1g -su 1.78s
sort -k2,2g sorted | sort -k1,1g -su 3.00s
With the first column randomised:
sort -k1,1 -k2,2g unsorted | sort -k1,1g -su 1.42s
sort -k1,1g -k2,2g unsorted | sort -k1,1g -su 2.19s
sort -k2,2g unsorted | sort -k1,1g -su 3.01s
You can use this gnu-awk command:
awk '!($1 in m) || m[$1]>$2{m[$1]=$2} END{for (i in m) print i, m[i]}' file
Or to get the order same as the input file:
awk 'BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} !($1 in m) || m[$1] > $2 {m[$1] = $2}
END{for (i in m) print i, m[i]}' file
BEGIN{PROCINFO["sorted_in"]="#ind_num_asc"} is used to order the associative array by numerical index.
Output:
0 0
23.9901 3.364
24.054 18.5279
25.0981 17.4839
42.582 0
45.79 0
45.7902 12.1518
51.034 12.028
54.11 14.072
54.1102 14.0718
You can do that:
awk 'NR==1{k=$1;v=$2;next} k==$1 { if (v>$2) v=$2; next} {print k,v; k=$1;v=$2}END{print k,v}'
indented:
# for the first record store the two fields
NR==1 {
k=$1
v=$2
next
}
# when the first field doesn\'t change
k==$1 {
# check if the second field is lower
if (v>$2)
v=$2
next
}
{
# otherwise print stored fields and reinitialize them
print k,v
k=$1
v=$2
}
END {
print k,v
}'
In Perl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my %min;
while (<>) {
chomp;
my ($key, $value) = split;
if (!exists $min{$key} or $value < $min{$key}) {
$min{$key} = $value;
}
}
for (sort { $a <=> $b } keys %min) {
say "$_ $min{$_}";
}
It's written as a Unix filter, so it reads from STDIN and writes to STDOUT. Call it as:
$ ./get_min < input_file > output_file
When you want to use sort, you first have to fix the ordering. Sort will not understand the decimal point, so temporary change that for a x.
Now sort numeric on the numeric fields and put back the decimal point.
The resulting list is sorted correctly, take the first value of each key.
sed 's/\./ x /g' inputfile | sort -n -k1,3 -k4,6 | sed 's/ x /./g' | sort -u -k1,1

Join in unix when field is numeric in a huge file

So I have two files. File A and File B. File A is huge (>60 GB) and has 16 rows, a mix of numeric and strings, is separated by "|", and has over 600,000,000 lines. Field 3 in this file is the ID and it is a numeric field, with different lengths (e.g., someone's ID can be 1, and someone else's can be 100)
File B just has a bunch of ID (~1,000,000) and I want to extract all the rows from File A that have an ID that is in `File B'. I have started doing this using Linux with the following code
sort -k3,3 -t'|' FileA.txt > FileASorted.txt
sort -k1,1 -t'|' FileB.txt > FileBSorted.txt
join -1 3 -2 1 -t'|' FileASorted.txt FileBSorted.txt > merged.txt
The problem I have is that merged.txt is empty (when I know for a fact there are at least 10 matches)... I have googled this and it seems like the issue is that the join field (the ID) is numeric. Some people propose padding the field with zeros but 1) I'm not entirely sure how to do this, and 2) this seems very slow/time inefficient.
Any other ideas out there? or help on how to add the padding of 0s only to the relevant field.
I would first sort file b using the unique flag (-u)
sort -u file.b > sortedfile.b
Then loop through sortedfile.b and for each grep file.a. In zsh I would do a
foreach C (`cat sortedfile.b`)
grep $C file.a > /dev/null
if [ $? -eq 0 ]; then
echo $C >> res.txt
fi
end
Redirect output from grep to /dev/null and test whether there was a match ($? -eq 0) and append (>>) the result from that line to res.txt.
A single > will overwrite the file. I'm a bit rusty at zsh now so there might be a typo. You may be using bash which can have a slightly different foreach syntax.

compare 2 csv find a match and output it using linux shell

I am really confused with uniq, sort, awk so ...
got 2 csv
tail 300513-code.csv
11916
11922
11896
11897
128647
1319760
1321176
1017022
1017017
1220901
tail 30-05-4UTF.csv
131318,"...","st365-3",0,5
1220357,"Ящик алюминиевый зимний",,0,1
,"!!Марко Поло",,,
1014492,"Коробка Марко Поло TF1331D 13.8х7.7х3.1см.","1694.13.31"," 16,00",1
1017795,"Ящик Марко Поло FS2000 white-black 2-х полочный 29х16х14см.","1694.20.01"," 122,00",5
10923,"Ящик Марко Поло TR2045 red 2-х секционый большой 51.5х39.5х56.5см.","1694.20.45"," 351,00",4
10925,"Ящик Марко Поло TR2045 yellow 2-х секционый большой 51.5х39.5х56.5см.","1694.20.47"," 351,00",1
12717,"Металоискатель CARRETT",," 4050,00",1
1319913,"Пакет 50 коп.","01.янв",0,269
17596,"Пакет полиэтиленовый 40х50",1," 1,00",4843
So the first one is a code for which i need to find a match and output only the ones that are matching. Example output.csv
12717,"Металоискатель CARRETT",," 4050,00",1
1319913,"Пакет 50 коп.","01.янв",0,269
17596,"Пакет полиэтиленовый 40х50",1," 1,00",4843
suppose this 3 lines had a match
your given input and output don't match. 12717, 1319913, 17596 I cannot find them in your first file. I assume they are just example. And I think the following line is what you are looking for, so try this line:
awk -F, 'NR==FNR{a[$0];next}$1 in a' 300513-code.csv 30-05-4UTF.csv
If you are attempting to link using the first field from each file(bash on linux)
join -1 1 -2 1 -t, <(sort -k1,1 -t, 300513-code.csv)
<(sort -k1,1 -t, 30-05-4UTF.csv)

multiple field and numeric sort

List of files:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-16
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndwr-threads-1
sysbench-size-256M-mode-rndwr-threads-16
sysbench-size-256M-mode-rndwr-threads-4
sysbench-size-256M-mode-rndwr-threads-8
sysbench-size-256M-mode-seqrd-threads-1
sysbench-size-256M-mode-seqrd-threads-16
sysbench-size-256M-mode-seqrd-threads-4
sysbench-size-256M-mode-seqrd-threads-8
sysbench-size-256M-mode-seqwr-threads-1
sysbench-size-256M-mode-seqwr-threads-16
sysbench-size-256M-mode-seqwr-threads-4
sysbench-size-256M-mode-seqwr-threads-8
I would like to sort them by mode (rndrd, rndwr etc.) and then number:
sysbench-size-256M-mode-rndrd-threads-1
sysbench-size-256M-mode-rndrd-threads-4
sysbench-size-256M-mode-rndrd-threads-8
sysbench-size-256M-mode-rndrd-threads-16
sysbench-size-256M-mode-rndrw-threads-1
sysbench-size-256M-mode-rndrw-threads-4
sysbench-size-256M-mode-rndrw-threads-8
sysbench-size-256M-mode-rndrw-threads-16
....
I've tried the following loop but it's sorting by number but I need sequence like 1,4,8,16:
$ for f in $(ls -1A); do echo $f; done | sort -t '-' -k 7n
EDIT:
Please note that numeric sort (-n) sort it by number (1,1,1,1,4,4,4,4...) but I need sequence like 1,4,8,16,1,4,8,16...
Sort by more columns:
sort -t- -k5,5 -k7n
Primary sort is by 5th column (and not the rest, that's why 5,5), secondary sorting by number in the 7th column.
The for loop is completely unnecessary as is the -1 argument to ls when piping its output. This yields
ls -A | sort -t- -k 5,5 -k 7,7n
where the first key begins and ends at column 5 and the second key begins and ends at column 7 and is numeric.

Print only values smaller than certain threshold in bash

I have a file with more than 10000 lines like this, mostly numbers and some strings;
-40
-50
stringA
100
20
-200
...
I would like to write a bash (or other) script that reading this file only outputs numbers (no strings) and only those values smaller than zero (or some other predefined number). How can this be done?
In this case the output (sorted) would be
-40
-50
-200
...
cat filename | awk '{if($1==$1+0 && $1<THRESHOLD_VALUE)print $1}' | sort -n
The $1==$1+0 ensure that the string is a number, it will then check that it is less than THRESHOLD_VALUE (change this to whatever number you wish. Print it out if it passes, and sort.
awk '$1 < NUMBER { print }' FILENAME | sort -n
where NUMBER is the number that you want to use as an upper bound and FILENAME is your file with 10000+ lines of numbers. You can drop the | sort -n if you don't want to sort the numbers.
edit: One small caveat. If your string starts with a number, it will treat it as that number. Otherwise it should ignore it.
Another alternative is as follows:
function compare() {
if test $1 -lt $MAX_VALUE; then
echo $1
fi
} 2> /dev/null
Have a look at help test and man bash for further help on this. The 2> /dev/null redirects errors thrown by test when you try to compare something other than two integers. Call the function like:
compare 1
compare -1
compare string A
Only the middle line will give output.

Resources