Unix sort command doesn't give me correct output - sorting

I have input file (test) which looks like this:
MarkerName Allele1 Allele2 Weight Zscore P-value Direction
rs8065651 t c 2.00 -1.345 0.1787 --
rs12450876 a g 2.00 -0.496 0.6201 +-
rs7209239 a t 2.00 1.134 0.2569 ++
rs7210970 a g 2.00 1.724 0.08462 ++
rs4791114 a g 2.00 -1.156 0.2476 --
rs10853140 a g 2.00 0.989 0.3229 ++
rs237316 a g 2.00 0.738 0.4607 ++
rs11871508 a g 2.00 -5.527 3.265e-08 --
I am running sorting command and trying to find the top 3 smallest values:
sort -nk 6 test | head -3 > output.txt
but it my result (output.txt) I am getting this:
MarkerName Allele1 Allele2 Weight Zscore P-value Direction
rs7210970 a g 2.00 1.724 0.08462 ++
rs8065651 t c 2.00 -1.345 0.1787 --
This is obviously not good result.
Can you please help with this.

First you need to remove header line in the file.
tail -n +2 test
Then sort. For sorting floating values -g flag is used.
Also you need to be sure that your locale is correct. Otherwise collation rules will influence the result.
LC_ALL=C sort -bg --key=6,6
So:
tail -n +2 test | LC_ALL=C sort -bg --key=6,6 | head -3

Related

unix sort groups by their associated maximum value?

Let's say I have this input file 49142202.txt:
A 5
B 6
C 3
A 4
B 2
C 1
Is it possible to sort the groups in column 1 by the value in column 2? The desired output is as follows:
B 6 <-- B group at the top, because 6 is larger than 5 and 3
B 2 <-- 2 less than 6
A 5 <-- A group in the middle, because 5 is smaller than 6 and larger than 3
A 4 <-- 4 less than 5
C 3 <-- C group at the bottom, because 3 is smaller than 6 and 5
C 1 <-- 1 less than 3
Here is my solution:
join -t$'\t' -1 2 -2 1 \
<(cat 49142202.txt | sort -k2nr,2 | sort --stable -k1,1 -u | sort -k2nr,2 \
| cut -f1 | nl | tr -d " " | sort -k2,2) \
<(cat 49142202.txt | sort -k1,1 -k2nr,2) \
| sort --stable -k2n,2 | cut -f1,3
The first input to join sorted by column 2 is this:
2 A
1 B
3 C
The second input to join sorted by column 1 is this:
A 5
A 4
B 6
B 2
C 3
C 1
The output of join is:
A 2 5
A 2 4
B 1 6
B 1 2
C 3 3
C 3 1
Which is then sorted by the nl line number in column 2 and then the original input columns 1 and 3 are kept with cut.
I know it can be done a lot easier with for example groupby of pandas of Python, but is there a more elegant way of doing it, while sticking to the use of GNU Coreutils such as sort, join, cut, tr and nl? Preferably I want to avoid a memory inefficient awk solution, but please share those as well. Thanks!
As explained in the comment my solution tries to reduce the number of pipes, unnecessary cat commands and more especially the number of pipeline sort operations since sorting is a complex/time consuming operation:
I reached the following solution where f_grp_sort is the input file:
for elem in $(sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}')
do
grep $elem <(sort -k2nr f_grp_sort)
done
OUTPUT:
B 6
B 2
A 5
A 4
C 3
C 1
Explanations:
sort -k2nr f_grp_sort will generate the following output:
B 6
A 5
A 4
C 3
B 2
C 1
and sort -k2nr f_grp_sort | awk '!seen[$1]++{print $1}' will generate the output:
B
A
C
the awk will just generate in the same order 1 unique element of the first column of the temporary output.
Then the for elem in $(...)do grep $elem <(sort -k2nr f_grp_sort); done
will grep for lines containing B then A, then C what will provide the required output.
Now as enhancement, you can use a temporary file to avoid doing sort -k2nr f_grp_sort operation twice:
$ sort -k2nr f_grp_sort > tmp_sorted_file && for elem in $(awk '!seen[$1]++{print $1}' tmp_sorted_file); do grep $elem tmp_sorted_file; done && rm tmp_sorted_file
So, this won't work for all cases, but if the values in your first column can be turned into bash variables, we can use dynamically named arrays to do this instead of a bunch of joins. It should be pretty fast.
The first while block reads in the contents of the file, getting the first two space separated strings and putting them into col1 and col2. We then create a series of arrays named like ARR_A and ARR_B where A and B are the values from column 1 (but only if $col1 only contains characters that can be used in bash variable names). The array contains the column 2 values associated with these column 1 values.
I use your fancy sort chain to get the order we want column 1 values to print out in, we just loop through them, then for each column 1 array we sort the values and echo out column 1 and column 2.
The dynamc variable bits can be hard to follow, but for the right values in column 1 it will work. Again, if there's any characters that can't be part of a bash variable name in column 1, this solution will not work.
file=./49142202.txt
while read col1 col2 extra
do
if [[ "$col1" =~ ^[a-zA-Z0-9_]+$ ]]
then
eval 'ARR_'${col1}'+=("'${col2}'")'
else
echo "Bad character detected in Column 1: '$col1'"
exit 1
fi
done < "$file"
sort -k2nr,2 "$file" | sort --stable -k1,1 -u | sort -k2nr,2 | while read col1 extra
do
for col2 in $(eval 'printf "%s\n" "${ARR_'${col1}'[#]}"' | sort -r)
do
echo $col1 $col2
done
done
This was my test, a little more complex than your provided example:
$ cat 49142202.txt
A 4
B 6
C 3
A 5
B 2
C 1
C 0
$ ./run
B 6
B 2
A 5
A 4
C 3
C 1
C 0
Thanks a lot #JeffBreadner and #Allan! I came up with yet another solution, which is very similar to my first one, but gives a bit more control, because it allows for easier nesting with for loops:
for x in $(sort -k2nr,2 $file | sort --stable -k1,1 -u | sort -k2nr,2 | cut -f1); do
awk -v x=$x '$1==x' $file | sort -k2nr,2
done
Do you mind, if I don't accept either of your answers, until I have time to evaluate the time and memory performance of your solutions? Otherwise I would probably just go for the awk solution by #Allan.

Passing variable grep command

A given text file (say foo*.txt) data as follow.
1 g = 0.54 0.00
2 g = 0.32 0.00
3 g = 0.45 0.00
...
5000 g = 0.5 0.00
Basically, I want to extract 10 lines before and after the matching line (including the matching line). The matching line contains 59 characters that contain strings, spaces and numbers.
I have a script as follow:
#!/usr/bin/bash
for file in file*.txt;
do
var=$(command_to_extract_var) # 59 characters containing strings, spaces and numbers
# to get this var, I use grep and head
grep -C 10 "$var" "$file"
done > bar.csv
Running script by bash -x script_name.sh gives the following:
+ for file in 'foo*.txt'
++ grep 'match_pattern' foo1.txt
++ awk '{print $6}'
++ head -n1
++ grep '[0-9]'
+ basis=150
++ grep 'match_pattern' foo1.txt
++ tail -n1
++ awk '{print $3}'
+ number=25
++ grep '[0-9] f = ' foo.txt
++ tail -n150
This is followed by a number of lines (even up to 1000) like
001 h = 0.000000000000000E+00 e = 3.543218084205956E+00
Finally,
File name too long
+ final=
+ grep -C 10 '' foo1.txt
The output I expect is (one column from each file):
0.54 0.62 0.36 ... 0.45
0.32 3.25 0.89 ... 0.25
0.45 0.96 0.14 ... 0.14
... .... .... ... 0.96
0.25 0.00 7.23 ... 0.77

Calculating mean from values in columns specified on the first line using awk

I have a huge file (hundreds of lines, ca. 4,000 columns) structured like this
locus 1 1 1 2 2 3 3 3
exon 1 2 3 1 2 1 2 3
data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
and I need to calculate mean from all values (on each data line separately) with the same locus number (i.e., the same number in the first line), i.e.
data1: mean from first three values (three columns with locus '1':
17.07, 7.11, 10.58), next two values (10.21, 19.34) and next three values (14.69, 3.32, 21.07)
I would like to have output like this
data1 mean1 mean2 mean3
data1 mean1 mean2 mean3
I was thinking about using bash and awk...
Thank you for your advice.
You can use GNU datamash version 1.1.0 or newer (I used last version - 1.1.1):
#!/bin/bash
lines=$(wc -l < "$1")
datamash -W transpose < "$1" |
datamash -H groupby 1 mean 3-"$lines" |
datamash transpose
Usage: mean_value.sh input.txt | column -t (column -t needed for pretty view, it is not necessary)
Output:
GroupBy(locus) 1 2 3
mean(data1) 11.586666666667 14.775 13.026666666667
mean(data2) 13.586666666667 18.565 10.933333333333
if it was me, i would use R, not awk:
library(data.table)
x = fread('data.txt')
#> x
# V1 V2 V3 V4 V5 V6 V7 V8 V9
#1: locus 1.00 1.00 1.00 2.00 2.00 3.00 3.00 3.00
#2: exon 1.00 2.00 3.00 1.00 2.00 1.00 2.00 3.00
#3: data1 17.07 7.11 10.58 10.21 19.34 14.69 3.32 21.07
#4: data2 21.42 11.46 7.88 9.89 27.24 12.40 0.58 19.82
# save first column of names for later
cnames = x$V1
# remove first column
x[,V1:=NULL]
# matrix transpose: makes rows into columns
x = t(x)
# convert back from matrix to data.table
x = data.table(x,keep.rownames=F)
# set the column names
colnames(x) = cnames
#> x
# locus exon data1 data2
#1: 1 1 17.07 21.42
#...
# ditch useless column
x[,exon:=NULL]
#> x
# locus data1 data2
#1: 1 17.07 21.42
# apply mean() function to each column, grouped by locus
x[,lapply(.SD,mean),locus]
# locus data1 data2
#1: 1 11.58667 13.58667
#2: 2 14.77500 18.56500
#3: 3 13.02667 10.93333
for convenience, here's the whole thing again without comments:
library(data.table)
x = fread('data.txt')
cnames = x$V1
x[,V1:=NULL]
x = t(x)
x = data.table(x,keep.rownames=F)
colnames(x) = cnames
x[,exon:=NULL]
x[,lapply(.SD,mean),locus]
awk ' NR==1{for(i=2;i<NF+1;i++) multi[i]=$i}
NR>2{
for(i in multi)
{
data[multi[i]] = 0
count[multi[i]] = 0
}
for(i=2;i<NF+1;i++)
{
data[multi[i]] += $i
count[multi[i]] += 1
};
printf "%s ",$1;
for(i in data)
printf "%s ", data[i]/count[i];
print ""
}' <file_name>
Replace <file_name> with your data file

Custom Sort Multiple Files

I have 10 files (1Gb each). The contents of the files are as follows:
head -10 part-r-00000
a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3
I need to sort the file based on the last column of each line (descending order), which is of variable length. If there is a match on the numerical value then sort alphabetically by the 2nd last column. The following BASH command works on small datasets (not complete files) and takes 3 second to sort only 10 lines from one file.
cat part-r-00000 | awk '{print $NF,$0}' | sort -nr | cut -f2- -d' ' > FILE
I want the output in a separate FILE. Can someone help me out to speed up the process?
No, once you get rid of the UUOC that's as fast as it's going to get. Obviously you need to add the 2nd-last field to everything too, e.g. something like:
awk '{print $NF,$(NF-1),$0}' part-r-00000 | sort -k1,1nr -k2,2 | cut -f3- -d' '
Check the sort args, I always get mixed up with those..
Reverse order, sort and reverse order:
awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}' file | sort -nr | awk '{for (i=NF;i>0;i--){printf "%s ",$i};printf "\n"}'
Output:
a a a h o 4
a a a k 3
a a a general i 2
a a a glory 2
a a a h z 1
a a a hem hem 1
a a a dumbbell 1
a a a h d 1
a a a c b 1
a a a f a 1
You can use a Schwartzian transform to accomplish your task,
awk '{print -$NF, $(NF-1), $0}' input_file | sort -n | cut -d' ' -f3-
The awk command prepends each record with the negative of the last field and the second last field.
The sort -n command sorts the record stream in the required order because we used the negative of the last field.
The cut command splits on spaces and cuts the first two fields, i.e., the ones we used to normalize the sort
Example
$ echo 'a a a c b 1
a a a dumbbell 1
a a a f a 1
a a a general i 2
a a a glory 2
a a a h d 1
a a a h o 4
a a a h z 1
a a a hem hem 1
a a a k 3' | awk '{print -$NF, $(NF-1), $0}' | sort -n | cut -d' ' -f3-
a a a h o 4
a a a k 3
a a a glory 2
a a a general i 2
a a a f a 1
a a a c b 1
a a a h d 1
a a a dumbbell 1
a a a hem hem 1
a a a h z 1
$

The simplest way to join 2 files using bash and both of their keys appear in the result

I have 2 input files
file1
A 0.01
B 0.09
D 0.05
F 0.08
file2
A 0.03
C 0.01
D 0.04
E 0.09
The output I want is
A 0.01 0.03
B 0.09 NULL
C NULL 0.01
D 0.05 0.04
E NULL 0.09
F 0.08 NULL
The best that I can do is
join -t' ' -a 1 -a 2 -1 1 -2 1 -o 1.1,1.2,2.2 file1 file2
which doesn't give me what I want
You can write:
join -t $'\t' -a 1 -a 2 -1 1 -2 1 -e NULL -o 0,1.2,2.2 file1 file2
where I've made these changes:
In the output format, I changed 1.1 ("first column of file #1") to 0 ("join field"), so that values from file #2 can show up in the first field when necessary. (Specifically, so that C and E will.)
I added the -e option to specify a value (NULL) for missing/empty fields.
I used $'\t', which Bash converts to a tab, instead of typing an actual tab. I find this easier to use than a tab in the middle of the command. But if you disagree, and the actual tab is working for you, then by all means, you can keep using it. :-)

Resources