Bash Sorting a List by Columns - bash

I'm reverse sorting column 2.
As for column 1, if multiple lines have the same $2 value, I want them to be sorted in a reverse order. I have stored this list in a variable at the moment in a bash script. Is there a sed or awk function to be used?
My output right now, for example, is:
123, 3
124, 3
12345, 2
898, 1
1010, 1
what I want is:
124, 3
123, 3
12345, 2
1010, 1
898, 1

Use a combination of Perl one-liners and sort. The one-liners convert the , delimiter into tab (and back). And sort uses the -r option for reverse, and -g option for numeric sort. Option -kN,N specifies to sort just by field N, here 2nd, then 1st field.
perl -pe 's/, /\t/' in_file | sort -k2,2gr -k1,1gr | perl -pe 's/\t/, /' > out_file
For example:
Create example input file:
cat > foo <<EOF
123, 3
124, 3
12345, 2
898, 1
1010, 1
EOF
Run the command:
cat foo | perl -pe 's/, /\t/' | sort -k2,2gr -k1,1gr | perl -pe 's/\t/, /'
Output:
124, 3
123, 3
12345, 2
1010, 1
898, 1
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

It's not a trivial awk script, but it's not hard either. You simply use an array a[] below, to store the values for the first field for equal values of the second field. If last is set (e.g. not the first record) and the second field changes, you output the current array and reset the array (that is Rule 1).
In Rule 2, you just scan through the existing array and insert the current first field in the array in order. You keep the last value of the second field so you know when it changes. You use the END rule to output the last set of values, e.g.
awk -F, '
last && $2 != last {
for (i=1; i<=n; i++)
print a[i]", "last;
delete a
n = 0
}
{
swapped=0
for (i=1; i<=n; i++)
if ($1 > a[i]) {
swapped=1
for (j=n+1; j>i; j--)
a[j]=a[j-1]
a[i]=$1
}
if (!swapped)
a[++n]=$1
else
n++
last=$2
}
END {
for (i=1; i<=n; i++)
print a[i]", "last
}
' file
The swapped flag just tells you whether the current first-field was inserted into the array before an existing element (swapped == 1) or if it was just added at the end (swapped == 0).
Example Use/Output
With your sample file in the file named file, you can simply change to the directory that contains it, select the script above with the mouse (change the filename to what yours is) and then middle-mouse-paste the script into the terminal, e.g.
$ awk -F, '
> last && $2 != last {
> for (i=1; i<=n; i++)
> print a[i]", "last;
> delete a
> n = 0
> }
> {
> swapped=0
> for (i=1; i<=n; i++)
> if ($1 > a[i]) {
> swapped=1
> for (j=n+1; j>i; j--)
> a[j]=a[j-1]
> a[i]=$1
> }
> if (!swapped)
> a[++n]=$1
> else
> n++
> last=$2
> }
> END {
> for (i=1; i<=n; i++)
> print a[i]", "last
> }
> ' file
124, 3
123, 3
12345, 2
1010, 1
898, 1
Look things over and let me know if you have questions.

Also with awk, you can try this:
awk 'BEGIN{RS=""; OFS=FS="\n"} {tmp2 = $2; $2 = $1; $1 = tmp2; tmp5=$5; $5=$4; $4=tmp5}1' file
124, 3
123, 3
12345, 2
1010, 1
898, 1

Related

Loop to create a a DF from values in bash

Im creating various text files from a file like this:
Chrom_x,Pos,Ref,Alt,RawScore,PHRED,ID,Chrom_y
10,113934,A,C,0.18943,5.682,rs10904494,10
10,126070,C,T,0.030435000000000007,3.102,rs11591988,10
10,135656,T,G,0.128584,4.732,rs10904561,10
10,135853,A,G,0.264891,6.755,rs7906287,10
10,148325,A,G,0.175257,5.4670000000000005,rs9419557,10
10,151997,T,C,-0.21169,0.664,rs9286070,10
10,158202,C,T,-0.30357,0.35700000000000004,rs9419478,10
10,158946,C,T,2.03221,19.99,rs11253562,10
10,159076,G,A,1.403107,15.73,rs4881551,10
What I am trying to do is extract, in bash, all values beetwen two values:
gawk '$6>=0 && $NF<=5 {print $0}' file.csv > 0_5.txt
And create files from 6 to 10, from 11 to 15... from 95 to 100. I was thinking in creating a loop for this with something like
#!/usr/bin/env bash
n=( 0,5,6,10...)
if i in n:
gawk '$6>=n && $NF<=n+1 {print $0}' file.csv > n_n+1.txt
and so on.
How can i convert this as a loop and create files with this specific values.
While you could use a shell loop to provide inputs to an awk script, you could also just use awk to natively split the values into buckets and write the lines to those "bucket" files itself:
awk -F, ' NR > 1 {
i=int((($6 - 1) / 5))
fname=(i*5) "_" (i+1)*5 ".txt"
print $0 > fname
}' < input
The code skips the header line (NR > 1) and then computes a "bucket index" by dividing the value in column six by five. The filename is then constructed by multiplying that index (and its increment) by five. The whole line is then printed to that filename.
To use a shell loop (and call awk 20 times on the input), you could use something like this:
for((i=0; i <= 19; i++))
do
floor=$((i * 5))
ceiling=$(( (i+1) * 5))
awk -F, -v floor="$floor" -v ceiling="$ceiling" \
'NR > 1 && $6 >= floor && $6 < ceiling { print }' < input \
> "${floor}_${ceiling}.txt"
done
The basic idea is the same; here, we're creating the bucket index with the outer loop and then passing the range into awk as the floor and ceiling variables. We're only asking awk to print the matching lines; the output from awk is captured by the shell as a redirection into the appropriate file.

Add sequence lengths to headers in a fasta file

I have a multifasta file and would like to add the sequence lengths to the headers by keeping the sequences.
>Seq1
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
KGTVGSILDRKDETKTQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQK
>Seq2
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTSQSKIAWISETLCIGCGI
CIKKCPFGALSIVNLPSNLEKETTHRYCANAFKLHRLPIPRPGEVLGLVGTNGIGKSTAL
KGTVGSILDRKDETKTQTVVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQKADIFMF
DEPSSYLDVKQRLKAAITIRSLINPDRYIIV
My desired output
>Seq1_174
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTPQSKIAWISETLCIGCGI
KILAGKQKPNLGKYDDPPDWQEILTYFRGSELQNYFTKILEDDLKAIIKPQYVDQIPKAA
KGTVGSILDRKDETKTQAIVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQK
>Seq2_211
MADKLTRIAIVNHDKCKPKKCRQECKKSCPVVRMGKLCIEVTSQSKIAWISETLCIGCGI
CIKKCPFGALSIVNLPSNLEKETTHRYCANAFKLHRLPIPRPGEVLGLVGTNGIGKSTAL
KGTVGSILDRKDETKTQTVVCQQLDLTHLKERNVEDLSGGELQRFACAVVCIQKADIFMF
DEPSSYLDVKQRLKAAITIRSLINPDRYIIV
I tried to use this command
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta | paste - - | sed 's/\t/_/' | >seq_len.fasta
but it only shows the length without the sequence.
Can you help me to fix that without using biopython or seqkit?
for example:
When the line doesn't begin with >, accumulate the sequence data in a variable and add its length to a total variable. When the line begins with >, print the sequence that you were accumulating, and save the current line as the name of the next sequence. Finally, at the end of the file print the last sequence.
awk '/^>/ { if (name) {printf("%s_%d\n%s", name, len, seq)} name=$0; seq=""; len = 0; next}
NF > 0 {seq = seq $0 "\n"; len += length()}
END { if (name) {printf("%s_%d\n%s", name, len, seq)} }' file.fasta > seq_len.fasta

How to merge two .txt files into one, matching each line by timestamp, using awk

Summary:
I currently have two .txt files imported from a survey system I am testing. Column 1 of each data file is a timestamp of the format "HHMMSS.SSSSSS". In file1, there is a second column of field intensity readings. In file2 there are two additional columns of positional information. I'm attempting to write a script that matches data points between these files by lining the timestamps up. The issue is that at no point are any of the timestamps the exact same value. The script must be able to match data points (lines in each .txt file) based the timestamp of its closest counterpart in the other file (i.e. the time 125051.354948 from file1 should "match" the nearest timestamp in file2, which is 125051.112784).
If anyone with a little bit more awk/sed/join/regex/Unix knowledge could point me in the right direction, I would be very appreciative.
What I have so far:
(Please note that the exact syntax shown here may not make sense for the sample .txt files attached in this question, there are more extensive versions of these files with more columns that were being used for testing scripts.)
I'm new to awk/Unix/shell scripting so please bear with me if some of these trial solutions don't work or don't make a whole lot of sense.
I have already attempted some solutions posted here on stack overflow using join, but it doesn't seem to want to properly sort or join either of these files:
${
join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1 file2)
join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 file1) <(sort -k 1
file2)
} | sort -k 1
Result: only outputs a similar version of the original file2
I attempted to reconfigure existing awk solutions that were posted here as well:
awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$3]=$2; next} {print $1, (v[$3] ?
v[$3] : 0)}' file1 file2 > file3
awk 'BEGIN {FS=OFS="\t"} NR==FNR {v[$1]=$2; next} {print $1, (v[$1] ?
v[$1] : 0)}' file1 file2 > file3
Result: both of these awk commands result in the output of file2's
data with nothing from file1 included (or so it seems).
awk -F '
FNR == NR {
time[$3]
next
}
{ for(i in time)
if(index($3, i) == 1) {
print
next
}
}' file1 file2 > file3
Result: keeps returning a syntax error regarding the "." of ".txt"
I looked into integrating some sort of regex or split command to the script... but was confused as to how to proceed and didn't come up with anything of substance.
Sample Data
$ cat file1.txt
125051.354948 058712.429
125052.352475 058959.934
125054.354322 058842.619
125055.352671 058772.045
125057.351794 058707.281
125058.352678 058758.959
$ cat file2.txt
125050.105886 4413.34358 07629.87620
125051.112784 4413.34369 07629.87606
125052.100811 4413.34371 07629.87605
125053.097826 4413.34373 07629.87603
125054.107361 4413.34373 07629.87605
125055.107038 4413.34375 07629.87604
125056.093783 4413.34377 07629.87602
125057.097928 4413.34378 07629.87603
125058.098475 4413.34378 07629.87606
125059.095787 4413.34376 07629.87602
Expected Result:
(Format: Column1File1 Column1File2 Column2File1 Column2File2 Column3File2)
$ cat file3.txt
125051.354948 125051.112784 058712.429 4413.34358 07629.87620
125052.352475 125052.100811 058959.934 4413.34371 07629.87605
125054.354322 125054.107361 058842.619 4413.34373 07629.87605
125055.352671 125055.107038 058772.045 4413.34375 07629.87604
125057.351794 125057.097928 058707.281 4413.34378 07629.87603
125058.352678 125058.098475 058758.959 4413.34378 07629.87606
As shown, not every data point from each file will find a match. Only pairs of lines that have the most proximal timestamps to one another will be written over to the new file
As previously mentioned, current solutions result in file3 being entirely blank, or just containing information from one of the two files (but not both)
Please try the following:
awk '
# find the closest element in "a" to val and return the index
function binsearch(a, val, len,
low, high, mid) {
if (val < a[1])
return 1
if (val > a[len])
return len
low = 1
high = len
while (low <= high) {
mid = int((low + high) / 2)
if (val < a[mid])
high = mid - 1
else if (val > a[mid])
low = mid + 1
else
return mid
}
return (val - a[low]) < (a[high] - val) ? high : low
}
NR == FNR {
time[FNR] = $1
position[FNR] = $2
intensity[FNR] = $3
len++
next
}
{
i = binsearch(time, $1, len)
print $1 " " time[i] " " $2 " " position[i] " " intensity[i]
}
' file2.txt file1.txt
Result:
125051.354948 125051.112784 058712.429 4413.34369 07629.87606
125052.352475 125052.100811 058959.934 4413.34371 07629.87605
125054.354322 125054.107361 058842.619 4413.34373 07629.87605
125055.352671 125055.107038 058772.045 4413.34375 07629.87604
125057.351794 125057.097928 058707.281 4413.34378 07629.87603
125058.352678 125058.098475 058758.959 4413.34378 07629.87606
Note that the 4th and 5th values in your expected result may be wrongly copy-and-pasted.
[How it works]
The key is the binsearch function which finds the closest value in the
array and returns the index to the array. I would not mention about
the algorithm in detail because it is a common "binary search" technique.
#!/bin/bash
if [[ $# -lt 2 ]]; then
echo "wrong args, it should be $0 file1 file2"
exit 0
fi
# clear blanks, add an extra column 'm' to file1, merge file1, file2, sort
{ awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1 | \
\
awk '# record lines and fields in to a
{a[NR] = $0; a[NR,1] = $1; a[NR,2] = $2; a[NR,3] = $3}
END{
for(i=1; i<= NR; ++i){
# 3rd filed of file1 is "m"
if(a[i, 3] == "m"){
# get difference of column1 between current record ,previous record, next record
prevDiff = (i-1) in a && a[i-1,3] == "m" ? -1 : a[i,1] - a[i-1,1]
nextDiff = (i+1) in a && a[i+1,3] == "m" ? -1 : a[i+1,1] - a[i,1]
# compare differences, choose the close one and print.
if(prevDiff !=-1 && (nextVal == -1 || prevDiff < nextDiff))
print a[i,1], a[i-1, 1], a[i, 2], a[i-1, 2], a[i-1, 3]
else if(nextDiff !=-1 && (prevDiff == -1 || nextDiff < prevDiff))
print a[i,1], a[i+1, 1], a[i, 2], a[i+1, 2], a[i+1, 3]
else
print a[i]
}
}
}'
Out put of { awk 'NF{print $0, "m"}' "$1" ; awk 'NF' "$2"; } | sort -nk1,1 is:
125050.105886 4413.34358 07629.87620
125051.112784 4413.34369 07629.87606
125051.354948 058712.429 m
125052.100811 4413.34371 07629.87605
125052.352475 058959.934 m
125053.097826 4413.34373 07629.87603
125054.107361 4413.34373 07629.87605
125054.354322 058842.619 m
125055.107038 4413.34375 07629.87604
125055.352671 058772.045 m
125056.093783 4413.34377 07629.87602
125057.097928 4413.34378 07629.87603
125057.351794 058707.281 m
125058.098475 4413.34378 07629.87606
125058.352678 058758.959 m
125059.095787 4413.34376 07629.87602

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Add leading zeroes to awk variable

I have the following awk command within a "for" loop in bash:
awk -v pdb="$pdb" 'BEGIN {file = 1; filename = pdb"_" file ".pdb"}
/ENDMDL/ {getline; file ++; filename = pdb"_" file ".pdb"}
{print $0 > filename}' < ${pdb}.pdb
This reads a series of files with the name $pdb.pdb and splits them in files called $pdb_1.pdb, $pdb_2.pdb, ..., $pdb_21.pdb, etc. However, I would like to produce files with names like $pdb_01.pdb, $pdb_02.pdb, ..., $pdb_21.pdb, i.e., to add padding zeros to the "file" variable.
I have tried without success using printf in different ways. Help would be much appreciated.
Here's how to create leading zeros with awk:
# echo 1 | awk '{ printf("%02d\n", $1) }'
01
# echo 21 | awk '{ printf("%02d\n", $1) }'
21
Replace %02 with the total number of digits you need (including zeros).
Replace file on output with sprintf("%02d", file).
Or even the whole assigment with filename = sprintf("%s_%02d.pdb", pdb, file);.
This does it without resort of printf, which is expensive. The first parameter is the string to pad, the second is the total length after padding.
echo 722 8 | awk '{ for(c = 0; c < $2; c++) s = s"0"; s = s$1; print substr(s, 1 + length(s) - $2); }'
If you know in advance the length of the result string, you can use a simplified version (say 8 is your limit):
echo 722 | awk '{ s = "00000000"$1; print substr(s, 1 + length(s) - 8); }'
The result in both cases is 00000722.
Here is a function that left or right-pads values with zeroes depending on the parameters: zeropad(value, count, direction)
function zeropad(s,c,d) {
if(d!="r")
d="l" # l is the default and fallback value
return sprintf("%" (d=="l"? "0" c:"") "d" (d=="r"?"%0" c-length(s) "d":""), s,"")
}
{ # test main
print zeropad($1,$2,$3)
}
Some tests:
$ cat test
2 3 l
2 4 r
2 5
a 6 r
The test:
$ awk -f program.awk test
002
2000
00002
000000
It's not fully battlefield tested so strange parameters may yield strange results.

Resources