bash cycle - output according to string from file - bash

How to call the output file as the string in 4th column of output (or according to 4th column of ith row of the input)?
I tried:
for i in {1..321}; do
awk '(FNR==i) {outfile = $4 print $0 >> outfile}' RV1_phase;
done
or
for i in {1..321}; do
awk '(FNR==i) {outfile = $4; print $0}' RV1_phase > "$outfile";
done
input file:
1 2 2 a
4 5 6 f
4 4 5 f
....
....
desired input i=1
name: a
1 2 2 a
The aim: I have data that I plotted in gnuplot and I would like to plot set of figures named after string to know which point come from which file. The point will be coloured. I need to get files for plotting in gnuplot so I would like to create them using the cycle from my question.

Simply
for i in {1..321}; do
awk '(FNR==i) {print $0 >> $4}' RV1_phase;
done
The problem with your first attempt was that you didn't use a ; to separate the assignment to outfile from the print command. The separate variable isn't necessary, though.
You don't need a bash loop, either:
awk '1 <= FNR && FNR <= 321 {print $0 >> $4}' RV1_phase;

Related

Loop to create a a DF from values in bash

Im creating various text files from a file like this:
Chrom_x,Pos,Ref,Alt,RawScore,PHRED,ID,Chrom_y
10,113934,A,C,0.18943,5.682,rs10904494,10
10,126070,C,T,0.030435000000000007,3.102,rs11591988,10
10,135656,T,G,0.128584,4.732,rs10904561,10
10,135853,A,G,0.264891,6.755,rs7906287,10
10,148325,A,G,0.175257,5.4670000000000005,rs9419557,10
10,151997,T,C,-0.21169,0.664,rs9286070,10
10,158202,C,T,-0.30357,0.35700000000000004,rs9419478,10
10,158946,C,T,2.03221,19.99,rs11253562,10
10,159076,G,A,1.403107,15.73,rs4881551,10
What I am trying to do is extract, in bash, all values beetwen two values:
gawk '$6>=0 && $NF<=5 {print $0}' file.csv > 0_5.txt
And create files from 6 to 10, from 11 to 15... from 95 to 100. I was thinking in creating a loop for this with something like
#!/usr/bin/env bash
n=( 0,5,6,10...)
if i in n:
gawk '$6>=n && $NF<=n+1 {print $0}' file.csv > n_n+1.txt
and so on.
How can i convert this as a loop and create files with this specific values.
While you could use a shell loop to provide inputs to an awk script, you could also just use awk to natively split the values into buckets and write the lines to those "bucket" files itself:
awk -F, ' NR > 1 {
i=int((($6 - 1) / 5))
fname=(i*5) "_" (i+1)*5 ".txt"
print $0 > fname
}' < input
The code skips the header line (NR > 1) and then computes a "bucket index" by dividing the value in column six by five. The filename is then constructed by multiplying that index (and its increment) by five. The whole line is then printed to that filename.
To use a shell loop (and call awk 20 times on the input), you could use something like this:
for((i=0; i <= 19; i++))
do
floor=$((i * 5))
ceiling=$(( (i+1) * 5))
awk -F, -v floor="$floor" -v ceiling="$ceiling" \
'NR > 1 && $6 >= floor && $6 < ceiling { print }' < input \
> "${floor}_${ceiling}.txt"
done
The basic idea is the same; here, we're creating the bucket index with the outer loop and then passing the range into awk as the floor and ceiling variables. We're only asking awk to print the matching lines; the output from awk is captured by the shell as a redirection into the appropriate file.

piping commands of awk and sed is too slow! any ideas on how to make it work faster?

I am trying to convert a file containing a column with scaffold numbers and another one with corresponding individual sites into a bed file which lists sites in ranges. For example, this file ($indiv.txt):
SCAFF SITE
1 1
1 2
1 3
1 4
1 5
3 1
3 2
3 34
3 35
3 36
should be converted into $indiv.bed:
SCAFF SITE-START SITE-END
1 1 5
3 1 2
3 34 36
Currently, I am using the following code but it is super slow so I wanted to ask if anybody could come up with a quicker way??
COMMAND:
for scaff in $(awk '{print $1}' $indiv.txt | uniq)
do
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt | awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' | sed "s/^/$scaff\t/" >> $indiv.bed
done
DESCRIPTION:
awk '{print $1}' $indiv.txt | uniq #outputs a list with the unique scaffold numbers
awk -v I=$scaff '$1 == I { print $2 }' $indiv.txt #extracts the values from column 2 if the value in the first column equals the variable $scaff
awk 'NR==1{first=$1;last=$1;next} $1 == last+1 {last=$1;next} {print first,last;first=$1;last=first} END{print first,last}' #converts the list of sequential numbers into ranges as described here: https://stackoverflow.com/questions/26809668/collapse-sequential-numbers-to-ranges-in-bash
sed "s/^/$scaff\t/" >> $indiv.bed #adds a column with the respective scaffold number and then outputs the file into $indiv.bed
Thanks a lot in advance!
Calling several programs for each line of the input must be slow. It's usually better to find a way how to process all the lines in one call.
I'd reach for Perl:
tail -n+2 indiv.txt \
| sort -u -nk1,1 -nk2,2 \
| perl -ane 'END {print " $F[1]"}
next if $p[0] == $F[0] && $F[1] == $p[1] + 1;
print " $p[1]\n#F";
} continue { #p = #F;' > indiv.bed
The first two lines sort the input so that the groups are always adjacent (might be unnecessary if your input is already sorted that way); Perl than reads the lines,-a splits each line into the #F array, the #p array is used to keep the previous line: if the current line has the same first element and the second element is greater by 1, we go to the continue section which just stores the current line into #p. Otherwise, we print the last element of the previous section and the first line of the current one. The END block is responsible for printing the last element of the last section.
The output is different from yours for sections that have only a single member.

Sum values of specific columns using awk

So I have a file which looks like this:
1 4 6
2 5
3
I want to sum only specific columns, let's say the first and third.
And the output should look like this:
7
2
3
I store numbers of columns (arguments) in a variable:
x=${#:2} (because I omit first passed argument which is a $filename)
How to calclute this using awk in a bash script ?
I was thinking about sth like this
for i in ${#:2}
do
awk -v c=$i '{sum+=$c;print sum}' $fname
done
But it does not work properly.
How about something like this:
$ awk -v c="1 3" 'BEGIN{split(c,a)}{c=0;for(i in a) c+=$a[i]; print c}' file
7
2
3
Explained:
$ awk -v c="1 3" ' # the desired column list space-separated
BEGIN {
split(c,a) # if not space-separated, change it here
}
{
c=0; # reusing col var as count var. recycle or die!
for(i in a) # after split desired cols are in a arr, ie. a[1]=1, a[2]=3
c+=$a[i]; # sum em up
print c # print it
}' file
EDIT: changed comma-separation to space-separation.
awk '{print $1 + $3}' file
7
2
3

AWK: parsing arguments in a loop

I'm trying to write a simple script that will display the fields specified by the user as bash arguments. For example I've got text file looks like this:
1 2 3 4 5
1 2 3 4 5
a b c d e
And for example user types:
./script.sh text 1 2 5
Where $1 = text, and other parameters (like $2 $3 and $4) are the fields, so output will look like this:
1 2 5
1 2 5
a b e
I've got this code which prints all the columns defined as a arguments, but one below the others:
#!/bin/bash
text="$1"
shift
for x in $#; do
awk '{print $var}' var="$x" $text
done
Output for example ./script.sh text 1 2 5:
1
1
a
2
2
b
5
5
e
I guess output looks like that because loop "for" is outside of AWK. Is it a good solution for this task to place the loop inside AWK? I tried a few things but always have trouble with the syntax.
Thank you for your time and help!
file="$1"
shift
awk -v flds="$*" 'BEGIN{n=split(flds,f)} {for (i=1;i<=n;i++) printf "%s%s", $(f[i]), (i<n?OFS:ORS)}' "$file"
You don't need to loop over the params, pass all of them to awk with -v option:
awk -v v1=$2 -v v2=$3 -v v3=$4 '{print $v1, $v2, $v3;}' $1
You may want to perform additional checks such as whether the file ($1) contains enough fields, the file ($1) exists etc. But the idea is the same.
In your code, you are reading the file multiple times, each checking for only a particular field but to get your desired output, each line must be checked for multiple fields at the same time.
Pass the columns to awk and split them into an array and print the column corresponding to each value in the array:
file=$1
shift
p="$#"
awk -v l="$p" '{t=split(l,a," "); for (i=1;i<=t;i++) printf $(a[i]) " ";printf "\n";}' $file

Print a comma except on the last line in Awk

I have the following script
awk '{printf "%s", $1"-"$2", "}' $a >> positions;
where $a stores the name of the file. I am actually writing multiple column values into one row. However, I would like to print a comma only if I am not on the last line.
Single pass approach:
cat "$a" | # look, I can use this in a pipeline!
awk 'NR > 1 { printf(", ") } { printf("%s-%s", $1, $2) }'
Note that I've also simplified the string formatting.
Enjoy this one:
awk '{printf t $1"-"$2} {t=", "}' $a >> positions
Yeh, looks a bit tricky at first sight. So I'll explain, first of all let's change printf onto print for clarity:
awk '{print t $1"-"$2} {t=", "}' file
and have a look what it does, for example, for file with this simple content:
1 A
2 B
3 C
4 D
so it will produce the following:
1-A
, 2-B
, 3-C
, 4-D
The trick is the preceding t variable which is empty at the beginning. The variable will be set {t=...} only on the next step of processing after it was shown {print t ...}. So if we (awk) continue iterating we will got the desired sequence.
I would do it by finding the number of lines before running the script, e.g. with coreutils and bash:
awk -v nlines=$(wc -l < $a) '{printf "%s", $1"-"$2} NR != nlines { printf ", " }' $a >>positions
If your file only has 2 columns, the following coreutils alternative also works. Example data:
paste <(seq 5) <(seq 5 -1 1) | tee testfile
Output:
1 5
2 4
3 3
4 2
5 1
Now replacing tabs with newlines, paste easily assembles the date into the desired format:
<testfile tr '\t' '\n' | paste -sd-,
Output:
1-5,2-4,3-3,4-2,5-1
You might think that awk's ORS and OFS would be a reasonable way to handle this:
$ awk '{print $1,$2}' OFS="-" ORS=", " input.txt
But this results in a final ORS because the input contains a newline on the last line. The newline is a record separator, so from awk's perspective there is an empty last record in the input. You can work around this with a bit of hackery, but the resultant complexity eliminates the elegance of the one-liner.
So here's my take on this. Since you say you're "writing multiple column values", it's possible that mucking with ORS and OFS would cause problems. So we can achieve the desired output entirely with formatting.
$ cat input.txt
3 2
5 4
1 8
$ awk '{printf "%s%d-%d",t,$1,$2; t=", "} END{print ""}' input.txt
3-2, 5-4, 1-8
This is similar to Michael's and rook's single-pass approaches, but it uses a single printf and correctly uses the format string for formatting.
This will likely perform negligibly better than Michael's solution because an assignment should take less CPU than a test, and noticeably better than any of the multi-pass solutions because the file only needs to be read once.
Here's a better way, without resorting to coreutils:
awk 'FNR==NR { c++; next } { ORS = (FNR==c ? "\n" : ", "); print $1, $2 }' OFS="-" file file
awk '{a[NR]=$1"-"$2;next}END{for(i=1;i<NR;i++){print a[i]", " }}' $a > positions

Resources