Loop to create a a DF from values in bash - bash

Im creating various text files from a file like this:
Chrom_x,Pos,Ref,Alt,RawScore,PHRED,ID,Chrom_y
10,113934,A,C,0.18943,5.682,rs10904494,10
10,126070,C,T,0.030435000000000007,3.102,rs11591988,10
10,135656,T,G,0.128584,4.732,rs10904561,10
10,135853,A,G,0.264891,6.755,rs7906287,10
10,148325,A,G,0.175257,5.4670000000000005,rs9419557,10
10,151997,T,C,-0.21169,0.664,rs9286070,10
10,158202,C,T,-0.30357,0.35700000000000004,rs9419478,10
10,158946,C,T,2.03221,19.99,rs11253562,10
10,159076,G,A,1.403107,15.73,rs4881551,10
What I am trying to do is extract, in bash, all values beetwen two values:
gawk '$6>=0 && $NF<=5 {print $0}' file.csv > 0_5.txt
And create files from 6 to 10, from 11 to 15... from 95 to 100. I was thinking in creating a loop for this with something like
#!/usr/bin/env bash
n=( 0,5,6,10...)
if i in n:
gawk '$6>=n && $NF<=n+1 {print $0}' file.csv > n_n+1.txt
and so on.
How can i convert this as a loop and create files with this specific values.

While you could use a shell loop to provide inputs to an awk script, you could also just use awk to natively split the values into buckets and write the lines to those "bucket" files itself:
awk -F, ' NR > 1 {
i=int((($6 - 1) / 5))
fname=(i*5) "_" (i+1)*5 ".txt"
print $0 > fname
}' < input
The code skips the header line (NR > 1) and then computes a "bucket index" by dividing the value in column six by five. The filename is then constructed by multiplying that index (and its increment) by five. The whole line is then printed to that filename.
To use a shell loop (and call awk 20 times on the input), you could use something like this:
for((i=0; i <= 19; i++))
do
floor=$((i * 5))
ceiling=$(( (i+1) * 5))
awk -F, -v floor="$floor" -v ceiling="$ceiling" \
'NR > 1 && $6 >= floor && $6 < ceiling { print }' < input \
> "${floor}_${ceiling}.txt"
done
The basic idea is the same; here, we're creating the bucket index with the outer loop and then passing the range into awk as the floor and ceiling variables. We're only asking awk to print the matching lines; the output from awk is captured by the shell as a redirection into the appropriate file.

Related

bash cycle - output according to string from file

How to call the output file as the string in 4th column of output (or according to 4th column of ith row of the input)?
I tried:
for i in {1..321}; do
awk '(FNR==i) {outfile = $4 print $0 >> outfile}' RV1_phase;
done
or
for i in {1..321}; do
awk '(FNR==i) {outfile = $4; print $0}' RV1_phase > "$outfile";
done
input file:
1 2 2 a
4 5 6 f
4 4 5 f
....
....
desired input i=1
name: a
1 2 2 a
The aim: I have data that I plotted in gnuplot and I would like to plot set of figures named after string to know which point come from which file. The point will be coloured. I need to get files for plotting in gnuplot so I would like to create them using the cycle from my question.
Simply
for i in {1..321}; do
awk '(FNR==i) {print $0 >> $4}' RV1_phase;
done
The problem with your first attempt was that you didn't use a ; to separate the assignment to outfile from the print command. The separate variable isn't necessary, though.
You don't need a bash loop, either:
awk '1 <= FNR && FNR <= 321 {print $0 >> $4}' RV1_phase;

awk script for decimal values

I am using this script to extract lines if column 7 is < 1.0E-08 AND
column eight has one or more than one values > 0.2 and 0.3
Is it the right approach ?
InputFile: head -1 test.txt
A2 DR28 P3379 72 7 5.008 8.252e-14
0.05132,0.04248,0.002704,0.116,0.04439,0.2,0.3
A2 DR28 P3379 72 7 5.008 0.05
0.05132,0.04248,0.002704,0.116,0.04439,0.006,0.004
Script: first I did
awk '{if($7 < 1.0E-08 || $8 > 0.2) print}' test.txt
This gives the first line as output but i want to use && (AND) instead of || (OR)
when I use AND (&&)
awk '{if($7 < 1.0E-08 && $8 > 0.2) print}' test.txt
no result though line one fits this criteria.
I also try this but here just considering column eight as a cut-off point
awk -F',' '$8 > 0.2' test.txt
this script work fine but I need to consider column 7 too as I have few lines in output so just want to make sure that i am not missing anything
not tested, but something like this should work
$ awk 'function anyGreater(x,v) {
n=split(x,f8,",");
for(i=1;i<=n;i++) if(f8[i]>v) return 1;
return 0}
$7<1.0E-08 && anyGreater($8,0.2)' file

Bash script to print X lines of a file in sequence

I'd be very grateful for your help with something probably quite simple.
I have a table (table2.txt), which has a single column of randomly generated numbers, and is about a million lines long.
2655087
3721239
5728533
9082076
2016819
8983893
9446748
6607974
I want to create a loop that repeats 10,000 times, so that for iteration 1, I print lines 1 to 4 to a file (file0.txt), for iteration 2, I print lines 5 to 8 (file1.txt), and so on.
What I have so far is this:
#!/bin/bash
for i in {0..10000}
do
awk 'NR==((4 * "$i") +1)' table2.txt > file"$i".txt
awk 'NR==((4 * "$i") +2)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +3)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +4)' table2.txt >> file"$i".txt
done
Desired output for file0.txt:
2655087
3721239
5728533
9082076
Desired output for file1.txt:
2016819
8983893
9446748
6607974
Something is going wrong with this, because I am getting identical outputs from all my files (i.e. they all look like the desired output of file0.txt). Hopefully you can see from my script that during the second iteration, i.e. when i=2, I want the output to be the values of rows 5, 6, 7 and 8.
This is probably a very simple syntax error, and I would be grateful if you can tell me where I'm going wrong (or give me a less cumbersome solution!)
Thank you very much.
The beauty of awk is that you can do this in one awk line :
awk '{ print > ("file"c".txt") }
(NR % 4 == 0) { ++c }
(c == 10001) { exit }' <file>
This can be slightly more optimized and file handling friendly (cfr. James Brown):
awk 'BEGIN{f="file0.txt" }
{ print > f }
(NR % 4 == 0) { close(f); f="file"++c".txt" }
(c == 10001) { exit }' <file>
Why did your script fail?
The reason why your script is failing is because you used single quotes and tried to pass a shell variable to it. Your lines should read :
awk 'NR==((4 * '$i') +1)' table2.txt > file"$i".txt
but this is very ugly and should be improved with
awk -v i=$i 'NR==(4*i+1)' table2.txt > file"$i".txt
Why is your script slow?
The way you are processing your file is by doing a loop of 10001 iterations. Per iterations, you perform 4 awk calls. Each awk call reads the full file completely and writes out a single line. So in the end you read your files 40004 times.
To optimise your script step by step, I would do the following :
Terminate awk to step reading the file after the line is print
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i 'NR==(4*i+1){print; exit}' table2.txt > file"$i".txt
awk -v i=$i 'NR==(4*i+2){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+3){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+4){print; exit}' table2.txt >> file"$i".txt
done
Merge the 4 awk calls into a single one. This prevents reading the first lines over and over per loop cycle.
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i '(NR<=4*i) {next} # skip line
(NR> 4*(i+1)}{exit} # exit awk
1' table2.txt > file"$i".txt # print line
done
remove the final loop (see top of this answer)
This is functionally the same as #JamesBrown's answer but just written more awk-ishly so don't accept this, I just posted it to show the more idiomatic awk syntax as you can't put formatted code in a comment.
awk '
(NR%4)==1 { close(out); out="file" c++ ".txt" }
c > 10000 { exit }
{ print > out }
' file
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why you should avoid shell loops for manipulating text.
With just bash you can do it very simple:
chunk=4
files=10000
head -n $(($chunk*$files)) table2.txt |
split -d -a 5 --additional-suffix=.txt -l $chunk - file
Basically read first 10k lines and split them into chunks of 4 consecutive lines, using file as prefix and .txt as suffix for the new files.
If you want a numeric identifier, you will need 5 digits (-a 5), as pointed in the comments (credit: #kvantour).
Another awk:
$ awk '{if(NR%4==1){if(i==10000)exit;close(f);f="file" i++ ".txt"}print > f}' file
$ ls
file file0.txt file1.txt
Explained:
awk ' {
if(NR%4==1) { # use mod to recognize first record of group
if(i==10000) # exit after 10000 files
exit # test with 1
close(f) # close previous file
f="file" i++ ".txt" # make a new filename
}
print > f # output record to file
}' file

Sorting awk results into multiple files based on value ranges

I have the following data, potentially spread across multiple input files:
colB1:colB2:colB3:$130.00:...
colC1:colC2:colC3:$83.00:...
colD1:colD2:colD3:$80.00:...
My goal is to write the first two columns from each line of input into an output file whose name contains the number beginning a $10 range.
Thus, for the above input, I would have two output files, one with two lines, the other with one:
balance-80.out
===
colC1:colC2
colD1:colD2
balance-130.out
===
colB1:colB2
My current attempt looks like the following:
#!/bin/bash
for ((value_1=90; value_1<200; value_1+=10)); do
awk -F: \
-v var=$value_1 \
-v var1=$((value_1+10)) \
'{ sub(/\$/, "", $4) }
($4+0) >= var && ($4+0) < var1 { print $1":"$2 }' \
*.txt > balance_$value_1.txt
done
This works, but it's obviously very kludgy. How could this be done in a single pass?
A few points:
You can create output files from awk -- and high-quality awk implementations will cache file descriptors automatically, thus keeping output files open between individual lines (which similar shell scripts won't do)
Modulo arithmetic is the appropriate tool for selecting a filename as an appropriate bucket for any given value.
Thus:
awk -F: -v min=80 -v max=200 '
{ sub(/\$/, "", $4); val = (0 + $4) }
val >= min && val < max {
print $1 ":" $2 >("balance-" (val - (val % 10)) ".out")
}
' *.txt

How to efficiently sum two columns in a file with 270,000+ rows in bash

I have two columns in a file, and I want to automate summing both values per row
for example
read write
5 6
read write
10 2
read write
23 44
I want to then sum the "read" and "write" of each row. Eventually after summing, I'm finding the max sum and putting that max value in a file. I feel like I have to use grep -v to rid of the column headers per row, which like stated in the answers, makes the code inefficient since I'm grepping the entire file just to read a line.
I currently have this in a bash script (within a for loop where $x is the file name) to sum the columns line by line
lines=`grep -v READ $x|wc -l | awk '{print $1}'`
line_num=1
arr_num=0
while [ $line_num -le $lines ]
do
arr[$arr_num]=`grep -v READ $x | sed $line_num'q;d' | awk '{print $2 + $3}'`
echo $line_num
line_num=$[$line_num+1]
arr_num=$[$arr_num+1]
done
However, the file to be summed has 270,000+ rows. The script has been running for a few hours now, and it is nowhere near finished. Is there a more efficient way to write this so that it does not take so long?
Use awk instead and take advantage of modulus function:
awk '!(NR%2){print $1+$2}' infile
awk is probably faster, but the idiomatic bash way to do this is something like:
while read -a line; do # read each line one-by-one, into an array
# use arithmetic expansion to add col 1 and 2
echo "$(( ${line[0]} + ${line[1]} ))"
done < <(grep -v READ input.txt)
Note the file input file is only read once (by grep) and the number of externally forked programs is kept to a minimum (just grep, called only once for the whole input file). The rest of the commands are bash builtins.
Using the <( ) process substition, in case variables set in the while loop are required out of scope of the while loop. Otherwise a | pipe could be used.
Your question is pretty verbose, yet your goal is not clear. The way I read it, your numbers are on every second line, and you want only to find the maximum sum. Given that:
awk '
NR%2 == 1 {next}
NR == 2 {max = $1+$2; next}
$1+$2 > max {max = $1+$2}
END {print max}
' filename
You could also use a pipeline with tools that implicitly loop over the input like so:
grep -v read INFILE | tr -s ' ' + | bc | sort -rn | head -1 > OUTFILE
This assumes there are spaces between your read and write data values.
Why not run:
awk 'NR==1 { print "sum"; next } { print $1 + $2 }'
You can afford to run it on the file while the other script it still running. It'll be complete in a few seconds at most (prediction). When you're confident it's right, you can kill the other process.
You can use Perl or Python instead of awk if you prefer.
Your code is running grep, sed and awk on each line of the input file; that's damnably expensive. And it isn't even writing the data to a file; it is creating an array in Bash's memory that'll need to be printed to the output file later.
Assuming that it's always one 'header' row followed by one 'data' row:
awk '
BEGIN{ max = 0 }
{
if( NR%2 == 0 ){
sum = $1 + $2;
if( sum > max ) { max = sum }
}
}
END{ print max }' input.txt
Or simply trim out all lines that do not conform to what you want:
grep '^[0-9]\+\s\+[0-9]\+$' input.txt | awk '
BEGIN{ max = 0 }
{
sum = $1 + $2;
if( sum > max ) { max = sum }
}
END{ print max }' input.txt

Resources