Another split file in bash - based on difference between rows of column x - bash

Hello stackoverflow users!
Generally I would like to tune up script I am using, just to make it more insensitive to missing data.
My example data looks like this (tab delimited csv file with headers):
ColA ColB ColC
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
I use awk script found elsewhere, as follows:
awk 'BEGIN { fn=0 }
NR==1 { next }
NR==2 { delim=$2 }
$2 == delim {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f }'
Which gives me output I want - omit 1st line, find 2nd column and set delimiter - in this example it will be '0':
file_no00.txt
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
file_no01.txt
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
file_no02.txt
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
To make the script more robust (imagine that rows with 0's are deleted) I would need to split file according to the subtracted value of rows 'n+1' and 'n' if this value is below 0 split file, so basically if (value_row_n+1)-value_row_n < 0 then split file. Of course I would need also to maintain the file naming. Preferred way is bash with awk use. Any advices? Thanks in advance!
Cheers!

Here is awk command that you can use:
cat file
ColA ColB ColC
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100
awk 'NR == 1 {
next
}
!p || $2 < p {
f = sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{
p = $2;
print $0 > f
}' file

I suggest small modifications to your current script:
awk 'BEGIN { fn=0; f=sprintf("file_no%02d.txt",fn++); print "Creating " f }
NR==1 { next }
NR==2 { delim=$2 }
$2 - delim < 0 {
f=sprintf("file_no%02d.txt",fn++);
print "Creating " f
}
{ print $0 > f; delim = $2 }' infile
First, create the first file name just before starting the processing.
Second, in last condition save the value of current line to compare with the value of next line.
Third, instead the comparison with zero, do the substraction between previous value and current one to check if result is less than zero.
It yields:
==> file_no00.txt <==
6 0 0
3 5.16551 12.1099
1 10.2288 19.4769
6 20.0249 30.6543
3 30.0499 40.382
1 59.9363 53.2281
2 74.9415 57.1477
2 89.9462 61.3308
6 119.855 64.0319
==> file_no01.txt <==
4 0 0
8 5.06819 46.8086
6 10.0511 60.1357
9 20.0363 71.679
6 30.0228 82.1852
6 59.8738 98.4446
3 74.871 100.648
1 89.9973 102.111
6 119.866 104.148
==> file_no02.txt <==
3 0 0
1 5.07248 51.9168
2 9.92203 77.3546
2 19.9233 93.0228
6 29.9373 98.7797
6 59.8709 100.518
6 74.7751 100.056
3 89.9363 99.5933
1 119.872 100

Related

AWK assign upper value for rank assignment during tie

I'm working on rank assignment to a list of values that is sorted in a file.
A miniature example is
Input:
1
2
2
2
3
4
Instead of normal ranking when there is a tie, I need to assign the upper value. So the required output is
1 1
2 4 #Note that it is not 2, since we have three 2's the upper bound is 4
2 4
2 4
3 5
4 6
I tried something like below, but it is not consistent.
$ awk ' BEGIN{t=0} NR==FNR { a[$1]++; next } { print $1,a[$1]+t; t=a[$1] } ' rank_in.txt rank_in.txt
1 1
2 4
2 6
2 6
3 4
4 2
This answer does normal ranking, so this question is not duplicate.
Instead of doing a double pass or keeping track of memory, we just use a uniq and reconstruct everything:
uniq -c file | awk '{n=n+$1;for(i=1;i<=$1;++i) print $2,n}' -
Two passes with just awk:
$ awk 'NR==FNR{rank[$1]=NR; next} {print $1, rank[$1]}' file file
1 1
2 4
2 4
2 4
3 5
4 6
or one pass with a pipe:
$ nl file | sort -k2,2 -k1,1nr | awk '$2!=prev{rank=$1; prev=$2} {print $2, rank}'
1 1
2 4
2 4
2 4
3 5
4 6
If you don't have nl on your system you could use cat -n or awk '{print NR, $0}' to generate the line numbers.
Try this awk:
awk 'FNR==NR {++fq[$1]; next} p != $1{s+=fq[$1]} {print p=$1, s}' file file
1 1
2 4
2 4
2 4
3 5
4 6
Assumptions:
input data is already sorted
Sample data:
$ cat rank.dat
1
2
2
2
3
4
One awk idea requiring a single pass through the file:
awk '
function print_rank() {
for ( i=1 ; i<=cnt ; i++ )
print id,rank
}
$1 != id { print_rank() # if we have a new id, print last id
cnt=0 # reset counter
}
{ id=$1 # keep track of current id
rank++ # increment rank by 1 for each new row processed
cnt++ # keep track of number of times we see this id
}
END { print_rank() } # flush last id to stdout
' rank.dat
This generates:
1 1
2 4
2 4
2 4
3 5
4 6
Another awk
$ awk ' NR==FNR { a[$1]++; next } { print $1, FNR + --a[$1] } ' rank_in.txt rank_in.txt
1 1
2 4
2 4
2 4
3 5
4 6
$

how to compare two column from same file?

I have long data file, file.txt
1 3
3 2
2 3
5 5
8 9
so out file should be, out.txt
1 3
1 2
1 5
1 9
3 3
3 2
3 5
Could you please try following.
awk '
FNR==NR{
a[++count]=$2
next
}
{
for(i=1;i<=count;i++){
print $1,a[i]
}
}
' Input_file Input_file

Average of multiple files in shell

I want to calculate the average of 15 files:- ifile1.txt, ifile2.txt, ....., ifile15.txt. Number of columns and rows of each file are same. Part of the data looks as
ifile1.txt ifile2.txt ifile3.txt
3 5 2 2 . 1 2 1 3 . 4 3 4 1 .
1 4 2 1 . 1 3 0 2 . 5 3 1 5 .
4 6 5 2 . 2 5 5 1 . 3 4 3 1 .
5 5 7 1 . 0 0 1 1 . 4 3 4 0 .
. . . . . . . . . . . . . . .
I would like to find over a new file which will show the average of these 15 fils.
ofile.txt
2.66 3.33 2.33 2 . (i.e. average of 3 1 4, average of 5 2 3 and so on)
2.33 3.33 1 2.66 .
3 5 4.33 1.33 .
3 2.33 4 0.66 .
. . . . .
I was trying with following, but getting error
awk'{for (i=1; i<=NF; i++)} rows=FNR;cols=NF} END
{for (i=1; i<=rows; i++){for (j=1; j<=cols; j++)
s+=$i;print $0,s/NF;s=0}}' ifile* > ofile.txt
As written:
awk'{for (i=1; i<=NF; i++)} rows=FNR;cols=NF} END
…
you get 'command not found' as the error because you must leave a space between awk and the script inside the quotes. When you fix that, you start getting into problems because there are two } and only one { on the first line of the script.
When you get around to tackling the problem, you're going to need a 2D array, indexed by line number and column number, summing the values from the files. You'll also need to know the number of files processed, and the number of columns. You can then arrange to iterate over the 2D array in the END block.
awk 'FNR == 1 { nfiles++; ncols = NF }
{ for (i = 1; i < NF; i++) sum[FNR,i] += $i
if (FNR > maxnr) maxnr = FNR
}
END {
for (line = 1; line <= maxnr; line++)
{
for (col = 1; col < ncols; col++)
printf " %f", sum[line,col]/nfiles;
printf "\n"
}
}' ifile*.txt
Given the three data files from the question:
ifile1.txt
3 5 2 2
1 4 2 1
4 6 5 2
5 5 7 1
ifile2.txt
1 2 1 3
1 3 0 2
2 5 5 1
0 0 1 1
ifile3.txt
4 3 4 1
5 3 1 5
3 4 3 1
4 3 4 0
The script I showed produces:
2.666667 3.333333 2.333333
2.333333 3.333333 1.000000
3.000000 5.000000 4.333333
3.000000 2.666667 4.000000
If you want to control the number of decimal places to 2, then use %.2f in place of %f.
$ { head -n1 ifile1.txt; paste ifile*.txt;} | awk 'NR==1{d=NF; next;} {for (i=1;i<=d;i++) {s=0; for (j=i;j<=NF;j+=d) s+=$j; printf "%.2f%s",s/(NF/d),j==NF+d?"\n":"\t";}}'
2.67 3.33 2.33 2.00
2.33 3.33 1.00 2.67
3.00 5.00 4.33 1.33
3.00 2.67 4.00 0.67
This script computes each row and prints the results before moving on to the next row. Because of this, the script does not need to hold all the data in memory at once. This is important if the data files are large.
How it works
{ head -n1 ifile1.txt; paste ifile*.txt;}
This prints just the first line of ifile1.txt. Then, the paste command causes it to print the first row of all files merged, then the second row merged, and so on:
$ paste ifile*.txt
3 5 2 2 1 2 1 3 4 3 4 1
1 4 2 1 1 3 0 2 5 3 1 5
4 6 5 2 2 5 5 1 3 4 3 1
5 5 7 1 0 0 1 1 4 3 4 0
|
The pipe symbol causes the output of the above commands to be sent as input to awk. Addressing each of the awk commands in turn:
NR==1{d=NF; next;}
For the first row, we save the number of columns in variable d. Then, we skip the rest of the commands and start over on the next line of input.
for (i=1;i<=d;i++) {s=0; for (j=i;j<=NF;j+=d) s+=$j; printf "%.2f%s",s/(NF/d),j==NF+d?"\n":"\t";}
This adds up the numbers from the respective files and prints the average.
As a multiline script:
{
head -n1 ifile1.txt
paste ifile*.txt
} |
awk '
NR==1 {d=NF; next;}
{
for (i=1;i<=d;i++)
{
s=0; for (j=i;j<=NF;j+=d)
s+=$j;
printf "%.2f%s",s/(NF/d),j==NF+d?"\n":"\t";
}
}
You need to save the sum the fields into an array when you're reading the original files. You can't access $0 and i in the END block, since there's no input line then.
awk '{rows=FNR; cols=NF; for (i = 1; i <= NF; i++) { total[FNR, i] += $i }}
FILENAME != lastfn { count++; lastfn = FILENAME }
END { for (i = 1; i <= rows; i++) {
for (j = 1; j <= cols; j++) {
printf("%s ", total[i, j]/count)
}
printf("\n")
}
}' ifile* > ofile.txt

Combine multiple columns of different lengths into one column in BASH

I need to combine columns of different lengths into one column using BASH. Here is an example input file:
11 1 2 3 4 5 6 7 8
12 1 2 3 4 5 6 7 8
13 1 2 3 4 5 6 7 8
14 1 2 5 6 7 8
15 1 2 7 8
And my desired output:
1
1
1
1
1
3
3
3
5
5
5
5
7
7
7
7
7
The input data is pairs of columns as shown. Each pair is separated from another by a fixed number of spaces. Values within a pair of columns are separated by one space. Thanks in advance!
Using GNU awk for fixed width field handling:
$ cat file
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 5 6 7 8
1 2 7 8
$ cat tst.awk
BEGIN{ FIELDWIDTHS="1 1 1 3 1 1 1 3 1 1 1 3 1 1 1" }
{
for (i=1;i<=NF;i++) {
a[NR,i] = $i
}
}
END {
for (i=1;i<=NF;i+=4)
for (j=1;j<=NR;j++)
if ( a[j,i] != " " )
print a[j,i]
}
$ gawk -f tst.awk file
1
1
1
1
1
3
3
3
5
5
5
5
7
7
7
7
7
You may try the following:
awk -f ext.awk input.txt
where input.txt is your input data file and ext.awk is:
BEGIN {
ncols=4 # number of columns
nspc=3 # number of spaces that separates the columns
}
{
str=$0;
for (i=1; i<=ncols; i++) {
pos=match(str,/^([0-9]+) ([0-9]+)/,a)
if (pos>0) {
b[NR,i]=a[1]
if (NR==1) colw[i]=RLENGTH; #assume col width are given as in first row
}
str=substr(str,colw[i]+1+nspc);
}
}
END {
for (i=1;i<=ncols;i++)
for (j=1;j<=NR;j++) {
if (b[j,i]) print b[j,i];
}
}

how to subtract fields pairwise in bash?

I have a large dataset that looks like this:
5 6 5 6 3 5
2 5 3 7 1 6
4 8 1 8 6 9
1 5 2 9 4 5
For every line, I want to subtract the first field from the second, third from fourth and so on deepening on the number of fields (always even). Then, I want to report those lines for which difference from all the pairs exceeds a certain limit (say 2). I should also be able to report next best lines i.e., lines in which one pairwise comparison fails to meet the limit, but all other pairs meet the limit.
from the above example, if I set a limit to 2 then, my output file should contain
best lines:
2 5 3 7 1 6 # because (5-2), (7-3), (6-1) are all > 2
4 8 1 8 6 9 # because (8-4), (8-1), (9-6) are all > 2
next best line(s)
1 5 2 9 4 5 # because except (5-4), both (5-1) and (9-2) are > 2
My current approach is to read every line, save each field as a variable, do subtraction.
But I don't know how to proceed further.
Thanks,
Prints "best" lines to the file "best", and prints "next best" lines to the file "nextbest"
awk '
{
fail_count=0
for (i=1; i<NF; i+=2){
if ( ($(i+1) - $i) <= threshold )
fail_count++
}
if (fail_count == 0)
print $0 > "best"
else if (fail_count == 1)
print $0 > "nextbest"
}
' threshold=2 inputfile
Pretty straightforward stuff.
Loop through fields 2 at a time.
If (next field - current field) does not exceed threshold, increment fail_count
If that line's fail_count is zero, that means it belongs to "best" lines.
Else if that line's fail_count is one, it belongs to "next best" lines.
Here's a bash-way to do it:
#!/bin/bash
threshold=$1
shift
file="$#"
a=($(cat "$file"))
b=$(( ${#a[#]}/$(cat "$file" | wc -l) ))
for ((r=0; r<${#a[#]}/b; r++)); do
br=$((b*r))
for ((c=0; c<b; c+=2)); do
if [[ $(( ${a[br + c+1]} - ${a[br + c]} )) < $threshold ]]; then
break; fi
if [[ $((c+2)) == $b ]]; then
echo ${a[#]:$br:$b}; fi
done
done
Usage:
$ ./script.sh 2 yourFile.txt
2 5 3 7 1 6
4 8 1 8 6 9
This output can then easily be redirected:
$ ./script.sh 2 yourFile.txt > output.txt
NOTE: this does not work properly if you have those empty lines between each line...But I'm sure the above will get you well on your way.
I probably wouldn't do that in bash. Personally, I'd do it in Python, which is generally good for those small quick-and-dirty scripts.
If you have your data in a text file, you can read here about how to get that data into Python as a list of lines. Then you can use a for-loop to process each line:
threshold = 2
results = []
for line in content:
numbers = [int(n) for n in line.split()] # Split it into a list of numbers
pairs = zip(numbers[::2],numbers[1::2]) # Pair up the numbers two and two.
result = [abs(y - x) for (x,y) in pairs] # Subtract the first number in each pair from the second.
if sum(result) > threshold:
results.append(numbers)
Yet another bash version:
First a check function that return nothing but a result code:
function getLimit() {
local pairs=0 count=0 limit=$1 wantdiff=$2
shift 2
while [ "$1" ] ;do
[ $(( $2-$1 )) -ge $limit ] && : $((count++))
: $((pairs++))
shift 2
done
test $((pairs-count)) -eq $wantdiff
}
than now:
while read line ;do getLimit 2 0 $line && echo $line;done <file
2 5 3 7 1 6
4 8 1 8 6 9
and
while read line ;do getLimit 2 1 $line && echo $line;done <file
1 5 2 9 4 5
If you can use awk
$ cat del1
5 6 5 6 3 5
2 5 3 7 1 6
4 8 1 8 6 9
1 5 2 9 4 5
1 5 2 9 4 5 3 9
$ cat del1 | awk '{
> printf "%s _ ",$0;
> for(i=1; i<=NF; i+=2){
> printf "%d ",($(i+1)-$i)};
> print NF
> }' | awk '{
> upper=0;
> for(i=1; i<=($NF/2); i++){
> if($(NF-i)>threshold) upper++
> };
> printf "%d _ %s\n", upper, $0}' threshold=2 | sort -nr
3 _ 4 8 1 8 6 9 _ 4 7 3 6
3 _ 2 5 3 7 1 6 _ 3 4 5 6
3 _ 1 5 2 9 4 5 3 9 _ 4 7 1 6 8
2 _ 1 5 2 9 4 5 _ 4 7 1 6
0 _ 5 6 5 6 3 5 _ 1 1 2 6
You can process result further according to your needs. The result is sorted by ‘goodness’ order.

Resources