Awk - find the minimum in a given row - bash

I have a file organized in rows and columns. I want to find the minimum
in a given row, for example row number 4, and then transfer the corresponding column number in a bash variable (lev).
However the small code I wrote is not working
lev=`echo - |awk '{
m=100; l=1;
{If (NR==4)
for(i=2;i<=NF;i++)
{
if( $i <m)
m=$i;
l=i
}
}
print l
}' file.txt`

There are multiple things wrong with your script. Perhaps you can figure out using this sample.
$ lev=$(awk 'NR==4{min=99999;
for(i=1;i<=NF;i++)
if($i < min) {min=$i; ix=i}
print ix}' file.txt)

Related

SUM up all values of each row and write the results in a new column using Bash

I have a big file (many columns) that generally looks like:
Gene,A,B,C
Gnai3,2,3,4
P53,5,6,7
H19,4,4,4
I want to sum every row of the data frame and add it as a new column as below:
Gene,A,B,C,total
Gnai3,2,3,4,9
P53,5,6,7,18
H19,4,4,4,12
I tried awk -F, '{sum=0; for(i=1; i<=NF; i++) sum += $i; print sum}' but then I am not able to make a new column for the total counts.
Any help would be appreciated.
Could you please try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
print $0,"total"
next
}
{
for(j=2;j<=NF;j++)
$(NF+1)+=$j
}
$1=$1
}
1
' Input_file
2nd solution: adding solution as per OP's comment to print first column and sum only.
BEGIN{
FS=OFS=","
}
FNR==1{
print $0,"total"
next
}
{
for(j=2;j<=NF;j++)
sum+=$j
}
print $1,sum
sum=""
}
' Input_file
Can use perl here:
perl -MList::Util=sum0 -F, -lane '
print $_, ",", ($. == 1 ? "total" : sum0( #F[1..$#F] ));
' file
To add a new column, just increment number of columns and assign the new column a value:
NF++; $NF=sum
do:
awk -v OFS=, -F, 'NR>1{sum=0; for(i=1; i<=NF; i++) sum += $i; NF++; $NF=sum } 1'
Using only bash:
#!/bin/bash
while read -r row; do
sum=
if [[ $row =~ (,[0-9]+)+ ]]; then
numlist=${BASH_REMATCH[0]}
sum=,$((${numlist//,/+}))
fi
echo "$row$sum"
done < datafile
There are a few assumptions here about rows in the data file: Numeric fields to be summed up are non-negative integers and the first field is not a numeric field (it will not participate in the sum even if it is a numeric field). Also, the numeric fields are consecutive, that is, there is no a non numeric field between two numeric fields. And, the sum won't overflow.

How to select the minimum value which includes exponential value for each ID based on the forth column?

Can you please tell me how to Select rows with min value, including exponential, based on fourth column and group by first column in linux?
Original file
ID,y,z,p-value
1,a,b,0.22
1,a,b,5e-10
1,a,b,1.2e-10
2,c,d,0.06
2,c,d,0.003
2,c,d,3e-7
3,e,f,0.002
3,e,f,2e-8
3,e,f,1.0
The file I want is as below.
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
Actually this worked fine, so thanks for everybody!
tail -n +2 original_file > txt sort -t, -k 4g txt | awk -F, '!visited[$1]++' | sort -k2,2 -k3,3 >> final_file
You can do it fairly easily in awk just by keeping the current record with the minimum 4th field for a given 1st field. You have to handle outputting the header-row and storing the first record to begin the comparison, which you can do by operating on the first record NR==1 (or first in each file processed, FNR==1).
You can store the first minimum in an array indexed by the first field and save the initial record containing values operating on the 2nd record. Then it is just a matter of checking if the first-field is not the same as the last, if so output the minimum record for the last and keep going until you run out of records. (note: this presumes the first-fields appear in increasing order as they do in your file) Then you use the END rule to output the final record.
You can put that together as follows:
awk -F, '
FNR==1 {print; next}
FNR==2 {rec=$0; m[$1]=$4; next}
{
if ($1 in m) {
if ($4 < m[$1]) {
rec=$0
m[$1]=$4
}
}
else {
print rec
rec=$0
m[$1]=$4
}
}
END {
print rec
}' file
(where your data is in the file file)
If your first field is not in increasing order, then you will need to save the current minimum record in an array as well. (e.g. turn rec into an array indexed by the first-field holding the total record as its value). You would then delay looping over both arrays until the END rule to output the minimum record for each first-field.
Example Use/Output
You can update the filename to match the filename containing your data, and then to test, all you need to do is select-copy the awk expression and middle-mouse paste it into an xterm in the directory containing your file, e.g.
$ awk -F, '
> FNR==1 {print; next}
> FNR==2 {rec=$0; m[$1]=$4; next}
> {
> if ($1 in m) {
> if ($4 < m[$1]) {
> rec=$0
> m[$1]=$4
> }
> }
> else {
> print rec
> rec=$0
> m[$1]=$4
> }
> }
> END {
> print rec
> }' file
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
Look things over and let me know if you have questions.
A non-awk approach, using GNU datamash:
$ datamash -H -f -t, -g1 min 4 < input.txt | cut -d, -f1-4
ID,y,z,p-value
1,a,b,1.2e-10
2,c,d,3e-7
3,e,f,2e-8
(The cut is needed because with the -f option datamash adds a fifth column that's a duplicate of the 4th; without it it'll just show the first and fourth column values. Minor annoyance.)
This does require that your data is sorted on the first column like in your sample.

Detecting semi-duplicate records in Bash/AWK

Right now I have a script that rifles through tabulated data for cross-referencing record by record (using AWK). But I've run into a problem. AWK is great for line-by-line comparisons to run through formatted data, but I also want to detect semi-duplicate records. Unfortunately, uniq will not work by itself as the record is not 100% carbon-copy.
This is an orderly list, sorted by second and third columns. What I want to detect is the same values in Column 3, 6 and 7
Here's an example:
JJ 0072 0128 V7589 N 22.35 22.35 0.00 Auth
JJ 0073 0128 V7589 N 22.35 22.35 0.00 Auth
The second number is different while the other information is exactly the same, so uniq will not find it solo.
Is there something in AWK that lets me reference the previous line? I already have this code block from AWK going line-by-line. (EDIT awk statement was an older version that was terrible)
awk '{printf "%s", $0; if($6 != $7 && $9 != "Void" && $5 == "N") {printf "****\n"} else {printf "\n"}}' /tmp/verbout.txt
Is there something in AWK that lets me reference the previous line?
No, but there's nothing stopping you from explicitly saving certain info from the last line and using that later:
{
if (last3 != $3 || last6 != $6 || last7 != $7) {
print
} else
handle duplicate here
}
last3=$3
last6=$6
last7=$7
}
The lastN variables all (effectively) default to an empty string at the start then we just compare each line with those and print that line if any are different.
Then we store the fields from that line to use for the next.
That is, of course, assuming duplicates should only be detected if they're consecutive. If you want to remove duplicates when order doesn't matter, you can sort on those fields first.
If order needs to be maintained, you can use an associative array to store the fact that the key has been seen before, something like:
{
seenkey = $3" "$6" "$7
if (seen[seenkey] == 0) {
print
seen[seenkey] = 1
} else {
handle duplicate here
}
}
one way of doing this with awk is
$ awk '{print $0, (a[$3,$6,$7]++?"duplicate":"")' file
this will mark the duplicate records, note that you don't need to sort the file.
if you want to just print the uniq records, the idiomatic way is
$ awk '!a[$3,$6,$7]++' file
again, sorting is not required.

Hi, trying to obtain the mean from the array values using awk?

Im new to bash programming. Here im trying to obtain the mean from the array values.
Heres what im trying:
${GfieldList[#]} | awk '{ sum += $1; n++ } END { if (n > 0) print "mean: " sum / n; }';
Using $1 Im not able to get all the values? Guys pls help me out in this...
For each non-empty line of input, this will sum everything on the line and print the mean:
$ echo 21 20 22 | awk 'NF {sum=0;for (i=1;i<=NF;i++)sum+=$i; print "mean=" sum / NF; }'
mean=21
How it works
NF
This serves as a condition: the statements which follow will only be executed if the number of fields on this line, NF, evaluates to true, meaning non-zero.
sum=0
This initializes sum to zero. This is only needed if there is more than one line.
for (i=1;i<=NF;i++)sum+=$i
This sums all the fields on this line.
print "mean=" sum / NF
This prints the sum of the fields divided by the number of fields.
The bare
${GfieldList[#]}
will not print the array to the screen. You want this:
printf "%s\n" "${GfieldList[#]}"
All those quotes are definitely needed .

Comparison shell script for large text/csv files - improvement needed

My task is the following - I have two CSV files:
File 1 (~9.000.000 records):
type(text),number,status(text),serial(number),data1,data2,data3
File 2 (~6000 records):
serial_range_start(number),serial_range_end(number),info1(text),info2(text)
The goal is to add to each entry in File 1 the corresponding info1 and info2 from File 2:
type(text),number,status(text),serial(number),data1,data2,data3,info1(text),info2(text)
I use the following script:
#!/bin/bash
USER="file1.csv"
RANGE="file2.csv"
for SN in `cat $USER | awk -F , '{print $4}'`
do
#echo \n "$SN"
for LINE in `cat $RANGE`
do
i=`grep $LINE $RANGE| awk -F, '{print $1}'`
#echo \n "i= " "$i"
j=`grep $LINE $RANGE| awk -F, '{print $2}'`
#echo \n "j= " "$j"
k=`echo $SN`
#echo \n "k= " "$k"
if [ $k -ge $i -a $k -le $j ]
then
echo `grep $SN $USER`,`grep $i $RANGE| cut -d',' -f3-4` >> result.csv
break
#else
#echo `grep $SN $USER`,`echo 'N/A','N/A'` >> result.csv
fi
done
done
The script works rather good on small files but I'm sure there is a way to optimize it because I am running it on an i5 laptop with 4GB of RAM.
I am a newbie in shell scripting and I came up with this script after hours and hours of research, trial and error but now I am out of ideas.
note: not all the info in file 1 can be found in file.
Thank you!
Adrian.
FILE EXAMPLES and additional info:
File 1 example:
prep,28620026059,Active,123452010988759,No,No,No
post,28619823474,Active,123453458466109,Yes,No,No
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes
File 2 example:
123452010988750,123452010988759,promo32,1.11
123453458466100,123453458466199,promo64,2.22
123450000000000,123450000000010,standard128,3.333
Result example (currently):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
Result example (nice to have):
prep,28620026059,Active,123452010988759,No,No,No,promo32,1.11
post,28619823474,Active,123453458466109,Yes,No,No,promo64,2.22
post,28619823474,Inactive,123453395270941,Yes,Yes,Yes,NA,NA
File 1 is sorted after the 4th column
File 2 is sorted after the first column.
File 2 does not have ranges that overlap
Not all the info in file 1 can be found in a range in file 2
Thanks again!
LE:
The script provided by Jonathan seems to have an issue on some records, as follows:
file 2:
123456780737000,123456780737012,ONE 32,1.11
123456780016000,123456780025999,ONE 64,2.22
file 1:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes
The output is the following:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 32,1.11
and it should be:
Postpaid,24987326427,Active,123456780737009,Yes,Yes,Yes,ONE 32,1.11
Postpaid,54234564719,Active,123456780017674,Yes,Yes,Yes,ONE 64,2.22
It seems that it returns 0 and writes the info on first record from file2...
I think this will work reasonably well:
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { for (i = 0; i < n; i++)
{
if ($4 >= lo[i] && $4 <= hi[i])
{
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
break;
}
}
}' file2 file1
Given file2 containing:
1,10,xyz,pqr
11,20,abc,def
21,30,ambidextrous,warthog
and file1 containing:
A,123,X2,1,data01_1,data01_2,data01_3
A,123,X2,2,data02_1,data02_2,data02_3
A,123,X2,3,data03_1,data03_2,data03_3
A,123,X2,4,data04_1,data04_2,data04_3
A,123,X2,5,data05_1,data05_2,data05_3
A,123,X2,6,data06_1,data06_2,data06_3
A,123,X2,7,data07_1,data07_2,data07_3
A,123,X2,8,data08_1,data08_2,data08_3
A,123,X2,9,data09_1,data09_2,data09_3
A,123,X2,10,data10_1,data10_2,data10_3
A,123,X2,11,data11_1,data11_2,data11_3
A,123,X2,12,data12_1,data12_2,data12_3
A,123,X2,13,data13_1,data13_2,data13_3
A,123,X2,14,data14_1,data14_2,data14_3
A,123,X2,15,data15_1,data15_2,data15_3
A,123,X2,16,data16_1,data16_2,data16_3
A,123,X2,17,data17_1,data17_2,data17_3
A,123,X2,18,data18_1,data18_2,data18_3
A,123,X2,19,data19_1,data19_2,data19_3
A,223,X2,20,data20_1,data20_2,data20_3
A,223,X2,21,data21_1,data21_2,data21_3
A,223,X2,22,data22_1,data22_2,data22_3
A,223,X2,23,data23_1,data23_2,data23_3
A,223,X2,24,data24_1,data24_2,data24_3
A,223,X2,25,data25_1,data25_2,data25_3
A,223,X2,26,data26_1,data26_2,data26_3
A,223,X2,27,data27_1,data27_2,data27_3
A,223,X2,28,data28_1,data28_2,data28_3
A,223,X2,29,data29_1,data29_2,data29_3
the output of the command is:
A,123,X2,1,data01_1,data01_2,data01_3,xyz,pqr
A,123,X2,2,data02_1,data02_2,data02_3,xyz,pqr
A,123,X2,3,data03_1,data03_2,data03_3,xyz,pqr
A,123,X2,4,data04_1,data04_2,data04_3,xyz,pqr
A,123,X2,5,data05_1,data05_2,data05_3,xyz,pqr
A,123,X2,6,data06_1,data06_2,data06_3,xyz,pqr
A,123,X2,7,data07_1,data07_2,data07_3,xyz,pqr
A,123,X2,8,data08_1,data08_2,data08_3,xyz,pqr
A,123,X2,9,data09_1,data09_2,data09_3,xyz,pqr
A,123,X2,10,data10_1,data10_2,data10_3,xyz,pqr
A,123,X2,11,data11_1,data11_2,data11_3,abc,def
A,123,X2,12,data12_1,data12_2,data12_3,abc,def
A,123,X2,13,data13_1,data13_2,data13_3,abc,def
A,123,X2,14,data14_1,data14_2,data14_3,abc,def
A,123,X2,15,data15_1,data15_2,data15_3,abc,def
A,123,X2,16,data16_1,data16_2,data16_3,abc,def
A,123,X2,17,data17_1,data17_2,data17_3,abc,def
A,123,X2,18,data18_1,data18_2,data18_3,abc,def
A,123,X2,19,data19_1,data19_2,data19_3,abc,def
A,223,X2,20,data20_1,data20_2,data20_3,abc,def
A,223,X2,21,data21_1,data21_2,data21_3,ambidextrous,warthog
A,223,X2,22,data22_1,data22_2,data22_3,ambidextrous,warthog
A,223,X2,23,data23_1,data23_2,data23_3,ambidextrous,warthog
A,223,X2,24,data24_1,data24_2,data24_3,ambidextrous,warthog
A,223,X2,25,data25_1,data25_2,data25_3,ambidextrous,warthog
A,223,X2,26,data26_1,data26_2,data26_3,ambidextrous,warthog
A,223,X2,27,data27_1,data27_2,data27_3,ambidextrous,warthog
A,223,X2,28,data28_1,data28_2,data28_3,ambidextrous,warthog
A,223,X2,29,data29_1,data29_2,data29_3,ambidextrous,warthog
This uses a linear search on the list of ranges; you can write functions in awk and a binary search looking for the correct range would perform better on 6,000 entries. That part, though, is an optimization — exercise for the reader. Remember that the first rule of optimization is: don't. The second rule of optimization (for experts only) is: don't do it yet. Demonstrate that it is a problem. This code shouldn't take all that much longer than the time it takes to copy the 9,000,000 record file (somewhat longer, but not disastrously so). Note, though, that if the file1 data is sorted, the tail of the processing will take longer than the start because of the linear search. If the serial numbers are in a random order, then it will all take about the same time on average.
If your CSV data has commas embedded in the text fields, then awk is no longer suitable; you need a tool with explicit support for CSV format — Perl and Python both have suitable modules.
Answer to Exercise for the Reader
awk -F, 'BEGIN { n = 0; OFS = ","; }
NR==FNR { lo[n] = $1; hi[n] = $2; i1[n] = $3; i2[n] = $4; n++ }
NR!=FNR { i = search($4)
print $1, $2, $3, $4, $5, $6, $7, i1[i], i2[i];
}
function search(i, l, h, m)
{
l = 0; h = n - 1;
while (l <= h)
{
m = int((l + h)/2);
if (i >= lo[m] && i <= hi[m])
return m;
else if (i < lo[m])
h = m - 1;
else
l = m + 1;
}
return 0; # Should not get here
}' file2 file1
Not all that hard to write the binary search. This gives the same result as the original script on the sample data. It has not been exhaustively tested, but appears to work.
Note that the code does not really handle missing ranges in file2; it assumes that the ranges are contiguous but non-overlapping and in sorted order and cover all the values that can appear in the serial column of file1. If those assumptions are not valid, you get erratic behaviour until you fix either the code or the data.
In Unix you may use the join command (type 'man join' for more information) which can be configured to work similarly to join operation in databases. That may help you to add the information from File2 to File1.

Resources