Fast way to modify a field for rows having the same values in multiple columns (in bash)? - bash

(it doesn't seem complicated, so I'm sorry if it's a duplicated question; I didn't manage to formulate it well enough for a good search!)
I have a file with five columns:
dog 5 red car 10
dog 9 blue car 10
cat 1 blue car 18
owl 5 red car 15
bla 8 blue train 100
bla 2 red train 100
...
I want to find rows matching the same values in columns 1, 4 and 5 (rows (1,2) and (5,6) in the example above), and modify their value in column 2 to "0", as in:
dog 0 red car 10
dog 0 blue car 10
cat 1 blue car 18
owl 5 red car 15
bla 0 blue train 100
bla 0 red train 100
...
I am doing that by looping through the unique values of column 1 and using grep + awk to find the matches and modify their values. However, I actually have a large file (> 10000 rows, 15 columns) and I would like to find a way of finding the matches without having to read the file multiple times in the loop. Any ideas for bash or a perl one-liner to speed that up?
Thanks a lot!

Using awk (And assuming columns are delimited by tabs):
$ awk 'BEGIN { FS=OFS="\t"}
NR == FNR { dups[$1,$4,$5]++; next }
dups[$1,$4,$5] > 1 { $2 = 0 }
1' input.txt input.txt
dog 0 red car 10
dog 0 blue car 10
cat 1 blue car 18
owl 5 red car 15
bla 0 blue train 100
bla 0 red train 100
This does read the file twice in order to find all lines with the same three fields, but I don't think there's a good way to avoid two passes through the data one way or another while preserving order. For example, a perl version that only reads the file once, but stores all the contents in memory to iterate over at the end, so still two passes:
perl -F"\t" -lane '$dups->{$F[0]}{$F[3]}{$F[4]}++;
push #lines, [#F];
END {
for my $l (#lines) {
$l->[1] = 0 if $dups->{$l->[0]}{$l->[3]}{$l->[4]} > 1;
print join("\t", #$l);
}
}' input.txt

Related

Put each first column element the number of times that appear on second column

I don't have much experience using Unix tools and I was wondering how to do this:
I have a file with 2 columns like this (space tab):
Agent 2
Person 3
Place 1
Location 4
Each different element of first column will be a number (Agent -> 1, Person -> 2, Place -> 3, Location -> 4).
Thus, I want to have each first column numeric element the number of times that appear on the second column. In this case:
1
1
2
2
2
3
4
4
4
4
Explanation: Agent (1) appears 2 times, Person (2) appears 3 times, etc.
Hope you can help me. Thanks in advance.
$ awk '{for (i=1; i<=$2; i++) print NR}' file
1
1
2
2
2
3
4
4
4
4

Find mean and maximum in 2nd column for a selection in 1st column

I have two columns as follows
ifile.dat
1 10
3 34
1 4
3 32
5 3
2 2
4 20
3 13
4 50
1 40
2 20
5 2
I would like to calculate the mean and maximum values in 2nd column for some selection in 1st column.
ofile.dat
1-2 40 15.2 #Here 1-2 means all values in 1st column ranging from 1 to 2;
#40 is the maximum of corresponding values in 2nd column and 15.2 is their mean i.e. (10+4+2+40+20)/5
3-4 50 29.8 #Here 3-4 means all values in 1st column ranging from 3 to 4;
#50 is their maximum and 29.8 is their mean i.e. (34+32+20+13+50)/5
5-6 3 2.5 #Here 5-6 means all values in 1st column ranging from 5 to 6;
#3 is their maximum and 2.5 is their mean i.e. (3+2)/2
Similarly if I choose the range of selection with 3 number, then the desire output will be
ofile.dat
1-3 40 19.37
4-6 50 18.7
I have the following script which calculates for single values in the 1st column. But I am looking for multiple selections from 1st column.
awk '{
if (a[$1] < $2) { a[$1]=$2 }} END { for (i in a){}}
{b[$1]+=$2; c[$1]++} END{for (i in b)
printf "%d %2s %5s %5.2f\n", i, OFS, a[i], b[i]/c[i]}' ifile.dat
The original data has the values in the 1st column varying from 1 to 100000. So I need to stratify with an interval of 1000. i.e. 1-1000, 1001-2000, 2001-3000,...
The following awk script will provide basic descriptive statistics with grouping.
Suggesting to look into more robust solution (Python, Perl, R, ...) which will support additional measures, flexibility - no point to reinvent the circle.
Logic for grouping separated is 1-1000, 1001-2000, as per comment above. Code is verbose for clarity.
awk '
{
# Total Counter
nn++ ;
# Group id
gsize = 1000
gid = int(($1-1)/gsize )
v = $2
# Setup new group, if needed
if ( !n[gid] ) {
n[gid] = 0
sum[gid] = 0
max[gid] = min[gid] = v
name[gid] = (gid*gsize +1) "-" ((gid+1)*gsize)
}
if ( v > max[gid] ) max[gid] = v
sum[gid] += v
n[gid]++
}
END {
# Print all groups
for (gid in name) {
printf "%-20s %4d %6.1f %5.1F\n", name[gid], max[gid], sum[gid]/n[gid], n[gid]/nn ;
}
}
'
Could you please try following, tested and written with shown samples only.
sort -k1 Input_file |
awk -v range="1" '
!b[$1]++{
c[++count]=$1
}
{
a[$1]=a[$1]>$2?a[$2]:$2
d[$1]+=$2
e[$1]++
till=$1
}
END{
for(i=1;i<=till;i+=(range+1)){
for(j=i;j<=i+range;j++){
max=max>a[c[j]]?max:a[c[j]]
total+=d[c[j]]
occr+=e[c[j]]
}
print i"-"i+range,max,occr?total/occr:0
occr=total=max=""
}
}
'
For shown samples output will be as follows.
1-2 40 15.2
3-4 50 29.8
5-6 3 2.5
I have kept range variable as 1 since difference of 1st digit is 2 so in your case case lets say 1,1001 and so on is there then keep range variable value as 999 for same.

Compute percentile and max value per variable

Bash Gurus, I need to compute the max and percentile numbers for each item in the list, using awk
aa 1
ab 3
aa 4
ac 5
aa 3
ad 2
ab 4
ac 2
ae 2
ac 5
Expected output
Item 90th percentile max value
aa 3.8 4
ab 3.9 4
ac 5 5
ad 2 2
ae 2 2
Am able to get the sum and max using the below, but not the percentile.
awk '{
item[$1]++;
count[$1]+=$2;
max[$1]=$2;
percentile[$1,.9]=$2
}
END{
for (var in item)
print var,count[var],max[var],percentile[var]
}
'
Please suggest.
Percentile calculation from Statistics for Dummies 2nd ed. :). In Gnu awk:
$ cat mnp.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_num_asc" # for order in output
if(p=="") # if p not defined it's median
p=0.5
else
p=p/100 # if 90th percentile: p=0.9
}
{
v[$1][NR]=$2 # values stored per keyword. NR for unique
if($2>m[$1]) # find max val
m[$1]=$2
}
END {
for(i in v) { # for all keywords
n=asort(v[i]) # sort values, n is count
prc=p*n; # percentile figuration
if(prc==int(prc))
w=(v[i][prc]+v[i][prc+1])/2
else
w=v[i][int(prc)+1]
print i, m[i], w # print keyword, max and nth value
}
}
Run it:
$ awk -p=90 -f mnp.awk data.txt
aa 4 4
ab 4 4
ac 5 5
ad 2 2
ae 2 2
TODO: if the data file was sorted, this could be streamlined and not all data would need to be stored to memory.
datamash is a lovely tool, although it doesn't support the percantile part.
$ datamash -W --sort --group=1 max 2 min 2 < INPUT
aa 4 1
ab 4 3
ac 5 2
ad 2 2
ae 2 2
It supports the following operations
File operations:
transpose, reverse
Numeric Grouping operations:
sum, min, max, absmin, absmax
Textual/Numeric Grouping operations:
count, first, last, rand
unique, collapse, countunique
Statistical Grouping operations:
mean, median, q1, q3, iqr, mode, antimode
pstdev, sstdev, pvar, svar, mad, madraw
pskew, sskew, pkurt, skurt, dpo, jarque
Here is an elegant solution I found floating around the internet for finding the max value:
{
max[$1] = !($1 in max) ? $2 : ($2 > max[$1]) ? $2 : max[$1]
}
END {
for (i in max)
print i, max[i]
}
Output:
ab 4
ac 5
ad 2
ae 2
aa 4

How to calculate gradient with AWK

I have a file which includes two columns such as:
A B
1 2
10 20
100 200
.
.
.
I want to calculate gradient (or slope) dB/dA by awk. It means the third column should be the difference between each adjacent rows in column B divides on difference between each corresponding adjacent rows in column A. The results for above date should be:
A B dB/dA
1 2 (20-2)/(10-1)=2
10 20 (200-20)/(100-10)=2
100 200
.
.
.
How can I do that?
Given your files, you can do this :
$cat file
A B
1 2
10 20
100 200
awk 'BEGIN{OFS="\t"}NR==1{print $1,$2,"dA/dB"}NR>2{print a,b,($2-b)/($1-a)}{a=$1;b=$2}' file
A B dA/dB
1 2 2
10 20 2
100 200 2
With :
BEGIN{OFS="\t"} to set Output Field Separator to tab
NR==1{print $1,$2,"dA/dB"} to copy the header and add the grad column
NR>2 to skip the header and the first line as you want to start from the second row (of value)
{a=$1;b=$2} to save values in a and b, for next line. This part works from the 1st line
{print a,b,($2-b)/($1-a)} print the previous line and the gradient between this line and the previous
Hope this helps

using awk to average specified rows

I have a data file set up like
a 1
b 2
c 3
d 4
a 5
b 6
c 7
d 6
etc
and I would like to output to a new file
a average of 2nd column from all "a" rows
b average of 2nd column from all "b" rows
etc
where a, b, c... are also numbers.
I have been able to do this for specific values (1.4 in the example below) of the 1st column using awk:
awk '{ if ( $1 == 1.4) total += $2; count++ }
END {print total/10 }' data
though count is not giving me the correct about of rows (i.e. count should be 10 as I have manually put in 10 to do the average in the last line).
I assume a for loop will be required but I have not been able to implement that correctly.
Please help. Thanks.
awk '{a[$1]+=$2;c[$1]++}END{for(x in a)printf "average of %s is %.2f\n",x,a[x]/c[x]}'
the output of above line (with your example input) is:
average of a is 3.00
average of b is 4.00
average of c is 5.00
average of d is 5.00

Resources