Optimizing grep -f piping commands [duplicate] - bash

This question already has answers here:
Inner join on two text files
(5 answers)
Closed 4 years ago.
I have two files.
file1 has some keys that start have abc in the second column
et1 abc
et2 abc
et55 abc
file2 has the column 1 values and some other numbers I need to add up:
1 2 3 4 5 et1
5 5 5 5 5 et100
3 3 3 3 3 et55
5 5 5 5 4 et1
6 6 6 6 3 et1
For the keys extracted in file1, I need to add up the corresponding column 5 if it matches. File2 itself is very large
This command seems to be working but it is very slow:
egrep -isr "abc" file1.tcl | awk '{print $1}' | grep -vwf /dev/stdin file2.tcl | awk '{tl+=$5} END {print tl}'
How would I go about optimizing the pipe. Also what am I doing wrong with grep -f. Is it generally not recommended to do something like this.
Edit: Expected output is the sum of all column5 in file2 when the column6 key is present in file1
Edit2:Expected output: Since file 1 has keys "et1, et2 and et55", in file2 adding up the column 5 with matching keys in rows 1,3,4 and 5, the expected output is [5+3+4+3=15]

Use a single awk to read file1 into the keys of an array. Then when reading file2, add $5 to a total variable when $6 is in the array.
awk 'NR==FNR {if ($2 == "abc") a[$1] = 0;
next}
$6 in a {total += $5}
END { print total }
' file1.tcl file2.tcl

Could you please try following, with reading first Input_file2.tcl and with less loops. Since your expected output is not clear so haven't completely tested it.
awk 'FNR==NR{a[$NF]+=$(NF-1);next} $2=="abc"{print $1,a[$1]+0}' file2.tcl file1.tcl

Related

awk with loops to reorder columns

I am trying to reorder the columns of a file writting a awk programn. The file looks like:
My little program to reorder the columns is:
awk -v column=number 'BEGIN {FS=","; ORS="\n"; OFS=","; n=column} {for (i=1; i<=NF; i++){if (i!=n) $(i+1)=$i else $1=$i} {print $0}' file_name
I would like to put first the column given with number and then the remaing ones, but it does not work
You are overwriting fields as you iterate. You should instead "bubble" the value from position in column to the first position.
Consider how would you move column 3 here:
1 2 3 4
to
3 1 2 4
For example with this input file:
$ cat table
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
You could do it like this (separators changed to whitespace for readability):
$ awk -v col=3 '{val=$col; for (i=col; i>1; i--) $i=$(i-1); $1=val; print $0}' table
3 1 2 4
4 2 3 5
5 3 4 6
6 4 5 7
You can simply print the required column first and then the rest of columns.
$ awk -v column=3 -F "," '{n=column; printf $column; for (i=1; i<NF; i++){if (i!=n) printf ","$i} print ","$NF}' file
Input:
10,Hello,meow,20,30
hello,world,34,meow,60
Output:
meow,10,Hello,20,30
34,hello,world,meow,60

How to print and store specific named columns from csv file with new row numbers

start by saying, I'm very new to using bash and any sort of script writing in general.
I have a csv file that has basic column headers and values underneath which looks something like this as an example:
a b c d
3 3 34 4
2 5 4 94
4 5 8 3
9 8 5 7
Is there a way to extract only the numerical values from a specific column and add a number for each row. For example first numbered row of the first column (starting from 1 after the column header) is 1, then 2, then 3, etc, for example for column b the output would be:
1 3
2 5
3 5
4 8
I would like to be able to do this for various different named column headers.
Any help would be appreciated,
Chris
Like this? Using awk:
$ awk 'NR>1{print NR-1, $2}' file
1 3
2 5
3 5
4 8
Explained:
$ awk ' # using awk for the job
NR>1 { # for the records or rows after the first
print NR-1, $2 # output record number minus one and the second field or column
}' file # state the file
I would like to be able to do this for various different named column headers. With awk you don't specify the column header name but the column number, like you don't state b but $2.
awk 'NR>1 {print i=1+i, $2}' file
NR>1 skips the first line, in your case the header.
print print following
i=1+i prints i, i is first 0 and add 1, so i is 1, next time 2 and so on.
$2 prints the second column.
file is the path to your file.
If you have a simple multi-space delimited file (as in your example) awk is the best tool for the job. To select the column by name in awk you can do something like:
$ awk -v col="b" 'FNR==1 { for (i=1;i<=NF;i++) if ($i==col) x=i; next }
{print FNR-1 OFS $x}' file
1 3
2 5
3 5
4 8

Count how many occurences are greater or equel of a defined value in a line

I've a file (F1) with N=10000 lines, each line contains M=20000 numbers. I've an other file (F2) with N=10000 lines with only 1 column. How can count the number of occurences in line i of file F2 that are greater or equal to the number found at line i in the file F2 ? I tried using a bash loop with awk / sed but my output is empty.
Edit >
For now I've only succeed to print the number of occurences that are higher than a defined value. Here an example with a file with 3 lines and a defined value of 15 (sorry it's a very dirty code ..) :
for i in {1..3};do sed -n "$i"p tmp.txt | sed 's/\t/\n/g' | awk '{if($1 > 15){print $1}}' | wc -l; done;
Thanks in advance,
awk 'FNR==NR{a[FNR]=$1;next}
{count=0;for(i=1;i<=NF;i++)
{if($i >= a[FNR])
{count++}
};
print count
}' file2 file1
While processing file2, total line record is equal to line record of current file, store value in array a with current record as index.
initialize count to 0 for each line.
loop through the fields, increment the counter if value is greater or equal at current FNR index in array a.
Print the count value
$ cat file1
1 3 5 7 3 6
2 5 6 8 7 7
4 6 7 8 9 4
$ cat file2
6
3
1
$ awk -f file.awk
2
5
6
You could do it in a single awk command:
awk 'NR==FNR{a[FNR]=$1;next}{c=0;for(i=1;i<=NF;i++)c+=($i>a[FNR]);print c}' file2 file1

compare single column of two files

I have two files, each with two columns separated by a space.
I'd like to find the lines in which column 2 is not the same in both files and output them to a third file.
file A:
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
3 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
4 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
5 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
6 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
7 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
8 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
file B:
1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
3 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
4 DDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
9 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
10 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
11 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
12 HHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
desired output:
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
I assumed the easiest way to do this was grep each line from file A in file B, but I'm new to bash and can't figure out the next step. Any help is greatly appreciated!
You can use awk for this:
$ awk 'FNR==NR {a[$1]=$2; next} $1 in a && a[$1] != $2' fileA fileB
5 WWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
6 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
7 YYYYYYYYYYYYYYYYYYYYYYYYYYYYYY
8 ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
It loops through the first file storing the values in an array a[1st col] = 2nd col. Then, it loops through the second file and prints those lines matching these conditions:
The first column is present in the first file.
The second column value is different from the one in the first file.
To store it into a new file, just redirect the command to a file:
awk 'FNR==NR {a[$1]=$2; next} $1 in a && a[$1] != $2' fileA fileB > fileC
^^^^^^^

How can I use awk to sort columns by the last value of a column?

I have a file like this (with hundreds of lines and columns)
1 2 3
4 5 6
7 88 9
and I would like to re-order columns basing on the last line values (or a specific line values)
1 3 2
4 6 5
7 9 88
How can I use awk (or other) to accomplish this task?
Thank you in advance for your help
EDIT: I would like to thank everybody and to apologize if I wasn't enough clear.
What I would like to do is:
take a line (for example the last one);
reorder the columns of the matrix using the sorted values of the chosen line to derermine the order.
So, the last line is 7 88 9, which sorted is 7 9 88, then the three columns have to be reordered in a way such that, in this case, the last two columns are swapped.
A four-column more generic example, based on the last line again:
Input:
1 2 3 4
4 5 6 7
7 88.0 9 -3
Output:
4 1 3 2
7 4 6 5
-3 7 9 88.0
Here's a quick, dirty and improvable solution: (edited because OP clarified that numbers are floating point).
$ cat test.dat
1 2 3
4 5 6
.07 .88 -.09
$ awk "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
3 1 2
6 4 5
-.09 .07 .88
To see what's going on there (or at least part of it):
$ echo "{print $(printf '$%d%.0s\n' \
$(i=0; for x in $(tail -n1 test.dat); do
echo $((++i)) $x
done |
sort -k2g) | paste -sd,)}" test.dat
{print $3,$1,$2} test.dat
To make it work for an arbitrary line, replace tail -n1 with tail -n+$L|head -n1
This problem can be elegantly solved using GNU awk's array sorting feature. GNU awk allows you to control array traversal using PROCINFO. So two passes of the file are required, the first pass to split the last record into an array and the second pass to loop through the indices of the array in value order and output fields based on indices. The code below probably explains it better than I do.
awk 'BEGIN{PROCINFO["sorted_in"] = "#val_num_asc"};
NR == FNR {for (x in arr) delete arr[x]; split($0, arr)};
NR != FNR{sep=""; for (x in arr) {printf sep""$x; sep=" "} print ""}' file.txt file.txt
4 1 3 2
7 4 6 5
-3 7 9 88.0
Update:
Create a file called transpose.awk like this:
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str OFS a[i,j];
}
print str
}
}
Now here is the script that should do work for you:
awk -f transpose.awk file | sort -n -k $(awk 'NR==1{print NF}' file) | awk -f transpose.awk
1 3 2
4 6 5
7 9 88
I am using transpose.awk twice here. Once to transpose rows to columns then I am doing numeric sorting by last column and then again I am transposing rows to columns. It may not be most efficient solution but it is something that works as per the OP's requirements.
transposing awk script courtesy of: #ghostdog74 from An efficient way to transpose a file in Bash

Resources