match awk column value to a column in another file - shell

I need to know if I can match awk value while I am inside a piped command. Like below:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
from here I need to check if the computed value $4*10^10+$6 is present (matches to) in any of the column value of another file. If it is present then print, else just move forward.
File where value needs to be matched is as below:
a,b,c,d,e
1,2,30000000000,3,4
I need to match with the 3rd column of the above file.
I would ideally like this to be in the same command, because if this check is not applied, it prints more than 100 million rows (and a large file).
I have already read this question.
Adding more info:
Breaking my command into parts
part1-command:
somebinaryGivingOutputToSTDOUT | grep -A3 "sometext" | grep "Something:"
part1-output(just showing 1 iteration output):
Something:38|Something1:1|Something2:10588429|Something3:1491539456372358463
part2-command Now I use awk
awk -F '[:|]' 'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}{print $4,$6,$4*10^10+$6,$8}'
part2-command output: currently below values are printed (see how i multiplied 1*10^10+10588429 and got 10010588429
1,10588429,10010588429,1491539456372358463
3,12394810,30012394810,1491539456372359082
1,10588430,10010588430,1491539456372366413
Now here I need to put a check (within the command [near awk]) to print only if 10010588429 was present in another file (say another_file.csv as below)
another_file.csv
A,B,C,D,E
1,2, 10010588429,4,5
x,y,z,z,k
10,20, 10010588430,40,50
output should only be
1,10588429,10010588429,1491539456372358463
1,10588430,10010588430,1491539456372366413
So for every row of awk we check entry in file2 column C

Using the associative array approach in previous question, include a hyphen in place of the first file to direct AWK to the input stream.
Example:
grep -A3 "sometext" | grep "somemoretext" | awk -F '[:|]'
'BEGIN{OFS=","; print "Col1,Col2,Col3,Col4"}
NR==FNR {
query[$4*10^10+$6]=$4*10^10+$6;
out[$4*10^10+$6]=$4 FS $6 FS $4*10^10+$6 FS $8;
next
}
query[$3]==$3 {
print out[$3]
}' - another_file.csv > output.csv
More info on the merging process in the answer cited in the question:
Using AWK to Process Input from Multiple Files

I'll post a template which you can utilize for your computation
awk 'BEGIN {FS=OFS=","}
NR==FNR {lookup[$3]; next}
/sometext/ {c=4}
c&&c--&&/somemoretext/ {value= # implement your computation here
if(value in lookup)
print "what you want"}' lookup.file FS=':' grep.files...
here awk loads up the values in the third column of the first file (which is comma delimited) into the lookup array (a hashmap in disguise). For the next set of files, sets the delimiter to : and similar to grep -A3 looks within the 3 distance of the first pattern for the second pattern, does the computation and prints what you want.
In awk you can have more control on what column your pattern matches as well, here I replicated grep example.
This is another simplified example to focus on the core of the problem.
awk 'BEGIN{for(i=1;i<=1000;i++) print int(rand()*1000), rand()}' |
awk 'NR==FNR{lookup[$1]; next}
$1 in lookup' perfect.numbers -
first process creates 1000 random records, and second one filters the ones where the first fields is in the look up table.
28 0.736027
496 0.968379
496 0.404218
496 0.151907
28 0.0421234
28 0.731929
for the lookup file
$ head perfect.numbers
6
28
496
8128
the piped data is substituted as the second file at -.

You can pipe your grep or awk output into a while read loop which gives you some degree of freedom. There you could decide on whether to forward a line:
grep -A3 "sometext" | grep "somemoretext" | while read LINE; do
COMPUTED=$(echo $LINE | awk -F '[:|]' 'BEGIN{OFS=","}{print $4,$6,$4*10^10+$6,$8}')
if grep $COMPUTED /the/file/to/search &>/dev/null; then
echo $LINE
fi
done | cat -

Related

awk to do group by sum of column

I have this csv file and I am trying to write shell script to calculate sum of column after doing group by on it. Column number is 11th (STATUS)
My script is
awk -F, 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' $f > $parentdir/outputfile.csv;
File output expected is
COMMITTED 2
but actual output is just 2.
It prints only count and not group by sum. If I delete any other columns and run same query then it works fine but not with below sample data.
FILE NAME;SEQUENCE NR;TRANSACTION ID;RUN NUMBER;START EDITCREATION;END EDITCREATION;END COMMIT;EDIT DURATION;COMMIT DURATION;HAS DEPENDENCY;STATUS;DETAILS
Buldhana_Refinesource_FG_IW_ETS_000001.xml;1;4a032127-b20d-4fa8-9f4d-7f2999c0c08f;1;20180831130210345;20180831130429638;20180831130722406;140;173;false;COMMITTED;
Buldhana_Refinesource_FG_IW_ETS_000001.xml;2;e4043fc0-3b0a-46ec-b409-748f98ce98ad;1;20180831130722724;20180831130947144;20180831131216693;145;150;false;COMMITTED;
change the FS to ; in your script
awk -F';' 'NR>1{arr[$11]++}END{for (a in arr) print a, arr[a]}' file
COMMITTED 2
You're using wrong field separator. Use
awk -F\;
; must be escaped to use it as a literal. Except this, your approach seems OK.
Besides awk, you may also use
tail -n +2 $f | cut -f11 -d\; | sort | uniq -c
or
datamash --header-in -t \; -g 11 count 11 < $f
to do the same thing.

Extract the last three columns from a text file with awk

I have a .txt file like this:
ENST00000000442 64073050 64074640 64073208 64074651 ESRRA
ENST00000000233 127228399 127228552 ARF5
ENST00000003100 91763679 91763844 CYP51A1
I want to get only the last 3 columns of each line.
as you see some times there are some empty lines between 2 lines which must be ignored. here is the output that I want to make:
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
awk  '/a/ {print $1- "\t" $-2 "\t" $-3}'  file.txt.
it does not return what I want. do you know how to correct the command?
Following awk may help you in same.
awk 'NF{print $(NF-2),$(NF-1),$NF}' OFS="\t" Input_file
Output will be as follows.
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
EDIT: Adding explanation of command too now.(NOTE this following command is for only explanation purposes one should run above command only to get the results)
awk 'NF ###Checking here condition NF(where NF is a out of the box variable for awk which tells number of fields in a line of a Input_file which is being read).
###So checking here if a line is NOT NULL or having number of fields value, if yes then do following.
{
print $(NF-2),$(NF-1),$NF###Printing values of $(NF-2) which means 3rd last field from current line then $(NF-1) 2nd last field from line and $NF means last field of current line.
}
' OFS="\t" Input_file ###Setting OFS(output field separator) as TAB here and mentioning the Input_file here.
You can use sed too
sed -E '/^$/d;s/.*\t(([^\t]*[\t|$]){2})/\1/' infile
With some piping:
$ cat file | tr -s '\n' | rev | cut -f 1-3 | rev
64073208 64074651 ESRRA
127228399 127228552 ARF5
91763679 91763844 CYP51A1
First, cat the file to tr to squeeze out repeted \ns to get rid of empty lines. Then reverse the lines, cut the first three fields and reverse again. You could replace the useless cat with the first rev.

passing a parameter in awk command won't work

I run the script bellow with ./command script.sh 11, the first line of code bellow stores the output (321) successfully in parameter x (checked with echo on line 2). On line 3 I try to use parameter x to retrieve the last two columns on all lines where the value in the first column is equal to x (in doc2.csv). This won't work but when I replace z=$x by z=321it works fine. Why won't this code work when passing the parameter?
#!/bin/bash
x="$(awk -v y=$1 -F\; '$1 == y' ~/Documents/doc1.csv | cut -d ';' -f2)"
echo $x
awk -v z=$x -F, '$1 == z' ~/Documents/doc2.csv | cut -d ',' -f2,3
doc1.csv (all columns have unique values)
33;987
22;654
11;321
...
doc2.csv
321,156843,ABCD
321,637253,HYEB
123,256843,BHJN
412,486522,HDBC
412,257843,BHJN
862,256843,BHLN
...
Like others have mentioned there is probably some extra characters coming along for the ride in field 2 of your cut command.
If you just use awk to print the column you want instead of the entire line and cutting that you shouldn't have any problems. If you still do then you will need to look into dos2unix.
n=33;
x=$(awk -v y=$n -F\; '$1 == y {print $2}' d1);
echo ${x};
awk -v z=$x -F, '$1 == z' d2
d1 and d2 contain doc1 and doc2 contents as you outlined.
As you can see all I did was stop using cut on the output of awk and just told awk to print the second field if the first field is equal to the input variable.
By the way awk is pretty powerful if you weren't aware... You can do this entire program within awk.
n=11; awk -v x=$n -F\; 'NR==FNR{ if($1==x){ y[$2]; } next} $1 in y{print $2, $3}' d1 <( sed 's/,/;/g' d2)
NR==FNR Is a trick that effectively says "If we are still in the first file, do this"... the key is not forgetting to use next to skip the rest of the awk command. Once we get to the second file FNR flips back to 1 but NR keeps incrementing up so they'll never be equal again.
So for the first file we just load up the second column values into an array where the first column matches our passed variable. You could optimize this since you said d1 was always unique lines.
So once we get into the next file the logic skips everything and runs $1 in y. This just checks if the first column is in the array we have created. If it is awk prints column 2 and 3.
<( sed 's/,/;/g' d2) just means we want to treat the output of the sed command as a file. The sed command is just converting the commas in d2 to semicolons so that it matches the FS that awk expects.
Hopefully you've learned a bit about awk, read more here http://www.catonmat.net/blog/ten-awk-tips-tricks-and-pitfalls/ and a great redirection cheat sheet is available here http://www.catonmat.net/download/bash-redirections-cheat-sheet.pdf .

Greping asterisk through bash

I am validating few columns in a pipe delimited file. My second column is defaulted with '*'.
E.g. data of file to be validated:
abc|* |123
def|** |456
ghi|* |789
2nd record has 2 stars due to erroneous data.
I teied it as:
Value_to_match="*"
unmatch_count=cat <filename>| cut -d'|' -f2 | awk '{$1=$1};1' | grep -vw "$Value_to_match" | sort -n | uniq | wc -l
echo "unmatch_count"
This gives me count as 0 whereas I am expecting 1 (for **) as I have used -w with grep which is exact match and -v which is invert match.
How can I grep **?
The problem here is grep considering ** a regular expression. To prevent this, use -F to use fixed strings:
grep -F '**' file
However, you have an unnecessarily big set of piped operations, while awk alone can handle it quite well.
If you want to check lines containing ** in the second column, say:
$ awk -F"|" '$2 ~ /\*\*/' file
def|** |456
If you want to count how many of such lines you have, say:
$ awk -F"|" '$2 ~ /\*\*/ {sum++} END {print sum}' file
1
Note the usage of awk:
-F"|" to set the field separator to |.
$2 ~ /\*\*/ to say: hey, in every line check if the second field contains two asterisks (remember we sliced lines by |). We are escaping the * because it has a special meaning as a regular expression.
If you want to output those lines that have just one asterisk as second field, say:
$ awk -F"|" '$2 ~ /^*\s*$/' file
abc|* |123
ghi|* |789
Or check for those not matching this regex with !~:
$ awk -F"|" '$2 !~ /^*\s*$/' a
def|** |456

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Resources