find lines that aren't matching in directory with grep - bash

I have this bash script:
function getlist() {
grep -E 'pattern' ../fileWithInput.js | sed "s#^regexPattern#\1 \2#" | grep -v :
}
getlist | while read line; do
method=$(echo $line | awk '{ print $1 }')
uri=$(echo $line | awk '{ print $2 }')
`grep "$method" -vr .
#echo method: $method uri: $uri
done
Question:
Currently I have many 'pattern' strings. How to check with directory and output only 'pattern' strings that doesn't match.
What I have example in fileWithInput.js:
'foo','bar','hello'.
~/repo/anotherDirectory:
'foo','bar'.
How to print only strings from fileWithInput.js that are not in /repo/anotherDirectory?
Final output have to be like this:
'hello': 0 matches.
Please help with grep command to do this. Or maybe you have another idea. Thanks for attention and have a nice day!

file1.txt
'foo','bar','hello'.
filem.txt
'foo','bar'.
with awk
awk 'BEGIN{RS="[,\\.]"} NR==FNR{a[$0];next} {delete a[$0]} END{for(i in a){print i": 0 matches."}} ' filei.txt filem.txt
code breakdown:
BEGIN{RS="[,\\.]"} # Record seperator , or .
NR==FNR{a[$0];next} # store values ina array a and skip from next process
{delete a[$0]} # delete from array if file1 exists in file2
END{
for(i in a){
print i": 0 matches."} # print missing items
}
output:
'hello': 0 matches.

Related

Using awk to extract two separate strings

MacOS, Unix
So I have a file in the following stockholm format:
# STOCKHOLM 1.0
#=GS WP_002855993.1/5-168 DE [subseq from] MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
#=GS WP_002856586.1/5-166 DE [subseq from] MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
WP_002855993.1/5-168 ------LEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELmkfgKALLT.K...NDFLKTLLECFFKVLGKEGTLLMP-TF---TYSF------CKNE------VYDKVHSKG--KVGVLNEFFRTSGgGVRRTSDPIFSFAVKGAKADIFLKEN--SSCFGKDSVYEILTREGGKFMLLGLNYG-HALTHYAEE-----
#=GR WP_002855993.1/5-168 PP ......6788899999***********************9333344455.6...8999********************.33...3544......4555......799999975..68********98626999****************999865..689*********************9875.456799996.....
WP_002856586.1/5-166 ------LEFENKKYSTYDFIETFYKLGLQKGDTLCVHTEL....FNFGFpLlsrNEFLQTILDCFFEVIGKEGTLIMP-TF---TYSF------CKNE------VYDKINSKT--KMGALNEYFRKQT.GVKRTNDPIFSFAIKGAKEELFLKDT--TSCFGENCVYEVLTKENGKYMTFGGQG--HTLTHYAEE-----
#=GR WP_002856586.1/5-166 PP ......5566677788889999******************....**9953422246679*******************.33...3544......4455......799998876..589**********.******************99999886..689******************999765..5666***96.....
#=GC PP_cons ......6677788899999999*****************9....77675.5...68889*******************.33...3544......4455......799999976..689*******998.8999**************99999876..689******************9998765.466699996.....
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....xxxxx.x...xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WP_002855993.1/5-168 -----------------------------------------------------------------------------------------------------
#=GR WP_002855993.1/5-168 PP .....................................................................................................
WP_002856586.1/5-166 -----------------------------------------------------------------------------------------------------
#=GR WP_002856586.1/5-166 PP .....................................................................................................
#=GC PP_cons .....................................................................................................
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
//
And I've created a script to extract the IDs I want, in this case, WP_002855993.1 and WP_002856586.1, and search through another file to extract DNA sequences with the appropriate IDs. The script is as follows:
#!/bin/bash
for fileName in *.sto;
do
protID=$(grep -o "WP_.\{0,11\}" $fileName | sort | uniq)
echo $protID
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
And here's an example of the type of file I'm looking through:
>WP_002855993.1 MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
MKYFLEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELMKFGKALLTKNDFLKTLLECFFKVLGKEGTLLMPTFT
>WP_002856586.1 MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
MKYLLEFENKKYSTYDFIETFYKLGLQKGDTLCVHTELFNFGFPLLSRNEFLQTILDCFFEVIGKEGTLIMPTFT
YSFCKNEVYDKINSKTKMGALNEYFRKQTGVKRTNDPIFSFAIKGAKEELFLKDTTSCFGENCVYEVLTKENGKY
>WP_002856595.1 MULTISPECIES: acetyl-CoA carboxylase biotin carboxylase subunit [Campylobacter]
MNQIHKILIANRAEIAVRVIRACRDLHIKSVAVFTEPDRECLHVKIADEAYRIGTDAIRGYLDVARIVEIAKACG
This script works if I have one ID, but in some cases I get two IDs, and I get an error, because I think it's looking for an ID like "WP_002855993.1 WP_002856586.1". Is there a way to modify this script so it looks for two separate occurrences? I guess it's something with the gawk command, but I'm not sure what exactly. Thanks in advance!
an update to the original script:
#!/usr/bin/env bash
for file_sto in *.sto; do
file_faa=$(echo $file_sto | cut -d '_' -f 1,2,3)
file_faa=${file_faa}"_protein.faa"
awk '(NR==FNR) { match($0,/WP_.\{0,11\}/);
if (RSTART > 0) a[substr($0,RSTART,RLENGTH)]++
next; }
($1 in a){ print RS $0 }' $file_sto RS=">" $file_faa >> sequence_protein.file
done
The awk part can probably even be reduced to :
awk '(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }
($1 in a) { print RS $0 }' FS='/' $file_sto FS=" " RS=">" $file_faa
This awk script does the following:
Set the field separator FS to / and read file $file_sto.
When reading $file_sto the record number NR is the same as the file record number FNR.
(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }: this line works only one $file_sto due to the condition in the front. It checks if the line starts with WP_. If it does, it stores the first field $1 (separated by FS which is a /) in an array a; it then skips to the next record in the file (next).
If we finished reading file $file_sto, we set the field separator back to a single space FS=" " (see section Regular expression) and the record separator RS to > and start reading file $file_faa The latter implies that $0 will contain all lines between > and the first field $1 is the protID.
Reading $file_faa, the file record number FNR is restarted from 1 while NR is not reset. Hence the first awk line is skipped.
($1 in a){ print RS $0 } if the first field is in the array a, print the record with the record separator in front of it.
fixing the original script:
If you want to keep your original script, you could store the protID in a list and then loop the list :
#!/bin/bash
for fileName in *.sto; do
protID_list=( $(grep -o "WP_.\{0,11\}" $fileName | sort | uniq) )
echo ${protID_list[#]}
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
for protID in ${protID_list[#]}; do
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
done
Considering your output file is test.
Using following command gives you only file names:
>>cat text | awk '{print $1}' | grep -R 'WP*' | cut -d":" -f2
gives me output:
WP_002855993.1/5-168
WP_002856586.1/5-166
WP_002855993.1/5-168
WP_002856586.1/5-166
Do you want output like that?

awk print something if column is empty

I am trying out one script in which a file [ file.txt ] has so many columns like
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha| |325
xyz| |abc|123
I would like to get the column list in bash script using awk command if column is empty it should print blank else print the column value
I have tried the below possibilities but it is not working
cat file.txt | awk -F "|" {'print $2'} | sed -e 's/^$/blank/' // Using awk and sed
cat file.txt | awk -F "|" '!$2 {print "blank"} '
cat file.txt | awk -F "|" '{if ($2 =="" ) print "blank" } '
please let me know how can we do that using awk or any other bash tools.
Thanks
I think what you're looking for is
awk -F '|' '{print match($2, /[^ ]/) ? $2 : "blank"}' file.txt
match(str, regex) returns the position in str of the first match of regex, or 0 if there is no match. So in this case, it will return a non-zero value if there is some non-blank character in field 2. Note that in awk, the index of the first character in a string is 1, not 0.
Here, I'm assuming that you're interested only in a single column.
If you wanted to be able to specify the replacement string from a bash variable, the best solution would be to pass the bash variable into the awk program using the -v switch:
awk -F '|' -v blank="$replacement" \
'{print match($2, /[^ ]/) ? $2 : blank}' file.txt
This mechanism avoids problems with escaping metacharacters.
You can do it using this sed script:
sed -r 's/\| +\|/\|blank\|/g' File
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha|blank|325
xyz|blank|abc|123
If you don't want the |:
sed -r 's/\| +\|/\|blank\|/g; s/\|/ /g' File
abc pqr lmn 123
pqr xzy 321 azy
lee cha blank 325
xyz blank abc 123
Else with awk:
awk '{gsub(/\| +\|/,"|blank|")}1' File
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha|blank|325
xyz|blank|abc|123
You can use awk like this:
awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i++) if ($i ~ /^ *$/) $i="blank"} 1' file
abc|pqr|lmn|123
pqr|xzy|321|azy
lee|cha|blank|325
xyz|blank|abc|123

How to print variable value always as last column in CSV file

I have a list of CSV files, I have to print a variable name (dynamically; it will change), to last column in the CSV files.
Here is the code:
addProgramtypeID () {
for csv in $1
do
file_name="$csv"
echo $file_name
f=`echo $file_name | cut -d '_' -f3 | cut -d '.' -f1`
echo $f
k=`grep -i $f Program_type.csv | cut -d ',' -f3`
echo $k
awk '{ print $0 "," "'"$k"'" }' "$csv" > tempfile && mv tempfile "$csv"
done
}
addProgramtypeID "T_H_EDCGO.csv"
As of now the variable value K is being printed at the 1st column of the CSV file , also it is removing the first 2 characters of the first column in the file. My requirement is that the variable value should always come as the last column in the CSV file.
input :
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
if suppose $k=2
output:
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,2
123,3,334,234,3,2
545,2,444,456,5,2
Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
Assuming there is is nothing nasty in your CSV file, you can use awk as follows:
for csv_file in $ALL_MY_FILES
do
cat csv_file | awk 'BEGIN{FS=","}; {print($(NF))}'
done
Or even just
cat $ALL_MY_FILES | awk 'BEGIN{FS=","}; {print($(NF))}'
Both of these will print the last line column of all the csv files. The results from each CSV are just appended together (is that really what you want?).
The difficulties are on the awk side. This completely unaware of things like quited strings
or extra whitespace. My recommendation is to try the line above, see what goes wrong (if anything) and then start tweaking.
It looks like what you want is just:
$ cat tst.sh
addProgramtypeID () {
csv="$1"
awk -v csv="$csv" '
BEGIN{ FS=OFS=","; split(csv,csvA,/[_.]/); f=csvA[3] }
NR==FNR { if ($0 ~ f) { k = $3 }; next }
{ print $0, k }
' Program_type.csv "$csv" > tempfile && mv tempfile "$csv"
}
addProgramtypeID "T_H_EDC.csv"
$ cat Program_type.csv
type,desc,id
EDC,Alb,1
EDG,Gsc,2
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID
123,3,334,234,3
545,2,444,456,5
$ ./tst.sh
$ cat T_H_EDC.csv
TX_ID,SEQUENCE,PROGRAM_ID,CA_ID,C_ID,1
123,3,334,234,3,1
545,2,444,456,5,1
but it's hard to tell since your posted sample input could not produce your posted desired output so I had to make some up.
if ($0 ~ f) should probably just be if ($1 == f), I just copied what your original grep f <file> logic would do.

bash awk first 1st column and 3rd column with everything after

I am working on the following bash script:
# contents of dbfake file
1 100% file 1
2 99% file name 2
3 100% file name 3
#!/bin/bash
# cat out data
cat dbfake |
# select lines containing 100%
grep 100% |
# print the first and third columns
awk '{print $1, $3}' |
# echo out id and file name and log
xargs -rI % sh -c '{ echo %; echo "%" >> "fake.log"; }'
exit 0
This script works ok, but how do I print everything in column $3 and then all columns after?
You can use cut instead of awk in this case:
cut -f1,3- -d ' '
awk '{ $2 = ""; print }' # remove col 2
If you don't mind a little whitespace:
awk '{ $2="" }1'
But UUOC and grep:
< dbfake awk '/100%/ { $2="" }1' | ...
If you'd like to trim that whitespace:
< dbfake awk '/100%/ { $2=""; sub(FS "+", FS) }1' | ...
For fun, here's another way using GNU sed:
< dbfake sed -r '/100%/s/^(\S+)\s+\S+(.*)/\1\2/' | ...
All you need is:
awk 'sub(/.*100% /,"")' dbfake | tee "fake.log"
Others responded in various ways, but I want to point that using xargs to multiplex output is rather bad idea.
Instead, why don't you:
awk '$2=="100%" { sub("100%[[:space:]]*",""); print; print >>"fake.log"}' dbfake
That's all. You don't need grep, you don't need multiple pipes, and definitely you don't need to fork shell for every line you're outputting.
You could do awk ...; print}' | tee fake.log, but there is not much point in forking tee, if awk can handle it as well.

How to replace all but last matching in a file using bash?

Assuming using bash, having a configuration file like:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence <-- Replace
param-c=cccccc
# param-foo=first commented foo <-- Commented: don't replace
param-d=dddddd
param-e=eeeeee
param-foo=second occurence <-- Rreplace
param-foo=third occurence <-- Last active: don't replace
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Commented: don't replace
param-x=xxxxxx2
In which you can find multiple commented or uncommented lines of the param-foo,
how can you comment all the uncommented param-foos except the very last active one,
resulting in:
param-a=aaaaaa
param-b=bbbbbb
# param-foo=first occurence <-- Replaced
param-c=cccccc
# param-foo=commented foo <-- Left
param-d=dddddd
param-e=eeeeee
# param-foo=second occurence <-- Replaced
param-foo=third occurence <-- Left
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Left
param-x=xxxxxx2
Two parts of the question:
1. How to do it with only one known repeating param?
(only param-foo in the example above)
2. How to do it with all multiple active params at once?
(param-foo + param-x in the example above)
Attention: In this case I don't know previously the name of the repeating params!
Thanks
If awk is acceptable, this will do it for param-foo and param-x:
awk -F= -v p='param-foo param-x' 'BEGIN {
ARGV[ARGC++] = ARGV[ARGC - 1]
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile
You may use a single parameter: p=param-x or add more parameters separated by spaces: p='param-1 param-2 ... param-n'.
Edit: I'm assuming the real input file looks like this:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence
param-c=cccccc
# param-foo=commented foo
param-d=dddddd
param-e=eeeeee
param-foo=second occurence
param-foo=third occurence
param-x=xxxxxx1
param-f=ffffff
param-x=xxxxxx2
Let me know if it's different.
Second edit: providing a solution for mawk users:
awk -F= -v p='param-foo param-x' 'BEGIN {
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile infile
Adding solution for the latest requirement:
awk -F= 'NR == FNR {
if (NF && !/^#/)
_p[$1]++ && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
FNR != nr[$1] && $0 = "# " $0
}1' infile infile
I have not tested fully the script, but it worked on the first example:
#!/bin/bash
input_file=/path/to/your/input/file
last_occurence=`nl $input_file | grep 'param-foo' | grep -v '#' | tail -1 | awk -F" " '{print $1}'`
sed -i '/#/!s/param-foo/# param-foo/g' $input_file
sed -i "${last_occurence}s/# param-foo/param-foo/" $input_file
It's very straight forward logic. First we get the last occurrence of param-foo, which is not commented.
The first sed goes and comments all param-foo, which are not commented.
The second sed uses the line_number of last occurence of param-foo and removes the # character. You can easily wrap that in a function and use it inside a loop, providing a list of parameters, instead of only one.
A bit slow for long files, but should work for all the parameters:
grep -v ^# $file |
cut -f1 -d= |
sort -u |
sed 's/^/grep -n . '$file' |
tac |
grep -m1 :/;s/$/= /' |
bash |
sed -r 's%([0-9]+):(.*)=(.*)%\1!s/^\2=/# \2=/%' |
sed -f- $file
This might work:
param="param-foo"
tac input_file |sed '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}'|tac >output_file
For multiple params:
cp input_file{,.backup}
params=(param-{foo,bar,baz})
tac input_file >backwards_file
for param in "${params[#]}"; do
sed -i '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}' backwards_file
done
tac backwards_file >output_file
Turn input_file backwards, preprend all but the first occurrence of $param with a comment #,then revert the file.
EDIT:
To extract the params from the file use this piece of code:
params=($(sed -rn '/^#/d;/^$/!s/^\s*([^=]*).*/\1/gp' input_file | sort | uniq))

Resources