How to apply 'awk' for all files in folder? - bash

I am new to awk pls pardon my ignorance. I am using awk to extract tag values from file. following code works for single execution
awk -F"<NAME>|</NAME>" '{print $2; exit;}' file.txt
but I am not sure how I can run it for all files in folder.
File sample is as follows
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>

#!/bin/bash
STRING=ABC
DATE=$(date +%Y/%m/%d | tr '/' '-')
changedate(){
for a in $(ls /root/Working/awk/*)
do
for b in $(awk -F"<NAME>|</NAME>" '{print $2;}' "$a")
do
if [ "$b" == "$STRING" ]; then
for c in $(awk -F"<DATE>|</DATE>" '{print $2;}' "$a")
do
sed "s/$c/$DATE/g" "$a";
done
else
echo "Strings are not a match";
fi
done
done
}
changedate
When you run it -
root#revolt:~# cat /root/Working/awk/*
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>DEF</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>GHI</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>JKL</NAME><DATE>2015-12-11</DATE></BODY>
String in code is set to ABC
root#revolt:~# ./ANSWER
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-07-24</DATE></BODY>
Strings are not a match
Strings are not a match
Strings are not a match
String in code is set to DEF
root#revolt:~# ./ANSWER
Strings are not a match
<HEADER><H1></H1></HEADER><BODY><NAME>DEF</NAME><DATE>2015-07-24</DATE></BODY>
Strings are not a match
Strings are not a match
Alright. So in this you would set the STRING=ABC or whatever your desired string is. You can also set it to = a list of strings you're checking for.
The date variable echoes the date in the same format (Y/m/d) as your string. The tr command then replaces all instances of forward slashes with hyphens.
First we're creating a function called "changedate". Within this function we're going to nest a few for loops to do different things. The first for loop sets ls /root/Working/awk/* to the variable a. This means that for each instance of a file/directory in /root/Working/awk/, do the following.
The next for loop is checking for of each instance, grab between the Name tags and print it. Notice we're still using $a as the file because that's going to be the file path for each file. Then we're going to have an if statement to check for your string. If it is true, then do another for loop that will substitute the date in file a. If it isn't true, then echo Strings are not a match.
Lastly, we call our "changedate" function which basically runs the entire looping sequence above.

To answer somewhat generically your question about running awk on multiple
files, imagine we have these files:
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
$ cat file2.txt
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
$ cat file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
One thing you can do is simply supply awk with multiple files as with almost any command (like ls *.txt):
$ awk -F"<NAME>|</NAME>" '{print $2}' *.txt
XYZ
ABC
123
Awk just reads lines from each file in turn. As mentioned in the comments,
be careful with exit because it will stop processing all together after the first match::
$ awk -F"<NAME>|</NAME>" '{print $2; exit}' *.txt
XYZ
However, if for efficiency or some other reason you want to stop
processing in the current file and move on immediately to the next one,
you can use the gawk only nextfile:
$ # GAWK ONLY!
$ gawk -F"<NAME>|</NAME>" '{print $2; nextfile}' *.txt
XYZ
ABC
123
Sometimes the results on multiple files are not useful without knowing
which lines came from which file. For that you can use the built in FILENAME
variable:
$ awk -F"<NAME>|</NAME>" '{print FILENAME, $2}' *.txt
file1.txt XYZ
file2.txt ABC
file3.txt 123
Things get trickier when you want to modify the files you are working
on. Imagine you want to convert the name to lower case:
$ awk -F"<NAME>|</NAME>" '{print tolower($2)}' *.txt
xyz
abc
123
With traditional awk, the usual pattern is to save to a temp file and copy
the temp file back to the original (obviously you want to be careful with
this, keeping copies of the orignals!)
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
$ awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' file1.txt > tmp && mv tmp file1.txt
$ cat file1.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
To use this style on multiple files, it's probably easier to drop back to
the shell and run awk in a loop on single files:
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
$ for f in file*.txt; do
> awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' $f > tmp && mv tmp $f
> done
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>abc</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
Finally, with gawk you have the option if in-place editing (much like sed -i):
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
$ # GAWK ONLY!
$ gawk -v INPLACE_SUFFIX=.sav -i inplace -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' *.txt
$ cat file1.txt file2.txt file3.txt
<HEADER><H1></H1></HEADER><BODY><NAME>xyz</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>abc</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>
The recommended INPLACE_SUFFIX variable tells gawk to make backups of
each file with that extension:
$ cat file1.txt.sav file2.txt.sav file3.txt.sav
<HEADER><H1></H1></HEADER><BODY><NAME>XYZ</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>ABC</NAME><DATE>2015-12-11</DATE></BODY>
<HEADER><H1></H1></HEADER><BODY><NAME>123</NAME><DATE>2015-12-11</DATE></BODY>

Related

Replace one line of a file with another line in a second file if it matches the condition

I am here wondering that if I can read each line of a.txt and compare it to each line in b.txt. If any line in a.txt matches the beginning part of the line in b.txt, we replace the matched line with the line we found in a.txt. So let's say there are two lines: alias cd /correct/path/ and alias cd /wrong/path/sth in a.txt b.txt respectively. Now after I execute my command I would like the lines to be all like: alias cd /correct/path/ on both files. My own solution is to do two while...read.. functions and use sed -i /// to replace the line, but I think it is very clumsy and not efficient. I am looking to be enlightened with a more clean & efficient solution. Here is my code if it helps by any chance:
awk 'NR==FNR { array[$0]; next } { delete array[$0] } END{for (key in array) { print key } }' a.txt b.txt > tmp
input="tmp"
while IFS= read -r line
do
echo "$line"
cat b.txt > n_tmp
n_input="$n_tmp"
while IFS= read -r n_line
do
if $n_line | awk '{print $1, $2}' == $line | awk '{print $1, $2}'; then
sed -i "s/$n_line/$line/" b.txt
fi
done < "$n_input"
rm -rf n_tmp
done < "$input"
rm -rf tmp```
There are a few mistakes in this script and most of them are within the line: if $n_line | awk '{print $1, $2}' == $line | awk '{print $1, $2}'; then. First of all the way to get result from $n_line | awk '{print $1, $2}' is wrong as there is no action for n_line variable. There needs to be added an echo so that we can get the output of the string and the awk command can follow up. Secondly there is no double quotes for strings or whatever I was trying to get from the $n_line | awk '{print $1, $2}' command. Lastly, there is a double bracket needed to wrap around the two sides of the comparator. So in the end it should look something like this:
b_string=`echo "$n_line" | awk '{print $1, $2}'`
if [[ "$a_string" == "$b_string" ]]; then
I figured to declare the echoing part into a variable as well, it may look a bit cleaner and easier to handle. There are still some other problems with this script, but as of now I think the primary issue is solved.

print lines that first column not in the list

I have a list of numbers in a file
cat to_delete.txt
2
3
6
9
11
and many txt files in one folder. Each file has tab delimited lines (can be more lines than this).
3 0.55667 0.66778 0.54321 0.12345
6 0.99999 0.44444 0.55555 0.66666
7 0.33333 0.34567 0.56789 0.34543
I want to remove the lines that the first number ($1 for awk) is in to_delete.txt and print only the lines that the first number is not in to_delete.txt. The change should be replacing the old file.
Expected output
7 0.33333 0.34567 0.56789 0.34543
This is what I got so far, which doesn't remove anything;
for file in *.txt; do awk '$1 != /2|3|6|9|11/' "$file" > "$tmp" && mv "$tmp" "$file"; done
I've looked through so many similar questions here but still cannot make it work. I also tried grep -v -f to_delete.txt and sed -n -i '/$to_delete/!p'
Any help is appreciated. Thanks!
In awk:
$ awk 'NR==FNR{a[$1];next}!($1 in a)' delete file
Output:
7 0.33333 0.34567 0.56789 0.34543
Explained:
$ awk '
NR==FNR { # hash records in delete file to a hash
a[$1]
next
}
!($1 in a) # if $1 not found in record in files after the first, output
' delete files* # mind the file order
My first idea was this:
printf "%s\n" *.txt | xargs -n1 sed -i "$(sed 's!.*!/& /d!' to_delete.txt)"
printf "%s\n" *.txt - outputs the *.txt files each on separate lines
| xargs -n1 execute the following command for each line passing the line content as the input
sed -i - edit file in place
$( ... ) - command substitution
sed 's!.*!/^& /d!' to_delete.txt - for each line in to_delete.txt, append the line with /^ and suffix with /d. That way from the list of numbers I get a list of regexes to delete, like:
/^2 /d
/^3 /d
/^6 /d
and so on. Which tells sed to delete lines matching the regex - line starting with the number followed by a space.
But I think awk would be simpler. You could do:
awk '$1 != 2 && $1 != 3 && $1 != 6 ... and so on ...`
but that would be longish, unreadable. It's easier to read the map from the file and then check if the number is in the array:
awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file"
The FNR==NR is true only for the first file. So when we read it, we set the map[$1] (we "set" it, just so such element exists). Then FNR!=NR is true for the second file, for which we check if the first element is the key in the map. If it is not, the expression is true and the line gets printed out.
all together:
for file in *.txt; do awk 'FNR==NR{ map[$1] } FNR!=NR && !($1 in map)' to_delete.txt "$file" > "$tmp"; mv "$tmp" "$file"; done

Read each line of a column of a file and execute grep

I have file.txt exemplary here:
This line contains ABC
This line contains DEF
This line contains GHI
and here the following list.txt:
contains ABC<TAB>ABC
contains DEF<TAB>DEF
Now I am writing a script that executes the following commands for each line of this external file list.txt:
take the string from column 1 of list.txt and search in a third file file.txt
if the first command is positive, return the string from column 2 of list.txt
So my output.txt is:
ABC
DEF
This is my code for grep/echo with putting the query/return strings manually:
if grep -i -q 'contains abc' file.txt
then
echo ABC >output.txt
else
echo -n
fi
if grep -i -q 'contains def' file.txt
then
echo DEF >>output.txt
else
echo -n
fi
I have about 100 search terms, which makes the task laborious if done manually. So how do I include while read line; do [commands]; done<list.txt together with the commands about column1 and column2 inside that script?
I would like to use simple grep/echo/awkcommands if possible.
Something like this?
$ awk -F'\t' 'FNR==NR { a[$1] = $2; next } {for (x in a) if (index($0, x)) {print a[x]}} ' list.txt file.txt
ABC
DEF
For the lines of the first file (FNR==NR), read the key-value pairs to array a. Then for the lines of the second line, loop through the array, check if the key is found on the line, and if so, print the stored value. index($0, x) tries to find the contents of x from (the current line) $0. $0 ~ x would instead take x as a regex to match with.
If you want to do it in the shell, starting a separate grep for each and every line of list.txt, something like this:
while IFS=$'\t' read k v ; do
grep -qFe "$k" file.txt && echo "$v"
done < list.txt
read k v reads a line of input and splits it (based on IFS) into k and v.
grep -F takes the pattern as a fixed string, not a regex, and -q prevents it from outputting the matching line. grep returns true if any matching lines are found, so $v is printed if $k is found in file.txt.
Using awk and grep:
for text in `awk '{print $4}' file.txt `
do
grep "contains $text" list.txt |awk -F $'\t' '{print $2}'
done

output of oddlines in sed not appearing on separate lines

I have the following file:
>A6NGG8_201_I_F
line2
>B1AK53_719_S_R
line4
>B1AK53_744_D_N
line5
>B7U540_205_R_H
line6
>B7U540_354_T_M
line7
where I want to print out all odd lines. I can do this by:
$ sed -n 1~2p file
>A6NGG8_201_I_F
>B1AK53_719_S_R
>B1AK53_744_D_N
>B7U540_205_R_H
>B7U540_354_T_M
and so I want to store the number in each line as a variable in bash, however I run into a problem - storing the result of sed puts the output all on one line:
#!/bin/bash
line1=$(sed -n 1~2p)
echo ${line1}
in which the output is:
>A6NGG8_201_I_F >B1AK53_719_S_R >B1AK53_744_D_N >B7U540_205_R_H >B7U540_354_T_M
so that when I do something like:
#!/bin/bash
line1=$(sed -n 1~2p)
pos=$(echo ${line1} | awk -F"[__]" 'NF>2{print $2}')
echo ${pos}
I get
201
where I of course want:
201
719
744
205
354
How do I store the result of sed into separate lines so that they are processed properly when piped into my awk statement? I see you can use the /anotation, however when I tried sed -n '/1~2p/a' filethis does not work in my bash script. Thanks
As said in comments, you need to quote the variable to make this happen:
echo "${line1}"
instead of
echo ${line1}
However, you can directly say:
awk -F_ 'NR%2 && NF>2 {print $2}' file
This will process even lines and, in them, print the 2nd field on _ separated, just if it there are more than 2 fields.
From tripleee's answer I observe that a FASTA file can contain a different format. If so, I guess you will still want to get the ID in the lines starting with ">". This can be translated as:
awk -F_ '/^>/ && NF>2 {print $2}' file
See an example of how quoting preserves the format:
The file:
$ cat a
hello
bye
Read it into a variable:
$ var=$(< a)
echo without quoting:
$ echo $var
hello bye
Let's quote!
$ echo "$var"
hello
bye
If you are trying to get the header lines out of a FASTA file, your problem statement is wrong -- the data between the headers could be more than one line. You could simply do
sed -n '/^>/!d;s/^[^_]*//;s/_.*//p' file.fasta
to get just the second underscore-delimited field out of each header line; or equivalently, in Awk,
awk -F _ '/^>/ { print $2 }' file.fasta

bash script: check if all words from one file are contained in another, otherwise issue error

I was wondering if you could help. I am new to bash scripting.
I want to be able to compare two lists. File1.txt will contain a list of a lot of parameters and file2.txt will only contain a section of those parameters.
File1.txt
dbipAddress=192.168.175.130
QAGENT_QCF=AGENT_QCF
QADJUST_INVENTORY_Q=ADJUST_INVENTORY_Q
QCREATE_ORDER_Q=CREATE_ORDER_Q
QLOAD_INVENTORY_Q=LOAD_INVENTORY_Q
File2.txt
AGENT_QCF
ADJUST_INVENTORY_Q
CREATE_ORDER_Q
I want to check if all the Qs in file1.txt are contained in file2.txt (after the =). If they aren't, then the bash script should stop and echo a message.
So, in the example above the script should stop as File2.txt does not contain the following Q: LOAD_INVENTORY_Q.
The Qs in file1.txt or file2.txt do not follow any particular order.
The following command will print out lines in file1.txt with values (anything appearing after =) that do not appear in file2.txt.
[me#home]$ awk -F= 'FNR==NR{keys[$0];next};!($2 in keys)' file2.txt file1.txt
dbipAddress=192.168.175.130
QLOAD_INVENTORY_Q=LOAD_INVENTORY_Q
Breakdown of the command:
awk -F= 'FNR==NR{keys[$0];next};!($2 in keys)' file2.txt file1.txt
--- ---------------------- -------------
| | |
change the | Target lines in file1.txt where
delimiter | the second column (delimited by `=`) do
to '=' | not exist in the keys[] array.
Store each line in
file2.txt as a key
in the keys[] array
To do something more elaborate, say if you wish to run the command as a pre-filter to make sure the file is valid before proceeding with your script, you can use:
awk -F= 'FNR==NR{K[$0];N++;next};!($2 in K) {print "Line "(NR-N)": "$0; E++};END{exit E}' file2.txt file1.txt
ERRS=$?
if [ $ERRS -ne 0 ]; then
# errors found, do something ...
fi
That will print out all lines (including line numbers) in file1.txt that do not meet the bill, and returns an exit code that matches the number of non-conforming lines. That way your script can detect the errors easily by checking $? and respond accordingly.
Example output:
[me#home]$ awk -F= 'FNR==NR{K[$0];N++;next};!($2 in K) {print "Line "(NR-N)": "$0;E++};END{exit E}' file2.txt file1.txt
Line 1: dbipAddress=192.168.175.130
Line 5: QLOAD_INVENTORY_Q=LOAD_INVENTORY_Q
[me#home]$ echo $?
2
You can use cut to get only the part after =. comm can be used to output the lines contained in the first file but not the second one:
grep ^Q File1.txt | cut -d= -f2- | sort | comm -23 - <(sort File2.txt)
The following command line expression will filter out the lines that occur in file2.txt but not file1.txt:
cat file1.txt | grep -Fvf file2.txt | grep '^Q'
explanation:
-F : match patterns exactly (no expansion etc.) ; much faster
-v : only print lines that don't match
-f : get your patterns from the file specified
| grep '^Q' : pipe the output into grep, and look for lines that start with "Q"
This isn't exactly "stop the bash script when..." since it will process and print every mismatch; also, it doesn't test that there's an "=" in front of the pattern - but I hope it's useful.
Here's another way:
missing=($(comm -23 <(awk -F= '/^Q/ {print $2}' file1.txt | sort) <(sort file2.txt)))
if (( ${#missing[#]} )); then
echo >&2 "The following items are missing from file2.txt:"
printf '%s\n' "${missing[#]}"
exit 1
fi
Assuming that the relevant lines in file1.txt always start with a Q:
grep "^Q" file1.txt | while IFS= read -r line
do
what=${line#*=}
grep -Fxq "$what" file2.txt || echo "error: $what not found"
done
Output:
error: LOAD_INVENTORY_Q not found

Resources