bash script: check if all words from one file are contained in another, otherwise issue error - bash

I was wondering if you could help. I am new to bash scripting.
I want to be able to compare two lists. File1.txt will contain a list of a lot of parameters and file2.txt will only contain a section of those parameters.
I want to check if all the Qs in file1.txt are contained in file2.txt (after the =). If they aren't, then the bash script should stop and echo a message.
So, in the example above the script should stop as File2.txt does not contain the following Q: LOAD_INVENTORY_Q.
The Qs in file1.txt or file2.txt do not follow any particular order.

The following command will print out lines in file1.txt with values (anything appearing after =) that do not appear in file2.txt.
[me#home]$ awk -F= 'FNR==NR{keys[$0];next};!($2 in keys)' file2.txt file1.txt
Breakdown of the command:
awk -F= 'FNR==NR{keys[$0];next};!($2 in keys)' file2.txt file1.txt
--- ---------------------- -------------
| | |
change the | Target lines in file1.txt where
delimiter | the second column (delimited by `=`) do
to '=' | not exist in the keys[] array.
Store each line in
file2.txt as a key
in the keys[] array
To do something more elaborate, say if you wish to run the command as a pre-filter to make sure the file is valid before proceeding with your script, you can use:
awk -F= 'FNR==NR{K[$0];N++;next};!($2 in K) {print "Line "(NR-N)": "$0; E++};END{exit E}' file2.txt file1.txt
if [ $ERRS -ne 0 ]; then
# errors found, do something ...
That will print out all lines (including line numbers) in file1.txt that do not meet the bill, and returns an exit code that matches the number of non-conforming lines. That way your script can detect the errors easily by checking $? and respond accordingly.
Example output:
[me#home]$ awk -F= 'FNR==NR{K[$0];N++;next};!($2 in K) {print "Line "(NR-N)": "$0;E++};END{exit E}' file2.txt file1.txt
Line 1: dbipAddress=
[me#home]$ echo $?

You can use cut to get only the part after =. comm can be used to output the lines contained in the first file but not the second one:
grep ^Q File1.txt | cut -d= -f2- | sort | comm -23 - <(sort File2.txt)

The following command line expression will filter out the lines that occur in file2.txt but not file1.txt:
cat file1.txt | grep -Fvf file2.txt | grep '^Q'
-F : match patterns exactly (no expansion etc.) ; much faster
-v : only print lines that don't match
-f : get your patterns from the file specified
| grep '^Q' : pipe the output into grep, and look for lines that start with "Q"
This isn't exactly "stop the bash script when..." since it will process and print every mismatch; also, it doesn't test that there's an "=" in front of the pattern - but I hope it's useful.

Here's another way:
missing=($(comm -23 <(awk -F= '/^Q/ {print $2}' file1.txt | sort) <(sort file2.txt)))
if (( ${#missing[#]} )); then
echo >&2 "The following items are missing from file2.txt:"
printf '%s\n' "${missing[#]}"
exit 1

Assuming that the relevant lines in file1.txt always start with a Q:
grep "^Q" file1.txt | while IFS= read -r line
grep -Fxq "$what" file2.txt || echo "error: $what not found"
error: LOAD_INVENTORY_Q not found


xargs and cut: getting `cut` fields of a csv to bash variable

I am using xargs in conjuction with cut but I am unsure how to get the output of cut to a variable which I can pipe to use for further processing.
So, I have a text file like so:
I do this:
cat test.txt | xargs -L1 | cut -d, -f 1,2
but what Id like to do is:
cat test.txt | xargs -L1 | cut -d, -f 1,2 | echo $1 $2
where $1 and $2 are /some/path/to/dir and filename.jpg
I am stumped that I cannot seem to able to achieve this..
You may want to say something like:
while IFS=, read -r f1 f2; do
echo ./mypgm -i "$f1" -o "$f2"
done < test.txt
IFS=, read -r f1 f2 reads a line from test.txt one by one,
splits the line on a comma, then assigns the variables f1 and f2
to the fields.
The line echo .. is for the demonstration purpose. Replace the
line with your desired command using $f1 and $f2.
Try this:
cat test.txt | awk -F, '{print $1, $2}'
From man xargs:
xargs [-L number] [utility [argument ...]]
-L number
Call utility for every number non-empty lines read.
From man awk:
Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as -f progfile.
So you don't have to use xargs -L1 as you don't pass the utility to call.
Also from man awk:
The -F fs option defines the input field separator to be the regular expression fs.
So awk -F, can replace the cut -d, part.
The fields are denoted $1, $2, ..., while $0 refers to the entire line.
So $1 is for the first column, $2 is for the second one.
An action is a sequence of statements. A statement can be one of the following:
print [ expression-list ] [ > expression ]
An empty expression-list stands for $0.
The print statement prints its argument on the standard output (or on a file if > file or >> file is present or on a pipe if | cmd is present), separated by the current output field separator, and terminated by the output record separator.
Put all these together, cat test.txt | awk -F, '{print $1, $2}' would achieve that you want.

How do I merge different text files?

I have 3 txt files:
I want to combine the 3 text files into a single file and put a comma between them.
endfile.txt should be as follows:
I'd try:
cat file1.txt; cat file2.txt; cat file3.txt > endfile.txt
Wrote line by line but I want to print side by side and put a comma
Could you help?
cat file1.txt | cat - file2.txt | cat - file3.txt | tr '\n' ',' | head --bytes -1
A very easy approach simply uses printf:
(printf "%s" $(cat file1.txt); printf ",%s" $(cat file2.txt file3.txt)) > endfile.txt
Which would results in 11,22,33 in endfile.txt. The two grouping of printf were used to prevent a comma from being written before 11 and the entire line is executed as a subshell so output from all commands is redirected to endfile.txt. You may also want to write a final '\n' after file3.txt to ensure the resulting endfile.txt contains a POSIX line-ending.
My answer is following.
$ cat *.txt | sed -z 's/\n\(.\)/,\1/g'
If you define exactly order, it is following.
$ cat file{1,2,3}.txt | sed -z 's/\n\(.\)/,\1/g'
My sed is version 4.8.
$ sed --version | head -n 1
sed (GNU sed) 4.8
Use paste:
paste -sd, file1.txt file2.txt file3.txt > endfile.txt

Read each line of a column of a file and execute grep

I have file.txt exemplary here:
This line contains ABC
This line contains DEF
This line contains GHI
and here the following list.txt:
contains ABC<TAB>ABC
contains DEF<TAB>DEF
Now I am writing a script that executes the following commands for each line of this external file list.txt:
take the string from column 1 of list.txt and search in a third file file.txt
if the first command is positive, return the string from column 2 of list.txt
So my output.txt is:
This is my code for grep/echo with putting the query/return strings manually:
if grep -i -q 'contains abc' file.txt
echo ABC >output.txt
echo -n
if grep -i -q 'contains def' file.txt
echo DEF >>output.txt
echo -n
I have about 100 search terms, which makes the task laborious if done manually. So how do I include while read line; do [commands]; done<list.txt together with the commands about column1 and column2 inside that script?
I would like to use simple grep/echo/awkcommands if possible.
Something like this?
$ awk -F'\t' 'FNR==NR { a[$1] = $2; next } {for (x in a) if (index($0, x)) {print a[x]}} ' list.txt file.txt
For the lines of the first file (FNR==NR), read the key-value pairs to array a. Then for the lines of the second line, loop through the array, check if the key is found on the line, and if so, print the stored value. index($0, x) tries to find the contents of x from (the current line) $0. $0 ~ x would instead take x as a regex to match with.
If you want to do it in the shell, starting a separate grep for each and every line of list.txt, something like this:
while IFS=$'\t' read k v ; do
grep -qFe "$k" file.txt && echo "$v"
done < list.txt
read k v reads a line of input and splits it (based on IFS) into k and v.
grep -F takes the pattern as a fixed string, not a regex, and -q prevents it from outputting the matching line. grep returns true if any matching lines are found, so $v is printed if $k is found in file.txt.
Using awk and grep:
for text in `awk '{print $4}' file.txt `
grep "contains $text" list.txt |awk -F $'\t' '{print $2}'

How to apply 'awk' for all files in folder?

I am new to awk pls pardon my ignorance. I am using awk to extract tag values from file. following code works for single execution
awk -F"<NAME>|</NAME>" '{print $2; exit;}' file.txt
but I am not sure how I can run it for all files in folder.
File sample is as follows
DATE=$(date +%Y/%m/%d | tr '/' '-')
for a in $(ls /root/Working/awk/*)
for b in $(awk -F"<NAME>|</NAME>" '{print $2;}' "$a")
if [ "$b" == "$STRING" ]; then
for c in $(awk -F"<DATE>|</DATE>" '{print $2;}' "$a")
sed "s/$c/$DATE/g" "$a";
echo "Strings are not a match";
When you run it -
root#revolt:~# cat /root/Working/awk/*
String in code is set to ABC
root#revolt:~# ./ANSWER
Strings are not a match
Strings are not a match
Strings are not a match
String in code is set to DEF
root#revolt:~# ./ANSWER
Strings are not a match
Strings are not a match
Strings are not a match
Alright. So in this you would set the STRING=ABC or whatever your desired string is. You can also set it to = a list of strings you're checking for.
The date variable echoes the date in the same format (Y/m/d) as your string. The tr command then replaces all instances of forward slashes with hyphens.
First we're creating a function called "changedate". Within this function we're going to nest a few for loops to do different things. The first for loop sets ls /root/Working/awk/* to the variable a. This means that for each instance of a file/directory in /root/Working/awk/, do the following.
The next for loop is checking for of each instance, grab between the Name tags and print it. Notice we're still using $a as the file because that's going to be the file path for each file. Then we're going to have an if statement to check for your string. If it is true, then do another for loop that will substitute the date in file a. If it isn't true, then echo Strings are not a match.
Lastly, we call our "changedate" function which basically runs the entire looping sequence above.
To answer somewhat generically your question about running awk on multiple
files, imagine we have these files:
$ cat file1.txt
$ cat file2.txt
$ cat file3.txt
One thing you can do is simply supply awk with multiple files as with almost any command (like ls *.txt):
$ awk -F"<NAME>|</NAME>" '{print $2}' *.txt
Awk just reads lines from each file in turn. As mentioned in the comments,
be careful with exit because it will stop processing all together after the first match::
$ awk -F"<NAME>|</NAME>" '{print $2; exit}' *.txt
However, if for efficiency or some other reason you want to stop
processing in the current file and move on immediately to the next one,
you can use the gawk only nextfile:
$ gawk -F"<NAME>|</NAME>" '{print $2; nextfile}' *.txt
Sometimes the results on multiple files are not useful without knowing
which lines came from which file. For that you can use the built in FILENAME
$ awk -F"<NAME>|</NAME>" '{print FILENAME, $2}' *.txt
file1.txt XYZ
file2.txt ABC
file3.txt 123
Things get trickier when you want to modify the files you are working
on. Imagine you want to convert the name to lower case:
$ awk -F"<NAME>|</NAME>" '{print tolower($2)}' *.txt
With traditional awk, the usual pattern is to save to a temp file and copy
the temp file back to the original (obviously you want to be careful with
this, keeping copies of the orignals!)
$ cat file1.txt
$ awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' file1.txt > tmp && mv tmp file1.txt
$ cat file1.txt
To use this style on multiple files, it's probably easier to drop back to
the shell and run awk in a loop on single files:
$ cat file1.txt file2.txt file3.txt
$ for f in file*.txt; do
> awk -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' $f > tmp && mv tmp $f
> done
$ cat file1.txt file2.txt file3.txt
Finally, with gawk you have the option if in-place editing (much like sed -i):
$ cat file1.txt file2.txt file3.txt
$ gawk -v INPLACE_SUFFIX=.sav -i inplace -F"<NAME>|</NAME>" '{ sub($2,tolower($2)); print }' *.txt
$ cat file1.txt file2.txt file3.txt
The recommended INPLACE_SUFFIX variable tells gawk to make backups of
each file with that extension:
$ cat file1.txt.sav file2.txt.sav file3.txt.sav

output of oddlines in sed not appearing on separate lines

I have the following file:
where I want to print out all odd lines. I can do this by:
$ sed -n 1~2p file
and so I want to store the number in each line as a variable in bash, however I run into a problem - storing the result of sed puts the output all on one line:
line1=$(sed -n 1~2p)
echo ${line1}
in which the output is:
>A6NGG8_201_I_F >B1AK53_719_S_R >B1AK53_744_D_N >B7U540_205_R_H >B7U540_354_T_M
so that when I do something like:
line1=$(sed -n 1~2p)
pos=$(echo ${line1} | awk -F"[__]" 'NF>2{print $2}')
echo ${pos}
I get
where I of course want:
How do I store the result of sed into separate lines so that they are processed properly when piped into my awk statement? I see you can use the /anotation, however when I tried sed -n '/1~2p/a' filethis does not work in my bash script. Thanks
As said in comments, you need to quote the variable to make this happen:
echo "${line1}"
instead of
echo ${line1}
However, you can directly say:
awk -F_ 'NR%2 && NF>2 {print $2}' file
This will process even lines and, in them, print the 2nd field on _ separated, just if it there are more than 2 fields.
From tripleee's answer I observe that a FASTA file can contain a different format. If so, I guess you will still want to get the ID in the lines starting with ">". This can be translated as:
awk -F_ '/^>/ && NF>2 {print $2}' file
See an example of how quoting preserves the format:
The file:
$ cat a
Read it into a variable:
$ var=$(< a)
echo without quoting:
$ echo $var
hello bye
Let's quote!
$ echo "$var"
If you are trying to get the header lines out of a FASTA file, your problem statement is wrong -- the data between the headers could be more than one line. You could simply do
sed -n '/^>/!d;s/^[^_]*//;s/_.*//p' file.fasta
to get just the second underscore-delimited field out of each header line; or equivalently, in Awk,
awk -F _ '/^>/ { print $2 }' file.fasta
