Using NOT operator in an IF with multiple grep commands - bash

I am new to Shell scripting, and am writing a Korn shell script.
My aim is to search for each line in fileA.txt in 4 separate files (let's call them fileA.txt, fileB.txt, fileC.txt and fileD.txt). I need to print "not found" for the lines from fileA.txt that were found in neither of the four files in a separate file.
So I came up with the following If statement. I am trying to combine the 4 grep commands using &&, and doing a logical Not (!) since I only need the lines that were found in neither of the 4 files.
for i in $(<fileA.txt);
do
if !((grep -q $i fileB.txt) && (grep -q $i fileB.txt) && (grep -q $i fileC.txt) && (grep -q $i fileD.txt)); then
print "$i not found in either of 4 files"
fi
done
I know there's something definitely wrong with the syntax, but being a beginner in shell scripting, I can't figure it out.

It doesn't answer the question you asked, and thus violates SO policy, but there's a way to solve your actual problem with awk in one pass that I can't fit in a reasonable comment:
awk 'FNR==NR{a[$0];next} {for(p in a)if($0~p){delete a[p]}} \
END{for(p in a)print "notfound: ",p}' patternfile data1 data2 data3 etc
The notfound: is just for clarity, you can change or omit it as desired.
The output values (patterns that were not found in any data file) are not necessarily in the same order as they were in patternfile; if you care about that:
awk 'FNR==NR{a[$0]=FNR;next} {for(p in a)if($0~p){delete a[p]}} \
END{for(p in a)print a[p],p}' patternfile data1 data2 data3 etc | sort -k1n | cut -f2-
# or in GNU awk v4+ only
awk 'FNR==NR{a[$0]=FNR;next} {for(p in a)if($0~p){delete a[p]}} \
END{PROCINFO["sorted_in"]="#val_num_asc";for(p in a)print p}' patternfile data1 data2 data3 etc
Your question is also ambiguous about 'lines'; do you mean each line in patternfile should occur as a line in one of the data files, or can it occur within a line but not necessarily the whole line? Also, are the values in the patternfile only data characters or are any of them special characters that match something different in the data? For example with grep defaults as you posted (or awk with ~ as I have above) if patternfile contains a line boojum.. that item will be considered found if a data file contains any of the following lines:
boojum..
boojumXY
the snark was a boojum!!
OTOH the patternfile line ^abc will match:
abc
abcdefghi
but will NOT match:
^abc
You can get full-line match in grep with option -x, literal (non-regex) match with -F, or both. These can also be achieved in awk but differently.

You don't need the parentheses. In fact, because you are using &&, you don't need 3 separate calls to grep.
while IFS= read -r line; do
if ! grep -q "$i" fileB.txt fileC.txt fileD.txt; then
print "$i not found in any of the 3 files"
fi
done < fileA.txt
You don't even need the loop; this pattern is covered by the -f option:
if ! grep -f fileA.txt fileB.txt fileC.txt fileD.txt; then
...
fi

Related

Concatenate files based on numeric sort of name substring in awk w/o header

I am interested in concatenate many files together based on the numeric number and also remove the first line.
e.g. chr1_smallfiles then chr2_smallfiles then chr3_smallfiles.... etc (each without the header)
Note that chr10_smallfiles needs to come after chr9_smallfiles -- that is, this needs to be numeric sort order.
When separate the two command awk and ls -v1, each does the job properly, but when put them together, it doesn't work. Please help thanks!
awk 'FNR>1' | ls -v1 chr*_smallfiles > bigfile
The issue is with the way that you're trying to pass the list of files to awk. At the moment, you're piping the output of awk to ls, which makes no sense.
Bear in mind that, as mentioned in the comments, ls is a tool for interactive use, and in general its output shouldn't be parsed.
If sorting weren't an issue, you could just use:
awk 'FNR > 1' chr*_smallfiles > bigfile
The shell will expand the glob chr*_smallfiles into a list of files, which are passed as arguments to awk. For each filename argument, all but the first line will be printed.
Since you want to sort the files, things aren't quite so simple. If you're sure the full range of files exist, just replace chr*_smallfiles with chr{1..99}_smallfiles in the original command.
Using some Bash-specific and GNU sort features, you can also achieve the sorting like this:
printf '%s\0' chr*_smallfiles | sort -z -n -k1.4 | xargs -0 awk 'FNR > 1' > bigfile
printf '%s\0' prints each filename followed by a null-byte
sort -z sorts records separated by null-bytes
-n -k1.4 does a numeric sort, starting from the 4th character (the numeric part of the filename)
xargs -0 passes the sorted, null-separated output as arguments to awk
Otherwise, if you want to go through the files in numerical order, and you're not sure whether all the files exist, then you can use a shell loop (although it'll be significantly slower than a single awk invocation):
for file in chr{1..99}_smallfiles; do # 99 is the maximum file number
[ -f "$file" ] || continue # skip missing files
awk 'FNR > 1' "$file"
done > bigfile
You can also use tail to concatenate all the files without header
tail -q -n+2 chr*_smallfiles > bigfile
In case you want to concatenate the files in a natural sort order as described in your quesition, you can pipe the result of ls -v1 to xargs using
ls -v1 chr*_smallfiles | xargs -d $'\n' tail -q -n+2 > bigfile
(Thanks to Charles Duffy) xargs -d $'\n' sets the delimiter to a newline \n in case the filename contains white spaces or quote characters
Using a bash 4 associative array to extract only the numeric substring of each filename; sort those individually; and then retrieve and concatenate the full names in the resulting order:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "Requires bash 4.0 or newer" >&2; exit 1;; esac
# when this is done, you'll have something like:
# files=( [1]=chr_smallfiles1.txt
# [10]=chr_smallfiles10.txt
# [9]=chr_smallfiles9.txt )
declare -A files=( )
for f in chr*_smallfiles.txt; do
files[${f//[![:digit:]]/}]=$f
done
# now, emit those indexes (1, 10, 9) to "sort -n -z" to sort them as numbers
# then read those numbers, look up the filenames associated, and pass to awk.
while read -r -d '' key; do
awk 'FNR > 1' <"${files[$key]}"
done < <(printf '%s\0' "${!files[#]}" | sort -n -z) >bigfile
You can do with a for loop like below, which is working for me:-
for file in chr*_smallfiles
do
tail +2 "$file" >> bigfile
done
How will it work? For loop read all the files from current directory with wild chard character * chr*_smallfiles and assign the file name to variable file and tail +2 $file will output all the lines of that file except the first line and append in file bigfile. So finally all files will be merged (accept the first line of each file) into one i.e. file bigfile.
Just for completeness, how about a sed solution?
for file in chr*_smallfiles
do
sed -n '2,$p' $file >> bigfile
done
Hope it helps!

read values of txt file from bash [duplicate]

This question already has answers here:
How to grep for contents after pattern?
(8 answers)
Closed 5 years ago.
I'm trying to read values from a text file.
I have test1.txt which looks like
sub1 1 2 3
sub8 4 5 6
I want to obtain values '1 2 3' when I specify 'sub1'.
The closest I get is:
subj="sub1"
grep "$subj" test1.txt
But the answer is:
sub8 4 5 6
I've read that grep gives you the next line to the match, so I've tried to change the text file to the following:
test2.txt looks like:
sub1
1 2 3
sub8
4 5 6
However, when I type
grep "$subj" test2.txt
The answer is:
sub1
It should be something super simple but I've tried awk, seg, grep,egrep, cat and none is working...I've also read some posts somehow related but none was really helpful
Awk works: awk '$1 == "'"$subj"'" { print $2, $3, $4 }' test1.txt
The command outputs fields two, three, and four for all lines in test1.txt where the first field is $subj (i.e.: the contents of the variable named subj).
With your original text file format:
target=sub1
while IFS=$' \t\n' read -r key values; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
This requires no external tools, using only functionality built into bash itself. See BashFAQ #1.
As has come up during debugging in comments, if you have a traditional Apple-format text file (CR newlines only), then you might want something more like:
target=sub1
while IFS=$' \t\n' read -r -d $'\r' key values || [[ $key ]]; do
if [[ $key = "$target" ]]; then
echo "Found values: $values"
fi
done <test1.txt
Alternately, using awk (for a standard UNIX text file):
target="sub1"
awk -v target="$target" '$1 == target { $1 = ""; print; }' <test1.txt
...or, for a file with CR-only newlines:
target="sub1"
tr '\r' '\n' <test1.txt | awk -v target="$target" '$1 == target { $1 = ""; print; }'
This version will be slower if the text file being read is small (since awk, like any other external tool, takes time to start up); but faster if it's large (since awk's operation is much faster than that of bash's built-ins once it's done starting up).
grep "sub1" test1.txt | cut -c6-
or
grep -A 1 "sub1" test2.txt | tail -n 1
You doing it right, but it seems like test1.txt has a wrong value in it.
with grep foo you get all lines with foo in it. use grep -m1 foo to find the first line with foo in it only.
then you can use cut -d" " -f2- to get all the values behind foo, while seperated by empty spaces.
In the end the command would look like this ...
$ subj="sub1"
$ grep -m1 "$subj" test1.txt | cut -d" " -f2-
But this doenst explain why you could not find sub1 in the first place.
Did you read the proper file ?
There's a bunch of ways to do this (and shorter/more efficient answers than what I'm giving you), but I'm assuming you're a beginner at bash, and therefore I'll give you something that's easy to understand:
egrep "^$subj\>" file.txt | sed "s/^\S*\>\s*//"
or
egrep "^$subj\>" file.txt | sed "s/^[^[:blank:]]*\>[[:blank:]]*//"
The first part, egrep, will search for you subject at the beginning of the line in file.txt (that's what the ^ symbol does in the grep string). It also is looking for a whole word (the \> is looking for an end of word boundary -- that way sub1 doesn't match sub12 in the file.) Notice you have to use egrep to get the \>, as grep by default doesn't recognize that escape sequence. Once done finding the lines, egrep then passes it's output to sed, which will strip the first word and trailing whitespace off of each line. Again, the ^ symbol in the sed command, specifies it should only match at the beginning of the line. The \S* tells it to read as many non-whitespace characters as it can. Then the \s* tells sed to gobble up as many whitespace as it can. sed then replaces everything it matched with nothing, leaving the other stuff behind.
BTW, there's a help page in Stack overflow that tells you how to format your questions (I'm guessing that was the reason you got a downvote).
-------------- EDIT ---------
As pointed out, if you are on a Mac or something like that you have to use [:alnum:] instead of \S, and [:blank:] instead of \s in your sed expression (as these are portable to all platforms)
awk '/sub1/{ print $2,$3,$4 }' file
1 2 3
What happens? After regexp /sub1/ the three following fields are printed.
Any drawbacks? It affects the space.
Sed also works: sed -n -e 's/^'"$subj"' *//p' file1.txt
It outputs all lines matching $subj at the beginning of a line after having removed the matching word and the spaces following. If TABs are used the spaces should be replaced by something like [[:space:]].

How to capture first column values of a command?

I am new to shell scripting. I am trying to write a script that is suppose to run a command and use for loop to capture first column of the output and do further processing.
command: tst get files
output of this command is something like
NAME COUNT ADMIN
FileA.txt 30 adminA
FileB.txt 21 local
FileC.txt 9 local
FileD.txt 90 adminA
Here is what I have tried so far : UPDATED also want to run additional commands
#!/bin/bash
for f in $(tst get files)
do
echo "FILE :[${f}]"
tst setprimary ${f} && tst get dataload
done
the output I am seeing is something like
FILE :[NAME]
FILE :[COUNT]
FILE :[ADMIN]
FILE :[FileA.txt]
FILE :[30]
FILE :[adminA]
FILE :[FileB.txt]
FILE :[21]
FILE :[local]
FILE :[FileC.txt]
FILE :[9]
FILE :[local]
FILE :[FileD.txt]
FILE :[90]
FILE :[adminA]
I am looking for an output something like
FILE :[FileA.txt]
FILE :[FileB.txt]
FILE :[FileC.txt]
FILE :[FileD.txt]
What should I modify in the shell script to only capture NAME column values? Am I executing the tst get files command correctly in the for loop or is there a better way to execute a command and loop thru the results?
EDIT (Samuel Kirschner): you can do without the for loop entirely and just use awk to print the lines you're interested in
tst get files | awk 'NR > 1 {print "FILE :[" $1 "]"}'
If you want to keep the for loop for some reason and just extract the file name from the lines while skipping the header, you have a few choices. Awk is probably the easiest because of the NR builtin variable (which counts lines) and automatic field-splitting ($1 refers to the first field in the line, for instance), but you can use sed and cut as well.
You can use awk 'NR > 1 {print $1}' to get the first column (using any whitespace character as a delimiter while skipping the first line) or sed 1d | cut -d$'\t' -f1. Note that $'\t' is bash-specific syntax for a literal tab character, if your file is padded with spaces rather than using tabs to delimit fields, you can't use the sed ... | cut ... example.
i.e.
#!/bin/bash
for f in $(tst get files | awk 'NR > 1 {print $1}')
do
echo "FILE :[${f}]"
done
or
#!/bin/bash
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
to avoid unnecessary splitting in the for loop. It's best to set IFS to something specific outside the loop body to prevent 'a file with whitespace.txt' from being broken up.
OLD_IFS=IFS
IFS=$'\n\t'
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
You can just do:
tst get files | awk 'NR > 1 { printf "FILE :[%s]\n", $1 }'
Update: To answer extended problem as per comments below by OP:
while read -r file _; do
tst setprimary "$file" && tst get dataload
done < <(tst get files)
Or perl:
tst ... | perl -lanE 'say "File: [$F[0]]" if $.>1'
the variable $. contains the current line number

sed - unterminated `s' command

I have this peace of code:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}') ;
name=$(awk '{print $2}') ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
file BP.csv has this format:
GO:0008283 cell proliferation
GO:0009405 pathogenesis
GO:0010201 response to continuous far red light stimulus by the high-irradiance response system
GO:0009641 shade avoidance
while GOEA.csv has this format:
4577 GO:0006807 0.994 2014_06_01
4577 GO:0016788 0.989 2014_06_01
4577 GO:0043169 0.977 2014_06_01
4577 GO:0043170 0.963 2014_06_01
sed doesn't work. I want to change GO:0043170 for example, to string "pi", but it gives:
sed: -e expression #1, char 12: unterminated `s' command
Why?
Thanks.
You running your awk command against no input, Try this:
cat BP.csv | while read line ; do
goterm=$(awk '{print $1}' <<< "$line") ;
name=$(awk '{print $2}' <<< "$line" ) ;
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g" ;
done
Let's clean up this code a bit:
while read goterm name
do
grep -w "$goterm" GOEA.csv | sed "s/$goterm/pi/g"
done < BP.cvs
The problem is that your awk statements are attempting to read in from STDIN just like your while is doing. You're reading from the same input stream.
What you want to do is to pull out the values from your line. I'm using read to do this. The read statement uses the values in $IFS to separate out the input. This is normally spaces, tabs, and newlines. The read reads each variable you put on the line, and the last value read in contains the entire rest of the line.
Thus:
while read line
reads in the entire line while:
while goterm name
will break the line as
goterm="GO:0008283"
name="cell proliferation"
One more thing. When you use grep and sed together, you probably can get away with just sed:
while read goterm name
do
sed -n "/$goterm/s/$goterm/pi/gp" GOEA.csv
done < BP.csv
The format for the sed command is:
/lines/command/parameters/
So, I'm searching for lines with $goterm in them, then I am replacing $goterm with pi. The -n means don't print out the lines as sed processes them and p means to print out the lines were the substitute is located.
By the way, csv as a file suffix means comma separated values but neither file looks like it is comma separated. Are these tabs separating each field. If so, you'll need to modify $IFS to be tabs.
I would restructure that whole thing more like this:
while read goterm restofline
do
grep -w "${goterm}" GOEA.csv | sed -e "s/${goterm}/pi/g"
done < BP.csv
No reason for the awk things, as the bash read builtin will do rudimentary field splitting for you if you give it multiple variables. Also, you aren't using name anyway, so it's not needed. cat is unnecessary as well.
Depending on your exact use case, even the grep may be unnecessary, making the inner command simply sed -ne "s/${goterm}/pi/gp" GOEA.csv. Unless your purpose for the grep -w is eliminating lines where ${goterm} is a substring of a word instead of the whole word...
For future reference, inserting a set -x above your loop in your script would show you the exact commands that are being run, so that you can compare them with your expectations.

"while read LINE do" and grep problems

I have two files.
file1.txt:
Afghans
Africans
Alaskans
...
where file2.txt contains the output from a wget on a webpage, so it's a big sloppy mess, but does contain many of the words from the first list.
Bashscript:
cat file1.txt | while read LINE; do grep $LINE file2.txt; done
This did not work as expected. I wondered why, so I echoed out the $LINE variable inside the loop and added a sleep 1, so i could see what was happening:
cat file1.txt | while read LINE; do echo $LINE; sleep 1; grep $LINE file2.txt; done
The output looks in terminal looks something like this:
Afghans
Africans
Alaskans
Albanians
Americans
grep: Chinese: No such file or directory
: No such file or directory
Arabians
Arabs
Arabs/East Indians
: No such file or directory
Argentinans
Armenians
Asian
Asian Indians
: No such file or directory
file2.txt: Asian Naruto
...
So you can see it did finally find the word "Asian". But why does it say:
No such file or directory
?
Is there something weird going on or am I missing something here?
What about
grep -f file1.txt file2.txt
#OP, First, use dos2unix as advised. Then use awk
awk 'FNR==NR{a[$1];next}{ for(i=1;i<=NF;i++){ if($i in a) {print $i} } } ' file1 file2_wget
Note: using while loop and grep inside the loop is not efficient, since for every iteration, you need to invoke grep on the file2.
#OP, crude explanation:
For meaning of FNR and NR, please refer to gawk manual. FNR==NR{a[1];next} means getting the contents of file1 into array a. when FNR is not equal to NR (which means reading the 2nd file now), it will check if each word in the file is in array a. If it is, print out. (the for loop is used to iterate each word)
Use more quotes and use less cat
while IFS= read -r LINE; do
grep "$LINE" file2.txt
done < file1.txt
As well as the quoting issue, the file you've downloaded contains CRLF line endings which are throwing read off. Use dos2unix to convert file1.txt before iterating over it.
Although usng awk is faster, grep produces a lot more details with less effort. So, after issuing dos2unix use:
grep -F -i -n -f <file_containing_pattern> <file_containing_data_blob>
You will have all the matches + line numbers (case insensitive)
At minimum this will suffice to find all the words from file_containing_pattern:
grep -F -f <file_containing_pattern> <file_containing_data_blob>

Resources