Ignoring lines from grep matching any element in a bash array - bash

I have an array (superStringIgnoreArray) containing superstrings like "formula", "forest", "foreign", "fortify", and I am running the following grep lines:
eval 'find "$SEARCH_DIR" -type f -print0 | xargs -0 grep -HniI "$hitWord" >> "$OUTPUT_FILE"'
eval 'find "$SEARCH_DIR" -type f -print0 | xargs -0 grep -HniI --color=always "$hitWord" | more'
Where hitWord in this instance would be "for".
How can I return all hits that do not match any entry in my superStringIgnoreArray? (so lines containing "for", "form", "fort" "fork" "forming" would be returned, but "fortify", "forest", etc would not).
Example output:
srcToSearch/open_source_licenses.txt:12:source software packages. One or more such open_source_licenses.txt files may there**for**e
srcToSearch/open_source_licenses.txt:19:-- **For** vCenter Server 5.5u2 GA, the license in**for**mation listed in Parts 2,
srcToSearch/open_source_licenses.txt:22:-- **For** vCenter Server on Linux Virtual Appliance 5.5u2 GA, the license
srcToSearch/open_source_licenses.txt:23:in**for**mation listed in Parts 1, 2, 3, 4, 5 and 6 are applicable.
srcToSearch/open_source_licenses.txt:29:document. This list is provided **for** your convenience; please read further if

grep + bash solution:
superStringIgnoreArray=("formula" "forest" "foreign" "fortify")
grep -HniIr "$hitWord" "$SEARCH_DIR"/* \
| grep -v -f <(printf '%s\n' "${superStringIgnoreArray[#]}") | tee "$OUTPUT_FILE"

since you're outputting the filenames, chaining another grep won't be trivial, but you can achieve the same with awk
$ grep -HniIFr "$hitWord" "$SEARCH_DIR" |
awk 'BEGIN {OFS=FS=":"}
NR==FNR {a[tolower($0)]; next}
{f=$1;n=$2;$1=$2="";
for(k in a) if(tolower($0)~k) next}
{$1=f;$2=n;print}' blacklist -
here awk limits the matches after the filename with : delimiter. If you "hitWord" is a literal adding -F will help. awk is still doing pattern matching though. tolower() is to make the second step case insensitive too.
Since delimiter ":" can appear within the body, we can't depend on $3 in awk, instead, store $1 and $2; remove them from the line, match and add them back before printing. I guess at this point you can add the first grep functionality to this awk as well.
However, I think without -o flag, this and other line based solutions will fail when there is a actual match and unwanted match on the same line. If the unwanted superstrings are few, perhaps a negative lookback/lookahead pattern is a better solution.
If your blacklist is not a file but an array, you can do file substitution as in the other answer, replace with
... | awk '...' <(printf '%s\n' "${superStringIgnoreArray[#]}") -

Related

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

Grep to exclude comment like # and -- with trailing spaces and within line

I tried to grep the word inside file which contains # and -- as a comment. The command that I used is
grep "^[^#]" -H -R -I "pathtofile" | grep "^[^--]" | grep -in ${1} | awk -F : ' { print $2 } ' | uniq)
which will print the file name of specific word. However, if there is a line like this
--test_specific_word_test test
The code above will treat above code as not to skip it. This case also apply to where the comment is in line with the code like var=1 --comment.
Should I use sed to delete comment line first or use just grep.
The downside is I have a significant amount of file to search and GNU grep is 2.0 and I can't upgrade the grep version because I don't have permission.
The command you've provided uses grep 4 times. You can skip commented lines with a single grep command:
grep -v "^ *\(--\|#\)" "pathtofile"
To print the filenames containing word1 use cut like so:
grep -Hv "^ *\(--\|#\)" filenames | grep "word1" | cut -d: -f1
To skip inline comments use sed:
sed "s/\(.*\)\(--\|#\).*/\1/g" inputfile
Sample input:
word1
word2
-word3 # inline comment
#comment1
--comment2
#comment3
output:
word1
word2
-word3
If in fact you are attempting to parse a programming language's source files, you are probably better off using a proper parser. Here is an attempt at refactoring your code into an Awk script, with several guesses as to what exactly the script should actually do.
find pathtofile -type f -exec awk -v word="$1" -F : '
# this doesn't reimplement grep -I though
{ sub("(#|--).*", "") } # remove comments
tolower($0) ~ tolower(word) && !($2 in a) { print FILENAME ":" FNR ":" $2; a[$2] }' {} +
This has the obvious flaw that if the programming language allows for # or -- in quoted strings and doesn't regard those as comments, the script will do the wrong thing.
There are no word boundaries in your script, so I didn't put any in mine either. This means if word="dog" then it will print any string which contains the three adjacent letters d-o-g in this order, even in substring matches like "doggone" or "endogenous". If that's not what you want, you can add word boundary markers -- if you have GNU Awk, you can say BEGIN { word = "\\<" word "\\> } at the beginning of the script; or see here.
The technique to add the key to an array and only print the key if it wasn't already in the array is a common way to implement uniq. This will fail if find returns so many files that it will end up running more than one instance of awk -- this will be controlled by the value of ARG_MAX of your kernel.

How to truncate trailing space in xargs

I would like to use xargs to list the contents of some files based on the output of command A. Xargs replace-str seem to be adding a space to the end and causing the command to fail. Any suggestions? I know this can be worked around using for loop. But curious to know how to do this using xargs.
lsscsi |awk -F\/ '/ATA/ {print $NF}' | xargs -L 1 -I % cat /sys/block/%/queue/scheduler
cat: /sys/block/sda /queue/scheduler: No such file or directory
The problem is not with xargs -I, which does not append a space to each argument, which can be verified as follows:
$ echo 'sda' | xargs -I % echo '[%]'
[sda]
Incidentally, specifying -L 1 in addition to -I is pointless: -I implies line-by-line processing.
Therefore, it must be the output from the command that provides input to xargs that contains the trailing space.
You can adapt your awk command to fix that:
lsscsi |
awk -F/ '/ATA/ {sub(/ $/,"", $NF); print $NF}' |
xargs -I % cat '/sys/block/%/queue/scheduler'
sub(/ $/,"", $NF) replaces a trailing space in field $NF with the empty string, thereby effectively removing it.
Note how I've (single-)quoted cat's argument so as to make it work even with filenames with spaces.
lsscsi |awk -F\/ '/ATA/ {print $NF}'| awk '{print $NF}' | xargs -L 1 -I % cat /sys/block/%/queue/scheduler
The first awk stmt splits by "/" so anything else is considered as field. In this is case "sda " becomes whole field including a space at the end. But by default, awk removes space . So after the pipe, the second awk prints $NF (which is last word of the line) and leaves out " " space as delimiter. awk { print $1 } will do the same because we have only one word, "sda" which is both first and last.

get last line from grep search on multiple files

I'm curently having some problem with a grep command.
I've found the way to only show the last line of a grep search :
grep PATERN FILE_NAME | tail -1
I also find the way to make a grep search in multiple selected files :
find . -name "FILE_NAME" | xargs -I name grep PATERN name
Now I would like to only get the last line of the grep result for each single file.
I tried this :
find . -name "FILE_NAME" | xargs -I name grep PATERN name | tail -1
This returns me only the last value of the last file where I would like to have the last matching patern for every file.
for f in $(find . -name "FILE_NAME"); do grep PATTERN $f | tail -1; done
Sort has a uniq option that allows you to select just one line from many. Try this:
grep PATTERN FILENAMES* | tac | sort -u -t: -k1,1
Explanation:
Grep will return one line for each match in a file. This looks like:
$ grep match file*
file1.txt:match
file1.txt:match2
file2.txt:match3
file2.txt:match4
And what we want is two lines from that output:
$ ???
file1.txt:match2
file2.txt:match4
You can treat this as a sort of table, in which the first column is the filename and the second is the match, where the column separator is the ':' character.
Our first pipe reverses the output:
$ grep match file* | tac
file2.txt:match4
file2.txt:match3
file1.txt:match2
file1.txt:match
Our second pipe to sort, says: pull out the first unique line (-u), where the key to group by is the first one (-k1,1, key from column 1 to column 1), and we split the data into columns with ':' as a delimiter (-t:). It will also sort our output too! And its output:
$ grep match file* | tac sort -u -t: -k1,1
file1.txt:match2
file2.txt:match4
An alternative to this could be done with awk instead of grep. A Posix version would read:
awk '(FNR==1)&&s{print s; s=""}/PATTERN/{s=$0}END{if(s) print s}' file1 file2 file3 ...
Using GNU awk, you can use ENDFILE
awk 'BEGINFILE{s=""}/PATTERN/{s=$0}ENDFILE{if(s) print s}' file1 file2 file3 ...
you can use find to execute commands too:
find . -name "<file-name-to-find>" -exec grep "<pattern-to-match>" "{}" ";" | tail -1
"{}" is the file name, take care with shell globing and expasion when writing the command
Another way to find the last line is to reverse the file and output the first match.
find . -name "FILE_NAME" | xargs -I name sh -c 'tac name|sed -n "/PATTERN/{p;q}"'
You could start with grep's -B (before) parameter. For example to get 5 lines before the match:
duli#i5 /etc/php5/apache2 $ grep -i -B5 timezone php.ini
[CLI Server]
; Whether the CLI web server uses ANSI color coding in its terminal output.
cli_server.color = On
[Date]
; Defines the default timezone used by the date functions
; http://php.net/date.timezone
;date.timezone =
Get last line of each file (prefixed with file name). Then, filter output based on pattern.
find . -name "*" -exec tail -v -n1 {} \; | grep "some_string" -B1
on macOS, you have to do it slightly different way
find . -name "*" | xargs tail -1 | grep "some_string" -B1
7years too late to the party. A slow way to modify the Line of command:
find . -name "FILE_NAME" | xargs -I name sh -c "grep PATERN name | tail -1"
If you need to show the file name in each line:
find . -name "FILE_NAME" | xargs -I name sh -c "grep -H PATERN name | tail -1"
There is a solution without the need for loops, this gives what the OP wants.
find . -type f -exec sh -c "fgrep print {} /dev/null |tail -1" \;
./tway.pl:print map(lambda x : x[1], filter(lambda x : x[0].startswith('volume'), globals().items()))
./txml.py: print("%s does not exist: %s\n" % (host, error))
./utils.py:print combine_dicts(a, b, operator.mul)
./xml_example.py:print ET.tostring(root, method="text")
Compared without the tail -1 gives Too many lines per file but proves the above works.
find . -type f -exec sh -c "fgrep print {} /dev/null" \;
gives:
./tway.pl:print map(lambda x : x[1], filter(lambda x : x[0].startswith('volume'), globals().items()))
./txml.py: print("%s resolved to --> %s\n" % (host, ip))
./txml.py: print("%s does not exist: %s\n" % (host, error))
./utils.py:print "a", a
./utils.py:print "b", b
./utils.py:print combine_dicts(a, b, operator.mul)
./xml_example.py: print ">>"
./xml_example.py: print ET.tostring(e, method="text")
./xml_example.py: print "<<"
./xml_example.py:print ET.tostring(root, method="text")
EDIT - remove the /dev/null if you don't want the filename included in the output.
The sed version
# As soon as we find pattern
# we save that line in hold space
save_pattern_line='/PATTERN/{h;d}'
# switch pattern and hold space
switch_spaces='x'
# At the end of the file
# if the pattern is in the pattern space
# (which we swapped with our hold space)
# switch again, print and exit
eof_print='${/PATTERN/{x;p;d}}'
# Else, switch pattern and hold space
switch_spaces='x'
find . -name 'FILE_NAME' |
xargs sed -s -n -e $save_pattern_line \
-e $switch_spaces \
-e $eof_print \
-e $switch_spaces
The quickest way to do this would be get the output last 1 (or more) lines from the files and then grep through that. So -
tail -1 filenames.* | grep "what you want to grep for"

Only get hash value using md5sum (without filename)

I use md5sum to generate a hash value for a file.
But I only need to receive the hash value, not the file name.
md5=`md5sum ${my_iso_file}`
echo ${md5}
Output:
3abb17b66815bc7946cefe727737d295 ./iso/somefile.iso
How can I 'strip' the file name and only retain the value?
A simple array assignment works... Note that the first element of a Bash array can be addressed by just the name without the [0] index, i.e., $md5 contains only the 32 characters of md5sum.
md5=($(md5sum file))
echo $md5
# 53c8fdfcbb60cf8e1a1ee90601cc8fe2
Using AWK:
md5=`md5sum ${my_iso_file} | awk '{ print $1 }'`
You can use cut to split the line on spaces and return only the first such field:
md5=$(md5sum "$my_iso_file" | cut -d ' ' -f 1)
On Mac OS X:
md5 -q file
md5="$(md5sum "${my_iso_file}")"
md5="${md5%% *}" # remove the first space and everything after it
echo "${md5}"
Another way is to do:
md5sum filename | cut -f 1 -d " "
cut will split the line to each space and return only the first field.
By leaning on head:
md5_for_file=`md5sum ${my_iso_file}|head -c 32`
One way:
set -- $(md5sum $file)
md5=$1
Another way:
md5=$(md5sum $file | while read sum file; do echo $sum; done)
Another way:
md5=$(set -- $(md5sum $file); echo $1)
(Do not try that with backticks unless you're very brave and very good with backslashes.)
The advantage of these solutions over other solutions is that they only invoke md5sum and the shell, rather than other programs such as awk or sed. Whether that actually matters is then a separate question; you'd probably be hard pressed to notice the difference.
If you need to print it and don't need a newline, you can use:
printf $(md5sum filename)
md5=$(md5sum < $file | tr -d ' -')
md5=`md5sum ${my_iso_file} | cut -b-32`
md5sum puts a backslash before the hash if there is a backslash in the file name. The first 32 characters or anything before the first space may not be a proper hash.
It will not happen when using standard input (file name will be just -), so pixelbeat's answer will work, but many others will require adding something like | tail -c 32.
if you're concerned about screwy filenames :
md5sum < "${file_name}" | awk NF=1
f244e67ca3e71fff91cdf9b8bd3aa7a5
other messier ways to deal with this :
md5sum "${file_name}" | awk NF=NF OFS= FS=' .*$'
or
| awk '_{ exit }++_' RS=' '
f244e67ca3e71fff91cdf9b8bd3aa7a5
to do it entirely inside awk :
mawk 'BEGIN {
__ = ARGV[ --ARGC ]
_ = sprintf("%c",(_+=(_^=_<_)+_)^_+_*++_)
RS = FS
gsub(_,"&\\\\&",__)
( _=" md5sum < "((_)(__)_) ) | getline
print $(_*close(_)) }' "${file_name}"
f244e67ca3e71fff91cdf9b8bd3aa7a5
Well, I had the same problem today, but I was trying to get the file MD5 hash when running the find command.
I got the most voted question and wrapped it in a function called md5 to run in the find command. The mission for me was to calculate the hash for all files in a folder and output it as hash:filename.
md5() { md5sum $1 | awk '{ printf "%s",$1 }'; }
export -f md5
find -type f -exec bash -c 'md5 "$0"' {} \; -exec echo -n ':' \; -print
So, I'd got some pieces from here and also from 'find -exec' a shell function in Linux
For the sake of completeness, a way with sed using a regular expression and a capture group:
md5=$(md5sum "${my_iso_file}" | sed -r 's:\\*([^ ]*).*:\1:')
The regular expression is capturing everything in a group until a space is reached. To get a capture group working, you need to capture everything in sed.
(More about sed and capture groups here: How can I output only captured groups with sed?)
As delimiter in sed, I use colons because they are not valid in file paths and I don't have to escape the slashes in the filepath.
Another way:
md5=$(md5sum ${my_iso_file} | sed '/ .*//' )
md5=$(md5sum < index.html | head -c -4)

Resources