Grep with a variable - bash

I would like to grep a number of documents by using a set of search terms and to specify the number of characters after match. Here is what I tried
grep -F -o -P "$(<search.txt).{0,4}" foo.txt
but I get the message 'grep: conflicting matchers specified' because -F and '-oP' cannot be combined. It does not work with '-E' either.

-F and -P are conflicting options, simple as that. The first means that the patterns are fixed strings, the second means that the patterns are Perl-compatible regular expressions. Perhaps you meant to use -f instead, which reads patterns from a file or a process substitution.
If you want to match any of the patterns in your file, followed by 4 characters, you could use something like this
grep -oP -f <(awk '{print $0 ".{4}"}' search.txt) file
This dynamically adds the pattern to each line in the file.
Alternatively, a more portable and concise version would be this:
sed 's/$/.{0,4}/' search.txt | grep -f - -oP file

Related

What is the best way to process multiple lines and extract a large list of specific strings? [duplicate]

I'm after a grep-type tool to search for purely literal strings. I'm looking for the occurrence of a line of a log file, as part of a line in a seperate log file. The search text can contain all sorts of regex special characters, e.g., []().*^$-\.
Is there a Unix search utility which would not use regex, but just search for literal occurrences of a string?
You can use grep for that, with the -F option.
-F, --fixed-strings PATTERN is a set of newline-separated fixed strings
That's either fgrep or grep -F which will not do regular expressions. fgrep is identical to grep -F but I prefer to not have to worry about the arguments, being intrinsically lazy :-)
grep -> grep
fgrep -> grep -F (fixed)
egrep -> grep -E (extended)
rgrep -> grep -r (recursive, on platforms that support it).
Pass -F to grep.
you can also use awk, as it has the ability to find fixed string, as well as programming capabilities, eg only
awk '{for(i=1;i<=NF;i++) if($i == "mystring") {print "do data manipulation here"} }' file
cat list.txt
one:hello:world
two:2:nothello
three:3:kudos
grep --color=always -F"hello
three" list.txt
output
one:hello:world
three:3:kudos
I really like the -P flag available in GNU grep for selective ignoring of special characters.
It makes grep -P "^some_prefix\Q[literal]\E$" possible
from grep manual
-P, --perl-regexp
Interpret I as Perl-compatible regular
expressions (PCREs). This option is experimental when
combined with the -z (--null-data) option, and grep -P may
warn of unimplemented features.

Grep multiple strings from text file

Okay so I have a textfile containing multiple strings, example of this -
Hello123
Halo123
Gracias
Thank you
...
I want grep to use these strings to find lines with matching strings/keywords from other files within a directory
example of text files being grepped -
123-example-Halo123
321-example-Gracias-com-no
321-example-match
so in this instance the output should be
123-example-Halo123
321-example-Gracias-com-no
With GNU grep:
grep -f file1 file2
-f FILE: Obtain patterns from FILE, one per line.
Output:
123-example-Halo123
321-example-Gracias-com-no
You should probably look at the manpage for grep to get a better understanding of what options are supported by the grep utility. However, there a number of ways to achieve what you're trying to accomplish. Here's one approach:
grep -e "Hello123" -e "Halo123" -e "Gracias" -e "Thank you" list_of_files_to_search
However, since your search strings are already in a separate file, you would probably want to use this approach:
grep -f patternFile list_of_files_to_search
I can think of two possible solutions for your question:
Use multiple regular expressions - a regular expression for each word you want to find, for example:
grep -e Hello123 -e Halo123 file_to_search.txt
Use a single regular expression with an "or" operator. Using Perl regular expressions, it will look like the following:
grep -P "Hello123|Halo123" file_to_search.txt
EDIT:
As you mentioned in your comment, you want to use a list of words to find from a file and search in a full directory.
You can manipulate the words-to-find file to look like -e flags concatenation:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' '
This will return something like -e "Hello123" -e "Halo123" -e "Gracias" -e" Thank you", which you can then pass to grep using xargs:
cat words_to_find.txt | sed 's/^/-e "/;s/$/"/' | tr '\n' ' ' | dir_to_search/*
As you can see, the last command also searches in all of the files in the directory.
SECOND EDIT: as PesaThe mentioned, the following command would do this in a much more simple and elegant way:
grep -f words_to_find.txt dir_to_search/*

Grep/Sed/Awk Options

How could you grep or use sed or awk to parse for a dynamic length substring? Here are some examples:
I need to parse out everything except for the "XXXXX.WAV" in these strings, but the strings are not a set length.
Sometimes its like this:
{"filename": "/assets/JFM/imaging/19001.WAV"},
{"filename": "/assets/JFM/imaging/19307.WAV"},
{"filename": "/assets/JFM/imaging/19002.WAV"}
And sometimes like this:
{"filename": "/assets/JFM/LN_405999/101.WAV"},
{"filename": "/assets/JFM/LN_405999/102.WAV"},
{"filename": "/assets/JFM/LN_405999/103.WAV"}
Is there a great dynamic way to parse for just the .WAV? Maybe if I start at "/" and parse until "?
Edit:
Expected output like this:
19001.WAV
19307.WAV
19002.WAV
Or:
101.WAV
101.WAV
103.WAV
Just use grep as proposed in comments:
grep -o '[^/]\{1,\}\.WAV' yourfile
If the wav file always contains numbers, this seems more explicit (same result):
grep -o '[0-9]\{1,\}\.WAV'
Assuming there are [ and ] lines at the beginning and end of your file, it looks like your input is JSON, in which case I would recommend installing and using jq rather than text-based utilities, and doing something like this:
jq -r '.[]|.filename|split("/")[-1]'
But failing that, any of the tools listed will work just fine.
grep -o '[^/]*\.WAV'
or
sed -ne 's,.*/\([^/]*\.WAV\).*$,\1,p'
or
awk -F'"' '/WAV/ {split($4,a,"/"); print a[length(a)]}'
In each case there are a variety of other possible solutions as well.
Or with sed
$ sed 's,.*/,,; s,".*,,' x
101.WAV
102.WAV
103.WAV
Explanation:
s,.*/,, - delete everything up to and including the rightmost /
s,".*,, - delete everything starting with the leftmost " to the end of the line
another awk
awk -F'[/"]' '{print $(NF-1)}' file
19001.WAV
19307.WAV
19002.WAV
Try this -
awk -F'[{":}/]' '{print $(NF-2)}' f
19001.WAV
19307.WAV
19002.WAV
OR
egrep -o '[[:digit:]]{5}.WAV' f
19001.WAV
19307.WAV
19002.WAV
OR
egrep -o '[[:digit:]]{5}.[[:alpha:]]{3}' f
19001.WAV
19307.WAV
19002.WAV
You can easily change the value of digit and character as per your need for different example in egrep but awk will work fine for both case.
All of the programs you listed use regex to parse the names, so I will show you an example using grep, being probably the most basic one for this case.
There are a couple of options, depending on the exact way you define the XXX part before the ".wav".
Option 1, as you pointed out is just the file name, i.e., everything after the last slash:
grep -hoi "[^/]\+\.WAV"
This reads as "any character besides slash" ([^/]) repeated at least once (\+), followed by a literal .WAV (\.WAV).
Option 2 would be to only grab the digits before the extension:
grep -hoi "[[:digit:]]\+\.WAV"
OR
grep -hoi "[0-9]\+\.WAV"
These read as "digits" ([[:digit:]] and [0-9] mean the same thing) repeated at least once (\+), followed by a literal .WAV (\.WAV).
In all cases, I recommend using the flags -h, -o, -i, which I have concatenated into a single option -hoi. -h suppresses the file name from the output. -o makes grep only output the portion that matches. -i makes the match case insensitive, so should your extension ever change to .wav instead of .WAV, you'll be fine.
Also, in all cases, the input is up to you. You can pipe it in from another program, which will look like
program | grep -hoi "[^/]\+\.WAV"
You can get it from a file using stdin redirection:
grep -hoi "[^/]\+\.WAV" < somefile.txt
Or you can just pass the filename to grep:
grep -hoi "[^/]\+\.WAV" somefile.txt
awk -F/ '{print substr($5,1,7)}' file
101.WAV
102.WAV
103.WAV

How to grep and match the first occurrence of a line?

Given the following content:
title="Bar=1; Fizz=2; Foo_Bar=3;"
I'd like to match the first occurrence of Bar value which is 1. Also I don't want to rely on soundings of the word (like double quote in the front), because the pattern could be in the middle of the line.
Here is my attempt:
$ grep -o -m1 'Bar=[ ./0-9a-zA-Z_-]\+' input.txt
Bar=1
Bar=3
I've used -m/--max-count which suppose to stop reading the file after num matches, but it didn't work. Why this option doesn't work as expected?
I could mix with head -n1, but I wondering if it is possible to achieve that with grep?
grep is line-oriented, so it apparently counts matches in terms of lines when using -m[1]
- even if multiple matches are found on the line (and are output individually with -o).
While I wouldn't know to solve the problem with grep alone (except with GNU grep's -P option - see anubhava's helpful answer), awk can do it (in a portable manner):
$ awk -F'Bar=|;' '{ print $2 }' <<<"Bar=1; Fizz=2; Foo_Bar=3;"
1
Use print "Bar=" $2, if the field name should be included.
Also note that the <<< method of providing input via stdin (a so-called here-string) is specific to Bash, Ksh, Zsh; if POSIX compliance is a must, use echo "..." | grep ... instead.
[1] Options -m and -o are not part of the grep POSIX spec., but both GNU and BSD/OSX grep support them and have chosen to implement the line-based logic.
This is consistent with the standard -c option, which counts "selected lines", i.e., the number of matching lines:
grep -o -c 'Bar=[ ./0-9a-zA-Z_-]\+' <<<"Bar=1; Fizz=2; Foo_Bar=3;" yields 1.
Using perl based regex flavor in gnu grep you can use:
grep -oP '^(.(?!Bar=\d+))*Bar=\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
Bar=1
(.(?!Bar=\d+))* will match 0 or more of any characters that don't have Bar=\d+ pattern thus making sure we match first Bar=\d+
If intent is to just print the value after = then use:
grep -oP '^(.(?!Bar=\d+))*Bar=\K\d+' <<< "Bar=1; Fizz=2; Foo_Bar=3;"
1
You can use grep -P (assuming you are on gnu grep) and positive look ahead ((?=.*Bar)) to achieve that in grep:
echo "Bar=1; Fizz=2; Foo_Bar=3;" | grep -oP -m 1 'Bar=[ ./0-9a-zA-Z_-]+(?=.*Bar)'
First use a grep to make the line start with Bar, and then get the Bar at the start of the line:
grep -o "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"
When you have a large file, you can optimize with
grep -o -m1 "Bar=.*" input.txt | grep -o -m1 "^Bar=[ ./0-9a-zA-Z_-]\+"

bash grep newline

[Editorial insertion: Possible duplicate of the same poster's earlier question?]
Hi, I need to extract from the file:
first
second
third
using the grep command, the following line:
second
third
How should the grep command look like?
Instead of grep, you can use pcregrep which supports multiline patterns
pcregrep -M 'second\nthird' file
-M allows the pattern to match more than one line.
Your question abstract "bash grep newline", implies that you would want to match on the second\nthird sequence of characters - i.e. something containing newline within it.
Since the grep works on "lines" and these two are different lines, you would not be able to match it this way.
So, I'd split it into several tasks:
you match the line that contains "second" and output the line that has matched and the subsequent line:
grep -A 1 "second" testfile
you translate every other newline into the sequence that is guaranteed not to occur in the input. I think the simplest way to do that would be using perl:
perl -npe '$x=1-$x; s/\n/##UnUsedSequence##/ if $x;'
you do a grep on these lines, this time searching for string ##UnUsedSequence##third:
grep "##UnUsedSequence##third"
you unwrap the unused sequences back into the newlines, sed might be the simplest:
sed -e 's/##UnUsedSequence##/\n'
So the resulting pipe command to do what you want would look like:
grep -A 1 "second" testfile | perl -npe '$x=1-$x; s/\n/##UnUsedSequence##/ if $x;' | grep "##UnUsedSequence##third" | sed -e 's/##UnUsedSequence##/\n/'
Not the most elegant by far, but should work. I'm curious to know of better approaches, though - there should be some.
I don't think grep is the way to go on this.
If you just want to strip the first line from any file (to generalize your question), I would use sed instead.
sed '1d' INPUT_FILE_NAME
This will send the contents of the file to standard output with the first line deleted.
Then you can redirect the standard output to another file to capture the results.
sed '1d' INPUT_FILE_NAME > OUTPUT_FILE_NAME
That should do it.
If you have to use grep and just don't want to display the line with first on it, then try this:
grep -v first INPUT_FILE_NAME
By passing the -v switch, you are telling grep to show you everything but the expression that you are passing. In effect show me everything but the line(s) with first in them.
However, the downside is that a file with multiple first's in it will not show those other lines either and may not be the behavior that you are expecting.
To shunt the results into a new file, try this:
grep -v first INPUT_FILE_NAME > OUTPUT_FILE_NAME
Hope this helps.
I don't really understand what do you want to match. I would not use grep, but one of the following:
tail -2 file # to get last two lines
head -n +2 file # to get all but first line
sed -e '2,3p;d' file # to get lines from second to third
(not sure how standard it is, it works in GNU tools for sure)
So you just don't want the line containing "first"? -v inverts the grep results.
$ echo -e "first\nsecond\nthird\n" | grep -v first
second
third
Line? Or lines?
Try
grep -E -e '(second|third)' filename
Edit: grep is line oriented. you're going to have to use either Perl, sed or awk to perform the pattern match across lines.
BTW -E tell grep that the regexp is extended RE.
grep -A1 "second" | grep -B1 "third" works nicely, and if you have multiple matches it will even get rid of the original -- match delimiter
grep -E '(second|third)' /path/to/file
egrep -w 'second|third' /path/to/file
you could use
$ grep -1 third filename
this will print a string with match and one string before and after. Since "third" is in the last string you get last two strings.
I like notnoop's answer, but building on AndrewY's answer (which is better for those without pcregrep, but way too complicated), you can just do:
RESULT=`grep -A1 -s -m1 '^\s*second\s*$' file | grep -s -B1 -m1 '^\s*third\s*$'`
grep -v '^first' filename
Where the -v flag inverts the match.

Resources