grep find letters between two spaces - bash

I have to find words.
In my assignment a word is defined as letters between two spaces (" bla "). I have to find a decimalIntegerConstant like this but it has to be a word.
I use
grep -E -o " (0|[1-9]+[0-9]*)([Ll]?) "
but it doesn't work on, for example:
bla 0l labl 2 3 abla0La 0L sfdgpočítačsd
Output is
0l
2
0L
but 3 is missing.

Matches don't overlap. Your regex have matched 2. The blank after 2 is gone. It won't be considered for further matches.
POSIX grep cannot do what you want in one step, but you can do something like this in two stages (simplified from your regex, doesn't support [lL])
grep -o ' [0-9 ]* ' | grep -E -o '[0-9]+'
That is, match a sequence of space-separated numbers with leading and trailing spaces, and from that, match individual numbers regardless of spaces. De-simplify the definition of number to suit your needs.
Perl-compatible regular expressions have a way to match stuff without consuming it, for example, as mentioned in the comments:
grep -oP " (0|[1-9]+[0-9]*)[Ll]?(?= )"
(?= ) is a lookahead assertion, which means grep will look ahead in the input stream and make sure the match is followed by a space. The space will not be considered a part of the match and will not be consumed. When no space is found, the match fails.
PCRE are not guaranteed to work in all implementations of grep.
Edit: -o is not specified by Posix either.

Related

Grep pattern matching at most n times using Perl flag

I am trying to get a specific pattern from a text file. The pattern should start with chr followed by a digit appearing at most 2 times, or a letter X or Y appearing exactly 1 time, and then an underscore appearing also one time. Input example:
chr5_ 16560869
chrX 46042911
chr12_ 131428407
chr22_ 13191864
chr5 165608
chrX_ 96055593
I am running this code on the console: grep -P "^chr(\d{,2}|X{1}|Y{1})_" input_file.txt, which only gives me back the lines that start with chrX_ or chrY_, but not chr2_ (I wrote 2 but could be any digit/s).
The thing is that if I run grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt (note that I changed from {,2} to {1,2}) then I get back what I expected. I can't figure out why the first option is not working. I thought in regexp you could specify that a pattern was matched at most n times with the syntax {,N}.
Thanks in advance!
Note that you chose the PCRE regex engine with your grep due to the -P option.
The \d{,2} does not match zero to two digits, it matches a digit and then a {,2} string. See the regex demo.
See the PCRE documentation:
An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.
Also, see the limiting quantifier definition there:
The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second.
Note that POSIX regex flavor is not that strict, it allows omitting the minimum threshold value from the limiting quantifier (then it is assumed to be 0):
grep -oE '[0-9]{,2}_' <<< "12_ 21"
## => 12_
grep -oP '[0-9]{,2}_' <<< "21_ 1{,2}_"
## => 1{,2}_
See the online demo.
Note
I'd advise to always specify the 0 min value since the behavior varies from engine to engine. In TRE regex flavor used in R as the default base R regex engine, omitting the zero leads to a bug.
-P, --perl-regexp
Interpret PATTERNS as Perl-compatible regular expressions (PCREs).
It would seem {,n} is not compatable in perl regex.
Using grep ERE instead
$ grep -E 'chr([0-9]{,2}|[XY])_' input_file
chr5_ 16560869
chr12_ 131428407
chr22_ 13191864
chrX_ 96055593
There is a missing digit in the {,2}.
Give a try to :
grep -P "^chr(\d{1,2}|X{1}|Y{1})_" input_file.txt
My first guess was to use egrep instead.
This other seems ok too:
egrep "^chr([[:digit:]]{1,2}|X{1}|Y{1})_" input_file.txt

ignore spaces within/around brackets to count occurrences

(to LaTeX users) I want to search for manually labeled items
(to whom it may concern) script file on GitHub
I tried to find solution, but what I've found suggested to remove spaces first. In my case, I think there should be simpler solution. It could be using grep or awk or some other tool.
Consider the following lines:
\item[a)] some text
\item [i) ] any text
\item[ i)] foo and faa
\item [ 1) ] foo again
I want to find (or count) if there are items with a single ) inside brackets. The format could have blank spaces inside the brackets and/or around it. Also, the char before the closing parentheses could be any letter or number.
Edit: I tried grep "\[a)\]" but it missed [ a) ].
Since there are many possible ways to write an item, I can not decide about a possible pattern. I think that it is enough for me such as
\item<blank spaces>[<blank spaces><letter or number>)<blank spaces>]
Replace blank space could not work because the patter above in general contains text around it (for example: \item[ a)] consider the function...)
The output should indicate is there are such patterns or not. It could be zero or the number of occurrences.
So to do it all in the grep itself:
grep -c -E '\\item\s*\[\s*\w+\)\s*\]' file.txt
Note all the \s* checks for spaces. Also -c to get the count.
Breaking it down:
\\ a backslash (needs escape in grep)
item "item"
\s* optional whitespaces
\[ "[" (needs escape in -E)
\s* optional whitespaces
\w+ at least one 'word' char
\) ")" (needs escape in -E)
\s* optional whitespaces
\] "]" (needs escape in -E)
Following awk may also help here(I am simply removing the spaces between [ to ] and then looking for pattern of either digit or character in it.
awk '
match($0,/\[.*\]/){
val=substr($0,RSTART+1,RLENGTH-1);
gsub(/[[:space:]]+/,"",val);
if(val ~ /[a-z0-9]+\)/){ count++ }
}
END{
print count
}' Input_file
So I am thinking something like this:
tr -d " \t" < file.txt | grep -c '\\item\[[0-9A-Za-z])\]'
This will count the number of matches for you.
Edit: Added \t to tr call. Now removes all spaces and tabs.
Here is a grep only version. This could be useful for printing out all of the matches (by removing -c) as well since the above version modifies the input:
grep -c '\\item *\[ *[0-9A-Za-z]) *\]' file.txt
Here is a more versatile answer if this is what you looking for. Here, we output the matches to a file and count the lines from the file to get the number of matches...
grep '\\item *\[ *[0-9A-Za-z]) *\]' file.txt > matches.txt
wc -l < matches.txt

grep for a variable content with a dot

i found many similar questions about my issue but i still don't find the correct one for me.
I need to grep for the content of a variable plus a dot but it doesn't run escaping the dot after the variable. For example:
The file content is
item.
newitem.
My variable content is item. and i want to grep for the exact word, therefore I must use -w and not -F but with the command I can't obtain the correct output:
cat file | grep -w "$variable\."
Do you have suggestions please?
Hi, I have to rectify my scenario. My file contains some FQDN and for some reasons I have to look for hostname. with the dot.
Unfortunatelly the grep -wF doesn't run:
My file is
hostname1.domain.com
hostname2.domain.com
and the command
cat file | grep -wF hostname1.
doesn't show any output. I have to find another solution and I'm not sure that grep could help.
If $variable contains item., you're searching for item.\. which is not what you want. In fact, you want -F which interprets the pattern literally, not as a regular expression.
var=item.
echo $'item.\nnewitem.' | grep -F "$var"
Try:
grep "\b$word\."
\b: word boundary
\.: the dot itself is a word boundary
Following awk solution may help you in same.
awk -v var="item." '$0==var' Input_file
You are dereferencing variable and append \. to it, which results in calling
cat file | grep -w "item.\.".
Since grep accepts files as parameter, calling grep "item\." file should do.
from man grep
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent
character. Similarly, it must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
and
The Backslash Character and Special Expressions
The symbols \< and \> respectively match the empty string at the beginning and end of a word. The symbol \b matches the empty string at the edge of a word, and \B matches the empty string
provided it's not at the edge of a word. The symbol \w is a synonym for [[:alnum:]] and \W is a synonym for [^[:alnum:]].
as the last character is a . it must be followed by a non word [A-Za-z0-9_] however the next character is d
grep '\<hostname1\.'
should work as \< ensures previous chracter is not a word constituent.
You can dynamically construct the search pattern and then call grep
rexp='^hostname1\.'
grep "$rexp" file.txt
The single quotes tell bash not to interpret special characters in the variable. Double quotes tell bash to allow replacing $rexp with its value. The caret ( ^ ) in the expression tells grep to look for lines starting with 'hostname1.'

grep for a specific pattern in a file?

I have a file textFile.txt
abc_efg#qwe.asd
abc_aer#
#avret
afd_wer_asd#qweasd.zxcasd
wqe_a#qwea.cae
qwe.caer
I want to grep to get specific lines :
abc_efg#qwe.asd
afd_wer_asd#qweasd.zxcasd
wqe_a#qwea.cae
That is the ones that have
[a-z]_[a-z]#[a-z].[a-z]
but the part before the # can have any number of "_"
So far this is what I have :
grep "[a-z]_[a-z]#[a-z].[a-z]" textFile.txt
But I got only one line as the output.
wqe_a#qwea.cae
Could I know a better way to do this ? :)
you can add the _ simply inside [a-z_] so the new command is:
grep "[a-z_]#[a-z].[a-z]" textFile.txt
or if you want it to start with a non _ you can have
grep "[a-z][a-z_]#[a-z].[a-z]" textFile.txt
I would suggest keeping it simple by checking only one # is present in each line:
grep -E '^[^#]+#[^#]+$' file
abc_efg#qwe.asd
afd_wer_asd#qweasd.zxcasd
wqe_a#qwea.cae
The following selects lines that have at least one underline character followed by letters before the at-sign and one or more letters followed by at least one literal period after the at-sign:
$ grep '_[a-z]\+#[a-z]\+\.' textFile.txt
abc_efg#qwe.asd
afd_wer_asd#qweasd.zxcasd
wqe_a#qwea.cae
Notes
An unescaped period matches any character. If you want to match a literal period, it must be escaped like '.`.
Thus, #[a-z].[a-z] matches an at-sign, followed by a letter, followed by anything at all, followed by a letter.
[a-z] matches a single letter. Thus _[a-z]# would match only if there was only one character between the underline and the at-sign. To match one or more letters, use [a-z]\+.
#[a-z]\+\. will match an at-sign, followed by one or more letters, followed by a literal period character.
When you do [a-z] it only matches one character of that set. That's why you are only getting wqe_a#qwea.cae back from your grep call because there is only one character between the _ and the #.
To match more than one character, you can use + or *. + means one or more of the set and * any number of that set. As well, an unescaped . means any character.
So something like:
grep "[a-z]\+_[a-z]\+#[a-z]\+\.[a-z]\+" textFile.txt would work for this. There are shorter, less specific ways of doing this as well (that other answers have shown).
Note the escapes before the + signs and the . .
This regex should get all valid email from a text file:
grep -E -o "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" file
abc_efg#qwe.asd
afd_wer_asd#qweasd.zxcasd
wqe_a#qwea.cae
This greps for pattern like this text#text.some_more_text

searching specefic word in shell script

I have a problem.Please give me a solution.
I have to run a command as I given below, which will list all the files that contain the string given "abcde1234".
find /path/to/dir/ * | xargs grep abcde1234
But here it will display the files which contain the string "abcde1234567" also.But I nee only files which contain the word "abcde1234". What modification shall I need in the command ??
When I need something like that, I use the \< and \> which mean word boundary. Like this:
grep '\<abcde1234\>'
The symbols \< and \> respectively match the empty string at the beginning and end of a word.
But that's me. The correct way might be to use the -w switch instead (which I tend to forget about):
-w, --word-regexp
Select only those lines containing matches that form whole words. The test is that the matching substring must either be at the beginning of the line, or preceded by a non-word constituent character. Similarly, it
must be either at the end of the line or followed by a non-word constituent character. Word-constituent characters are letters, digits, and the underscore.
One more thing: instead of find + xargs you can use just find with -exec. Or actually just grep with -r:
grep -w -r abcde1234 /path/to/dir/
$ grep abcde1234 *
This will grep the string abcde1234 in current directory, with the file name in which the string is.
Ex:
abc.log: abcde1234 found
Hi I got the answer for this.By attaching $ with word to be searched, it will display the files that contain only that word.
The command will be like this.
find /path/to/dir/ * | xargs grep abcde1234$
Thanks.

Resources